**1. Introduction**

12 Will-be-set-by-IN-TECH

[25] Kim, J. & Choi, S. [2006]. Semidefinite spectral clustering, *Pattern Recognition*

[26] Langfelder, P., Zhang, B. & Horvath, S. [2008]. Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for r, *Bioinformatics Applications Note*

[27] Mangasarian, O. L. & Wolberg, W. H. [1990]. Cancer diagnosis via linear programming,

[28] Mitra, S., Das, R., Banka, H. & Mukhopadhyay, S. [2009]. Gene interaction - an

[29] Newman, M. E. J. & Girvan, M. [2004]. Finding and evaluating community structure in

[30] Phan, V., George, E. O., Tran, Q. T. & Goodwin, S. [2009]. Analyzing microarray data with transitive directed acyclic graphs, *Journal of Bioinformatics and Computational Biology*

[31] Rousseeuw, P. [1987]. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, *Journal of Computational and Applied Mathematics* 20(1): 53–65.

[32] Saha, S. & Bandyopadhyay, S. [2009]. A new point symmetry based fuzzy genetic clustering technique for automatic evolution of clusters, *Information Sciences*

[33] Tseng, G. C. [2007]. Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data, *Bioinformatics*

[34] Wang, S. & Zhu, J. [2008]. Variable selection for model-based high-dimensional clustering

[35] Wang, T. [2009]. Comparing hard and fuzzy c-means for evidence-accumulation clustering, *Proceedings of the 18th international conference on Fuzzy Systems*, FUZZ-IEEE'09,

[36] Watts, D. J. & Strogatz, S. H. [1998]. Collective dynamics of 'small-world' networks,

and its application to microarray data, *Biometrics* 64(2): 440–448.

evolutionary biclustering approach, *Information Fusion* 10: 242–249.

networks, *Physical Review E* 69(026113): 1–15.

URL: *http://dx.doi.org/10.1016/0377-0427(87)90125-7*

IEEE Press, Piscataway, NJ, USA, pp. 468–473. URL: *http://dl.acm.org/citation.cfm?id=1717561.1717643*

8(250): 1–13.

39: 2025–2035.

24(5): 719–720.

23(5): 1–18.

7(1): 135–156.

179(19): 3230–3246.

23(17): 2247–2255.

*Nature* 393(6684): 440–442.

URL: *http://dx.doi.org/10.1038/30918*

networks: clustering expression data based on gene neighborhoods, *BMC Bioinformatics*

The practical importance of the string matching problem should be obvious to everyone. For typical word-processing applications, immense amounts of work have been done on this subject. However, with the developments in bioinformatics (Cole et al., 2005), information retrieval (Califf et al., 2003), pattern mining (Xie et al., 2010; Ji et al., 2007; He et al., 2007), etc, sequential Pattern Matching with Wildcards and Length constraints (PMWL) has attracted more and more attention. It is not difficult to think up realistic cases where PMWL plays an important role. In Dan Gusfield's book (Gusfield, 1997), they give an example about *transcription factor* to illustrate the concept of wildcard. A *transcription factor* is a protein that binds to specific locations in DNA and regulates the transcription of the DNA into RNA. In this way, production of the protein that the DNA codes for is regulated. Many transcription factors are found and can be separated into families characterized by specific substrings containing wildcards. They use *Zinc Finger*, a common transcription factor as an example. It has the following signature:

#### CYS¢¢CYS¢¢¢¢¢¢¢¢¢¢¢¢¢HIS¢¢HIS

Where CYS is the amino acid cysteine and HIS is the amino acid histidine. They also give a conclusion that if the number of wildcards is bounded by a fixed constant, the problem can be solved in linear time.

Another respective example is about *promoter*. In bioinformatics, *promoter* will help researchers to quickly locate the starting position of the intron from hundreds of millions of the sequence of *ACGT*. Among these promoters, *TATA* box is a common one (Manber & Baeza-Yates, 1991). It has very loose sequence specificity, so many *TATA* sequences are not

© 2012 Wang et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Wang et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*TATA* box. As a result, indirect positioning by pairs of sites is needed. The commonly used one is *CAATCT* sequence. The DNA sequence *TATA* is a common promoter that often occurs after the sequence *CAATCT* within 30-50 wildcards. Therefore, matching patterns with wildcards becomes especially crucial in exploring valuable information from DNA sequences.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 301

areas. For example, in sequential pattern analysis in data mining, *P* can be treated as a candidate shopping pattern, the user is interested in how frequently *P* occurs in one document, it makes sense to count each occurrence for once, what is more, the *one-off* condition also makes the problem solving possible. However, under the *one-off* condition, how to allocate limited text resource to each matching occurrences, in order to obtaining the maximum number of occurrences, belongs to optimization problem. In the allocation of resources, the matching of different letters in the pattern possess a strong correlation, which conducts the selection of matching positions in the combination of explosive growth. Since it is difficult to develop a complete matching strategy in this problem, almost existed algorithms for PMWL are using greedy matching strategies, which is the root reason why matching algorithm is not complete. This article will focus on SAIL algorithm (Chen et al., 2006) which is a representative algorithm for PMWL problem and will also describe RSAIL (Wang et al., 2010), SBO (Wu et al., 2011), BPBM (Guo et al., 2011) algorithm which are all designed to solve PMWL problem in different conditions. The each of above algorithms has its own characteristics in the data

What is more, since the theoretic and practical importance of the definition of PMWL, we need to research the nature of this problem. To our best knowledge, there are still no efficient methods on this problem, because as for completeness of the problem, we still do not know whether it could be solved in polynomial time. In this article, we will research the completeness of PMWL under certain condition. In the traditional matching problem, description of pattern and text information is the key to the algorithm design, however, flexibility and complexity of the PMWL problem all depends on the pattern features, so this article will focus on pattern information, especially the pattern features including the size of alphabet, the length of pattern, the *gap* of wildcards in the pattern, etc. We will also investigate the relationship between pattern features and completeness, and use the approximate ratio judgment. Further more, since the definition itself is produced in realistic background, we need to consider the situation in real biological background and improve the solution of the

structures and matching strategies, which will be analyzed in this paper.

problem. Based on the above, we choose this topic as a research object in this book.

**2. Pattern matching with wildcards and length constraints** 

section 5.

Many of them are still open problems.

This capture is organized as follows: In section 2 we will give the development, definition and application of PMWL problem; Section 3 will show the representative algorithms, we will introduce their structure, strategy, complexity and completeness; Section 4 will analyze the PMWL problem completeness based on pattern features. We will give our conclusions in

The sequential pattern matching problem is to given a Text *T* and a pattern *P* as input, and output all the occurrences of *P* in *T*. After Fischer and Paterson's work, there are a variety of non-standard definitions of the pattern matching problem: the approximate matching (He et al., 2007), the swapped matching (Amir et al., 2000), the Parameterized matching (Amir et al., 2009), etc. They all belong to *Non-standard Stringology* problem (Muthukrishnan, 1994).

There are many applications that involve pattern matching with wildcards and various researches have provided many solutions to different forms of this problem. Fischer and Paterson were the first to generalized pattern matching with wildcards (Fischer & Paterson, 1974): given a pattern *P* and a text *T*, either of which may contain wildcards, denoted by ¢, the goal is to locate all *P*'s occurrences in *T*. ¢ can match any letter in a given alphabet, such as a¢¢c¢t. Unlike previous work, Chen, et al. proposed a PMWL problem integrating two problems(Chen et al., 2006): one is complex local constraints which means the user can specify a different range of wildcards between each two consecutive letters of *P*, for example, a¢[0,3]c¢[1,3]t. Another one is global length constraints. The user can constrain the length of each matching substring of *T* in which *P* occurs. Therefore, flexible constraints of wildcards conduct flexible jump of the matching positions. The definition of PMWL problem is an extension of Fischer and Paterson's definition and the introduction of complex local constraints increase the flexibility. On one hand, this definition of pattern is more suitable for areas such as bioinformatics; on the other hand, the size of the matching candidate positions is in the exponential increment which greatly increases the complexity of the problem solving.

**Figure 1.** The flexibility and complexity of PMWL problem

From a view of practical point, they also proposed two issues: with and without the *one-off* condition (Chen et al., 2006; Min et al., 2009). In their problem definition, users have more flexibility to search on sequences and the *one-off* condition has both theoretical and practical significance. *One-off* condition means that every letter in *T* can be used once at most. In practical applications, with and without the *one-off* condition has practical meaning in specific areas. For example, in sequential pattern analysis in data mining, *P* can be treated as a candidate shopping pattern, the user is interested in how frequently *P* occurs in one document, it makes sense to count each occurrence for once, what is more, the *one-off* condition also makes the problem solving possible. However, under the *one-off* condition, how to allocate limited text resource to each matching occurrences, in order to obtaining the maximum number of occurrences, belongs to optimization problem. In the allocation of resources, the matching of different letters in the pattern possess a strong correlation, which conducts the selection of matching positions in the combination of explosive growth. Since it is difficult to develop a complete matching strategy in this problem, almost existed algorithms for PMWL are using greedy matching strategies, which is the root reason why matching algorithm is not complete. This article will focus on SAIL algorithm (Chen et al., 2006) which is a representative algorithm for PMWL problem and will also describe RSAIL (Wang et al., 2010), SBO (Wu et al., 2011), BPBM (Guo et al., 2011) algorithm which are all designed to solve PMWL problem in different conditions. The each of above algorithms has its own characteristics in the data structures and matching strategies, which will be analyzed in this paper.

300 Bioinformatics

of the problem solving.

**Figure 1.** The flexibility and complexity of PMWL problem

From a view of practical point, they also proposed two issues: with and without the *one-off* condition (Chen et al., 2006; Min et al., 2009). In their problem definition, users have more flexibility to search on sequences and the *one-off* condition has both theoretical and practical significance. *One-off* condition means that every letter in *T* can be used once at most. In practical applications, with and without the *one-off* condition has practical meaning in specific

*TATA* box. As a result, indirect positioning by pairs of sites is needed. The commonly used one is *CAATCT* sequence. The DNA sequence *TATA* is a common promoter that often occurs after the sequence *CAATCT* within 30-50 wildcards. Therefore, matching patterns with wildcards

There are many applications that involve pattern matching with wildcards and various researches have provided many solutions to different forms of this problem. Fischer and Paterson were the first to generalized pattern matching with wildcards (Fischer & Paterson, 1974): given a pattern *P* and a text *T*, either of which may contain wildcards, denoted by ¢, the goal is to locate all *P*'s occurrences in *T*. ¢ can match any letter in a given alphabet, such as a¢¢c¢t. Unlike previous work, Chen, et al. proposed a PMWL problem integrating two problems(Chen et al., 2006): one is complex local constraints which means the user can specify a different range of wildcards between each two consecutive letters of *P*, for example, a¢[0,3]c¢[1,3]t. Another one is global length constraints. The user can constrain the length of each matching substring of *T* in which *P* occurs. Therefore, flexible constraints of wildcards conduct flexible jump of the matching positions. The definition of PMWL problem is an extension of Fischer and Paterson's definition and the introduction of complex local constraints increase the flexibility. On one hand, this definition of pattern is more suitable for areas such as bioinformatics; on the other hand, the size of the matching candidate positions is in the exponential increment which greatly increases the complexity

becomes especially crucial in exploring valuable information from DNA sequences.

What is more, since the theoretic and practical importance of the definition of PMWL, we need to research the nature of this problem. To our best knowledge, there are still no efficient methods on this problem, because as for completeness of the problem, we still do not know whether it could be solved in polynomial time. In this article, we will research the completeness of PMWL under certain condition. In the traditional matching problem, description of pattern and text information is the key to the algorithm design, however, flexibility and complexity of the PMWL problem all depends on the pattern features, so this article will focus on pattern information, especially the pattern features including the size of alphabet, the length of pattern, the *gap* of wildcards in the pattern, etc. We will also investigate the relationship between pattern features and completeness, and use the approximate ratio judgment. Further more, since the definition itself is produced in realistic background, we need to consider the situation in real biological background and improve the solution of the problem. Based on the above, we choose this topic as a research object in this book.

This capture is organized as follows: In section 2 we will give the development, definition and application of PMWL problem; Section 3 will show the representative algorithms, we will introduce their structure, strategy, complexity and completeness; Section 4 will analyze the PMWL problem completeness based on pattern features. We will give our conclusions in section 5.
