**2. Pattern matching with wildcards and length constraints**

The sequential pattern matching problem is to given a Text *T* and a pattern *P* as input, and output all the occurrences of *P* in *T*. After Fischer and Paterson's work, there are a variety of non-standard definitions of the pattern matching problem: the approximate matching (He et al., 2007), the swapped matching (Amir et al., 2000), the Parameterized matching (Amir et al., 2009), etc. They all belong to *Non-standard Stringology* problem (Muthukrishnan, 1994). Many of them are still open problems.

## **2.1. The development of PMWL problem**

After years of development, these *Non-standard Stringology* problems always focus on a problem: that is, how to conduct the traditional pattern matching definition to be more flexible to adapt the development of application. The *don't cares* problem always focus on how to combine the wildcards and the pattern. After Fischer and Paterson's work, Cole et al. considered a slightly different problem (Cole et al., 2004), where instead of fixing the number of ¢s between two consecutive letters in *P* and *T*, they fixed the total number of ¢s in *P*. The disadvantage of these problem definitions is that the number of ¢s is a constant but not a range. This limits flexibilities for the user's queries. To alleviate the problem of a fixed number of ¢s, Kucherov et al. (Kucherov et al., 1995) proposed a solution to allow an unbounded number of ¢s between two consecutive letters in a given pattern. Given a set of such patterns, their objective is to find whether any of these patterns matches some substring of the text that does not contain any ¢.Obviously, allowing an unbounded number of ¢s still does not offer the users enough flexibilities to control their queries. Manber et al. (Manber & Baeza-Yates, 1991) proposed an algorithm for string matching with a sequence of wildcards. They considered the following problem: given two pattern strings *P* and *Q*, each of which consists of letters, and an integer g, all occurrences of the form *P*¢0−*gQ* in the text are returned. The number of ¢s between *P* and *Q* is in the range of [0, *g*], and the text does not contain any ¢. This problem was so-called exact string matching with variable-length *don't cares*. Chen et al. sum up all these definitions into three conditions (Chen et al., 2006): firstly, there is a wildcard between two consecutive letters in *P*, for example A¢[0,1]*T*¢[0,2]G¢[1,3]C; secondly, every letter in *T* can only be used once for matching; thirdly, there is a global constraint to limit the matching occurrence length. We call the problem satisfying above definition PMWL problem, which has been used in approximate matching, pattern mining, information retrieval, etc.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 303

3. Network Security: Pattern matching methods in network security and intrusion detection need high performance. A complete IDS (Intrusion Detection System) based on Snort rules needs to optimize hundreds of rules and many of them need to do pattern matching efficiently for the entire data partition of a package. Efficient pattern matching and mining with wildcards constraints give the system administrator a more

**Definition 1** Let be an alphabet, *T = t0t1…tn-1* <sup>∈</sup> \* is called a **text** of where *n* = |*T*| is the length of *T*. A **pattern** is a tuple *P* = (*p*, *g*) where *p = p0p1…pm-1* is a sequence of characters, which belong to the alphabet , and *g* = *g0g1…gm-2* is a sequence of wildcards. And *m* = |*P*| is the length of *P*. The interval of wildcards between *p*i and *p*i+1 is denoted by *gi* = *g*(*Ni*, *Mi*) where 0 ≤ *i* ≤ *m* - 1, called the **local constraints**. *Ni* and *Mi* is the upper and lower limit of wildcard. Such as *P* = *a*¢[1,3]*g*, where 1, 3 is respectively the lower and upper limit of local constraints. ¢[1,3] means the wildcards between *a* and *g* is referring to a string which length is 1~3. Given interval [*minLen*, *maxLen*], set *globalLength* = *t*[*am-1*] - *t*[*a0*] +1, if *globalLength*

**Definition 2** Given a pattern *P* = (*p*, *g*), *p = p0p1…pm-1*, *gi* = *g*(*Ni*, *Mi*). The *max* {*Mi* - *Ni*} where 0 ≤ *i* ≤ *m* - 1 is called the *gap* of local constraints, named **Gap** for short. For example, *P* =

**Definition 3 PMWL problem** (Pattern Matching with Wildcards and Length constraints) Pattern Matching with Wildcards and Length constraints meets the following conditions:

1. ¢s can occur between each two consecutive letters in pattern and are independent to

2. ¢s between two consecutive letters can match a string which length is limited by *local* 

3. *One-off* condition is taken into consideration that every letter in *T* can only be used once for matching *p*j (0 ≤ *j* ≤ *m* - 1) and as soon as there exists one occurrence of *P* in *T* when *T*

**Definition 4** Given a text *T* and a pattern *P*, if there is a sequence of matching positions *A* = (a1, a2, …, a*m*-1), where *t*[a*i*] = *p*[*i*] for every 0 1 *i m* , we say **A** is a matching **occurrence** of *P*. A set of occurrences *A1*, *A2*, ..., *At* constitute an occurrence set *U* where *t* is the number of

**Definition 5** Let t be the matching number of *A*, if there is no occurrence set *A'* with the matching number *t'*, and *t'* > *t*, then *A* is called a **complete occurrence set**. If there is another occurrence set *U*, with the matching number *tu* = *t*, the *U* is equivalent to *A*. Specially, if *A* is

*constrain*t*s*, and the total length of pattern is limited by *global constraint*;

flexible and accurate solution to locate the suspicious users.

∈[*minLen*, *maxLen*], then it is called **global constraint** (Chen et al., 2006).

**2.3. Problem statement for PMWL** 

*a*¢[0,2]*g*¢[1,4]*g*, then *gap* = *max*{2 - 0, 4 - 1} = 3.

each other;

PMWL problem can be defined by the above definition:

is being scanned from left to right it will be returned.

occurrences, and also named **matching number** in our paper.

complete, and so is *U*. So the complete occurrence set is not always unique.

### **2.2. The potential applications of the PMWL problem**


3. Network Security: Pattern matching methods in network security and intrusion detection need high performance. A complete IDS (Intrusion Detection System) based on Snort rules needs to optimize hundreds of rules and many of them need to do pattern matching efficiently for the entire data partition of a package. Efficient pattern matching and mining with wildcards constraints give the system administrator a more flexible and accurate solution to locate the suspicious users.

#### **2.3. Problem statement for PMWL**

302 Bioinformatics

information retrieval, etc.

on the market.

**2.2. The potential applications of the PMWL problem** 

**2.1. The development of PMWL problem** 

After years of development, these *Non-standard Stringology* problems always focus on a problem: that is, how to conduct the traditional pattern matching definition to be more flexible to adapt the development of application. The *don't cares* problem always focus on how to combine the wildcards and the pattern. After Fischer and Paterson's work, Cole et al. considered a slightly different problem (Cole et al., 2004), where instead of fixing the number of ¢s between two consecutive letters in *P* and *T*, they fixed the total number of ¢s in *P*. The disadvantage of these problem definitions is that the number of ¢s is a constant but not a range. This limits flexibilities for the user's queries. To alleviate the problem of a fixed number of ¢s, Kucherov et al. (Kucherov et al., 1995) proposed a solution to allow an unbounded number of ¢s between two consecutive letters in a given pattern. Given a set of such patterns, their objective is to find whether any of these patterns matches some substring of the text that does not contain any ¢.Obviously, allowing an unbounded number of ¢s still does not offer the users enough flexibilities to control their queries. Manber et al. (Manber & Baeza-Yates, 1991) proposed an algorithm for string matching with a sequence of wildcards. They considered the following problem: given two pattern strings *P* and *Q*, each of which consists of letters, and an integer g, all occurrences of the form *P*¢0−*gQ* in the text are returned. The number of ¢s between *P* and *Q* is in the range of [0, *g*], and the text does not contain any ¢. This problem was so-called exact string matching with variable-length *don't cares*. Chen et al. sum up all these definitions into three conditions (Chen et al., 2006): firstly, there is a wildcard between two consecutive letters in *P*, for example A¢[0,1]*T*¢[0,2]G¢[1,3]C; secondly, every letter in *T* can only be used once for matching; thirdly, there is a global constraint to limit the matching occurrence length. We call the problem satisfying above definition PMWL problem, which has been used in approximate matching, pattern mining,

1. Text Indexing: There is a large amount of hypertext information on the Internet. How to effectively obtain information that meet users' needs is becoming more and more urgent. Text indexing is a method to solve this problem. How to determine the position

2. Data stream is becoming more and more crucial in many new database applications such as data warehouse and sensor network. Mining dependence or association in large amounts of data flow has practical value and during which the sequential pattern matching with wildcards is the first and the most important step. In addition, in data mining, sequential pattern mining also search frequent patterns as in transaction sequence, a typical instance is the similar consuming pattern of many consumers, for example, buying a desktop, a laser printer, a digital camera and an LCD screen monitor in turn, between each of them exists a certain time interval. Mining such typical user mode, which is obviously a pattern matching with wildcards, will has a great influence

of user-specified pattern (may contain wildcards) is a challenge task.

**Definition 1** Let be an alphabet, *T = t0t1…tn-1* <sup>∈</sup> \* is called a **text** of where *n* = |*T*| is the length of *T*. A **pattern** is a tuple *P* = (*p*, *g*) where *p = p0p1…pm-1* is a sequence of characters, which belong to the alphabet , and *g* = *g0g1…gm-2* is a sequence of wildcards. And *m* = |*P*| is the length of *P*. The interval of wildcards between *p*i and *p*i+1 is denoted by *gi* = *g*(*Ni*, *Mi*) where 0 ≤ *i* ≤ *m* - 1, called the **local constraints**. *Ni* and *Mi* is the upper and lower limit of wildcard. Such as *P* = *a*¢[1,3]*g*, where 1, 3 is respectively the lower and upper limit of local constraints. ¢[1,3] means the wildcards between *a* and *g* is referring to a string which length is 1~3. Given interval [*minLen*, *maxLen*], set *globalLength* = *t*[*am-1*] - *t*[*a0*] +1, if *globalLength* ∈[*minLen*, *maxLen*], then it is called **global constraint** (Chen et al., 2006).

**Definition 2** Given a pattern *P* = (*p*, *g*), *p = p0p1…pm-1*, *gi* = *g*(*Ni*, *Mi*). The *max* {*Mi* - *Ni*} where 0 ≤ *i* ≤ *m* - 1 is called the *gap* of local constraints, named **Gap** for short. For example, *P* = *a*¢[0,2]*g*¢[1,4]*g*, then *gap* = *max*{2 - 0, 4 - 1} = 3.

PMWL problem can be defined by the above definition:

**Definition 3 PMWL problem** (Pattern Matching with Wildcards and Length constraints)

Pattern Matching with Wildcards and Length constraints meets the following conditions:


**Definition 4** Given a text *T* and a pattern *P*, if there is a sequence of matching positions *A* = (a1, a2, …, a*m*-1), where *t*[a*i*] = *p*[*i*] for every 0 1 *i m* , we say **A** is a matching **occurrence** of *P*. A set of occurrences *A1*, *A2*, ..., *At* constitute an occurrence set *U* where *t* is the number of occurrences, and also named **matching number** in our paper.

**Definition 5** Let t be the matching number of *A*, if there is no occurrence set *A'* with the matching number *t'*, and *t'* > *t*, then *A* is called a **complete occurrence set**. If there is another occurrence set *U*, with the matching number *tu* = *t*, the *U* is equivalent to *A*. Specially, if *A* is complete, and so is *U*. So the complete occurrence set is not always unique.
