**4.1. The impact of the alphabet, the length of pattern and the gap on completeness**

In the traditional pattern matching research, the length of pattern and the size of alphabet are key elements influencing time complexity when analyzing traditional matching problems. Taking into account PMWL problem definition, upper and lower limits of the length constraints probably affect problem solving. Especially, instead of upper and lower limits themselves, the distance between the upper and lower limits, that is *gap*, are taken into consideration. Therefore, the parameters related to the algorithm completeness may be the size of alphabet, the length of pattern and distance between the upper and lower limits, denoted as Σ, *m*, and *gap* respectively. In this article, the approximate degree of completeness of the algorithm will be measured by approximation ratio ε. Consequently, we try to build following model:

$$\kappa = \mathcal{F} \left( \Sigma, \, m, \, gap \right) \tag{1}$$

Taking into account that the size of Σ is determined in a specific area, for example, in bioinformatics, DNA sequences can be defined on Σ = {a, c, g, t}, the above formula can be simplified as ε = F (*m*, *ga*p). In experiment project, input text is a biology DNA sequence, so Σ = {a, c, g, t}. Then the remaining parameter values are as follows: *gap* ∈ [1, 30], *m* ∈ [3, 9], consequently, there are 30\*7 = 210 groups of experiments. The aim is to find approximation ratio ε.

Firstly, pattern *P* is generated randomly by pattern generator according to Σ, *m*, and *gap*. For example, when *m* = 5, Σ = {a, c, g, t}, *gap* = 2, a¢[0,2]c¢[0,2]c¢[0,2]t¢[0,2]g is a qualified pattern. For simplicity, in generated patterns, each two consecutive characters have the same length constraints i.e. *gap*. Then, what needs to be done is calculating approximate ratio ε for each pattern. Since **ε** = N(*UALG*) / N(*Uopt*), we need to know N(*Uopt*). However, it is not desirable to directly solve this from a text *T*, since there is no any known algorithm to obtain the completeness solution. If we use a simple brute-force, the exponential time will be need. Therefore, we have developed a text generator, which can generate text *T* according to *P* and N(*UALG*). In addition, SAIL algorithm is currently regarded as the most representative algorithm for PMWL problem, since SAIL firstly adopts the *left-most* strategy which is applied in different situations and technologies such as BPBM(Guo et al., 2011) algorithm and the mining algorithm MAIL(Xie et al., 2010). Based on the above analysis, we have SAIL as a research object, that is, N(*UALG*) = N(*USAIL*).

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 315

In *figure* 5, the trend of curves is the same as in *figure* 4. The difference between them is curves in *figure* 5 reach the minimum when *gap* is about 9~11. It can be found that the impact of Σ, *m*, and *gap* on the curves is that the change of *gap* determines the trend of the curve, *m*

A B C D E

04

04

03

03

03

03

03

03

03

03

03

03

3.27E-

9.59E-

1.85E-

3.51E-

4.44E-

5.94E-

6.68E-

7.33E-

8.01E-

8.61E-

8.98E-

9.99E-

03

03

02

02

02

02

02

02

02

02

02

02













1.00

1.00

1.01

1.02

1.03

1.06

1.06

1.06

1.07

1.08

1.08

1.09

32

72

42

76

8

1

72

88

96

09

05

22

affects the magnitude of this change, and Σ makes the curve do translational move.













**Figure 5.** Curves of ε = F (*gap*) in experiment 2

1.79E-

4.97E-

9.95E-

1.98E-

2.45E-

3.20E-

3.55E-

3.84E-

4.11E-

4.58E-

4.79E-

5.44E-

05

05

05

04

04

04

04

04

04

04

04

04

07

07

07

06

06

06

06

06

06

06

06

06

0

1

2

3

4

**Table 8.** Parameters in mathematic model

In summary, the concrete steps of the experiment are as follows:


$$\textbf{5.}\quad \text{Calculating } \varepsilon = \sum\_{i=0}^{100} \varepsilon\_i \text{ / 100 .}$$


**Table 7.** Parameters in experiments for ε = F (*gap*)

The experimental results:

**Figure 4.** Curves of ε = F (*gap*) in experiment 1

By the *figure* 4, as *m* increases, ε is gradually decreasing. As the *gap* increases, the trend of ε is decreasing first and then increases, especially when *gap* = 1 and ε = 1, since the *left-most* strategy can obtain a complete occurrence set. With the increase of *gap*, ε begin to decline because when the *gap* is becoming greater, the probability of matching occurrences overlap is becoming greater and the algorithm is becoming more easily to lose occurrences; when *gap* is sufficient, although matching occurrences are still overlap, greater *gap* reserve enough space for matching, making the remaining occurrences which have not yet been still have enough resources. Moreover, it is worth noting that the minimum of these curves can be reached when *gap* is about 7, and have nothing to do with the pattern length.

**Figure 5.** Curves of ε = F (*gap*) in experiment 2

**5.** Calculating

as a research object, that is, N(*UALG*) = N(*USAIL*).

3. For *T*i, call SAIL algorithm to get N(*USAIL*);

/ 100 *<sup>i</sup>*

4. Calculating **ε<sup>i</sup>** = N(*UALG*) / N(*Uopt*);

100

0

**Table 7.** Parameters in experiments for ε = F (*gap*)

**Figure 4.** Curves of ε = F (*gap*) in experiment 1

The experimental results:

 

*i* 

 .

In summary, the concrete steps of the experiment are as follows:

applied in different situations and technologies such as BPBM(Guo et al., 2011) algorithm and the mining algorithm MAIL(Xie et al., 2010). Based on the above analysis, we have SAIL

1. For given Σ, *m* and *gap*, 100 patterns *p*i are generated randomly, where *i* = 1, 2,.., 100;

Experiment1 Experiment 2

By the *figure* 4, as *m* increases, ε is gradually decreasing. As the *gap* increases, the trend of ε is decreasing first and then increases, especially when *gap* = 1 and ε = 1, since the *left-most* strategy can obtain a complete occurrence set. With the increase of *gap*, ε begin to decline because when the *gap* is becoming greater, the probability of matching occurrences overlap is becoming greater and the algorithm is becoming more easily to lose occurrences; when *gap* is sufficient, although matching occurrences are still overlap, greater *gap* reserve enough space for matching, making the remaining occurrences which have not yet been still have enough resources. Moreover, it is worth noting that the minimum of these curves can be

reached when *gap* is about 7, and have nothing to do with the pattern length.

∑ 4 7 *m* 3~9 3~9 *gap* 1~29 1~29

2. For pattern *p*i, given N(*Uopt*) = 100, text length n = 2000, generate text *T*i;

In *figure* 5, the trend of curves is the same as in *figure* 4. The difference between them is curves in *figure* 5 reach the minimum when *gap* is about 9~11. It can be found that the impact of Σ, *m*, and *gap* on the curves is that the change of *gap* determines the trend of the curve, *m* affects the magnitude of this change, and Σ makes the curve do translational move.


**Table 8.** Parameters in mathematic model

After a series of experiments, we speculate that ε = A\**ga*p4+ B\**gap*3+ C\**gap*2+ D\**gap* +E, where A, B, C, D and E are parameters and for different *m* there are different parameters. We try to use this model to illustrate the relation between *gap* and approximation ratio ε.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 317

As we know, mining algorithm strategy is to learn from the strategy of matching algorithm, so PMWL pattern mining problem is naturally based on PMWL matching problem. For example, in mining algorithm MAIL (Xie et al., 2010), although a graph structure is utilized which conducts it different from SAIL; it is still based on the *left-most* strategy. As a result, they have the same degree of completeness. Therefore, our model can propose an evaluation

In next part, we will put forward another important concept, named *rep*, and analyze its impact on completeness. We first give an example to illustrate the reason why this concept is needed. Given *m* = 4, Σ = {a, c, g, t}, *gap* = 2, the corresponding patterns maybe *P*1 = a¢[0,2]c¢[0,2]g¢[0,2]t or *P*2 = a¢[0,2]c¢[0,2]c¢[0,2]t. They have the same Σ, *m* and *gap*. However, when applying SAIL or BPBM, the completeness of solutions is not the same,

A complete occurrence set of this example is {{0, 3, 5}, {2, 4, 6}}, the number of matching occurrences is 2. It is not difficult to find that, in SAIL algorithm, for *position* 5, the selection of position 2 as *p*[1]'s occurrence by the *left-most* strategy will consume the position for the next matching occurrence. We can guess that, the recurring 'b' character in this pattern affect

In this example, A complete occurrence set is {{0, 2, 4}, {1, 3, 5}}, the number of matching occurrences is 2. If we use SAIL algorithm and first obtain {0, 2, 3}, then we will only get this occurrence and lose {0, 2, 4}, {1, 3, 5}. Obviously, the recurring 'b' character in this pattern

since for *P*1 algorithms can obtain complete solutions while for *P*2 can not.

 0 1 2 3 4 5 6 *T* b c b b b c c

*P* b¢[1,2] b¢[1,2]c

0 1 2 3 4 5

*T* a a c c c c

mechanism for mining.

Considering two examples below:

**Table 9.** Example 1 for *rep* concept

the quality of matching occurrences.

**Table 10.** Example 1 for *rep* concept

affects the completeness.

*P* a¢[0,1] c¢[0,1]c

**4.2. The impact of pattern** *rep* **on completeness** 

Use this parameter table, some of illustrations for *m* = 3, 4……14 are listed below, where horizontal axis is the *gap*, vertical axis is the ε.

**Figure 6.** Model fitting

We believe this model can be used to predict the completeness of solutions given a certain pattern. For example, given *m* = 10, Σ = {a, c, g, t}, *gap* = 5, this model shows the prediction of approximation ratio ε of SAIL algorithm is about 0.878. Therefore, this model can be used in pattern mining showed as below.

PMWL pattern mining evaluation mechanism

*Input***:** Given *T*, Σ, *m*, *gap*, support *sup*

*Output***:** pattern *P* 

As we know, mining algorithm strategy is to learn from the strategy of matching algorithm, so PMWL pattern mining problem is naturally based on PMWL matching problem. For example, in mining algorithm MAIL (Xie et al., 2010), although a graph structure is utilized which conducts it different from SAIL; it is still based on the *left-most* strategy. As a result, they have the same degree of completeness. Therefore, our model can propose an evaluation mechanism for mining.

### **4.2. The impact of pattern** *rep* **on completeness**

In next part, we will put forward another important concept, named *rep*, and analyze its impact on completeness. We first give an example to illustrate the reason why this concept is needed. Given *m* = 4, Σ = {a, c, g, t}, *gap* = 2, the corresponding patterns maybe *P*1 = a¢[0,2]c¢[0,2]g¢[0,2]t or *P*2 = a¢[0,2]c¢[0,2]c¢[0,2]t. They have the same Σ, *m* and *gap*. However, when applying SAIL or BPBM, the completeness of solutions is not the same, since for *P*1 algorithms can obtain complete solutions while for *P*2 can not.

Considering two examples below:

316 Bioinformatics

**Figure 6.** Model fitting

*Output***:** pattern *P* 

pattern mining showed as below.

*Input***:** Given *T*, Σ, *m*, *gap*, support *sup*

PMWL pattern mining evaluation mechanism

After a series of experiments, we speculate that ε = A\**ga*p4+ B\**gap*3+ C\**gap*2+ D\**gap* +E, where A, B, C, D and E are parameters and for different *m* there are different parameters. We try to

Use this parameter table, some of illustrations for *m* = 3, 4……14 are listed below, where

We believe this model can be used to predict the completeness of solutions given a certain pattern. For example, given *m* = 10, Σ = {a, c, g, t}, *gap* = 5, this model shows the prediction of approximation ratio ε of SAIL algorithm is about 0.878. Therefore, this model can be used in

use this model to illustrate the relation between *gap* and approximation ratio ε.

horizontal axis is the *gap*, vertical axis is the ε.

**Table 9.** Example 1 for *rep* concept

A complete occurrence set of this example is {{0, 3, 5}, {2, 4, 6}}, the number of matching occurrences is 2. It is not difficult to find that, in SAIL algorithm, for *position* 5, the selection of position 2 as *p*[1]'s occurrence by the *left-most* strategy will consume the position for the next matching occurrence. We can guess that, the recurring 'b' character in this pattern affect the quality of matching occurrences.

**Table 10.** Example 1 for *rep* concept

In this example, A complete occurrence set is {{0, 2, 4}, {1, 3, 5}}, the number of matching occurrences is 2. If we use SAIL algorithm and first obtain {0, 2, 3}, then we will only get this occurrence and lose {0, 2, 4}, {1, 3, 5}. Obviously, the recurring 'b' character in this pattern affects the completeness.

From above examples, the matching of recurring character in the pattern may determine the completeness of the algorithm. As a result, we consider this repeatability as an element to influence the completeness.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 319

(1) If *USAIL* is the completeness set, *NSAIL* = *Nopt* is satisfied.

(3) If the condition holds *Nopt* = 1, *NSAIL* = *Nopt* is obtained.

(4) If there is no conflict, *NSAIL* = *Nopt* is achieved*.*

contradiction is achieved. So (4) is proved.

*<sup>u</sup>*1*,u*2 with a conflict where *u*1 *<sup>U</sup>*1, *u*2 *<sup>U</sup>*2.

*U1, u*2*<sup>j</sup>*

that is *S*∈*USAIL*, *A*∈*Uopt*, *B*∈*Uopt*. Let:

conflict and satisfy *N(u*1*i)* < *N(u*2*j)*. Lemma 6 is proved.

and *N(u*1*)* < *N(u*2*)*.

subsets where *u*1*<sup>i</sup>*

condition. The contradiction is achieved. Lemma 5 is proved.

(2) Otherwise, *NSAIL* < *Nopt* is obtained and there is a conflict between *USAIL* and *Uopt*.

*Proof*: It is obvious to conclude (1) is obviously true. According to the definition 5, if *NSAIL* < *Nopt*, there is an occurrence *S* satisfying *S*∈*Uopt* and *S* ∉ *USAIL*. Due to LEMMA 3, *S* conflicts with at least one occurrence in *USAIL*. That is Uopt is conflict with USAIL. So (2) is proved. With regard to (3), let *S* be the unique occurrence of *Uopt*. Assume *NSAIL* < *Nopt*, then *NSAIL* = 0. That is SAIL has no occurrence. In accordance with LEMMA 3, *S* conflicts with at least one occurrence of SAIL. But *USAIL* is empty, so there is no conflict. Thus the contradiction is achieved. And (3) is proved. With regard to (4), it is obvious *NSAIL* ≤ *Nopt.* We assume *NSAIL* < *Nopt*, then there is an occurrence *S* satisfying *S*∈*Uopt* and *S* ∉ *USAIL*. Due to LEMMA 3, *S* conflicts with at least one occurrence in *USAIL*. That is *Uopt* and *USAIL* have a conflict. The

**LEMMA 5** Given two occurrence sets *U*1*,U*2, if *U*1 conflict with *U*2, there are two sub-sets

*Proof*: Assume there is no sub-sets with a conflict. All the matching positions of *U*<sup>1</sup> and *U*<sup>2</sup> have no conflict. According to definition 10, *U*1 and *U*2 have no conflict and satisfy the *one-off* 

**LEMMA 6** Given two occurrence sets *U*1*, U*2, *U*2 is *Uopt*. If there is a conflict between *U*1 and *<sup>U</sup>*2, and *N(U1)* < *N(U2)*, there are two subsets *u*1,*u*2 where *u1 U1, u2 U2*, *u*1 is conflict with *u*<sup>2</sup>

Proof: In accordance with LEMMA 5, there are subsets *u*1,*u*2 where *u*<sup>1</sup> *<sup>U</sup>*1,*u*2 *<sup>U</sup>*2 with conflict. Let *U*1 = *u*11∪*u*12∪ ……∪*u*1n, *U*2 = *u*21∪*u*22∪ ……∪*u*2m, and *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* are arbitrary

have no conflict and do not satisfy *N(u*1*i)* < *N(u*2*j)*, then *U1,U2* have no conflict and *N(U1)* = *N(U2)*, the contradiction is achieved. ② *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* have a conflict and do not satisfy *N(u*1*i)* < *N(u*2*j)*, then *U1,U2* have no conflict, the contradiction is achieved. ③ *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* have no conflict and satisfy *N(u*1*i)* < *N(u*2*j)*, then *N(U1)* = *N(U2)*, the contradiction is achieved. So *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* have a

*Proof*: Let *USAIL* be the occurrence set of SAIL, *Uopt* is the completeness set, *NSAIL* is the matching number of SAIL, and *Nopt* is the complete matching number. Consider the SAIL is incompleteness, according to LEMMA 4, *NSAIL* < *Nopt*, and *USAIL* conflicts with *Uopt*. Due to LEMMA 6, we get two subsets u1, u2 with conflict, which are satisfying N(u1) < N(u2) where u1 *USAIL*, u2 *Uopt*. Without loss of generality, let N(u1) = 1, N(u2) = 2. Set u1 = {*S*}, u2 = {*A, B*},

*T* = *t*[0]*, t*[1]*… t*[i]*… t*[*n*-1], *t*[*i*] is stand for the *ith* letter in *T* where *i* = 0,1,2……*n*-1

**THEOREM 1** Given a text *T*, a pattern *P*, if SAIL is incomplete, *P* must be R pattern.

*U2*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>m</sup>*. We discuss in three conditions: ①*u*1*<sup>i</sup>*, *u*2*<sup>j</sup>*

In order to quantify the repeatability, the concept of repeatability, *rep*, is proposed in this paper.

**Definition 9** Given a pattern *P* = *p0p1…pm-1*, let *fij =* (*pi, pj*) be all binary combinations of characters in pattern *P*

$$\text{Let } f\_{i\bar{j}} = \begin{cases} 0, p\_i \neq p\_j \\ 1, p\_i = p\_j \end{cases}, \text{ and } rep = \sum\_{i=0}^{m-1} \sum\_{j=0}^{m-1} f\_{\bar{j}i}, \text{ where } 0 \le i, \ j \le m - 1, \text{ and } i \not\approx j. \text{ then } rep \text{ is the } j\text{-th row of } \bar{j} \text{ (or } \bar{j} \text{ is the } j\text{-th row of } \bar{j}).$$

repeatability of characters in pattern. It shows the number of pairs of the same characters in pattern.

**Definition 10** Given occurrences *A* and *S,* if *a*[*i*] = *s*[*k*] where 0 ≤ *i* ≤ *m*-1, 0 ≤ *k* ≤ *m*-1, we say *A*  conflicts with *S.* For example, *T* = aacccc, *P* = a¢[0,1]c¢[0,1]c, {0,2,3} conflicts with {0,2,4} and {1,3,5}. And "c" is the conflict letter.

For simplicity, global length constraint is deliberately ignored in our proof, and it does not affect the conclusion.

**LEMMA 1** Given two occurrences A, S, if *A* and *S* come from the same occurrence set, then *a*[*i*] ≠ *s*[*k*] where 0 ≤ *i* ≤ *m*-1, 0 ≤ *k* ≤ *m*-1.

*Proof*: Assume *a*[*i*] = *s*[*k*], then A conflicts with S, so they can not belong to the same set. The contradiction is achieved. Lemma 1 is proved.

**LEMMA 2** Given two occurrences *A* and *S* where *S*∈*USAIL*. If there is a conflict between *A* and *S*, and let *a*[*t*] and *s*[*i*] be the conflict positions. According to the definition 10, under the *one-off* condition, *A* should be discarded. Moreover, if *i* = *t*, then *s*[*i*] = *a*[*i*]; if *i* ≠ *t*, *s*[*i*] < *a*[*i*] where 0 ≤ *i* ≤ *m*-1. For instance, *S* = {0, 2, 3}, *A* = {1, 2, 4}, for *s*[1] = *a*[1], the conflict position is 1, and the other positions satisfy *s*[0] < *a*[0], *s*[2] < *a*[2].

Proof: Assume *s*[*i*] > *a*[*i*], then *a*[*i*] is in the left of *s*[*i*] in *T*. In accordance with the *left-most*  strategy of SAIL, the *left-most* one prior to others is selected, which is *a*[*i*]. Due to the issue, *S*∈*USAIL*, so *s*[*i*] should be selected. The contradiction is achieved. Thus, *s*[*i*] ≤ *a*[*i*]. If *i* = *t*, *s*[*i*] = *a*[*t*] = *a*[*i*], and if *i* ≠ *t, s*[*i*] = *a*[*t*] ≠ *a*[*i*]. It is obvious to concluded that *s*[*i*] < *a*[*i*].

**LEMMA 3** Given a text *T*, a pattern *P* and an occurrence *S*. Let *USAIL* be the occurrence set of SAIL. If *S* ∉ *USAIL*, *S* conflicts with at least one occurrence in *USAIL*.

*Proof*: Assume *S* does not conflict with any occurrence in *USAIL*. Then it indicates that the reason why SAIL lose *S* can only be the length constraint. According to the definition 4, all the occurrences satisfy the length constraint. The contradiction is achieved. So the lemma is proved.

**LEMMA 4** Let *USAIL* be the occurrence set of SAIL, and *Uopt* be the optimal one. Let *NSAIL*  (*Nopt*) be the matching number in *USAIL* (*Uopt*).

(1) If *USAIL* is the completeness set, *NSAIL* = *Nopt* is satisfied.


318 Bioinformatics

paper.

Let

*ij*

pattern.

proved.

influence the completeness.

characters in pattern *P*

*i j*

, and

{1,3,5}. And "c" is the conflict letter.

*a*[*i*] ≠ *s*[*k*] where 0 ≤ *i* ≤ *m*-1, 0 ≤ *k* ≤ *m*-1.

contradiction is achieved. Lemma 1 is proved.

(*Nopt*) be the matching number in *USAIL* (*Uopt*).

1, and the other positions satisfy *s*[0] < *a*[0], *s*[2] < *a*[2].

1 1

*ij*

*m m*

*i j rep f* 

0 0

*i j*

0, 1,

*p p <sup>f</sup> p p* 

affect the conclusion.

From above examples, the matching of recurring character in the pattern may determine the completeness of the algorithm. As a result, we consider this repeatability as an element to

In order to quantify the repeatability, the concept of repeatability, *rep*, is proposed in this

**Definition 9** Given a pattern *P* = *p0p1…pm-1*, let *fij =* (*pi, pj*) be all binary combinations of

repeatability of characters in pattern. It shows the number of pairs of the same characters in

**Definition 10** Given occurrences *A* and *S,* if *a*[*i*] = *s*[*k*] where 0 ≤ *i* ≤ *m*-1, 0 ≤ *k* ≤ *m*-1, we say *A*  conflicts with *S.* For example, *T* = aacccc, *P* = a¢[0,1]c¢[0,1]c, {0,2,3} conflicts with {0,2,4} and

For simplicity, global length constraint is deliberately ignored in our proof, and it does not

**LEMMA 1** Given two occurrences A, S, if *A* and *S* come from the same occurrence set, then

*Proof*: Assume *a*[*i*] = *s*[*k*], then A conflicts with S, so they can not belong to the same set. The

**LEMMA 2** Given two occurrences *A* and *S* where *S*∈*USAIL*. If there is a conflict between *A* and *S*, and let *a*[*t*] and *s*[*i*] be the conflict positions. According to the definition 10, under the *one-off* condition, *A* should be discarded. Moreover, if *i* = *t*, then *s*[*i*] = *a*[*i*]; if *i* ≠ *t*, *s*[*i*] < *a*[*i*] where 0 ≤ *i* ≤ *m*-1. For instance, *S* = {0, 2, 3}, *A* = {1, 2, 4}, for *s*[1] = *a*[1], the conflict position is

Proof: Assume *s*[*i*] > *a*[*i*], then *a*[*i*] is in the left of *s*[*i*] in *T*. In accordance with the *left-most*  strategy of SAIL, the *left-most* one prior to others is selected, which is *a*[*i*]. Due to the issue, *S*∈*USAIL*, so *s*[*i*] should be selected. The contradiction is achieved. Thus, *s*[*i*] ≤ *a*[*i*]. If *i* = *t*, *s*[*i*] =

**LEMMA 3** Given a text *T*, a pattern *P* and an occurrence *S*. Let *USAIL* be the occurrence set of

*Proof*: Assume *S* does not conflict with any occurrence in *USAIL*. Then it indicates that the reason why SAIL lose *S* can only be the length constraint. According to the definition 4, all the occurrences satisfy the length constraint. The contradiction is achieved. So the lemma is

**LEMMA 4** Let *USAIL* be the occurrence set of SAIL, and *Uopt* be the optimal one. Let *NSAIL* 

*a*[*t*] = *a*[*i*], and if *i* ≠ *t, s*[*i*] = *a*[*t*] ≠ *a*[*i*]. It is obvious to concluded that *s*[*i*] < *a*[*i*].

SAIL. If *S* ∉ *USAIL*, *S* conflicts with at least one occurrence in *USAIL*.

, where 0 <sup>≤</sup> *i, j* <sup>≤</sup> *<sup>m</sup>*-1, and *<sup>i</sup>* <sup>≠</sup> *<sup>j</sup>*. then *rep* is the

*Proof*: It is obvious to conclude (1) is obviously true. According to the definition 5, if *NSAIL* < *Nopt*, there is an occurrence *S* satisfying *S*∈*Uopt* and *S* ∉ *USAIL*. Due to LEMMA 3, *S* conflicts with at least one occurrence in *USAIL*. That is Uopt is conflict with USAIL. So (2) is proved. With regard to (3), let *S* be the unique occurrence of *Uopt*. Assume *NSAIL* < *Nopt*, then *NSAIL* = 0. That is SAIL has no occurrence. In accordance with LEMMA 3, *S* conflicts with at least one occurrence of SAIL. But *USAIL* is empty, so there is no conflict. Thus the contradiction is achieved. And (3) is proved. With regard to (4), it is obvious *NSAIL* ≤ *Nopt.* We assume *NSAIL* < *Nopt*, then there is an occurrence *S* satisfying *S*∈*Uopt* and *S* ∉ *USAIL*. Due to LEMMA 3, *S* conflicts with at least one occurrence in *USAIL*. That is *Uopt* and *USAIL* have a conflict. The contradiction is achieved. So (4) is proved.

**LEMMA 5** Given two occurrence sets *U*1*,U*2, if *U*1 conflict with *U*2, there are two sub-sets *<sup>u</sup>*1*,u*2 with a conflict where *u*1 *<sup>U</sup>*1, *u*2 *<sup>U</sup>*2.

*Proof*: Assume there is no sub-sets with a conflict. All the matching positions of *U*<sup>1</sup> and *U*<sup>2</sup> have no conflict. According to definition 10, *U*1 and *U*2 have no conflict and satisfy the *one-off*  condition. The contradiction is achieved. Lemma 5 is proved.

**LEMMA 6** Given two occurrence sets *U*1*, U*2, *U*2 is *Uopt*. If there is a conflict between *U*1 and *<sup>U</sup>*2, and *N(U1)* < *N(U2)*, there are two subsets *u*1,*u*2 where *u1 U1, u2 U2*, *u*1 is conflict with *u*<sup>2</sup> and *N(u*1*)* < *N(u*2*)*.

Proof: In accordance with LEMMA 5, there are subsets *u*1,*u*2 where *u*<sup>1</sup> *<sup>U</sup>*1,*u*2 *<sup>U</sup>*2 with conflict. Let *U*1 = *u*11∪*u*12∪ ……∪*u*1n, *U*2 = *u*21∪*u*22∪ ……∪*u*2m, and *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* are arbitrary subsets where *u*1*<sup>i</sup> U1, u*2*<sup>j</sup> U2*, 1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*, 1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>m</sup>*. We discuss in three conditions: ①*u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* have no conflict and do not satisfy *N(u*1*i)* < *N(u*2*j)*, then *U1,U2* have no conflict and *N(U1)* = *N(U2)*, the contradiction is achieved. ② *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* have a conflict and do not satisfy *N(u*1*i)* < *N(u*2*j)*, then *U1,U2* have no conflict, the contradiction is achieved. ③ *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* have no conflict and satisfy *N(u*1*i)* < *N(u*2*j)*, then *N(U1)* = *N(U2)*, the contradiction is achieved. So *u*1*<sup>i</sup>*, *u*2*<sup>j</sup>* have a conflict and satisfy *N(u*1*i)* < *N(u*2*j)*. Lemma 6 is proved.

**THEOREM 1** Given a text *T*, a pattern *P*, if SAIL is incomplete, *P* must be R pattern.

*Proof*: Let *USAIL* be the occurrence set of SAIL, *Uopt* is the completeness set, *NSAIL* is the matching number of SAIL, and *Nopt* is the complete matching number. Consider the SAIL is incompleteness, according to LEMMA 4, *NSAIL* < *Nopt*, and *USAIL* conflicts with *Uopt*. Due to LEMMA 6, we get two subsets u1, u2 with conflict, which are satisfying N(u1) < N(u2) where u1 *USAIL*, u2 *Uopt*. Without loss of generality, let N(u1) = 1, N(u2) = 2. Set u1 = {*S*}, u2 = {*A, B*}, that is *S*∈*USAIL*, *A*∈*Uopt*, *B*∈*Uopt*. Let:

*T* = *t*[0]*, t*[1]*… t*[i]*… t*[*n*-1], *t*[*i*] is stand for the *ith* letter in *T* where *i* = 0,1,2……*n*-1

*P* = *p*[0]*, p*[1]*… p*[*i*]*… p*[*m*-1], *p*[*i*] is stand for the *ith* letter in *P* where *i* = 0,1,2……*m*-1

*A* = *a*[0]*, a*[1]*… a*[*u*]*… a*[*m*-1], *a*[*i*] is stand for the *ith* character maching position of occurrence *A* where *i* = 0,1,2…*m*-1

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 321

*b*[*t*+1] - *b*[*t*] < *a*[*t*+1] - *b*[*t*] < *a*[*t*+1] - *a*[*t*] (4)

 *b*[*t*+1] < *a*[*t*+1] and *a*[*t*] < b[*t*] where *u* ≤ *t* ≤ *w-*1 (3)

Assume there is no *t* satisfying the condition, consider *b*[*t*] ≠ *a*[*t*] where *u* ≤ *t* ≤ *w.*Then due to any *t* there is *a*[*t*+1] < *b*[*t*+1] or *a*[*t*] > *b*[*t*] where *u* ≤ *t* ≤ *w*-1.Consider *a*[*u*] < *b*[*u*], there is *a*[*u*+1] < *b*[*u*+1].Due to *a*[*u*+*k*] < *b*[*u*+*k*], we can obtain *a*[*u*+*k+*1] < *b*[*u*+*k+*1] where 0 ≤ *k* ≤ *w*-*u*-1.Then we can induce *a*[*i*] < *b*[*i*] where *u* ≤ *i* ≤ *w.*It contradicts *b*[*w*] < *a*[*w*], so the assume is incorrect

That is *a*[*t*] and b[*t*-1] satisfy the local constraints. In this way, { *bu, bu+*1*…, aw-*1*, aw* }can be considered as {{ *bu, bu+*1*…,bt-*1},{*bt, at+*1},{ *at+*2 *…, aw-*1*, aw* }}. In accordance with definition 4, { *bu, bu+*1*…,bt-*1},{ *at+*2 *…, aw-*1*, aw* } satisfy the local constraints. ∴{ *bu, bu+*1*…, aw-*1*, aw* } satisfy the local constraints. ∴From the above analysis, {*b*0*, b*1*,…,bu,…,aw,…, am-*1} satisfy the local constraints.

However, according to the theorem, the other positions in A,B do not conflict with *USAIL* except for *a*[*u*], *b*[*w*]. That is, {*b*0*, b*1*,…,bu,,aw,…,am-*1} satisfies the *one-off* condition. ∴{*b*0*, b*1*,…,bu, aw,…, am-*1} is another occurrence, and does not conflict with any occurrences in *USAIL*. But *USAIL* does not include this occurrence. It contradicts with LEMMA 3. Thus, condition ② is impossible. And from the analysis of ①, under the condition of the theorem, P must be R

*Proof*: If *gap* = 0, the wildcard is a constant. For example a¢[1,1]c¢[2,2]c is converted into a¢c¢¢c. There won't be any conflict or exist seizing between occurrences. SAIL will perform

**Experiment design**1: ∑= 4, *m* = {5,7,9}, *gap* = [0,3], *rep* = {0,1,2,3,4,6,7,10,11,15, 21,28,35}. In each set of experiments, 20 patterns are randomly generated; the final result is the average. Analysis of experimental results: with increment of *rep*, the curve of approximation ratio gradually decreases, followed by a slight increase. The reason for decline is that *rep* lead to more nested occurrences, resulting in a greater degree of the possibility of losing occurrences; the reason for the increscent is that larger *rep* can cause more extreme pattern. For instance, when ∑ = 4, *m* = 7, *rep* = 21, patterns like *P*1 =

1 When ∑ and *m* are determined, *rep* can only be some certain values, because *rep* has correlation with ∑ and *m*

**THEOREM 2** Given a text *T*, a pattern *P*, if *P* is NR pattern, then SAIL is complete. *Proof*: It is the inverse negation of THEOREM 1. Apparently, THEOREM 2 is true. **THEOREM 3** Given a text *T*, a pattern *P*, if *P* is R pattern, then SAIL is incomplete.

*Proof*: It can be concluded from the analysis and example in section 2.

**THEOREM 4** If the pattern fulfills *gap* = 0, SAIL is complete.

*A* = …*a*[*u*]……*a*[*t*]……………*a*[*t*+1]……..*a*[*w*]… *B* = ……..*b*[*u*] ……*b*[*t*]……*b*[*t*+1]……*b*[*w*]………

pattern. The theorem 1 is proved.

complete.

Due to equation (2) and (3), *a*[*t*] < *b*[t] < *b*[*t*+1] < *a*[*t*+1] where *u* ≤ *t* ≤ *w-*1.

*B* = *b*[0]*, b*[1]*… b*[*w*]*… b*[*m*-1], another occurrece*.*

*S* = *s*[0]*, s*[1]*… s*[*i*]*… s*[*k*]*… s*[*m*-1], another occurrece.

Let *a*[*u*], *b*[*w*] be the positions in *A, B*, which conflict with *s*[*i*], *s*[*k*] in *S separately.* We assume other positions in *A* and *B* do not conflict with the occurrences in *USAIL*. ∴ *a*[*u*] = *s*[*i*], *b*[*w*] = *s*[*k*] ∴*t*[*a*[*u*]] = *t*[*s*[*i*]], *t*[*b*[*w*]] = *t*[*s*[*k*]] ∵According to the definition 4, *t*[*a*[*u*]] = *p*[*u*], *t*[*s*[*i*]] = *p*[*i*], *t*[ *b*[*w*] ] = *p*[*w*], *t*[*s*[*k*]] = *p*[*k*] ∴ *p*[*u*] *= p*[*i*], *p*[*w*] *= p*[*k*]

It would be discussed in the following two cases:

① *u* ≠ *i* or *w* ≠ *k*

② *u* = *i* and *w* = *k*

For ①, when if *u* ≠ *i*, ∵*p*[*u*] = *p*[*i*] ∴There are two of the same letters from different positions in *P*. ∴According to definition 6, *P* is an R pattern. For the case of *w* ≠ *k*, similarly, it can be proved.

Then we will prove the other condition is impossible, and conclude P is R pattern.

For ②, we obtain *a*[*u*] *= s*[*i*] *= s*[*u*], *b*[*w*] *= s*[*k*] *= s*[*w*]. There is *u* ≠ *w*. ∵ Assume *u* = *w*, then *u* = *i* = *w* = *k* ∴*a*[*u*] *= s*[*i*] *= s*[*k*] *= b*[*w*]. Consider *A*, *B* belong to the same occurrence set, which contradicts with LEMMA 1 ∴ *u* ≠ *w*. Without loss of generality, let *u* < *w*, according to LEMMA 2 ∵ SAIL adopts the *left-most* strategy, and *S*∈*USAIL*,A,B ∉ *USAIL* ∴ *s*[*u*] < *b*[*u*], *s*[*w*] < *a*[*w*] ∵ *a*[*u*] *= s*[*u*], *b*[*w*] *= s*[*w*]

$$a: a[\mu] \triangleleft b[\mu], b[w] \triangleleft a[w] \tag{1}$$

And∵ *u* < *w*, we can obtain *b*[*u*] < *b*[*w*]

*A* = …*a*[*u*]…………..*a*[*w*]…

*B* = ……..*b*[*u*]…*b*[*w*]………

The occurrence {*b*0*, b*1*,…, bu,…,aw,…, am-*1} can be considerd as {{*b*0*, b*1*,…,bu*}, {*bu,…,aw*}, {*aw,…, am-*1}}.

According to the definition 4, {*a*0*, a*1*,…,au,…,aw,…,am-*1} and {*b*0*, b*1*,…,bu,…,bw,…,bm-*1} satisfy the local constraints. So {*aw,…, am-*1} and {*b*0*, b*1*,…,bu*} satisfy the local constraints.

Due to {*bu,…,aw*}, we can get { *bu, bu+*1*…, aw-*1*, aw* }.

From the equation (1), *a*[*u*] < *b*[*u*], *b*[*w*] < *a*[*w*], and according to the definition 4:

$$b[i] < b[i+1], \ a[i] < a[i+1] \text{ where } \mu \le i \le \nu \text{-} 1 \tag{2}$$

There is a *t* satisfying:

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 321

$$b[t+1] < a[t+1] \text{ and } a[t] \le b[t] \text{ where } u \le t \le w \cdot 1 \tag{3}$$

$$A = \dots a[\![\mu] \dots \dots \!a[t] \dots \dots \dots \dots \dots \dots a[t+1] \dots \dots \dots a[\![w] \dots \dots \![\mu] \dots \mu]$$

$$B = \dots \dots \dots b[\mu] \dots \dots b[t] \dots \dots b[t+1] \dots \dots b[w] \dots \dots \dots \dots$$

320 Bioinformatics

① *u* ≠ *i* or *w* ≠ *k* ② *u* = *i* and *w* = *k*

*a*[*w*] ∵ *a*[*u*] *= s*[*u*], *b*[*w*] *= s*[*w*]

And∵ *u* < *w*, we can obtain *b*[*u*] < *b*[*w*] *A* = …*a*[*u*]…………..*a*[*w*]… *B* = ……..*b*[*u*]…*b*[*w*]………

Due to {*bu,…,aw*}, we can get { *bu, bu+*1*…, aw-*1*, aw* }.

proved.

*am-*1}}.

There is a *t* satisfying:

occurrence *A* where *i* = 0,1,2…*m*-1

*B* = *b*[0]*, b*[1]*… b*[*w*]*… b*[*m*-1], another occurrece*.*

*p*[*i*], *t*[ *b*[*w*] ] = *p*[*w*], *t*[*s*[*k*]] = *p*[*k*] ∴ *p*[*u*] *= p*[*i*], *p*[*w*] *= p*[*k*]

It would be discussed in the following two cases:

*S* = *s*[0]*, s*[1]*… s*[*i*]*… s*[*k*]*… s*[*m*-1], another occurrece.

*P* = *p*[0]*, p*[1]*… p*[*i*]*… p*[*m*-1], *p*[*i*] is stand for the *ith* letter in *P* where *i* = 0,1,2……*m*-1

Let *a*[*u*], *b*[*w*] be the positions in *A, B*, which conflict with *s*[*i*], *s*[*k*] in *S separately.* We assume other positions in *A* and *B* do not conflict with the occurrences in *USAIL*. ∴ *a*[*u*] = *s*[*i*], *b*[*w*] = *s*[*k*] ∴*t*[*a*[*u*]] = *t*[*s*[*i*]], *t*[*b*[*w*]] = *t*[*s*[*k*]] ∵According to the definition 4, *t*[*a*[*u*]] = *p*[*u*], *t*[*s*[*i*]] =

For ①, when if *u* ≠ *i*, ∵*p*[*u*] = *p*[*i*] ∴There are two of the same letters from different positions in *P*. ∴According to definition 6, *P* is an R pattern. For the case of *w* ≠ *k*, similarly, it can be

For ②, we obtain *a*[*u*] *= s*[*i*] *= s*[*u*], *b*[*w*] *= s*[*k*] *= s*[*w*]. There is *u* ≠ *w*. ∵ Assume *u* = *w*, then *u* = *i* = *w* = *k* ∴*a*[*u*] *= s*[*i*] *= s*[*k*] *= b*[*w*]. Consider *A*, *B* belong to the same occurrence set, which contradicts with LEMMA 1 ∴ *u* ≠ *w*. Without loss of generality, let *u* < *w*, according to LEMMA 2 ∵ SAIL adopts the *left-most* strategy, and *S*∈*USAIL*,A,B ∉ *USAIL* ∴ *s*[*u*] < *b*[*u*], *s*[*w*] <

The occurrence {*b*0*, b*1*,…, bu,…,aw,…, am-*1} can be considerd as {{*b*0*, b*1*,…,bu*}, {*bu,…,aw*}, {*aw,…,* 

According to the definition 4, {*a*0*, a*1*,…,au,…,aw,…,am-*1} and {*b*0*, b*1*,…,bu,…,bw,…,bm-*1} satisfy the

 *b*[*i*] < *b*[*i*+1], *a*[*i*] < *a*[*i*+1] where *u* ≤ *i* ≤ *w-*1 (2)

local constraints. So {*aw,…, am-*1} and {*b*0*, b*1*,…,bu*} satisfy the local constraints.

From the equation (1), *a*[*u*] < *b*[*u*], *b*[*w*] < *a*[*w*], and according to the definition 4:

∴ *a*[*u*] < *b*[*u*], *b*[*w*] < *a*[*w*] (1)

Then we will prove the other condition is impossible, and conclude P is R pattern.

*A* = *a*[0]*, a*[1]*… a*[*u*]*… a*[*m*-1], *a*[*i*] is stand for the *ith* character maching position of

Assume there is no *t* satisfying the condition, consider *b*[*t*] ≠ *a*[*t*] where *u* ≤ *t* ≤ *w.*Then due to any *t* there is *a*[*t*+1] < *b*[*t*+1] or *a*[*t*] > *b*[*t*] where *u* ≤ *t* ≤ *w*-1.Consider *a*[*u*] < *b*[*u*], there is *a*[*u*+1] < *b*[*u*+1].Due to *a*[*u*+*k*] < *b*[*u*+*k*], we can obtain *a*[*u*+*k+*1] < *b*[*u*+*k+*1] where 0 ≤ *k* ≤ *w*-*u*-1.Then we can induce *a*[*i*] < *b*[*i*] where *u* ≤ *i* ≤ *w.*It contradicts *b*[*w*] < *a*[*w*], so the assume is incorrect Due to equation (2) and (3), *a*[*t*] < *b*[t] < *b*[*t*+1] < *a*[*t*+1] where *u* ≤ *t* ≤ *w-*1.

$$b[t+1] - b[t] \le a[t+1] - b[t] \le a[t+1] - a[t] \tag{4}$$

That is *a*[*t*] and b[*t*-1] satisfy the local constraints. In this way, { *bu, bu+*1*…, aw-*1*, aw* }can be considered as {{ *bu, bu+*1*…,bt-*1},{*bt, at+*1},{ *at+*2 *…, aw-*1*, aw* }}. In accordance with definition 4, { *bu, bu+*1*…,bt-*1},{ *at+*2 *…, aw-*1*, aw* } satisfy the local constraints. ∴{ *bu, bu+*1*…, aw-*1*, aw* } satisfy the local constraints. ∴From the above analysis, {*b*0*, b*1*,…,bu,…,aw,…, am-*1} satisfy the local constraints.

However, according to the theorem, the other positions in A,B do not conflict with *USAIL* except for *a*[*u*], *b*[*w*]. That is, {*b*0*, b*1*,…,bu,,aw,…,am-*1} satisfies the *one-off* condition. ∴{*b*0*, b*1*,…,bu, aw,…, am-*1} is another occurrence, and does not conflict with any occurrences in *USAIL*. But *USAIL* does not include this occurrence. It contradicts with LEMMA 3. Thus, condition ② is impossible. And from the analysis of ①, under the condition of the theorem, P must be R pattern. The theorem 1 is proved.

**THEOREM 2** Given a text *T*, a pattern *P*, if *P* is NR pattern, then SAIL is complete.

*Proof*: It is the inverse negation of THEOREM 1. Apparently, THEOREM 2 is true.

**THEOREM 3** Given a text *T*, a pattern *P*, if *P* is R pattern, then SAIL is incomplete.

*Proof*: It can be concluded from the analysis and example in section 2.

**THEOREM 4** If the pattern fulfills *gap* = 0, SAIL is complete.

*Proof*: If *gap* = 0, the wildcard is a constant. For example a¢[1,1]c¢[2,2]c is converted into a¢c¢¢c. There won't be any conflict or exist seizing between occurrences. SAIL will perform complete.

**Experiment design**1: ∑= 4, *m* = {5,7,9}, *gap* = [0,3], *rep* = {0,1,2,3,4,6,7,10,11,15, 21,28,35}. In each set of experiments, 20 patterns are randomly generated; the final result is the average.

Analysis of experimental results: with increment of *rep*, the curve of approximation ratio gradually decreases, followed by a slight increase. The reason for decline is that *rep* lead to more nested occurrences, resulting in a greater degree of the possibility of losing occurrences; the reason for the increscent is that larger *rep* can cause more extreme pattern. For instance, when ∑ = 4, *m* = 7, *rep* = 21, patterns like *P*1 =

<sup>1</sup> When ∑ and *m* are determined, *rep* can only be some certain values, because *rep* has correlation with ∑ and *m*

a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a which is difficult to find a special text containing nested occurrences of such pattern, will be produced. For *P*1, the text like "aaaaaaaaaaaaaa" contains nested occurrences of this pattern. Obviously, this extreme text is very rare. Therefore, under the premise of nested occurrences are not easily to be formed, the approximation ratio will be increased slightly.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 323

*Cm E rep* (2)

2

is

( ) | | *Cm E rep*

2

( ) | |

As an extension of traditional matching problem, the PMWL problem has aroused more and more attention because of its unique flexibility and complexity. Based on problem definition and drawing on research idea in traditional matching problem, this article introduces SAIL, RSAIL, SBO and BPBM which are representative algorithms for PMWL in three important respects: the data structures, the matching strategies and the characteristics of pattern. The article also analyzes the pros and cons of the above algorithms from the point of quality of the solution and time complexity, and gives experimental matching results by using real DNA data. Among them, the SAIL algorithm is the first to propose the method of solving PMWL problem, it uses the sliding window structure and the representative *left-most* matching strategy. This paper finds that in short patterns, the approximation ratio of SAIL is higher than 0.9, while in longer patterns, the occurrences obtained by SAIL are of poor quality; the quality of occurrences obtained by SBO is best, but its time consumption has a non-linear relationship with the length of text; BPBM utilizes bit parallel technology to improve the efficiency of matching greatly, but also is impact by the machine word; for pattern with repeated letters in tail, RSAIL uses symmetry to improve the quality of occurrences under certain conditions, thus providing a solving idea to PMWL problem, but in longer patterns and wilder gaps, the efficiency is

Afterwards, this article focus on relationship between approximation ratio ε and alphabet size ∑, pattern length *m*, wildcards span *gap* and repeatability *rep*. Firstly, this article proposes the model ε = F (Σ, *m*, *gap*), describing the functional relationship between pattern characteristics and approximation ratio approximately; secondly, this article proves PMWL's completeness under the conditions of *rep* = 0; finally, the relationship between the

In future work, the formal description of the PMWL problem will be considered, in order to explain the complexity of the problem better, thus helping algorithm design and analysis for

pattern features are also analyzed andm in addition, relationship that

Finally, we can deduce:

**5. Conclusions** 

not obvious.

proposed.

problem complexity.

**Author details** 

Haiping Wang, Taining Xiang and Xuegang Hu

*Hefei University of Technology, China* 

**Figure 7.** The relation between *rep* and approximation ratio ε

Next we will analyze the relationship between the repeatability *rep* and alphabet size ∑, pattern length *m*. Original problem: a pattern which length is *m*, and alphabet size is ∑, what is the expectation of repeatability E(*rep*)?

This description is equivalent to the model of 'taking ball from the bag' in the combination mathematics:

There is a bag of balls, and |Σ| kinds of colors, taking *m* balls from the bag with replacement, then in fetched balls, how many pairs of the same color?


**Table 11.** The relationship between ∑, *m* and *rep*

Finally, we can deduce:

322 Bioinformatics

approximation ratio will be increased slightly.

**Figure 7.** The relation between *rep* and approximation ratio ε

is the expectation of repeatability E(*rep*)?

**Table 11.** The relationship between ∑, *m* and *rep*

mathematics:

m

a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a¢[0,3]a which is difficult to find a special text containing nested occurrences of such pattern, will be produced. For *P*1, the text like "aaaaaaaaaaaaaa" contains nested occurrences of this pattern. Obviously, this extreme text is very rare. Therefore, under the premise of nested occurrences are not easily to be formed, the

Next we will analyze the relationship between the repeatability *rep* and alphabet size ∑, pattern length *m*. Original problem: a pattern which length is *m*, and alphabet size is ∑, what

This description is equivalent to the model of 'taking ball from the bag' in the combination

There is a bag of balls, and |Σ| kinds of colors, taking *m* balls from the bag with

replacement, then in fetched balls, how many pairs of the same color?

<sup>Σ</sup> 3 4 5 6 …… m

3 3/3 6/3 10/3 15/3 …… <sup>2</sup> / 3 *Cm* 4 3/4 6/4 10/4 15/4 …… <sup>2</sup> / 4 *Cm* 5 3/5 6/5 10/5 15/5 …… <sup>2</sup> / 5 *Cm* 6 3/6 6/6 10/6 15/6 …… <sup>2</sup> / 6 *Cm* …… …… …… …… …… …… …… <sup>Σ</sup> 3/| | 6/| | 10/| | 15/| | …… <sup>2</sup> /| | *Cm*

$$E(rep) = \frac{C\_m^2}{|\sum|} \tag{2}$$
