**3.2. The RSAIL Algorithm**

Description of RSAIL Algorithm (Wang et al., 2010):

**Definition 6** Given a pattern *P*, if there are letters *p*[*i*] = *p*[*j*] where 0 ≤ *i* ≤ *m*-1,0 ≤ *j* ≤ *m*-1, *P* is called a **pattern with Recurring characters,** and **R pattern** in short, such as a¢[0,1]c¢[0,1]c¢[0,1]t.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 309

Size of alphabet ∑ 4 4 4 7 Length of pattern *m* 3 4 5 5 *gap* 0~30 0~30 0~30 0~30

**Figure 3.** The approximate ratio experimental results of RSAIL and SAIL

From the above images, SAIL itself is already a near-complete algorithm, in the above graphs, the average approximation ratio of SAIL is higher than 0.94;For the RT patterns, the completeness of RSAIL is better than SAIL in different , *m* and *gap*. Thus, not only a revised algorithm is obtained, the SAIL's deficiency on handling RT patterns is also

Like the SAIL algorithm, the BPBM algorithm also focuses on pattern matching in online sequential text with both flexible *gap* constraints by user's specification and the *one-off* condition. BPBM is based on bit-parallel technology to simulate the matching process and adopt two nondeterministic finite state automatons (NFAs). One is a search mechanism to identify all pattern *P*'s suffix, and another one is a security window transition mechanism

which accelerates the scanning process by dropping useless sequences in text.

**Table 5.** The parameters in the experiments.

The experimental results and analysis:

proved from another aspect.

**3.3. The BPBM Algorithm** 

Description of BPBM Algorithm (Guo et al., 2011):

Experiment1 Experiment 2 Experiment 3 Experiment 4

**Definition 7** Given a pattern *P*, if all the letters in *P* are different, *P* is called a **pattern with No-Recurring characters** and **NR Pattern** for brevity, such as a¢[0,1]c¢[0,1]g¢[0,1]t.

**Definition 8** Given a pattern *P*, if there is a position *i* such that *p*[*i*] *= p*[*i+*1] *=……= p*[*m-*1] where 1 ≤ *i* < *m*-1, *P* is called a pattern with recurring tail characters and **RT pattern** in short. Such as a¢[0,1]c¢[0,1]c. As we can see, the RT pattern is a special form of the R pattern.

From the above discussion, in the research of Chen et al., since they only concern about the on-line situation, their proof of SAIL's completeness is incomplete, which is only suitable for the on-line situation. What is more, it ignores the interaction between different occurrences.

We find that SAIL satisfies the completeness under a certain restriction, i.e. the pattern with no-recurring character (NR pattern), such as a¢[0, 1]t¢[0, 1]g¢[0, 1]c¢[0, 1]. The concept of NR pattern has practical significance, for example, in text mining, where the text is a sequence of words, the NR pattern reflects the semantic relation between words.

We utilize the symmetry to scan the text and the pattern. Then convert an RT pattern into an R pattern.


Obviously, since the time of the identification of pattern's characteristics is linear, O (RSAIL) = O (SAIL).

Experiments and Analysis:

We will give a set of experiments to illustrate two problems:


Considering there is no algorithm can obtain the completeness occurrences of PMWL problem in polynomial time, we have developed a text generator (Xie et al., 2010) to generate experimental text, by which way we can know the completeness occurrences in order to analyze the complete extent of algorithm. In addition, the patterns used in these experiments are all RT patterns.


**Table 5.** The parameters in the experiments.

308 Bioinformatics

**3.2. The RSAIL Algorithm** 

a¢[0,1]c¢[0,1]c¢[0,1]t.

occurrences.

R pattern.

= O (SAIL).

otherwise go to (2);

Experiments and Analysis:

experiments are all RT patterns.

2. Reverse *T* and *P*, respectively get *T'*, *P'*;

3. Call SAIL, and obtain the occurrences of *P'* in *T'*;

We will give a set of experiments to illustrate two problems: 1. Analysis of the complete extent of the SAIL algorithm;

2. The comparison of the complete extent of RSAIL and SAIL algorithm;

Description of RSAIL Algorithm (Wang et al., 2010):

**Definition 6** Given a pattern *P*, if there are letters *p*[*i*] = *p*[*j*] where 0 ≤ *i* ≤ *m*-1,0 ≤ *j* ≤ *m*-1, *P* is called a **pattern with Recurring characters,** and **R pattern** in short, such as

**Definition 7** Given a pattern *P*, if all the letters in *P* are different, *P* is called a **pattern with** 

**Definition 8** Given a pattern *P*, if there is a position *i* such that *p*[*i*] *= p*[*i+*1] *=……= p*[*m-*1] where 1 ≤ *i* < *m*-1, *P* is called a pattern with recurring tail characters and **RT pattern** in short.

From the above discussion, in the research of Chen et al., since they only concern about the on-line situation, their proof of SAIL's completeness is incomplete, which is only suitable for the on-line situation. What is more, it ignores the interaction between different

We find that SAIL satisfies the completeness under a certain restriction, i.e. the pattern with no-recurring character (NR pattern), such as a¢[0, 1]t¢[0, 1]g¢[0, 1]c¢[0, 1]. The concept of NR pattern has practical significance, for example, in text mining, where the text is a

We utilize the symmetry to scan the text and the pattern. Then convert an RT pattern into an

1. According to the characteristic of *P*, if it is not an RT pattern, we directly call SAIL;

4. Obtain the occurrences of *P* in *T* by coordinate transformation of the obtained solution. Obviously, since the time of the identification of pattern's characteristics is linear, O (RSAIL)

Considering there is no algorithm can obtain the completeness occurrences of PMWL problem in polynomial time, we have developed a text generator (Xie et al., 2010) to generate experimental text, by which way we can know the completeness occurrences in order to analyze the complete extent of algorithm. In addition, the patterns used in these

sequence of words, the NR pattern reflects the semantic relation between words.

**No-Recurring characters** and **NR Pattern** for brevity, such as a¢[0,1]c¢[0,1]g¢[0,1]t.

Such as a¢[0,1]c¢[0,1]c. As we can see, the RT pattern is a special form of the R pattern.

The experimental results and analysis:

**Figure 3.** The approximate ratio experimental results of RSAIL and SAIL

From the above images, SAIL itself is already a near-complete algorithm, in the above graphs, the average approximation ratio of SAIL is higher than 0.94;For the RT patterns, the completeness of RSAIL is better than SAIL in different , *m* and *gap*. Thus, not only a revised algorithm is obtained, the SAIL's deficiency on handling RT patterns is also proved from another aspect.

#### **3.3. The BPBM Algorithm**

Description of BPBM Algorithm (Guo et al., 2011):

Like the SAIL algorithm, the BPBM algorithm also focuses on pattern matching in online sequential text with both flexible *gap* constraints by user's specification and the *one-off* condition. BPBM is based on bit-parallel technology to simulate the matching process and adopt two nondeterministic finite state automatons (NFAs). One is a search mechanism to identify all pattern *P*'s suffix, and another one is a security window transition mechanism which accelerates the scanning process by dropping useless sequences in text.

BPBM has following characteristics:

1. BPBM also uses the *left-most* strategy to obtain the maximal occurrence of pattern in text, and return all these matching position sequences. This algorithm combines bitparallel technology with nondeterministic finite state automatons. It also simplifies the calculation of shift distance of the security window transition, which gets good results. BPBM inherits the advantage of BM algorithm to skip some of characters in text, which conducts the algorithm with a sub linear average time complexity. Therefore, the time complexity of BPBM is lower compared to SAIL.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 311

information need more place to store, so the space complexity of SBO is *O*(*gap*\**m*\**n*), the next

In many literatures, similar problems are defined and various algorithms are put out to solve certain problems. Morgante, et al. (Morgante, et al., 2004) described a structured model, which can be considered as 'compound patterns' made of a list of simple motifs and a list of intervals that specify at what distances adjacent motifs should occur. They gave a detailed description of the biological background of the problem definition. For example, many retrotransposons belonging to the Ty1-copia group contain a match of MT¢[115,136]MTNTAYGG¢[121,151]GTNGAYGAY, which consists of three patterns and two intervals. As the paper pointed out, structured motifs are called classes of Characters and Bounded Gaps (CBG) expressions in Navarro and Raffinot, but use of these expressions is quite different: the underlying motivation for CBG expressions is searching in database like PROSITE and a sequence of this kind is usually not very long, while structured motifs can be very long since gaps may span many letters. As we can see, the concept of CBG and structured motifs are all have practical meaning. Because of the different application background, they design different algorithms to solve their problems. From the application point, this paper also considered a problem of q-approximation match which means just finding partial motifs in the sequence. In this paper, they proposed a two-step procedure which is used in many algorithms for PMWL: firstly, finding the occurrences of all the component patterns; secondly, combining the occurrences that satisfy the distance constraints into a structured motif. For step two, they gave a detailed algorithm to build a directed acyclic graph according to the positions of the component patterns and interval constraints. Then they discussed how to output all the occurrences in detail. In (Rahman et al., 2006), the definition of their problem likes SAIL, but they don't consider global constraints and the *one-off* searching. In addition, just like paper (Chen et al., 2006), the local constraints exist between two substrings, while in SAIL, exist between any two consecutive letters. Certainly, a single character is a substring, but in this paper, all these substrings are used to build an AC automaton. It is not efficient to build a Trie structure over a set of single letters. This paper also used a two-step procedure: firstly using AC automaton to get occurrences of each sub-patterns in orders and combine them. They built an implicit graph, in which vertices are partitioned into several sets in order according to the corresponding sub-pattern and edges between two consecutive sets means two positions in these two consecutive sets fit corresponding local constraints. To output all *P* in *T*, we have to enumerate all possible paths in the implicit directed graph which length is the number of sub-patterns in the pattern. Morgante, et al. (Morgante, et al., 2004) applied a revised depth first searching algorithm. Philip Bille et al. (Bille et al., 2010) defined a concept named variable length *gap* (VLG) which is a pattern formed by a sequence of strings and variable length gaps. Obviously, this definition is almost the same with above works. Unlike Rahman's work, although this paper also applies AC automaton, it maintains a sorted list containing the ranges defined by previously reported relevant occurrences, and naturally it

one is more information means more calculations thus consuming more time.

**3.5. Other algorithms** 

