**3.1. The SAIL Algorithm**

304 Bioinformatics

prefix searching

suffix searching

factor searching

**3. Algorithms for PMWL** 

The results of traditional matching algorithm are complete, so the focus of research is to improve the matching efficiency. As a kind of searching problem, the key to solving matching problem is how to use and extract information getting from text and pattern. KMP, BM algorithm uses automata to describe the pattern characteristics, and deposit information obtained from scanning during matching process into automata. Algorithm visits the automata, when the jump distance needs to be calculated, thus to avoid obtaining the pattern information repeatedly and to ensure the jump in matching process does not affect the final result. The basic idea the suffix tree is to use the tree structure to describe the text information, and to avoid scanning the same text repeatedly when matching a set of patterns. We believe that data structure and search strategy are crucial for traditional algorithms to access to information of text and pattern. Reasonable data structure is better to explore the potential of the computer, such as bit parallel technology, and can also be a more reasonable representation of the sequence information, such as automata. In addition, there exist the sliding window, indexes and other data structures. Reasonable matching strategy makes better use of sequence information. These strategies can approximately be divided

into prefix searching, suffix searching and factor searching (Navarro & Raffinot, 2001).

Methods Remarks

Most of them are sliding window technique, the scope of algorithm

application depends on the alphabet size and the pattern

length

algorithms

KMP Deterministic

Shift-And Bit parallel, non-

BM Pre-calculation of

Horspool Improve the

BDM Suffix automaton BNDM Bit parallel BOM Factor Oracle

automata

distance

deterministic automata

the three functions used to determine the safe jumping

function of the BM, can have a greater jump distance, especially suitable for larger alphabet

automaton, suitable for longer pattern

Characteristics Representative

Forward search to find the longest common prefix of text and pattern strings in searching

Backward search to find the longest common suffix of text and pattern strings, can skip some text characters, the difficulty is how to safely move the

window

window

Backward search to determine whether the suffix of text in searching window is a substring of

**Table 1.** Analysis of the traditional pattern matching algorithms

pattern string

Description of SAIL Algorithm (Chen et al., 2006):

*Input*: A text *T* = *t*0*t*1*…tn-*1, a pattern *P* = *p*0*p*1*…pm-*1, local constraints *gi* = *g*(*Ni, Mi*), global constraints [*minLen*, *maxLen*].

*Output*: Occurrences of *P* in *T* satisfying the constraints.

The Steps of the algorithm:


Generally, SAIL starts from the beginning of *T* to search position *i* where *t*[*i*] = *p*[*m*-1]. After that, SAIL conducts two phases, the *Forward* phase and the *Backward* phase. In the *Forward* phase, SAIL determines whether there is a potential matching occurrence by using a search table. Afterwards, if a potential matching occurrence can be determined, *Backward* phase is triggered out to output an optimal occurrence by using the *left-most* strategy.

A running example for SAIL:

In this subsection, we show how SAIL works with a running example where *P*, *T* and constraints are given as follows.

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness 307

\* \* \*

Positions in *Text* Positions in *Table*

O(SAIL) = O(*n* + *klmg*) where *n* is the length of *T*, *k* is the frequency of *P*'s last letter occurring in *T*, *l* is the user-specified maximum length for each matching substring, *m* is the

Two important issues, *Online searching* and *Optimization*, are taken into consideration to design SAIL. As for optimization, under the *one-off* condition, SAIL determines which occurrence is an optimal one if multiple occurrences end at a *P*[*m*-1]'s position by applying the *left-most* strategy. As a heuristic algorithm, SAIL utilizes a kind of greedy strategy to select a set of occurrences; consequently, SAIL may obtain locally optimal solution which

Form the above example, we can know that a complete occurrence set for text *T* is {{2, 4, 6,

We believe that the SAIL's data structure is based on the sliding window, the *Location* also

Obviously, in the above example, SAIL loses occurrences in offline condition because of the selection of character *c*'s matching position. For further observation, it is not difficult to find that character *c* appears in the pattern twice. If pattern is a¢[0,1]g¢[0,1]c, SAIL will get a complete occurrence set, that is, {{2, 4, 6},{3, 5, 7}}. Further experiments show that, the recurring appearances of pattern characters influence the quality of matching occurrences obtained by the algorithm. In next part, we will analyze the completeness of PMWL based

uses the *left-most* strategy, so it is possible to lose occurrences, for instance:

a g c c **Table 4.** The constructed search table pos[*j*][*i*-start] when *P*[*m*-1] is 8

The time complexity and completeness analysis**:**

lead to losing occurrences in offline searching.

8}, {3, 5, 7, 9}}, but SAIL's output is {{2, 4, 6, 7}}.

**Figure 2.** A sliding window in SAIL

on pattern features.

length of *P*, and g is the maximum *gap* of wildcards in *P*.


**Table 2.** A running example for SAIL



**Table 3.** The constructed search table pos[*j*][*i*-start] when *P*[*m*-1] is 7



**Table 4.** The constructed search table pos[*j*][*i*-start] when *P*[*m*-1] is 8

The time complexity and completeness analysis**:**

306 Bioinformatics

A running example for SAIL:

constraints are given as follows.

**Table 2.** A running example for SAIL

constraints.

four positions used.

back to *Location*.

occurrence {2, 4, 6, 7}.

In this subsection, we show how SAIL works with a running example where *P*, *T* and

 0 1 2 3 4 5 6 7 8 9 *T* t t a a g g c c c c *P* a¢[0,1]g¢[0,1]c¢[0,1]c, *minLen* = 6, *maxLen* = 7

**Step 1.** Scan the *P*[*m*-1], that is the letter 'c', in *T* from left to right. The first matching position is 6, and then SAIL enters the *Location* phase. Use the global constraint [6, 7] to locate *P*[0]'s position, that is the letter 'a'. We get the scanning range is [6-6, 7-6].

**Step 2.** The second matching position is 7, and we can locate *P*[0] in position 2. Then we

**Step 3.** Build a 4×6 table. The row stand for character in *P*, and the column is the substring.

**Step 4.** Enter *Forward* and set all the positions to 1 in the table, which satisfy the local

**Step 5.** Enter *Backward* and select the left-most one from the marked positions in each row, and they are highlighted. In this way, we will get an occurrence {2, 4, 6, 7} and mark the

**Step 6.** Go on to execute *Location* and get the third matching position is 8, then we can build the table below. Notice the positions of 6, 7 have been used. Under the *one-off* condition, all used positions (marked as \* in Table) of *T* are never considered for further matching again. If the *one-off* condition is not considered, SAIL will get another two occurrences {2, 4, 6, 8} and {3, 5, 7, 9}. Then the *Forward* phase returns false, and SAIL go

**Step 7.** In position 9, the *Forward* also returns false. Finally, SAIL output only one

get the substring "aaggcc" from *T*. In this way global constraint is satisfied.

However there are no matching in [0, 1]. Then SAIL move on.

Then set the position pos[3][5], pos[0][0] and pos[0][1] to 1.

a g c c **Table 3.** The constructed search table pos[*j*][*i*-start] when *P*[*m*-1] is 7

Positions in *Text* Positions in *Table* O(SAIL) = O(*n* + *klmg*) where *n* is the length of *T*, *k* is the frequency of *P*'s last letter occurring in *T*, *l* is the user-specified maximum length for each matching substring, *m* is the length of *P*, and g is the maximum *gap* of wildcards in *P*.

Two important issues, *Online searching* and *Optimization*, are taken into consideration to design SAIL. As for optimization, under the *one-off* condition, SAIL determines which occurrence is an optimal one if multiple occurrences end at a *P*[*m*-1]'s position by applying the *left-most* strategy. As a heuristic algorithm, SAIL utilizes a kind of greedy strategy to select a set of occurrences; consequently, SAIL may obtain locally optimal solution which lead to losing occurrences in offline searching.

Form the above example, we can know that a complete occurrence set for text *T* is {{2, 4, 6, 8}, {3, 5, 7, 9}}, but SAIL's output is {{2, 4, 6, 7}}.

We believe that the SAIL's data structure is based on the sliding window, the *Location* also uses the *left-most* strategy, so it is possible to lose occurrences, for instance:

**Figure 2.** A sliding window in SAIL

Obviously, in the above example, SAIL loses occurrences in offline condition because of the selection of character *c*'s matching position. For further observation, it is not difficult to find that character *c* appears in the pattern twice. If pattern is a¢[0,1]g¢[0,1]c, SAIL will get a complete occurrence set, that is, {{2, 4, 6},{3, 5, 7}}. Further experiments show that, the recurring appearances of pattern characters influence the quality of matching occurrences obtained by the algorithm. In next part, we will analyze the completeness of PMWL based on pattern features.
