**2. Methodology**

Our method is based on the observation through a sliding "counter "of width W over DNA sequence [15]. A certain number of q-grams called as bins are set in the counter. As there are only four letters in the DNA alphabet, viz., {A, C, G, T} the number of all combinations of q-grams in a DNA sequence is 4<sup>q</sup> .

#### *Definition 1. q-gram of Sequence.*

*Given a sequence 'seq', when a window of length q slides over the characters of 'seq', its q-grams are formed. For a sequence 'seq, there are seq* j j � ð Þ *q* � 1 *q-grams.*

The number of all possible q-grams or called as "bin" is 4*<sup>q</sup>*. Bins can be arranged in lexicographic order, and *b*<sup>i</sup> is used to denote the *i* th bin in this order. All the possible bins are denoted as:

$$B\_{\emptyset} = \{b\_1, \ b\_2, \ \dots \ b\_{4^{\emptyset}}\} \tag{1}$$

**Example 1.** One-gram bins are *B*<sup>1</sup> ¼ f g *A*, *C*, *G*, *T* , consisting 4 bins. Two-gram bins are *B*<sup>2</sup> ¼ f g AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT , consisting 16 bins.

#### *Definition 2. Bin Signature.*

For a sequence, the q-gram bin signature, Sj is a mapping with the bin bj *b <sup>j</sup>* ∈*Bq* � � where ith bit in Sj , is corresponding to the presence or absence of bj . For a sequence 'seq, there are j j *seq* � *b <sup>j</sup>* � � � � � <sup>1</sup> � � bits in Sj.

**Example 2.** Consider a sequence, S = "AACTCG". Its two-grams (q = 2) signature in the sequence is S2 = [0 1 0 0 0].

#### *Definition 3. Filter.*

A sequence x[n] is filtered through mapping of the sequence into output sequence y[n] via a weighted window b by means of the convolution summation as

$$\text{y}[n] = \sum\_{i=0}^{k} b\_i \text{x}[n-i] \tag{2}$$

b is independent of x[n] and y[n], where n is the time index. y[n] is the response of the filter to input signal x[n]. The filter is finite impulse response (FIR) digital filter. The term digital filter arises because it operates on discrete-time signals. Finite impulse response arises because the filter output is computed as a weighted, finite term sum, of past and present (**Figure 1**).

*Example 3:* Weighted filter output of SA with the weighted window β = [0.2 0.1 0.3 0.4] is as follows:

SA = [1 1 0 0 0 0].

*Entropy Based Biological Sequence Study DOI: http://dx.doi.org/10.5772/intechopen.96615*

**Figure 1.** *Block diagram of finite impulse response (FIR) digital filter.*

$$\mathbb{Y}\_A[n] = \sum\_{0}^{3} \beta\_k \mathbb{S}\_A[n-k] \text{ with } \beta\_0 = 0.2, \beta\_1 = 0.1, \beta\_2 = 0.3, \beta\_3 = 0.4.$$

$$\Rightarrow \mathbb{Y}\_A[n] = \beta\_0 \mathbb{S}\_A[n] + \beta\_1 \mathbb{S}\_A[n-1] + \beta\_2 \mathbb{S}\_A[n-2] + \beta\_3 \mathbb{S}\_A[n-3]$$

*yA* = [0.2 0.3 0.4 0.7 0.4 0]; Similarly for other nucleotide viz., C, G, T, the output is obtained as,

*yC* = [0.2 0.0 0.2 0.1 0.5 0.5]; *yG* = [0.0 0.0 0.0 0.0 0.0 0.2]; *yT* = [0.0 0.0 0.0 0.2 0.1 0.3];

For nucleotide density calculation, evenly distributed window of unit value is considered. As explained, the output of the convolution summation represents the nucleotide density along the sequence. The detail algorithms for bin construction, bin signature, filter operation is displayed in **Tables 1**–**3** respectively.

1: 0 bincount; 2: 4^q n; 3: cell(1,n) bin; 4: **for** first = 1:4 **do** 5: **for** qth = 1:4 **do** 6: convert integer to nucleotide character ([first … qth]) binq; 7: bincount = bincount +1; 8: binq bin{bincount}; 9: **end** 10: **end** 11: bin *Bq*

**Table 1.** *Bin construction.*

**Input:** Sequence (seq), bin (b) **Output:** Bin Signature

1: *m* length (*seq*); 2: *nbin* length (*b*); 3: **for** *i 1 … .m - (nbin - 1)* **do** 4: **if** *seq* (*i*: *i + nbin-1*) = *b* **then** 5: *signature (i) = 1* 6: **else** 7: *signature (i) = 0* 8: **end** 9: *signature Bin Signature*

#### **Table 2.** *Bin signature.*

**Input:** *BinSignature*, *window* **Output:** *filter*

```
1: w length (window);
2: window = 1/w*array of ones(1,w);
3: 0 sum
4: for i 1 … . length (window) do
5: make array of zeros with length of i � 1 zero
6: sum = sum + window (i) * array[zeros BinSignature(1:(length(BinSignature)-(i-1)))]
7: end
8: filter sum
```
**Table 3.** *Filter.*

### **3. Sequence analysis**

The filter output is taken as a density distribution for DNA sequences. The density distribution is based on q-gram word density, which in turn is considered for the determination of Shannon Entropy as

$$y\_i = -\sum\_{j=1}^{q} p\_{ij} \log p\_{ij} \tag{3}$$

where pij is the probability of appearance of the *j*th genetic letter at *i*th position in the genetic sequence. Further we want to find a similarity/dissimilarity measure between two entropy distributions ρ<sup>i</sup> = (*yi1*, *yi2*, … , *yin*) and ρ*<sup>j</sup>* = (*yj1, yj2, … , yjn*). We construct the data matrix D comprising elements [ρ1, ρ2, … .,ρ m]<sup>0</sup> , where *m* is the number of sequences. Principal Component Analyses (PCA) is used to estimate scores between density distributions such that it reduces multidimensional data sets to lower dimensions with the consistent original data matrix [16].

We determine the dissimilarity between two sequences from the scores in the first three principal components by computing the Euclidean distance between pairs of density distributions in the m-by-n data matrix D. Rows of D correspond to sequence (observations) and columns correspond to position index in the sequence (variables). Thus, Euclidean distance X is a row vector of length *m (m–1)/2*, corresponding to pairs of observations in D. The distances are arranged in the order *(2, 1), (3, 1), … , (m, 1), (3, 2), … , (m, 2), … , (m, m–1).* X is used as a dissimilarity matrix in clustering or multidimensional scaling. An unweighted pair group method with arithmetic mean (UPGMA) is employed on PC scores for the construction of a phylogenetic tree [17]. UPGMA uses a local objective function to construct a rooted bifurcating tree.
