**4. Results and discussions**

The nucleotide density distribution was obtained through FIR filter. We have calculated the density distribution for one-, two-, three-, gram nucleotide for different species. Secondly we have calculated entropy distributions ρi = (yi1, yi2, … , yin) and ρj = (yj1, yj2, … , yjn). The variation of entropy with position for all other sequences are calculated for the above three combinations. The entropy values were found to be minimum for mono-mer density distributions in individual sequences while increasing linearly for di-mers and codons respectively. Observations based on the position of the n-mers in sequences of SARS-CoV-2 DNA reveals significant minimum entropy regions for codons. **Figure 2** shows the entropy profile calculated over 29000 bases for 7 DNA sequences. Similar analysis profile for mono-mers and di-mers does not show overlapping regions for different sequences. This suggests that codons are more effective in transferring information through different species. Codon bias has been reported for HIV 1 virus [18]. Therefore, it can be inferred that in various novel coronavirus strains, the codons at specific positions are the highest bias representing minimum entropy and hence carry the maximum information. Further studies with the sequences of these loci can be useful genetic engineering for developing vaccines or taking control over the spread of the second wave of the pandemic.

We have chosen seven SARS Corona virus sequences (SARS-COV) from various countries. The details of the organism are presented in **Table 4**.

Based on FIR filtering, firstly the nucleotide density distribution is generated. We have calculated the density distribution for one-, two-, three-, gram nucleotide for different species. Secondly we have calculated entropy distributions ρi = (yi1, yi2, … , yin) and ρj = (yj1, yj2, … , yjn). **Figure 3** displays the spatial variation of the entropy along the SARS-COV sequence for seven species.

In fact it is inconvenient to realize all the entropy variation in 2D graphical representation. For example, the organism HKU1 shows the positions where it possesses the minima in entropy values. Some are demonstrated at the positions, around 7400, 10000, 23000 etc. the Amsterdam strain, NL63 has shown minima at around 7300, 8000 etc. But other strains exhibit their entropy representation in a crowded manner. It is difficult to understand the variation for them differentially. Rather it is more comprehensive to show the entropy variation for all sequences (total 7) in a single panel. It has been shown in **Figure 3**.

The present work intends to assess the variability and complexity at each nucleotide site with the calculation of entropy for each position using the Shannon entropy formula, Eq. (2). The low entropy regions around 7400 and 9000 position


**Table 4.** *SARS-COV strains with their complete genome sequence, accession no. and source.*

#### **Figure 2.**

*Entropy profile of seven SARS-COV sequence. Entropy is calculated based on single nucleotide distribution. Sequences are: B: Wuhan-Hu-1; C CV7; D: MERS-CoV/C1272; E: HCoV-OC43; F: NL63; G: HCoV\_229E; H: HKU1 (see Table 4).*

#### **Figure 3.**

*Entropy profile of 7 SARS-COV sequences. Entropy is calculated based on single nucleotide distribution. Sequences are represented as number starting from 1 to 7. (see Table 4).*

are common to all 7 sequences (**Figure 3**). Entropy (Yi) is an important parameter for the understanding of sequential stability. Y*i* becomes maximal when all symbols occur at equal probability. On the other hand, Y*i* becomes the least if one symbol occurs at probability 1 and in that case the other symbols will be forbidden. It means that lower the value of entropy the site is more stable without much complexity. Under this assumption, the zone around the site 7400 and 9000 position are most stable for all strain/species. It may find a good structural relationship between the regions of low entropy and the secondary structure of proteins which include α-helix, β-sheets and loops regions.

Strain no. 4–7 (HCoV-OC43; F: NL63; G: HCoV\_229E; H: HKU1) show the stability with lower entropy around 8 K, 9 K, 11 K, 12 K site position. But this behavoiu is not exhibited in case of the strains numbers 1–3 (Wuhan-Hu-1; C CV7; D: MERS-CoV/C1272). If one can go through these strains, as a whole, it is noticed that the entropy is increasing or in turn the complexity is more. It is an indication of

### *Entropy Based Biological Sequence Study DOI: http://dx.doi.org/10.5772/intechopen.96615*

**Figure 4.** *Dissimilarity matrix for 7 SARS-COV sequences.*

evolutionary development among the SARS-COV strains. Based on site entropy we prepared the dissimilarity matrix for the sequences (**Figure 4**).

The dissimilarity matrix demonstrates the existence of 4 different clusters. One can see that the SARS-COV sequences in a cluster shows less dissimilarity among themselves. In other way to mean that the sequences have much similarity residing in a cluster [19]. The COVID sequence appearing in cluster I is typically from Wuhan, China. The Wuhan virus genome sequence examination found β-CoV strain [20]. The Wuhan novel β-CoV revealed 88% similarity with the sequence of two bat-derived SARS-COV, bat-SL-CoVZC45 and it was named "SARS-CoV-2" by the International Virus Classification Commission. The genome of SARS-CoV-2 sequence has the similarity with the typical CoVs. It encompasses more than ten open reading frames (ORFs). The first ORFs covers about two-thirds of viral RNA, which get translated into two large polyproteins, pp1a and pp1ab. These proteins assist to form the viral replicase transcriptase complex [21]. The remaining onethird of viral RNA take part in translation of four structural proteins: spike (S), envelope (E), nucleocapsid (N) and membrane (M) proteins [22].

Cluster-II comprising of two strains CV7, MERS-CoV, belong to β-CoV genera, which also includes SARS-CoV-2 strain as placed singly in cluster-I. Two HCoVs of strains HCoV-229E and HCoV-OC43 being placed in the mixed Cluster of III and IV, are the members of α-CoV genera. From the cluster presentation (**Figure 5**), it will be understood that they belong to cluster-III. Remaining two strains, NL63 and HKU1 are placed in cluster IV.

Phylogenic relation among the strains is represented in **Figure 6**. We obtain the phylogenetic tree of the data set based on unweighted pair group method with arithmetic mean (UPGMA) on PC scores. Phylogenetic tree analysis clearly shows the relationship among all COVID strains under each cluster. We further subcluster in each cluster based on their genetic distance (GD). We have considered PC score to determine the dissimilarity or genetic distance between two organisms.

Explicitly the COVID strains are placed in a cluster description (**Figure 5**). The scores are determined in the principal component analysis. Three principal

#### **Figure 5.**

*Scatter plot PC values for 7 SARS-COV sequences: Cluster-I (Wuhan-Hu-1) is encircled with deep blue color). Cluster-II (CV7 and MERS-CoV) is encircled with light blue color. Cluster-III (HCoV-229E and HCoV-OC43) is encircled with green color. Cluster-IV (NL63 and HKU1) is encircled with yellow color.*

**Figure 6.** *The phylogenetic tree of 7 SARS-COV sequences.*

components are taken into consideration. Each strain is represented as state point by scatter plot in the three PC space. Cluster presentation is well agreement with phylogenic relations. Wuhan-Hu-1 strain is well isolated from all other strains. It belongs to cluster-I. Each of other three clusters possess two-member strain. Cluster-II comprises of two strains CV7 and MERS-CoV belonging to β-CoV genera (encircled with blue color ellipse in **Figure 5**). Already it is mentioned in the previous section that the strains HCoV-229E and HCoV-OC43 exist in Cluster of III. It is displayed by two state points encircled in green colored ellipse. Remaining pair of strains, NL63 and HKU1 are placed in cluster IV which is marked by yellow colored ellipse.
