Entropy Based Biological Sequence Study

*Bimal Kumar Sarkar*

## **Abstract**

SARS-CoV-2 virus strains are taken into consideration for the analysis of digitized sequences of information by means of the notions of entropy. The occurrence of a particular pattern in the corona viral sequence is paid a special attention. The incidence of genetic word is represented in a density means. The incidence frequency of the q-gram genetic word is determined with the help of finite impulse response (FIR) filter along the sequence. It is in turn, used for the determination of the probability distribution of the genetic word incidence as the input for the calculation of entropy in the sequence. The sequence entropy is further used for principal component analysis (PCA) to determine the similarity/dissimilarity between the viral sequences. We have considered seven human corona virus sequences. Entropy based similarity study for SARS-CoV-2 strains is presented in this work.

**Keywords:** sequences, genetic information, FIR filter, entropy, PCA, corona virus

### **1. Introduction**

The entropy of amino acid sequences in DNA of an organism can be considered as the measure of diversity of proteins. The higher the value of entropy, the greater the possibility of variation in the information content coded by the nucleic acid [1]. This theory is utilized in the present study to understand the variation in the genetic sequences of different novel corona viruses that have infected people across the world leading to one of the world's biggest pandemics. The pandemic itself highlights the importance of tracking the dynamics of viral transmission in real-time. Moreover, as the virus mutates frequently, each sequence is studied and compared with others to understand the variation of information that is transmitted from one species to the other. Hyper-variable genomic hotspot for the novel coronavirus SARS-CoV-2 has already been identified by Wen et al. [2]. Likewise, the similarities in the genetic code would also provide important information in understanding the virus and its prevention.

Corona virus molecule has a single-stranded, positive-sense RNA genome of length of approximately 27 to 32 kilobases (kb). The genome sizes of HCoV-229E and HCoV-NL63 are approximately 27.5 kb, and it is more than 30 kb for HCoV-OC43 and HCoV-HKU1. It possesses the RNA harbors a 50-cap structure and a 30 polyadenylate tail which enable to play a role of messenger RNA (mRNA) [3–10].

This study presents identification and analysis of regions of similarity in SARS-CoV genetic sequence [11–13]. According to information theory, individuality of a species can be aggregates that propagate information from past to future. The Shannon Entropy is considered as a measure for the order/disorder state of nucleotide sequences of the DNA [14]. The information in a genetic code is comprised of an alphabetic sequence of the four letters A, C, G, and T, which symbolizes the four nucleotides, namely, adenine (A), cytosine (C), guanine (G) and thymine (T). The sequences have been recognized for most of the SARS-CoV-2 genes and are accessible in computer readable form. The probability of occurrence of a combination of a group of symbols in a sequence is the measure of order in a sequence. An alignment free approach of DNA sequence analysis, *n*-mer/word frequency estimation, is attempted in this work.
