And in regions outside Africa where there was historical forced migration as a result of the slave trade [23].

¥ Vietnamese residing in Canada [24].

**Table 1.** Comparison of the virological and clinical characteristics of the genotypes and subgenotypes of HBV¶.

### **2.7. Genotyping and subgenotyping methods**

HBV genotypes, and in some cases subgenotypes and various mutations, can influence the clinical course of disease [22] as well as response to antiviral therapy [25] and can be used to show transmission [26] and to trace human migrations [23]. Thus, HBV genotyping is becom‐ ing increasingly relevant in the clinical setting and may contribute to future personalized treatment [27] and may be important in epidemiological and transmission studies. Bioinfor‐ matics has played a major role in the development of various tools that can be used for identifying genotypes/subgenotypes and detecting various mutations. Therefore, a number of methods have been developed [28, 29].

Although analysis of the HBV *S* gene sequence is sufficient to classify HBV into genotypes [30], the complete genome sequence provides additional information with respect to phylogenetic relatedness [31, 32], including the identification of recombinants. Furthermore, even though complete genome analysis is the gold standard for genotyping, it does not allow for rapid and direct analysis on a large scale basis [17] and requires expertise and thus capacity development in computer processing coupled with phylogenetic analyses. In order to expedite and facilitate genotyping, a number of methods have been developed [17, 28, 29]. Each one has its advantages and disadvantages [17, 28, 29], which should be taken into account, when selecting the genotyping method appropriate for a particular study or application.

### **2.8. Phylogenetic analyses of HBV**

Although, as already mentioned, the error-prone polymerase of HBV leads to sequence heterogeneity [16], the degree, at which this can occur, is constrained by the partially over‐ lapping ORFs and the presence of secondary RNA structures, such as ε, coded by nonoverlapping regions [33, 34]. The HBV genome has been estimated to evolve with an error rate of ~10−3–10−6 nucleotide substitutions/site/year [35–41], although this rate is not constant within the different regions of the HBV genome [41]. The progress of computers and information technology has played an important role in the development of phylogenetic analysis as a powerful tool in the analysis of the molecular evolution of viruses.

As exemplified in the next sections, comparative analysis of HBV strains from various geographic regions of the world and from different eras can shed light on the origin, evolution, transmission, and response to anti-HBV preventative and treatment measures.

### **2.9. Origin**

The origin and age of the family *Hepadnaviridae* remains controversial. However, until the issues with the estimation of the substitution rate of HBV [41] are overcome, the debate on the origin of HBV will continue ([17, 41] and references cited therein). Nonetheless, bioinformatics, coupled with growing number of hepadnaviral sequences in the databases, with accurate sampling times, and advances in phylogenetic and coalescent methodology [42], is beginning to shed light on this issue. For example, according to Suh and colleagues [43], analysis of the endogenous sequences in the zebra finch provides direct evidence that the compact genomic organization of hepadnaviruses has not changed during the last 482 million years of hepad‐ naviral evolution. Furthermore, phylogenetic analyses and distribution of HBV relics suggest that birds potentially are the ancestral hosts of the family *Hepadnaviridae* and that mammalian hepatitis B viruses probably emerged after a bird–mammal host switch [43].

### **2.10. Evolution**

Genetic variation is important in viral evolution. The sequence heterogeneity displayed by HBV because of the lack of proof-reading ability of the polymerase is limited by functional constraints [33], leading to non-random variation [44]. Moreover, mutations can be affected by host–virus interaction and selective pressure, imposed endogenously by the immune system and exogenously by vaccination and antiviral treatment [17]. Phenotypic resistance to antiviral drugs occurs because of mutations in the reverse transcriptase of POL, whereas mutations in the *BCP/preC* and *preS* regions have been implicated as risk factors for the development of HCC. Mutations in the *S* region coding for HBsAg can lead to both vaccine and detection escape of HBV. At any time, the virus population can be composed of a number of different mutants referred to as "quasispecies" [45]. Direct sequencing and more recently next generation sequencing (NGS), parallel with bioinformatics, provide us with powerful tools to study the evolution of the various HBV mutations. NGS or ultra-deep sequencing generates large volumes of data, which can only be analyzed using bioinformatics tools and provides large coverage that can detect minor quasispecies populations of HBV [46–51] that may be important in understanding HBV pathogenicity and response to treatment. In order to minimize the number of artifactual calls of single-nucleotide variations in NGS, it is important that the correct reference sequences are used [51, 52].

By designing a circular construct, Homs and co-workers [53] were able to use NGS to study evolution of both the precore and polymerase regions. They demonstrated the presence of precore mutants in HBeAg-positive phase, wild-type precore in the HBeAg-negative phase as well as lamivudine resistance strains in treatment naïve patients. This demonstrates that viral strains occurring at low frequencies can act as reservoirs or memory genomes, which are selected and evolve in response to both intrinsic (host immune response) and extrinsic (drug administration) factors.

### **2.11. Transmission and tracing human migrations**

and disadvantages [17, 28, 29], which should be taken into account, when selecting the

Although, as already mentioned, the error-prone polymerase of HBV leads to sequence heterogeneity [16], the degree, at which this can occur, is constrained by the partially over‐ lapping ORFs and the presence of secondary RNA structures, such as ε, coded by nonoverlapping regions [33, 34]. The HBV genome has been estimated to evolve with an error rate of ~10−3–10−6 nucleotide substitutions/site/year [35–41], although this rate is not constant within the different regions of the HBV genome [41]. The progress of computers and information technology has played an important role in the development of phylogenetic analysis as a

As exemplified in the next sections, comparative analysis of HBV strains from various geographic regions of the world and from different eras can shed light on the origin, evolution,

The origin and age of the family *Hepadnaviridae* remains controversial. However, until the issues with the estimation of the substitution rate of HBV [41] are overcome, the debate on the origin of HBV will continue ([17, 41] and references cited therein). Nonetheless, bioinformatics, coupled with growing number of hepadnaviral sequences in the databases, with accurate sampling times, and advances in phylogenetic and coalescent methodology [42], is beginning to shed light on this issue. For example, according to Suh and colleagues [43], analysis of the endogenous sequences in the zebra finch provides direct evidence that the compact genomic organization of hepadnaviruses has not changed during the last 482 million years of hepad‐ naviral evolution. Furthermore, phylogenetic analyses and distribution of HBV relics suggest that birds potentially are the ancestral hosts of the family *Hepadnaviridae* and that mammalian

Genetic variation is important in viral evolution. The sequence heterogeneity displayed by HBV because of the lack of proof-reading ability of the polymerase is limited by functional constraints [33], leading to non-random variation [44]. Moreover, mutations can be affected by host–virus interaction and selective pressure, imposed endogenously by the immune system and exogenously by vaccination and antiviral treatment [17]. Phenotypic resistance to antiviral drugs occurs because of mutations in the reverse transcriptase of POL, whereas mutations in the *BCP/preC* and *preS* regions have been implicated as risk factors for the development of HCC. Mutations in the *S* region coding for HBsAg can lead to both vaccine and detection escape of HBV. At any time, the virus population can be composed of a number of different mutants referred to as "quasispecies" [45]. Direct sequencing and more recently next generation sequencing (NGS), parallel with bioinformatics, provide us with powerful

genotyping method appropriate for a particular study or application.

powerful tool in the analysis of the molecular evolution of viruses.

transmission, and response to anti-HBV preventative and treatment measures.

hepatitis B viruses probably emerged after a bird–mammal host switch [43].

**2.8. Phylogenetic analyses of HBV**

184 Bioinformatics - Updated Features and Applications

**2.9. Origin**

**2.10. Evolution**

Sequencing and bioinformatics have played an important role in demonstrating transmission routes, for which previous evidence could only be anecdotal. For example, molecular charac‐ terization of HBV together with phylogenetic analysis was used to demonstrate inter-spousal transmission of HBV even after long marriages, in two Japanese patients, who developed acute liver failure [54]. Similarly, the first known case of transfusion-transmitted HBV infection by blood screened using individual donor nucleic acid testing was confirmed by the 99.7% sequence homology between the complete genome sequences of the donor and the recipient HBV strains [26]. When migration events were estimated by ancestral state reconstruction using the criterion of parsimony, it was shown that Africa was the most probable source of dispersal of subgenotype A1 of HBV globally and its dispersal to Asia and Latin America occurred as a result of the slave and trade routes [23, 55].

### **2.12. Treatment response and resistance to treatment**

According to international chronic hepatitis B treatment guidelines, the most desirable endpoint of treatment is HBsAg loss. Following HBsAg loss, patients have better clinical outcomes, including decreased risk of developing cirrhosis and HCC, and death [56]. How‐ ever, the currently available treatments, which include either nucleos(t)ide analogues (NAs) for direct inhibition of the viral polymerase or pegylated interferon (PegIFN) for immunemediated HBV control, generally achieve HBV DNA suppression and HBeAg loss only, which are not enduring. In an attempt to identify viral factors associated with HBsAg loss, Charuworn et al. [57] demonstrated that viral diversity could differentiate those patients, who would lose HBsAg when treated with tenofovir disoproxil fumarate. Lower diversity was seen in the protein-encoding regions of HBV from patients who lost HBsAg compared to those who did not. On the other hand, higher diversity in regulatory elements of HBV was found to be a predictor of HBsAg loss [57]. These findings need to be confirmed by studies incorporating larger numbers of patients, as well as genotypes other than A and D.

The high mutation rate of HBV means that it can evolve to develop resistance against NAs that target the viral DNA polymerase. Drug-resistant mutants develop under drug pressure in order for HBV to survive in the presence of the NA. The development of drug resistance mutations can be affected by HBV DNA levels at baseline, rate of viral suppression, length of NA treatment, and prior exposure to NA treatment [58]. Sequential treatment with different NAs, following drug failure, can lead to the development of multidrug resistance, which cannot be treated using currently available drugs [59]. The most frequent lamivudine drug resistance mutants are rtM204V/I, which are also selected by the L-pyrimidine analogues, emitricitabine, clevudine, and telbivudine but are susceptible to the purine analogues adefovir and tenofovir [59]. rtA181V develops following lamivudine treatment but is sensitive to other NAs, whereas rtN236T is resistant to adefovir only. In deciding on treatment options, the detection of genotypic resistance, which is defined as the detection of viral mutations confer‐ ring drug resistance, is a priority in clinics. Direct and NGS of the polymerase region of the HBV genome can detect both well-defined and novel mutations.

Bioinformatics tools and databases have been used to better understand HBV mutations and how they develop, especially in response to antiviral therapy and vaccination. Although laboratory methods have been used to study mutations, they are both labor intensive and expensive and limited in the degree of complexity they can investigate. As a more economical alternative, bioinformatics and computer simulation can use available biological data, such as the protein sequence and structural information, to investigate interactions by virus, host, and the environment [60]. Thus, Shen et al. [60] showed that most mutations develop in the hydrophobic regions of HBsAg and POL and that the amino acids that are more likely to be mutated are serine and threonine [60]. Understanding how amino acids mutations develop in HBV proteins can facilitate the rational design of both vaccines and drugs [60], for the prevention and treatment of HBV infection, respectively. By the use of bioinformatics to compare viral and host genomic patterns, together with clinical information, to data from databases can lead to enhanced and individualized antiviral therapy.

### **3. Bioinformatics tools and databases**

### **3.1. Bioinformatics challenges of HBV**

Despite its small genome size of ~3.2 kb, HBV presents several bioinformatic challenges:


sequences for the *S* and *POL* ORFs, which span the *EcoRI* site, from full-length or subge‐ nomic fragments, requires additional processing.


### **3.2. Public sequence databases**

The high mutation rate of HBV means that it can evolve to develop resistance against NAs that target the viral DNA polymerase. Drug-resistant mutants develop under drug pressure in order for HBV to survive in the presence of the NA. The development of drug resistance mutations can be affected by HBV DNA levels at baseline, rate of viral suppression, length of NA treatment, and prior exposure to NA treatment [58]. Sequential treatment with different NAs, following drug failure, can lead to the development of multidrug resistance, which cannot be treated using currently available drugs [59]. The most frequent lamivudine drug resistance mutants are rtM204V/I, which are also selected by the L-pyrimidine analogues, emitricitabine, clevudine, and telbivudine but are susceptible to the purine analogues adefovir and tenofovir [59]. rtA181V develops following lamivudine treatment but is sensitive to other NAs, whereas rtN236T is resistant to adefovir only. In deciding on treatment options, the detection of genotypic resistance, which is defined as the detection of viral mutations confer‐ ring drug resistance, is a priority in clinics. Direct and NGS of the polymerase region of the

Bioinformatics tools and databases have been used to better understand HBV mutations and how they develop, especially in response to antiviral therapy and vaccination. Although laboratory methods have been used to study mutations, they are both labor intensive and expensive and limited in the degree of complexity they can investigate. As a more economical alternative, bioinformatics and computer simulation can use available biological data, such as the protein sequence and structural information, to investigate interactions by virus, host, and the environment [60]. Thus, Shen et al. [60] showed that most mutations develop in the hydrophobic regions of HBsAg and POL and that the amino acids that are more likely to be mutated are serine and threonine [60]. Understanding how amino acids mutations develop in HBV proteins can facilitate the rational design of both vaccines and drugs [60], for the prevention and treatment of HBV infection, respectively. By the use of bioinformatics to compare viral and host genomic patterns, together with clinical information, to data from

HBV genome can detect both well-defined and novel mutations.

databases can lead to enhanced and individualized antiviral therapy.

Despite its small genome size of ~3.2 kb, HBV presents several bioinformatic challenges:

**1.** The genome is circular, with position 1 conventionally taken to be the first "T" nucleotide in the *Eco*R1 restriction site ("GAATTC"). Historically, position 1 was the start of the "Core" region, which is position 1901 in the current numbering system. Therefore, a number of sequences deposited earlier in the public databases are numbered using this outdated system and thus require processing before they can be used in alignments,

**2.** Four overlapping reading frames are encoded in the circular genome, whereas nucleotides or amino acids are sequenced and processed linearly. Extracting nucleotide or amino acid

**3. Bioinformatics tools and databases**

together with more recently submitted sequences.

**3.1. Bioinformatics challenges of HBV**

186 Bioinformatics - Updated Features and Applications

The first public sequence database, "GenBank," was established in 1982, having arisen from the earlier Los Alamos database, established in 1979 [61, 62]. Since then, the number of nucleotides in GenBank has doubled approximately every 18 months [63]. The International Nucleotide Sequence Database Collaboration (INSDC) is a collection of three publicly available nucleotide (DNA or RNA) sequence databases, which synchronize data daily [64]. The collection consists of the DNA DataBank of Japan (DDBJ, located in Japan), the European Molecular Biology Laboratory (EMBL, located in the United Kingdom) and GenBank (located in the United States of America). The latest release of the database (release 211.0, from 15 December, 2015; [65]) contains 189,232,925 loci and 203,939,111,071 bases, from 189,232,925 sequences, totaling approximately 742 gigabytes. In addition to the INSDC, many other databases exist, including genome databases, protein sequence, structure and interaction databases, microarray databases, and meta-databases. A list of biological databases on Wikipedia includes over 200 entries [66].

When searching for "hepatitis b virus" across all fields, the GenBank database [63], accessed on 27th January 2016, contained 105,745 sequences. When searching for "hepatitis b virus" in the "organism" field only, 84,119 sequences were found, with the oldest sequence submitted in the early 1980s. Refining this search to include only sequences of 200 nucleotides or longer, and excluding words such as "recombinant," "clone," and "patent," resulted in 68,762 sequences. When this same query was previously executed on 29 November 2015, 67,893 sequences were returned. Therefore, in the 59 days between the two queries, 869 new sequen‐ ces (of at least 200 nucleotides in length, and not containing the words mentioned previously) were uploaded to GenBank. On average, this equates to almost 15 new HBV sequences added to GenBank per day.

Making use of these sequences in downstream applications, such as multiple sequence alignments or phylogenetic analyses, is often challenging, as it is difficult to query for sufficient sequences, of the correct genotype, or subgenotype, and covering the required genomic region. In order to overcome this limitation, we have developed a bioinformatics solution, whereby all sequences matching a query are downloaded, curated, and aligned. The algorithm developed allows for the generation of a multiple sequence alignment for each genotype, which contains all the available sequences matching the query and in their correct position and orientation [67].


### **3.3. Bioinformatics tools for HBV**

¶Table modified from Bell and Kramvis [68]. \* Described for the first time here.

**Table 2.** List of the online tools developed and the workflow process at which each would be used¶.

A standard molecular biology laboratory workflow includes DNA extraction, polymerase chain reaction (PCR) amplification, direct DNA sequencing, viewing and checking of chromatograms, preparation of curated sequences, multiple sequence alignment, sequence analysis, serotyping, genotyping, phylogenetic analysis, and preparation of sequences for submission to the GenBank public sequence database [68]. Each of these steps presents data processing challenges, many of which have been addressed by the development of a suite of online tools (**Table 2**) [68].


¶Table modified from [67].

sequences, of the correct genotype, or subgenotype, and covering the required genomic region. In order to overcome this limitation, we have developed a bioinformatics solution, whereby all sequences matching a query are downloaded, curated, and aligned. The algorithm developed allows for the generation of a multiple sequence alignment for each genotype, which contains all the available sequences matching the query and in their correct position

• *Generates a contig from a forward and reverse chromatogram*

• *Eliminates "gap-columns" and disambiguate ambiguous bases*

• *Splits FASTA file based on gap threshold per column*

• *Calculates 2 × 2 wild-type/mutant contingency tables*

• *Intra- and Inter-group divergence with custom groups*

• *Generates random subsets from an input FASTA file*

• *Places two HBV sequence fragments on a backbone template*

• *Extracts HBV protein sequences (ORFs)*

and orientation [67].

**3.3. Bioinformatics tools for HBV**

188 Bioinformatics - Updated Features and Applications

Analysis **Babylon**

**Workflow Tool name and description** Chromatograms **Quality score analyzer**

Alignment **Automatic alignment clean-up tool**

**Mind the gap**

**Wild-type 2 × 2**

**Rafael\***

Serotyping **HBV serotyper tool**

Phylogenetics **Pipeline: TreeMail**

GenBank Submission **PadSeq**

Described for the first time here.

\*

¶Table modified from Bell and Kramvis [68].

**Divergence calculator\***

• *Determines HBV serotype*

• *Generates a phylogenetic tree*

**Table 2.** List of the online tools developed and the workflow process at which each would be used¶.

• *Plots chromatogram quality scores* **Automatic contig generator tool**

**Table 3.** Currently available HBV websites and databases¶.

Any operating system platform from any location with an internet connection can be used to access stand-alone, web-based tools. There is no requirement to install and learn new bioin‐ formatics software, as these tools can be used when required. A system for processing ultradeep pyrosequencing (amplicon resequencing) data has also been developed [51]. In addition, a number of HBV-specific websites and databases are currently available, a selection of which are represented in **Table 3**.

### **3.4. New bioinformatics tools for HBV**

Here, we present two newly developed tools for the bioinformatic analysis of HBV.

### *3.4.1. Divergence calculator [http://hvdr.bioinf.wits.ac.za/divergence/]*

One method of classifying HBV sequences into genotype or subgenotype is to examine nucleotide sequence divergence between sequences. This divergence calculation is performed by totaling the number of nucleotides, which differ, between two aligned sequences and computing the percentage difference. The divergence calculator (**Figure 3**) performs various divergence calculations on groups of sequences from nucleotide or amino acid multiple sequence alignments in FASTA format. A minimum of one group containing two sequences, or two groups containing one sequence each, must be specified.


**Figure 3.** The input screen of the divergence calculator in which sequences are extracted and allocated to groups and other parameters specified.

As an example, consider an alignment of 10 genotype A sequences (group 1) and 10 genotype D sequences (group 2). Intra-group divergence, for each group, is calculated by comparing each sequence in group 1 with each other sequence in group 1 and then calculating the median, mean, and standard deviation of the divergences. This is then repeated for group 2. The intergroup divergence compares each sequence in group 1 with each sequence in group 2, and then calculates the median, mean, and standard deviation. If more than two groups are specified, the calculations iterate over all groups in turn.

If the optional "query" group is specified, the tool compares each sequence in the query group with each sequence in the other group or groups, but outputs statistics for each sequence in the query group individually. This method would typically be used with a set of unknown query sequences and one or more groups of reference sequences. A comprehensive list of descriptive statistics is included on the output page for each analysis.

### *3.4.2. Random FASTA extraction and allocation (RAFAEL) [http://hvdr.bioinf.wits.ac.za/rafael/]*

In some analyses, particularly when constructing phylogenetic trees, it may be desirable to extract one or more random subsets of sequences from a master or reference alignment. The "RAFAEL" tool was designed to perform this task. This tool takes an input file in FASTA format, which does not have to be aligned and generates one or more subsets of the file, each containing a random selection of the specified number of sequences. The number of sequences may be specified as a count, or as a percentage of the number of sequences in the input file. There are guaranteed to be no duplicate sequences within each subset. However, duplicates may exist in multiple subsets, as subsets are not unique.

### **3.5. Open-source software**

formatics software, as these tools can be used when required. A system for processing ultradeep pyrosequencing (amplicon resequencing) data has also been developed [51]. In addition, a number of HBV-specific websites and databases are currently available, a selection of which

One method of classifying HBV sequences into genotype or subgenotype is to examine nucleotide sequence divergence between sequences. This divergence calculation is performed by totaling the number of nucleotides, which differ, between two aligned sequences and computing the percentage difference. The divergence calculator (**Figure 3**) performs various divergence calculations on groups of sequences from nucleotide or amino acid multiple sequence alignments in FASTA format. A minimum of one group containing two sequences,

**Figure 3.** The input screen of the divergence calculator in which sequences are extracted and allocated to groups and

As an example, consider an alignment of 10 genotype A sequences (group 1) and 10 genotype D sequences (group 2). Intra-group divergence, for each group, is calculated by comparing each sequence in group 1 with each other sequence in group 1 and then calculating the median, mean, and standard deviation of the divergences. This is then repeated for group 2. The intergroup divergence compares each sequence in group 1 with each sequence in group 2, and then

Here, we present two newly developed tools for the bioinformatic analysis of HBV.

*3.4.1. Divergence calculator [http://hvdr.bioinf.wits.ac.za/divergence/]*

or two groups containing one sequence each, must be specified.

are represented in **Table 3**.

other parameters specified.

**3.4. New bioinformatics tools for HBV**

190 Bioinformatics - Updated Features and Applications

In addition to biological databases, a large variety of biological analysis software, which is generally genome agnostic, is available. As with software in any field, the licensing terms and commercial costs of these packages vary widely. Packages, which may be free of cost, may not necessarily be open-source, for example.

The Free Software Foundation (FSF) [76, 77] defines free software as software which "respects the users' freedom" in the sense that "users have the freedom to run, copy, distribute, study, change, and improve the software". As such, "free" is "a matter of liberty, not price". Free software, therefore, does not necessarily have to be made available at no cost or be a noncommercial project. Furthermore, software, which is provided at no cost, may not be "free" in the sense described above.

The term "open-source" is often used when referring to "free" software. However, the two terms are not synonymous, although there is some overlap. Open-source software may, or may not, be free software, depending on the restrictions placed on users by the software. If the user is not free to distribute, change, and improve the software, even if it is open source, then it cannot be considered to be free software. Most software, for which a license is purchased, is not free, or open source. The user does not have the freedom to distribute the software, or to use it on any computer chosen.

### **3.6. Recommended software**

A list of recommended freely available download software is presented in **Table 4**. Compre‐ hensive lists of open-source bioinformatics software can be found elsewhere [78].



\* "GUI" = graphical user interface, "CL" = command line interface, "OSS" = open-source software, "Lin" = GNU/Linux, "Mac" = Apple MacIntosh, "Win" = Microsoft Windows, "Emu" = emulator or virtual machine recommended by authors, "Com" = compilation from source code required.

**Table 4.** Bioinformatics software available free of charge for various computer operating system platforms.

### **4. Conclusion**

**Software name** 

**Software description**

192 Bioinformatics - Updated Features and Applications

Unipro UGene Integrated bioinformatics suite

MEGA 6 Integrated bioinformatics

available

BioEdit Multiple sequence

SeaView Multiple sequence

AliView Multiple sequence

GeneDoc Multiple sequence

PHYLIP Programs for interring

programs

MrBayes Bayesian inference and

BEAST Bayesian analysis of

FigTree Graphical viewer and editor

Archaeopteryx Graphical viewer and editor

EMBOSS A suite of command-line

JalView Multiple sequence

Finch TV DNA sequence trace

suite; command-line version

alignment viewer and editor

alignment viewer and editor

alignment viewer and editor, and shading utility

phylogenies; website includes comprehensive list of other phylogenetics

model choice using Markov Chain Monte Carlo

molecular sequences using Markov Chain Monte Carlo

of phylogenetic trees

of phylogenetic trees

tools for molecular biology

alignment viewer and editor

(chromatogram) viewer

alignment viewer and editor, and molecular phylogenetics; opens GeneDoc "MSF" files

**Website (http://) Lin\* Mac\* Win\* References**

No No Yes

No No Yes

Yes Yes Yes [83]

Yes No Yes [86]

Yes Yes Yes

ugene.net Yes Yes Yes [79]

megasoftware.net Emu Yes Yes [80]

doua.prabi.fr/software/seaview Yes Yes Yes [81]

www.ormbunkar.se/aliview Yes Yes Yes [82]

mrbayes.sourceforge.net Com Yes Yes [84]

beast.bio.ed.ac.uk Yes Yes Yes [85]

tree.bio.ed.ac.uk/software/figtree/ Yes Yes Yes

emboss.sourceforge.net/ Yes Yes Emu

www.jalview.org Yes Yes Yes [87]

sites.google.com/site/cmzmasek/home/

software/archaeopteryx

www.geospiza.com/Products/

finchtv.shtml

www.mbio.ncsu.edu/bioedit/

iubio.bio.indiana.edu/soft/molbio/ ibmpc/genedoc-readme.html

evolution.genetics.washington.edu/

bioedit.html

phylip.html

The unique genome structure and molecular biology of HBV pose a number of challenges, and thus, the development of bioinformatic tools has facilitated a more comprehensive and detailed analysis and understanding of the origin, evolution, transmission, and response to antiviral agents of HBV and its interaction with the host. There are a wide range of free and commercially available tools, which have been developed for different applications. The availability and applications of high-throughput sequencing techniques and the advancement of "-omics" will continue to provide additional challenges, which will need to be addressed by further computational solutions.

### **Acknowledgements**

Trevor Bell is the recipient of a National Research Foundation (NRF) Scarce Skills Post-Doctoral Fellowship (GUN#86215) and Anna Kramvis received funding from the National Research Foundation (GUN#65530, GUN#93516).

### **Author details**

Trevor Graham Bell\* and Anna Kramvis

\*Address all correspondence to: TrevorGrahamBell@gmail.com

Hepatitis Virus Diversity Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa

### **References**

[1] Qi H, Wu NC, Du Y, Wu TT, Sun R. High-resolution genetic profile of viral genomes: why it matters. Curr Opin Virol. 2015;14:62–70. PubMed PMID: 26364133.


[16] Glebe D, Bremer CM. The molecular virology of hepatitis B virus. Semin Liver Dis. 2013;33(2):103–12. PubMed PMID: 23749666.

[2] Hepatitis B Fact sheet no. 204 [Internet]. 2015. Available from: http://www.who.int/

[3] Ott JJ, Stevens GA, Groeger J, Wiersma ST. Global epidemiology of hepatitis B virus infection: new estimates of age-specific HBsAg seroprevalence and endemicity.

[4] Tiollais P, Pourcel C, Dejean A. The hepatitis B virus. Nature. 1985;317(6037):489–95.

[5] Lucifora J, Arzberger S, Durantel D, Belloni L, Strubin M, Levrero M, et al. Hepatitis B virus X protein is essential to initiate and maintain virus replication after infection. J

[6] Kramvis A, Kew MC. The core promoter of hepatitis B virus. J Viral Hepat. 1999;6(6):

[7] Moolla N, Kew M, Arbuthnot P. Regulatory elements of hepatitis B virus transcription. J Viral Hepat. 2002;9(5):323–31. PubMed PMID: 12225325. Epub 2002/09/13. eng.

[8] Summers J, Mason WS. Replication of the genome of a hepatitis B-like virus by reverse transcription of an RNA intermediate. Cell. 1982;29(2):403–15. PubMed PMID: 6180831.

[9] Yan H, Zhong G, Xu G, He W, Jing Z, Gao Z, et al. Sodium taurocholate cotransporting polypeptide is a functional receptor for human hepatitis B and D virus. eLife.

[10] Rabe B, Glebe D, Kann M. Lipid-mediated introduction of hepatitis B virus capsids into nonsusceptible cells allows highly efficient replication and facilitates the study of early infection events. J Virol. 2006;80(11):5465–73. PubMed PMID: 16699026. Pubmed

[11] Kock J, Rosler C, Zhang JJ, Blum HE, Nassal M, Thoma C. Generation of covalently closed circular DNA of hepatitis B viruses via intracellular recycling is regulated in a virus specific manner. Plos Pathogens. 2010;6(9):e1001082. PubMed PMID: 20824087.

[12] Beck J, Nassal M. Hepatitis B virus replication. World J Gastroenterol. 2007;13(1):48–

[13] Kramvis A, Kew MC. Structure and function of the encapsidation signal of hepadna‐ viridae. J Viral Hepat. 1998;5(6):357–67. PubMed PMID: 9857345. Epub 1998/12/19. eng.

[14] Kramvis A, Kew M, Francois G. Hepatitis B virus genotypes. Vaccine. 2005;23(19):2409–

[15] Wang GH, Seeger C. Novel mechanism for reverse transcription in hepatitis B viruses. J Virol. 1993;67(11):6507–12. PubMed PMID: 7692081. Pubmed Central PMCID: 238087.

64. PubMed PMID: 17206754. Pubmed Central PMCID: 4065876.

23. PubMed PMID: 15752827. Epub 2005/03/09. eng.

2012;1:e00049. PubMed PMID: 23150796. Pubmed Central PMCID: 3485615.

Vaccine. 2012;30(12):2212–9. PubMed PMID: 22273662. Epub 2012/01/26. eng.

mediacentre/factsheets/fs204/en/[Accessed: 2016-01-16]

Hepatol. 2011;55(5):996–1003. PubMed PMID: 21376091.

415–27. PubMed PMID: 10607259. Epub 1999/12/22. eng.

PubMed PMID: 2995835.

194 Bioinformatics - Updated Features and Applications

Epub 1982/06/01. eng.

Central PMCID: 1472160.

Pubmed Central PMCID: 2932716.


mutations. J Clin Microbiol. 1998;36(2):531–8. PubMed PMID: 9466771. Pubmed Central PMCID: 104572. Epub 1998/02/18. eng.

[40] Paraskevis D, Magiorkinis G, Magiorkinis E, Ho SY, Belshaw R, Allain JP, et al. Dating the origin and dispersal of hepatitis B virus infection in humans and primates. Hepa‐ tology. 2013;57(3):908–16. PubMed PMID: 22987324. Epub 2012/09/19. eng.

[28] Bartholomeusz A, Schaefer S. Hepatitis B virus genotypes: comparison of genotyping

[29] Guirgis BS, Abbas RO, Azzazy HM. Hepatitis B virus genotyping: current methods and clinical implications. Int J Infect Dis IJID. 2010;14(11):e941–53. PubMed PMID:

[30] Kramvis A, Arakawa K, Yu MC, Nogueira R, Stram DO, Kew MC. Relationship of serological subtype, basic core promoter and precore mutations to genotypes/subge‐ notypes of hepatitis B virus. J Med Virol. 2008;80(1):27–46. PubMed PMID: 18041043.

[31] Hu X, Margolis HS, Purcell RH, Ebert J, Robertson BH. Identification of hepatitis B virus indigenous to chimpanzees. Proc Natl Acad Sci USA. 2000;97(4):1661–4. PubMed

[32] Norder H, Courouce AM, Magnius LO. Complete genomes, phylogenetic relatedness, and structural proteins of six strains of the hepatitis B virus, four of which represent two new genotypes. Virology. 1994;198(2):489–503. PubMed PMID: 8291231. Epub

[33] Mizokami M, Orito E, Ohba K, Ikeo K, Lau JY, Gojobori T. Constrained evolution with respect to gene overlap of hepatitis B virus. J Mol Evol. 1997;44 Suppl 1:S83–90. PubMed

[34] Torres C, Fernandez MD, Flichman DM, Campos RH, Mbayed VA. Influence of overlapping genes on the evolution of human hepatitis B virus. Virology. 2013;441(1):

[35] Tedder RS, Bissett SL, Myers R, Ijaz S. The 'Red Queen' dilemma–running to stay in the same place: reflections on the evolutionary vector of HBV in humans. Antivir Ther.

[36] Andernach IE, Hunewald OE, Muller CP. Bayesian Inference of the Evolution of HBV/ E. Plos One. 2013;8(11):e81690. PubMed PMID: 24312336. Pubmed Central PMCID:

[37] Fares MA, Holmes EC. A revised evolutionary history of hepatitis B virus (HBV). J Mol

[38] Orito E, Mizokami M, Ina Y, Moriyama EN, Kameshima N, Yamamoto M, et al. Hostindependent evolution and a genetic classification of the hepadnavirus family based on nucleotide sequences. Proc Natl Acad Sci USA. 1989;86(18):7059–62. PubMed PMID:

[39] Gunther S, Sommer G, Von Breunig F, Iwanska A, Kalinina T, Sterneck M, et al. Amplification of full-length hepatitis B virus genomes from samples from patients with low levels of viremia: frequency and functional consequences of PCR-introduced

2013;18(3 Pt B):489–96. PubMed PMID: 23792884. Epub 2013/06/26. eng.

Evol. 2002;54(6):807–14. PubMed PMID: 12029362. Epub 2002/05/25. eng.

2780562. Pubmed Central PMCID: 297993. Epub 1989/09/01. eng.

PMID: 10677515. Pubmed Central PMCID: 26492. Epub 2000/03/04. eng.

methods. Rev Med Virol. 2004;14(1):3–16. PubMed PMID: 14716688.

20674432.

Epub 2007/11/28. eng.

196 Bioinformatics - Updated Features and Applications

1994/02/01. eng.

PMID: 9071016. Epub 1997/01/01. eng.

3843692. Epub 2013/12/07. eng.

40–8. PubMed PMID: 23541083. Epub 2013/04/02. eng.


2013;19(41):6995–7023. PubMed PMID: 24222943. Pubmed Central PMCID: 3819535. Epub 2013/11/14. Eng.


[61] Kanehisa M, Fickett JW, Goad WB. A relational database system for the maintenance and verification of the Los Alamos sequence library. Nucleic Acids Res. 1984;12(1 Pt 1): 149–58. PubMed PMID: 6694899. Pubmed Central PMCID: 320992.

2013;19(41):6995–7023. PubMed PMID: 24222943. Pubmed Central PMCID: 3819535.

[51] Yousif M, Bell TG, Mudawi H, Glebe D, Kramvis A. Analysis of ultra-deep pyrose‐ quencing and cloning based sequencing of the basic core promoter/precore/core region of hepatitis B virus using newly developed bioinformatics tools. Plos One.

[52] Liu WC, Lin CP, Cheng CP, Ho CH, Lan KL, Cheng JH, et al. Aligning to the samplespecific reference sequence to optimize the accuracy of next-generation sequencing analysis for hepatitis B virus. Hepatol Int. 2016;10(1):147–57. PubMed PMID: 26208819.

[53] Homs M, Buti M, Quer J, Jardi R, Schaper M, Tabernero D, et al. Ultra-deep pyrose‐ quencing analysis of the hepatitis B virus preCore region and main catalytic motif of the viral polymerase in the same viral genome. Nucleic Acids Res. 2011;39(19):8457–71. PubMed PMID: 21742757. Pubmed Central PMCID: 3201856. Epub 2011/07/12. eng.

[54] Okamoto D, Nakayama H, Ikeda T, Ikeya S, Nagashima S, Takahashi M, et al. Molecular analysis of the interspousal transmission of hepatitis B virus in two Japanese patients who acquired fulminant hepatitis B after 50 and 49 years of marriage. J Med Virol.

[55] Lago BV, Mello FC, Kramvis A, Niel C, Gomes SA. Hepatitis B virus subgenotype A1: evolutionary relationships between Brazilian, African and Asian Isolates. Plos One. 2014;9(8):e105317. PubMed PMID: 25122004. Pubmed Central PMCID: 4133366. Epub

[56] Liu J, Yang HI, Lee MH, Lu SN, Jen CL, Batrla-Utermann R, et al. Spontaneous seroclearance of hepatitis B seromarkers and subsequent risk of hepatocellular carci‐

[57] Charuworn P, Hengen PN, Aguilar Schall R, Dinh P, Ge D, Corsa A, et al. Baseline interpatient hepatitis B viral diversity differentiates HBsAg outcomes in patients treated with tenofovir disoproxil fumarate. J Hepatol. 2015;62(5):1033–9. PubMed

[58] Gish RG, Given BD, Lai CL, Locarnini SA, Lau JY, Lewis DL, et al. Chronic hepatitis B: virology, natural history, current management and a glimpse at future opportunities.

[59] Zoulim F, Durantel D, Deny P. Management and prevention of drug resistance in chronic hepatitis B. Liver Int. 2009;29 Suppl 1:108–15. PubMed PMID: 19207973. Epub

[60] Shen K, Shen L, Wang J, Jiang Z, Shen B. Understanding amino acid mutations in hepatitis B virus proteins for rational design of vaccines and drugs. Adv Protein Chem

2014;9(4):e95377. PubMed PMID: 24740330. Epub 2014/04/18. eng.

Epub 2013/11/14. Eng.

198 Bioinformatics - Updated Features and Applications

2014/08/15. eng.

PMID: 25514556.

2009/02/12. eng.

Pubmed Central PMCID: 4722079.

2014;86(11):1851–60. PubMed PMID: 25132075.

noma. Gut. 2014;63(10):1648–57. PubMed PMID: 24225939.

Antivir Res. 2015;121:47–58. PubMed PMID: 26092643.

Struct Biol. 2015;99:131–53. PubMed PMID: 26067819.


### **Plant Bioinformatics**

[74] Rozanov M, Plikat U, Chappey C, Kochergin A, Tatusova T. A web-based genotyping resource for viral sequences. Nucleic Acids Res. 2004;32(Web Server issue):W654–9.

[75] Yuen LK, Ayres A, Littlejohn M, Colledge D, Edgely A, Maskill WJ, et al. SeqHepB: a sequence analysis program and relational database system for chronic hepatitis B.

[76] Free Software Foundation [Internet]. 2016. Available from: http://www.fsf.org/

[77] What is free software? [Internet]. 2016. Available from: http://www.gnu.org/philoso‐

[78] List of open-source bioinformatics software [Internet]. 2016. Available from: https:// en.wikipedia.org/wiki/List\_of\_open-source\_bioinformatics\_software [Accessed:

[79] Okonechnikov K, Golosova O, Fursov M, team U. Unipro UGENE: a unified bioinfor‐ matics toolkit. Bioinformatics. 2012;28(8):1166–7. PubMed PMID: 22368248.

[80] Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol. 2013;30(12):2725–9. PubMed PMID:

[81] Gouy M, Guindon S, Gascuel O. SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol.

[82] Larsson A. AliView: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics. 2014;30(22):3276–8. PubMed PMID: 25095880. Pubmed

[83] Phylogeny Programs [Internet]. 2016. Available from: http://evolution.genetics.wash‐

[84] Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees.

[85] Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29(8):1969–73. PubMed PMID: 22367748.

[86] Han MV, Zmasek CM. phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinform. 2009;10:356. PubMed PMID: 19860910. Pubmed Central

[87] Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton GJ. Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25(9):1189–91. PubMed PMID: 19151095. Pubmed Central PMCID: 2672624.

PubMed PMID: 15215470. Pubmed Central PMCID: 441557.

Antivir Res. 2007;75(1):64–74. PubMed PMID: 17215050.

[Accessed: 2016-01-27]

200 Bioinformatics - Updated Features and Applications

2016-01-28]

phy/free-sw.html [Accessed: 2016-01-27]

24132122. Pubmed Central PMCID: 3840312.

2010;27(2):221–4. PubMed PMID: 19854763.

ington.edu/phylip/software.html [Accessed: 2016-03-07]

Bioinformatics. 2001;17(8):754–5. PubMed PMID: 11524383.

Central PMCID: 4221126.

Pubmed Central PMCID: 3408070.

PMCID: 2774328.

### **Bioinformatics: A Way Forward to Explore "Plant Omics"**

Mehboob-ur- Rahman, Tayyaba Shaheen, Mahmood-ur- Rahman, Muhammad Atif Iqbal and Yusuf Zafar

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/64043

#### **Abstract**

Bioinformatics, a computer-assisted science aiming at managing a huge volume of genomic data, is an emerging discipline that combines the power of computers, mathematical algorithms, and statistical concepts to solve multiple genetic/biological puzzles. This science has progressed parallel to the evolution of genome-sequencing tools, for example, the next-generation sequencing technologies, that resulted in arranging and analyzing the genome-sequencing information of large genomes. Synergism of "plant omics" and bioinformatics set a firm foundation for deducing ancestral karyotype of multiple plant families, predicting genes, etc. Second, the huge genomic data can be assembled to acquire maximum information from a voluminous "omics" data. The science of bioinformatics is handicapped due to lack of appropriate computational procedures in assembling sequencing reads of the homologs occurring in complex genomes like cotton (2n = 4x = 52), wheat (2n = 6x = 42), etc., and shortage of multidisciplinary-oriented trained manpower. In addition, the rapid expansion of sequencing data restricts the potential of acquisitioning, storing, distributing, and analyzing the genomic information. In future, inventions of high-tech computational tools and skills together with improved biological expertise would provide better insight into the genomes, and this information would be helpful in sustaining crop productivities on this planet.

**Keywords:** databases, data mining, comparative genomics, plant genomes, sequence analysis, structure prediction

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **1. Introduction**

Sustainability in agriculture systems is largely challenged by a number of factors including human population increase, environmental changes, and tremendous demands for growing crops to produce biofuels worldwide [1, 2]. In this regard, exploring the plant genomes for determiningthefunctionofimportantgenesinvolvedinconferringtolerancetobioticandabiotic stresses, followed by exploiting these genes in the development of resilient cultivars, is one of the durable strategies for bringing sustainability in crop yields [2, 3].

After the genome sequencing of *Arabidopsis thaliana* genome, a project was launched by the National Science Foundation (NSF) for determining the function of 25,000 Arabidopsis genes [4]. Rice was the first genome-sequenced crop (International Rice Genome Sequencing Project 2005) followed by sequencing of a number of genomes of major crops. All these sequencing projects released a large amount of data. For arranging and analyzing these data, a number of bioinformatic tools have been developed, which helped a lot in drawing important biological conclusions, predicting gene functions, etc. Furthermore, development of unconventional mapping populations and online resources of molecular markers [4] facilitate researchers to identify quantitative trait loci (QTLs). A number of databases have been developed to tackle the newly generated genomic data. These databases provided a foundation to build hypoth‐ esis, to design experiments, and to infer knowledge about a particular organism. Moreover, the datasets and "omics" resources of numerous species facilitated the assessment of "omics" properties among species, which further allows studying of conserved genes and evolutionary relationships. Bioinformatics is a crucial tool to access datasets of "omics" and to gather a substantial biological knowledge [5].

From the sequence analysis to the identification of genes, clustering of associated sequences and study of evolutionary relationships using phylogenetics are major tasks of bioinformatics. It also includes the identification and functional annotation of all genes, proteins, and active sites of protein structure in the cell [6]. At present, with the advancement in NGS tools, a voluminous sequencing data is emerging. For deducing meaningful information from these data, it is important for the science of bioinformatics to coevolve with the genomic tools. In this regard, the main three components including mathematics, computer science, and biology upon which the whole citadel of bioinformatics is based, should evolve in parallel to the sequencing tools. It would pave the way for deducing useful information (phylogenies, syntenic relationship, predicting genes, and their function) from the data in a shortest possible time [6, 7].

### **2. Databases**

Databases are collection of organized data that can be retrieved from a website easily for addressing different queries. For managing and handling a database, different hardware and software programs in a computer are needed. The data are organized in structured records that can cater the easy retrieval of information. Broadly, biological databases are classified into sequence databases, relevant to protein and nucleic acid sequences, and structure databases, only relevant to proteins. The first database was developed after a short period of sequencing the insulin protein in 1956. The "Protein Data Bank" was the first ever biological database developed in 1971. Biological databases have flourished enormously due to availability of huge amount of data being generated every day [8]. The individual laboratories maintained the preliminary databases of protein sequences; later, the creation of a combined formal database called SWISS-PROT protein sequence database was introduced in 1986. Now a plethora of data resources are available for study and research purposes and CDROMs (on request from), which are constantly being updated with the availability of new data [9].

Biological databases generally offer software tools to analyze the data available on it and to compare new data with already available data. With the help of these computational methods, the laborious and costly "wet lab" work can be avoided. In future, prospects are dealing with some hindrances such as limited awareness of data, complications in data retrieval, availability of limited data analysis tools, and inadequate literature reference accessibility [10]. A number of biological databases are available that can be divided into three categories on the basis of their contents: (1) primary databases—contain raw nucleotide sequences (GenBank, EMBL, and DDBJ), (2) secondary databases—contain highly annotated data (SWISS-PROT and Protein Information Resource), and (3) specialized databases—deal with particular organism and unique data (FlyBase, WormBase, and TAIR). A major problem in interlinking these databases is the lack of format compatibility. This problem is overcome by using a specified language known as Common Object Request Broker Architecture (CORBA) [11].

At National Center for Biotechnology Information (NCBI), text-based search and retrieval of information can be undertaken by deploying Entrez. It deals with all databases, for example, PubMed, Nucleotide and Protein Sequences, Complete Genomes, etc. In sequence retrieval system (SRS), the Boolean operators are used for undertaking complex searching. It is also used for sequence retrieval, abstract searching, references, etc.

### **2.1. Dedicated databases for plant genomics**

**1. Introduction**

204 Bioinformatics - Updated Features and Applications

substantial biological knowledge [5].

time [6, 7].

**2. Databases**

Sustainability in agriculture systems is largely challenged by a number of factors including human population increase, environmental changes, and tremendous demands for growing crops to produce biofuels worldwide [1, 2]. In this regard, exploring the plant genomes for determiningthefunctionofimportantgenesinvolvedinconferringtolerancetobioticandabiotic stresses, followed by exploiting these genes in the development of resilient cultivars, is one of

After the genome sequencing of *Arabidopsis thaliana* genome, a project was launched by the National Science Foundation (NSF) for determining the function of 25,000 Arabidopsis genes [4]. Rice was the first genome-sequenced crop (International Rice Genome Sequencing Project 2005) followed by sequencing of a number of genomes of major crops. All these sequencing projects released a large amount of data. For arranging and analyzing these data, a number of bioinformatic tools have been developed, which helped a lot in drawing important biological conclusions, predicting gene functions, etc. Furthermore, development of unconventional mapping populations and online resources of molecular markers [4] facilitate researchers to identify quantitative trait loci (QTLs). A number of databases have been developed to tackle the newly generated genomic data. These databases provided a foundation to build hypoth‐ esis, to design experiments, and to infer knowledge about a particular organism. Moreover, the datasets and "omics" resources of numerous species facilitated the assessment of "omics" properties among species, which further allows studying of conserved genes and evolutionary relationships. Bioinformatics is a crucial tool to access datasets of "omics" and to gather a

From the sequence analysis to the identification of genes, clustering of associated sequences and study of evolutionary relationships using phylogenetics are major tasks of bioinformatics. It also includes the identification and functional annotation of all genes, proteins, and active sites of protein structure in the cell [6]. At present, with the advancement in NGS tools, a voluminous sequencing data is emerging. For deducing meaningful information from these data, it is important for the science of bioinformatics to coevolve with the genomic tools. In this regard, the main three components including mathematics, computer science, and biology upon which the whole citadel of bioinformatics is based, should evolve in parallel to the sequencing tools. It would pave the way for deducing useful information (phylogenies, syntenic relationship, predicting genes, and their function) from the data in a shortest possible

Databases are collection of organized data that can be retrieved from a website easily for addressing different queries. For managing and handling a database, different hardware and software programs in a computer are needed. The data are organized in structured records that can cater the easy retrieval of information. Broadly, biological databases are classified into

the durable strategies for bringing sustainability in crop yields [2, 3].

A number of databases deal with datasets focused on particular genes and transcription factors (TFs) related to plant issues and cellular processes. First, a genome-wide finding of repertories of TFs encoded by genes in Arabidopsis genome was described [12]. Accessibility of complete genome sequences in the last few years has enabled us to assemble catalogs of TFs based on their function and association of regulatory systems in different plant species. Numerous databases deliver datasets about genes putatively involved in encoding TFs. These databases are based on predictions made by computational methods (sequence similarity search and hidden Markov model (HMM) conserved DNA-binding domains search). In recent years, GRASSIUS was established to compile resources and tools for undertaking comparative genomics of regulatory sequences in grass species [13]. The Grass Transcription Factor Database (GrassTFDB, another database) of GRASSIUS contains combined sequence infor‐ mation on RiceTFDB, MaizeTFDB, CaneTFDB, and SorghumTFDB. These can be searched through a website. Information of the predicted genes coding TF (carried out by doing annotations across the three genome sequences of legumes) is available on the LegumeTFDB [14]—an extended database of the SoybeanTFDB.

The enhancement of the PGSB PlantsDB database framework has been accomplished with new tools, and sufficient new data have been added into the system particularly for the large complex genomes of wheat, barley, and rye. New resources such as GenomeZipper and CrowsNest for the comparative analysis of data RNASeq Expression Browser have been established. The transPLANT project makes available a platform to compile heterogeneous data about plant genome, for example, integrated searches over multiple databases (**Table 1**).


**Table 1.** Databases can be exploited for undertaking transcription factor studies in plants.

### **3. Analysis of the "omic" data**

### **3.1. Sequence retrieval**

First step is the identification and retrieval of sequences from different databases (NCBI, TAIR, Gramene, Rap-db, TIGR, Phytozome, PlantGDB, UniProt and SwissProt) developed for handling protein, DNA, RNA, and Expressed Sequence Tag(EST) sequences (**Table 2**). Sequence retrieval is not only carried out through query words but it can also be done using BLAST and or their specific accession numbers. To find out similar sequences from databases, BLAST variations according to sequence retrieval could be performed.

annotations across the three genome sequences of legumes) is available on the LegumeTFDB

The enhancement of the PGSB PlantsDB database framework has been accomplished with new tools, and sufficient new data have been added into the system particularly for the large complex genomes of wheat, barley, and rye. New resources such as GenomeZipper and CrowsNest for the comparative analysis of data RNASeq Expression Browser have been established. The transPLANT project makes available a platform to compile heterogeneous data about plant genome, for example, integrated searches over multiple databases (**Table 1**).

http://grassius.org/grasstfdb.html Maize, rice, sorghum, and sugarcane

*Medicago truncatula*

Barley, wheat, and rye

700 species

LegumeTFDB http://legumetfdb.psc.riken.jp/ Soybean, *Lotus japonicas*, and

PlnTFDB http://plntfdb.bio.uni-potsdam.de/v3.0/ Plant species

PlantTFDB http://planttfdb.cbi.pku.edu.cn/ 83 species

**Table 1.** Databases can be exploited for undertaking transcription factor studies in plants.

First step is the identification and retrieval of sequences from different databases (NCBI, TAIR, Gramene, Rap-db, TIGR, Phytozome, PlantGDB, UniProt and SwissProt) developed for

STIFDB http://caps.ncbs.res.in/stifdb2/ Arabidopsis and rice

[14]—an extended database of the SoybeanTFDB.

206 Bioinformatics - Updated Features and Applications

**Database URL Species** RARTF http://rarge.gsc.riken.jp/rartf/ Arabidopsis AGRIS, AtTFDB http://arabidopsis.med.ohio-state.edu/AtTFDB/ Arabidopsis DATF http://datf.cbi.pku.edu.cn/ Arabidopsis

DRTF http://drtf.cbi.pku.edu.cn/ Rice DPTF http://dptf.cbi.pku.edu.cn/ Poplar TOBFAC http://compsysbio.achs.virginia.edu/tobfac/ Tobacco SoybeanTFDB http://soybeantfdb.psc.riken.jp/ Soybean PlantTFDB http://planttfdb.cbi.pku.edu.cn/ 22 Plant species PlnTFDB http://plntfdb.bio.uni-potsdam.de/v3.0/ 20 Plant species

DBD http://dbd.mrc-lmb.cam.ac.uk/DBD/index.cgi?

PGSB http://pgsb.helmholtz-muenchen.de/plant/index.

Home

jsp

**3.1. Sequence retrieval**

**3. Analysis of the "omic" data**

GRASSIUS, GrassTFDB


**Table 2.** Databases which are helpful in studying regulatory elements and promoter sequences of a gene.

### **3.2. Multiple sequence alignment**

Multiple sequence alignment (MSA) deals with aligning three or more biological sequences, which may be DNA, RNA, and/or protein. Primarily, its purpose is to study similarity among sequences that can help to assess the evolutionary linkage and their common ancestry. It can be undertaken by many sequence analysis softwares including but not limited to ClustalW online software [15], ProbCons, and MAFFT [16]. Some other MSA tools are DNAMAN, T-Coffee, M-Coffee, R-Coffee, Expresso, PSI-Coffee, PSAlign, PRRN, MUSCLE, POA, MEME, etc.

A number of algorithms are available to generate MSA of proteins and DNA sequences. The basic approach in producing multiple alignments is to optimize the sum of pairs (SP) score. This approach is practical, and reproduces high-quality MSA dataset [17]. Mathematical approach (also called as probabilistic and stochastic methods) exploits the probability in developing MSA. Hidden Markov model is a masterpiece example of this approach. In this approach, MSA data are modeled as probabilistic models. All possible combination of gaps, mismatches, and matches are assigned with probabilities, and the algorithm finds the most likely MSA [18]. Other approaches are genetic algorithms and simulated annealing, which break a series of possible MSA into segments followed by their rearrangement. It can use an existing MSA and refines it by a series of rearrangements [19].

### **3.3. Domain and motif study**

Domain always refers to a conserved part of protein sequence and structure, which can evolve, function, and exist independently. Whereas motif is a well-maintained sequence of protein or DNA that remains the same to execute certain function [20]. For characterization of a gene, it is always advisable to study its functional domains and motifs. The novel sequences identified can be subjected for analyzing their domains and motifs to predict their functions. For motif analysis, MEME tool can be used, while for domain analysis PFAM, InterProScan, and SMART tools can be used.

Large protein molecules comprise of structural and functional domains. Structural domains regions are either compact, globular modules, or separated clearly from the flanking regions including membrane regions or long coiled-coil helices that are separating the other domains [21]. These domains can be seen in proteins as semi-independent three-dimensional (3D), and have the ability to fold independently [22]. These domains constitute the "units of evolution" [23] and have typical functions [24]. Structural Classification of Proteins (SCOP) database has been used extensively for assigning domains in proteins [25]. Most databases and methods (e.g., Class Architecture Topology Homology database) are not fully automated, which combine several other methods for assigning domains to the proteins [26]. Protein Informatics System for Modeling (PrISM) is the only completely automated method that can be used to assign sequence-continuous domains to proteins of known 3D structures [27]. If the structure (3D) of the protein is not known, then a number of alternative methods and databases are available. For example, one of the most prominent databases is putative protein domains (ProDom) [28].

### **3.4. Structural analysis**

**3.2. Multiple sequence alignment**

208 Bioinformatics - Updated Features and Applications

**3.3. Domain and motif study**

tools can be used.

(ProDom) [28].

etc.

Multiple sequence alignment (MSA) deals with aligning three or more biological sequences, which may be DNA, RNA, and/or protein. Primarily, its purpose is to study similarity among sequences that can help to assess the evolutionary linkage and their common ancestry. It can be undertaken by many sequence analysis softwares including but not limited to ClustalW online software [15], ProbCons, and MAFFT [16]. Some other MSA tools are DNAMAN, T-Coffee, M-Coffee, R-Coffee, Expresso, PSI-Coffee, PSAlign, PRRN, MUSCLE, POA, MEME,

A number of algorithms are available to generate MSA of proteins and DNA sequences. The basic approach in producing multiple alignments is to optimize the sum of pairs (SP) score. This approach is practical, and reproduces high-quality MSA dataset [17]. Mathematical approach (also called as probabilistic and stochastic methods) exploits the probability in developing MSA. Hidden Markov model is a masterpiece example of this approach. In this approach, MSA data are modeled as probabilistic models. All possible combination of gaps, mismatches, and matches are assigned with probabilities, and the algorithm finds the most likely MSA [18]. Other approaches are genetic algorithms and simulated annealing, which break a series of possible MSA into segments followed by their rearrangement. It can use an

Domain always refers to a conserved part of protein sequence and structure, which can evolve, function, and exist independently. Whereas motif is a well-maintained sequence of protein or DNA that remains the same to execute certain function [20]. For characterization of a gene, it is always advisable to study its functional domains and motifs. The novel sequences identified can be subjected for analyzing their domains and motifs to predict their functions. For motif analysis, MEME tool can be used, while for domain analysis PFAM, InterProScan, and SMART

Large protein molecules comprise of structural and functional domains. Structural domains regions are either compact, globular modules, or separated clearly from the flanking regions including membrane regions or long coiled-coil helices that are separating the other domains [21]. These domains can be seen in proteins as semi-independent three-dimensional (3D), and have the ability to fold independently [22]. These domains constitute the "units of evolution" [23] and have typical functions [24]. Structural Classification of Proteins (SCOP) database has been used extensively for assigning domains in proteins [25]. Most databases and methods (e.g., Class Architecture Topology Homology database) are not fully automated, which combine several other methods for assigning domains to the proteins [26]. Protein Informatics System for Modeling (PrISM) is the only completely automated method that can be used to assign sequence-continuous domains to proteins of known 3D structures [27]. If the structure (3D) of the protein is not known, then a number of alternative methods and databases are available. For example, one of the most prominent databases is putative protein domains

existing MSA and refines it by a series of rearrangements [19].

Modeller is used to generate 3D structure [29]. LOMETS server is used to find the best template for comparative modeling. DOPE (discrete optimized protein energy) helps to find best model by calculating each structure's value that is evaluated through PROSAII [30] and PROCHECK [31]. To calculate electrostatic surface and solvation properties of complex compounds, APBS [32] is used. For structure alignment, PDBsum tool [33] is deployed. Structure of gene can also be displayed on GSDS2.0 (Gene Structure Display Server) [34]. YASARA software is used to draw 3D structure, c-terminal, n-terminal, and domains of proteins [35]. Chromosomal position of genes can be located by NCBI map viewer tool, Mapchar 2.1, and cucumber genome database map viewer tool.

### **3.5. Analysis of regulatory elements**

The regulatory elements encode a protein that binds to promoter or operator region of a gene for up- and/or downregulating its expression. For instance, catabolite activator protein (CAP) is a regulatory element present in prokaryotes, which regulates the lac operon [36].

Regulation of gene expression takes place at transcription level by specific sequences known as transcription factors—inhibit or initiate the transcription. These factors can be repressors, activators, or both. It is worth mentioning that repressors inhibit the binding of RNA poly‐ merase with the transcription complex (promoters)—thus blocking the transcription. Howev‐ er, activators are activated by the enabling binding of RNA polymerase with the transcription complex.

These elements can be found *in silico* by deploying PlantCARE [37], and PLACE program. PLACE is repository of motifs occur cis-acting regulatory DNA elements of plants. This database also gives information about the variations in motifs found in different genes or plant species. Relevant literature and comprehensive description of different motifs can be retrieved from this database. Several research groups have identified a number of genes including WRKY genes, Ascorbate Peroxidase, PSY, etc. using different bioinformatics tools [38–40].

### **3.6. Mutation identification**

Mutation alters the nucleotide sequences of a gene that may change the gene expression. These mutations can be identified using conventional as well as NGS tools [41, 42]. Sequencing of cytosine methylome (methylC-seq), transcriptome (RNA-seq), and small RNA transcriptome (small RNA-seq) in Arabidopsis was undertaken by deploying NGS tools. Genome-scale methylation patterns and a direct relationship between the location of sRNAs and DNA methylation were identified [43]. Protein-protein interactions occur in majority cellular processes. The interactome, representing complete set of all protein-protein connections, is vital for studying the molecular networks [44]. Correlated mutation analysis can be harnessed to predict interface residues. Protein-protein interaction can be studied by detecting correlated mutations at interface [45].

**Figure 1.** Flow chart diagram for protein structure prediction **(Source: Ref. [117]).**


**Table 3.** Bioinformatics tools which are helpful in predicting protein structure.

#### **3.7. Protein structure prediction**

It is the prediction of protein from amino acids. Protein structure can be predicted by under‐ taking similarity searches, MSAs, secondary structure prediction, identification of domains, solvent accessibility predict, itself protein fold recognition, making 3D models, and model validation [46]. For example, small heat shock proteins (smHSPs, largely present in plants) are ubiquitous in nature, and their size is ranged from 17 to 30 kDa. These proteins are encoded by six nuclear gene families. Every gene family encodes a protein that is present in different part of the cell including cytosol, mitochondria, chloroplast, and endoplasmic reticulum. These proteins protect plants from high temperature stress [47].

### *3.7.1. Protein structure prediction steps*

Following is the flow sheet diagram that elaborates the process of protein 3D structure prediction using bioinformatics tools (**Figure 1**). Various online and offline resources that can be used for the prediction of protein structure are described in **Table 3**.

### **3.8. Phylogenetic analysis**

**Figure 1.** Flow chart diagram for protein structure prediction **(Source: Ref. [117]).**

210 Bioinformatics - Updated Features and Applications

3. ESyPred3D http://www.unamur.be/sciences/biologie/urbm/ bioinfo/easypred/

8. Phyre2 http://www.sbg.bio.ic.ac.uk/~phyre2/html/ page.cgi?id=index

**Table 3.** Bioinformatics tools which are helpful in predicting protein structure.

**3.7. Protein structure prediction**

**S. No. Software/server Link Description** 

1. MODELLER http://salilab.org/modeller/ Comparative modeling of protein 3D

4. SWISS-MODEL http://swissmodel.expasy.org/ Automated protein homology modeling

2. 3DJigsaw http://bmm.cancerresearchuk.org/~3djigsaw/ Predict structure and function of

5. YASARA http://www.yasara.org/ Molecular modeling tool 6. RaptorX http://raptorx.uchicago.edu/ Protein structure prediction 7. HHPred http://toolkit.tuebingen.mpg.de/hhpred Homology detection and structure

9. ROSETTA http://boinc.bakerlab.org/resetta/ 3D structure prediction

10. I-TASSER http://zhanglab.ccmb.med.umich.edu/I-TASSER/ Predict structure and function of

11. Bhageerah http://www.scfbio-iitd.res.in/bhageerath/index.jsp Energy-based protein structure

It is the prediction of protein from amino acids. Protein structure can be predicted by under‐ taking similarity searches, MSAs, secondary structure prediction, identification of domains,

structures

protein

server

protein

Homology modeling with increased

alignment performance

prediction server

prediction server

3D structure prediction

Phylogenetic analysis is the study of evolutionary relationships among different organisms. Phylogenetic analysis corresponds to the evolutionary interactions that can be presented in branching form. Phylogenetics refers as cladistics is a set of respective descendants such that it evolves from a respective single ancestor (**Figure 2**). Cladistics is a specific methodology of theorizing almost every evolutionary interactions [48]. In order to construct a phylogenetic tree, different methods are used that are based on the nature of the data and algorithms used. Each method is based on certain assumptions. Thus, the method used to draw evolutionary relationship on one kind of dataset may not be equally good for the other kind of dataset. It is therefore suggested that a number of distance-based methods [unweighted pair group method arithmetic mean (UPGMA) and neighbor joining (NJ)] and character-based (CB) methods [maximum parsimony (MP), maximum likelihood (ML)] should be run.

### *3.8.1. Distance-based method*

The distance-based method also called as phonetic method depends upon the extent of dissimilarity (the distance) to derive a tree from the two aligned sequences. This method can rebuild the accurate tree if whole genetic divergence proceedings are precisely verified in the sequence. Tree construction is based on the resultant genetic distances from sequenced data, distances from immunological studies, and Euclidean distance applied in various ways [49].

### *3.8.1.1. Unweighted pair group method arithmetic mean (UPGMA)*

It is the simplest procedure for studying the phylogenetic relationship among different organisms which uses the clustering approach and uncorrected data to make a tree. It joins tree branches based on the criterion of greatest similarity among pairs and averages of joined pairs. UPGMA generates a correct topology with true branch lengths only when the natural mutation is proportional to time (a molecular clock) or approximately equal to raw sequence dissimilarity [50]. However, these conditions are rarely met in practice. Distance matrix is recalculated, and this procedure is continued until the operational taxonomic units [OTU (= neighbors)] are grouped in one cluster. However, this method does not reflect the evolutionary descents.

**Figure 2.** A descriptive diagram of phylogenetic analysis based on biological data.

### *3.8.1.2. Neighbor joining method (NJ)*

This method is usually pragmatic with distance tree making, irrespective of optimization measure. This method works on the principle to discover pairs of OUTs(Operational Taxo‐ nomic Units) that curtails the total branch length at respective stage of clustering of OTUs beginning with a star-like tree. Branch length and distance matrix are recalculated until one terminal is found. This method can be used to obtain the branch lengths in addition to the topology of a parsimonious tree speedily [51]. This method is relatively efficient than that of the UPGMA. This method can analyze a large dataset. Construction of one possible tree and also the biased tree are the major drawbacks of this method.

### *3.8.2. Character-based methods (CB)*

These methods are also called cladistic methods that use directly the aligned characters, for instance, DNA or protein sequences, through tree inference. The algorithm based on character takes an aligned set of characters, for example, DNA sequences, and builds a tree relating the changes in discrete characters, desirable to create the observed set of characters. These methods assume that a set of sequences descended from a common ancestor that may change by mutation and/or selection process without involving any kind of hybridization or horizontal gene transfers. Character-based algorithms are comprised of two groups: maximum likelihood and maximum parsimony [50].

### *3.8.2.1. Maximum likelihood (ML)*

neighbors)] are grouped in one cluster. However, this method does not reflect the evolutionary

**Figure 2.** A descriptive diagram of phylogenetic analysis based on biological data.

also the biased tree are the major drawbacks of this method.

This method is usually pragmatic with distance tree making, irrespective of optimization measure. This method works on the principle to discover pairs of OUTs(Operational Taxo‐ nomic Units) that curtails the total branch length at respective stage of clustering of OTUs beginning with a star-like tree. Branch length and distance matrix are recalculated until one terminal is found. This method can be used to obtain the branch lengths in addition to the topology of a parsimonious tree speedily [51]. This method is relatively efficient than that of the UPGMA. This method can analyze a large dataset. Construction of one possible tree and

These methods are also called cladistic methods that use directly the aligned characters, for instance, DNA or protein sequences, through tree inference. The algorithm based on character takes an aligned set of characters, for example, DNA sequences, and builds a tree relating the changes in discrete characters, desirable to create the observed set of characters. These methods assume that a set of sequences descended from a common ancestor that may change by

*3.8.1.2. Neighbor joining method (NJ)*

*3.8.2. Character-based methods (CB)*

descents.

212 Bioinformatics - Updated Features and Applications

Different statistical tools are exploited to assess hypothesis of evolutionary history. It con‐ structs all possible trees of evolutionary history from a given data. Multiple alignment is done in this method. Probability of all possible topologies for each data partition is estimated to identify a tree with the highest probability at all partitions based on the maximum similar phylogeny. In this method, whole sequence information is used to evaluate all the possible trees. This method cannot handle a large amount of data.

### *3.8.2.2. Maximum parsimony (MP)*

This method uses the philosophy of "the simpler hypothesis is better than the complicated ones" [52]. By this criterion, the MP tree is one with few character-state transformations for all the sequences from a common ancestor. It works by selecting trees that minimize the total tree length. For each site in the alignment, all possible trees are evaluated that is not the charac‐ teristic of other methods. This method is less dependent on suppositions about the evolution of sequences than the other strategies to construct a tree. This procedure is handicapped when the data are heterogeneous.

### *3.8.3. Evaluation of trees*

Phylogenetic trees can be statistically evaluated for reliability of branches/clades created using (1) skewness test, (2) bootstrapping analysis, and (3) likelihood ratio tests where all have currently computerized algorithms. Skewness test never has approximation with dependa‐ bility of specific topology; it is subtle to very small amounts of respective signal contemporary in otherwise random information set. Bootstrapping analysis is a resampling or rechecking tree evaluation methodology that works with distance, likelihood, and parsimony method. The outcome of bootstrap examination is a number related with specific branch in phylogenetic tree giving up the amount of bootstrap duplicates that ropes the monophyly of particular clade. Likelihood ratio tests support the likelihood ratio (tests) that is easily applicable to ML (maximum likelihood) examination. Value of likelihood is calculated for implication against normal circulation of fault in optimal models [50].

### *3.8.4. Software mostly used for phylogenetic analysis*

Phylogeny inference package (PHYLIP) [53] contains 30 programs that cover the main flows of phylogenetic analysis. It is a freely available software and is accessible to almost all kinds of computer platforms (Mac, UNIX, DOC, etc.). In addition, phylogenetic analysis using parsimony (PAUP) software is widely used to infer and interpret the evolutionary tree. Now the old version has been upgraded (PAUP\*) after the inclusion of maximum likelihood and distance methods. Other than those described above, some phylogenetic programs have unique proficiencies but mostly inadequate in their respective actions, and movability. These include molecular phylogenetics (MOLPHY) [54], TREE-PUZZLE [55], FastDNAml [56], and MACCLADE [57].

### **3.9. Molecular dynamics simulations for plant molecules**

Molecular dynamics simulations are the principal methods for elaborating the physical foundation of structure, function, and interaction of biological macromolecules (e.g., proteins and nucleic acids). Earlier, proteins were considered as comparatively rigid structures that now have been changed by a dynamic model in which the internal movements and confor‐ mational changes are key players in determining their functions. Computer simulations are carried out in comprehending the characteristics and arrangements of different molecules related to physical structure and interactions, otherwise not possible to observe by other means. There are two major classes of simulation techniques, i.e., molecular dynamics and Monte Carlo. These simulations have been used extensively in characterizing plant com‐ pounds (natural distillates) followed by finding optical counter parts with identical efficiency [58].

### **3.10. Proteomics and transcriptomics**

Study of proteins along with mRNA transcripts is referred as proteomics and transcriptomics, respectively [59]. Due to intrinsic complexity, experimental workflows and variety of data types, storage, and open depository of proteomics data based on mass spectrometry (MS) are still insufficiently established. Many public sources with particular purposes for MS proteo‐ mics research have been established to fulfill this need. These databases are Global Proteome Machine Database (GPMDB), PRIDE, PeptideAtlas, ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), PeptideAtlas SRM Experiment Library (PASSEL), etc. Moreover, for the purpose of enhanced integration and harmonized sharing of public warehouses, the ProteomeXchange consortium has been developed recently to capitalize on its advantage for the scientific community [60].

For transcriptomics studies, there are numerous databases comprising microarray data: NASCArrays, ArrayExpress, Genevestigator, Stanford Microarray Database, and the Gene Expression Omnibus, which are freely available [61]. An example of the transcriptome database is Chickpea Transcriptome Database (CTDB), which has information about the tools used for transcriptome sequence, conserved domain(s), molecular markers, transcription factor families, and complete gene expression information [62].

### **3.11. Protein-protein interactions**

The protein-protein interactions (PPIs) control the expensive scope of biological procedures that include interactions between cells, metabolic as well as developmental pathways. This noncovalent bonding brings a range of interactions and associations between proteins. PPIs can be classified in several ways depending upon their contrasting structural and functional characteristics [63]. There are several *in vivo* and *in vitro* methods for finding PPIs but our focus is on computational approaches. Computer modeling assisted with mathematical methods facilitates the study of different processes [64]. *In silico* methods combining the computational modeling are being used to study protein interactions. The *in silico* analysis integrates multiple data types including gene coexpression, colocalization, functional category, and the occurrence of orthologs or interologs to derive a global network in a species [65]. A list of webservers can be used to predict protein-protein interaction (**Table 4**).

unique proficiencies but mostly inadequate in their respective actions, and movability. These include molecular phylogenetics (MOLPHY) [54], TREE-PUZZLE [55], FastDNAml [56], and

Molecular dynamics simulations are the principal methods for elaborating the physical foundation of structure, function, and interaction of biological macromolecules (e.g., proteins and nucleic acids). Earlier, proteins were considered as comparatively rigid structures that now have been changed by a dynamic model in which the internal movements and confor‐ mational changes are key players in determining their functions. Computer simulations are carried out in comprehending the characteristics and arrangements of different molecules related to physical structure and interactions, otherwise not possible to observe by other means. There are two major classes of simulation techniques, i.e., molecular dynamics and Monte Carlo. These simulations have been used extensively in characterizing plant com‐ pounds (natural distillates) followed by finding optical counter parts with identical efficiency

Study of proteins along with mRNA transcripts is referred as proteomics and transcriptomics, respectively [59]. Due to intrinsic complexity, experimental workflows and variety of data types, storage, and open depository of proteomics data based on mass spectrometry (MS) are still insufficiently established. Many public sources with particular purposes for MS proteo‐ mics research have been established to fulfill this need. These databases are Global Proteome Machine Database (GPMDB), PRIDE, PeptideAtlas, ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), PeptideAtlas SRM Experiment Library (PASSEL), etc. Moreover, for the purpose of enhanced integration and harmonized sharing of public warehouses, the ProteomeXchange consortium has been developed recently to capitalize on

For transcriptomics studies, there are numerous databases comprising microarray data: NASCArrays, ArrayExpress, Genevestigator, Stanford Microarray Database, and the Gene Expression Omnibus, which are freely available [61]. An example of the transcriptome database is Chickpea Transcriptome Database (CTDB), which has information about the tools used for transcriptome sequence, conserved domain(s), molecular markers, transcription

The protein-protein interactions (PPIs) control the expensive scope of biological procedures that include interactions between cells, metabolic as well as developmental pathways. This noncovalent bonding brings a range of interactions and associations between proteins. PPIs can be classified in several ways depending upon their contrasting structural and functional characteristics [63]. There are several *in vivo* and *in vitro* methods for finding PPIs but our focus

**3.9. Molecular dynamics simulations for plant molecules**

MACCLADE [57].

214 Bioinformatics - Updated Features and Applications

[58].

**3.10. Proteomics and transcriptomics**

its advantage for the scientific community [60].

**3.11. Protein-protein interactions**

factor families, and complete gene expression information [62].


**Table 4.** A number of important computational tools use to study protein-protein interactions.

### *3.11.1. Arabidopsis protein interaction analysis*

More than 10 freely accessible protein interaction databases are available for *A. thaliana*. An intelligent bioinformatics web device, ANAP (Arabidopsis Network Analysis Pipeline) has been created for incorporating Arabidopsis protein collaboration databases. A total of 11 Arabidopsis protein collaboration databases having 201,699 protein association sets, 15,208 identifiers, 89 connection discovery routines, 73 species that interface with Arabidopsis, and 6161 references were incorporated in ANAP [66].

### *3.11.2. Computational identification of protein-protein interactions in rice*

Complexity of plant molecules always hinders progress toward exploring the protein-protein interaction networks on large scale. A total of 5049 proteins with 76,585 interactions were predicted in rice using Predicted Rice Interactome Network (PRIN). The prolonged molecular network in PRIN has greatly improved the ability to analyze the function and organization of genes and gene networks [67].

### *3.11.3. iPlants: the world's plant online*

This database has been designed to develop a comprehensive working list of scientific names of all plant species. Through this database, authenticated names of plant species (agreed by the scientific community) with their alternative synonyms can be found. This type of list will empower untrained botanists to get useful information about different plant species. *i*Plants will also resolve the existing confusions found in the published taxonomies. A total of 422,000 known plant species and 1,500,000–1,700,000 scientific names are used to refer these plant species are present in this database.

This database will help in exploiting plant biodiversity information in different breeding as well as gene cloning programs [68].

### *3.11.4. Reactome*

Reactome database provides access without any restriction about the peer-reviewed pathways [69]. This database is equipped with bioinformatics tools, which can be used to examine, visualize, interpret, and analyze knowledge about pathway. The information in this database is generated by the experts (curators and software developers) and cross-referenced to other databases, for example, NCBI, Ensembl, UniProt, UCSC Genome Browser, HapMap, KEGG, ChEBI, PubMed, and GO. In this database, orthologous reaction for over 20 nonhuman species including rice, Arabidopsis, and *Escherichia coli* can be found. This database can be accessed in the form of online text book [70]. Biological pathways and reaction can be viewed in a number of formats, comprising of PDF, SBML, and BioPax [71]. Recent version "v55" of Reactome was released in December 2015.

### **3.12. Metabolomics**

*3.11.1. Arabidopsis protein interaction analysis*

216 Bioinformatics - Updated Features and Applications

6161 references were incorporated in ANAP [66].

genes and gene networks [67].

*3.11.3. iPlants: the world's plant online*

species are present in this database.

well as gene cloning programs [68].

released in December 2015.

*3.11.4. Reactome*

*3.11.2. Computational identification of protein-protein interactions in rice*

More than 10 freely accessible protein interaction databases are available for *A. thaliana*. An intelligent bioinformatics web device, ANAP (Arabidopsis Network Analysis Pipeline) has been created for incorporating Arabidopsis protein collaboration databases. A total of 11 Arabidopsis protein collaboration databases having 201,699 protein association sets, 15,208 identifiers, 89 connection discovery routines, 73 species that interface with Arabidopsis, and

Complexity of plant molecules always hinders progress toward exploring the protein-protein interaction networks on large scale. A total of 5049 proteins with 76,585 interactions were predicted in rice using Predicted Rice Interactome Network (PRIN). The prolonged molecular network in PRIN has greatly improved the ability to analyze the function and organization of

This database has been designed to develop a comprehensive working list of scientific names of all plant species. Through this database, authenticated names of plant species (agreed by the scientific community) with their alternative synonyms can be found. This type of list will empower untrained botanists to get useful information about different plant species. *i*Plants will also resolve the existing confusions found in the published taxonomies. A total of 422,000 known plant species and 1,500,000–1,700,000 scientific names are used to refer these plant

This database will help in exploiting plant biodiversity information in different breeding as

Reactome database provides access without any restriction about the peer-reviewed pathways [69]. This database is equipped with bioinformatics tools, which can be used to examine, visualize, interpret, and analyze knowledge about pathway. The information in this database is generated by the experts (curators and software developers) and cross-referenced to other databases, for example, NCBI, Ensembl, UniProt, UCSC Genome Browser, HapMap, KEGG, ChEBI, PubMed, and GO. In this database, orthologous reaction for over 20 nonhuman species including rice, Arabidopsis, and *Escherichia coli* can be found. This database can be accessed in the form of online text book [70]. Biological pathways and reaction can be viewed in a number of formats, comprising of PDF, SBML, and BioPax [71]. Recent version "v55" of Reactome was

Study of all or utmost metabolites in an organism are denoted as metabolomics. It is a complex research field that involves interdisciplinary interaction of different sciences. One of the numerous methods is soft independent modeling of class analogy (SIMCA). Besides this, an effective protocol for data mining in metabolomics has also been developed [72]. In recent years, numerous databases containing data about compound names and structures, mass spectra, metabolic pathways, metabolite profile, and statistical/mathematical models are established. These databases are extremely useful for metabolomics research [73].

The MeRy-B (http://bit.ly/meryb) is dedicated to plants, and it provides information related to metabolites detected using NMR(Nuclear magnetic resonance), together with related analyt‐ ical and experimental metadata. MeRy-B is equipped with a list of many plant metabolites along with the data of their experimental conditions, the features studied, and concentration of metabolites of 19 different species including the model plant species such as Arabidopsis [74].

### **4. Implications of bioinformatics in plant omics**

Bioinformatics is an essential part of omics providing techniques to analyze large biological data sets and interpreting them into applications of "omic". Tools dealing with "omics" generate massive data that assist system biology to combine multivariate information into systems and models. The omics tools including high-throughput genome-scale genotyping platforms such as whole-genome resequencing, proteomics, and metabolomics offer better prospects for gene identification and exploration of molecular mechanisms. This information can be used to develop ideal genotypes suitable for varying climatic conditions [75].

### **4.1. Plant genome sequencing**

With the advancements in high-throughput techniques, whole-genome characterization of a wide range of organisms has been possible. Nevertheless, the storage and management of this massive genomic data is a major challenge. Revolution in sequencing technologies has made it possible to sequence large and complex genomes at extremely low cost and in much less time period. Presently, the most popular methods of genome sequencing are shotgun se‐ quencing and NGS. The NGS is very popular tool for the identification of housekeeping genes in crop plants. Many tools such as Genome Analyzer, the Applied Biosystems SOLiD System, Roche/454 FLX, and the Illumina/Solexa are commercially available for NGS [76]. NGS can be utilized for whole-genome sequencing, isolation of transcription factor binding sites, and expression of noncoding RNA and targeted resequencing [77]. Various software packages are available to assemble sequences, for example, Phred/Phrap/Consed [78], GAP4 [79], and chromaseq [80]. Another software called AMOS was developed by TIGR, which is useful for comparative genome assemblage [81].

### **4.2. Plant whole-genome resequencing**

The most effective approach in functional genomics is the whole-genome resequencing. For reducing cost, target region can be sequenced. Microarray is also a common way of target region sequencing, which is based on hybridization to arrays comprising of synthetic oligonucleotides that match the target DNA sequence [82]. Recent NGS technology has made it possible to discover differences between individuals and populations especially of the crop species whose genomes have already been sequenced and assembled. Similar projects in Arabidopsis [83] and rice [84] generated a huge data of natural variations occurring within different accessions.

### **4.3. Plant comparative genomics and databases**

Using comparative genomic approaches, functions to different genes (especially representing the less studied species) have been assigned. The developments in RNA interference and other technologies like mutagenesis have allowed phenotypic screens for genes—known as phe‐ nomics [85]. The field of phenomics is heavily dependent upon the interaction of plant genome with the prevailing environments. This science is largely dependent upon intensive collabo‐ ration between three disciplines including plant science, computer science, and engineering. Currently, there are yearly plant-focused image-processing tasks [86] that have positively stimulated the community and invigorated computer scientists to focus on developing joint plant datasets. Though there is limited accessibility to high-throughput phenotyping plat‐ forms. A current list of accessible image datasets can be accessed at the website [85].

### **4.4. Important information source of plant species**

The most prevalent and unified information collection source is TAIR that maintains data of molecular biology, genetic and genomic of Arabidopsis [87]. Similarly, Salk Institute Genomic Analysis Laboratory (SIGnAL) deals with the omics research of Arabidopsis.

Gramene is an integrated source of information for grasses. It exploits the rice-genomesequencing information as a foundation source for comparing the information of other members of grass family [88]. At this website, information about DNA and mRNA sequences, genome assembly and annotations, genes, genetic maps and physical maps, QTLs, and many more are available. These interesting features make this website more attractive for research‐ ers, and it is being updated regularly with new attributes like genetic diversity data, compar‐ ison of genomes of *Oryza sativa* with its wild relatives or with the other taxa for undertaking evolutionary studies, etc. [89].

The portal site SoyBase [90] provides information about whole-genome sequence data. The portal site for Solanaceae genome is the Sol genomics network. It also provides information about the tomato-genome-sequencing project [91]. The MaizeGDB is a public database for *Zea mays*[92]. GreenPhylDB is a broad platform intended for facilitation of comparative functional genomics in *O. sativa* and *A. thaliana* genomes [93]. PLAZA 3.0 has been established to develop comparative genomics data of plants accessible via user-friendly web interface. Structural and functional annotation, phylogenetic trees protein domains, gene families, and detailed data about genome organization can simply be inquired and envisioned [94]. A comparative genomics database named PIECE was established to accommodate information pertaining to gene structure comparisons and evolution. This database covers all the annotated genes mined from 25 plant species [95].

### **4.5. Use of bioinformatics for comparative genomics in plants**

**4.2. Plant whole-genome resequencing**

218 Bioinformatics - Updated Features and Applications

**4.3. Plant comparative genomics and databases**

**4.4. Important information source of plant species**

evolutionary studies, etc. [89].

different accessions.

The most effective approach in functional genomics is the whole-genome resequencing. For reducing cost, target region can be sequenced. Microarray is also a common way of target region sequencing, which is based on hybridization to arrays comprising of synthetic oligonucleotides that match the target DNA sequence [82]. Recent NGS technology has made it possible to discover differences between individuals and populations especially of the crop species whose genomes have already been sequenced and assembled. Similar projects in Arabidopsis [83] and rice [84] generated a huge data of natural variations occurring within

Using comparative genomic approaches, functions to different genes (especially representing the less studied species) have been assigned. The developments in RNA interference and other technologies like mutagenesis have allowed phenotypic screens for genes—known as phe‐ nomics [85]. The field of phenomics is heavily dependent upon the interaction of plant genome with the prevailing environments. This science is largely dependent upon intensive collabo‐ ration between three disciplines including plant science, computer science, and engineering. Currently, there are yearly plant-focused image-processing tasks [86] that have positively stimulated the community and invigorated computer scientists to focus on developing joint plant datasets. Though there is limited accessibility to high-throughput phenotyping plat‐

forms. A current list of accessible image datasets can be accessed at the website [85].

Analysis Laboratory (SIGnAL) deals with the omics research of Arabidopsis.

The most prevalent and unified information collection source is TAIR that maintains data of molecular biology, genetic and genomic of Arabidopsis [87]. Similarly, Salk Institute Genomic

Gramene is an integrated source of information for grasses. It exploits the rice-genomesequencing information as a foundation source for comparing the information of other members of grass family [88]. At this website, information about DNA and mRNA sequences, genome assembly and annotations, genes, genetic maps and physical maps, QTLs, and many more are available. These interesting features make this website more attractive for research‐ ers, and it is being updated regularly with new attributes like genetic diversity data, compar‐ ison of genomes of *Oryza sativa* with its wild relatives or with the other taxa for undertaking

The portal site SoyBase [90] provides information about whole-genome sequence data. The portal site for Solanaceae genome is the Sol genomics network. It also provides information about the tomato-genome-sequencing project [91]. The MaizeGDB is a public database for *Zea mays*[92]. GreenPhylDB is a broad platform intended for facilitation of comparative functional genomics in *O. sativa* and *A. thaliana* genomes [93]. PLAZA 3.0 has been established to develop comparative genomics data of plants accessible via user-friendly web interface. Structural and functional annotation, phylogenetic trees protein domains, gene families, and detailed data

Availability of whole-genome sequences and bioinformatics have accelerated the process for identifying specific gene families in different plant species. These tools were also used to study the duplications as well as deletions in different plant genomes [96]. These results are helpful in phylogenetic studies [97], study of synteny and collinearity relationship, and inference of shared ancestry of genes [98]. The plant genome duplication database (PGDD) provides important data for studying the syntenic relationships of intragenome or cross-genome identified in the genome-sequenced species [99]. Analysis of orthologous clusters at genome level is a significant element in elucidating comparative genomics. Recognizing overlap between orthologous clusters can permit us to clarify the utility and evolution of proteins among multiple species. OrthoVenn is a web platform that is freely accessible and can be used for making comparisons and annotations of orthologous gene clusters. It can be accessed at [100]. Information regarding orthologs of plants and green algae can be searched at PlantOrDB [101].

### **4.6. Gene prediction and genome annotation**

The characterization of introns and exons in a sequenced genome is referred as gene prediction. These predictions can be undertaken computationally or combination of manual as well as computational annotations. Numerous computer programs to find protein-coding genes are accessible through OMIC TOOLs website [102], which has been extensively used for genome annotations and genes prediction.

For structural annotations of a genome, a number of software packages were described [102, 103]. Additionally, tools (SynBrowse and VISTA) of genome comparison can be used to improve precision of gene identification. Repeat-Masker [104] was designed to find inter‐ spersed repeats and low complexity sequences in whole sequenced genome. Through this program, the repetitive sequences can also be masked. Similarly, a number of software programs (Repeat Finder, RECON, etc.) are available that can be used to find repeats in a sequenced genome.

### **4.7. Genome mapping and bioinformatics**

Selecting suitable mapping tool and sequences search may claim adjustments in specificity and sensitivity of the search statistics. The process of finding candidate genes conferring traits can be accelerated for those crops where genetic and physical maps and annotated genome assemblies are available. A wide range of tools have been developed recently for illustrating maps and imagining genomes primarily to facilitate genome assembly.

NCBI is a source to assess all types of information regarding genomes. Access to various biological databases is possible using "Entrez." For aligned genetic, physical, and sequence information of eukaryotes including plants, a genome browser "Map viewer" has been developed. To display aligned map from various species entered in Map Viewer, a special plant query page can be accessed. Customized plant basic local alignment search tool (Plant BLAST) facilitates the process of exploring sequence similarity from the collection of mapped plants sequence data, and the resulting alignment can be visualized in genomic text using "Map viewer" [105], R/qtl [106], JoinMap [107], OneMap [108], MSTMap [109], Lep-MAP [110], and HighMap [111], which can be used to develop genetic linkage maps [112].

Numerous databases offer data for exploring markers in multiple crop species. DNA markers including Single nucleotide polymorphism (SNP), Simple sequence repeat (SSR), and con‐ served ortholog set (COS) markers can be predicted using PlantMarkers [113]. A famous site for Triticeae genome is GrainGenes that contains information about linkage maps and DNA markers of wheat, rye, barley, and oat [114]. Gramene, a database for comparative genomics, contains genetic maps of multiple plant species [89]. The Triticeae Mapped EST database (TriMEDB) gives information of mapped cDNA markers related to barley and wheat [115]. The CottonGen web-based database provides information and open access to genetic, genom‐ ic, and breeding data of cotton. CottonGen has improved tools for sharing, mining, retrieval, and visualization of data as compared with the CottonDB and Cotton Marker Database [116].

### **5. Conclusions**

In this chapter, we described comprehensively the available resources and tools of bioinfor‐ matics pertaining to gene expression, databases, protein, and metabolite analyses and genome sequencing. Bioinformatics has been evolved rapidly over the last 15 years—emerged as a new key discipline of biology. A huge amount of genetic and genomic data have been generated using next-generation sequencing technologies that provide opportunities for generating huge genetic and genomic data. However, drawing useful genetic information is handicapped due to unavailability of skilled bioinformaticians. Still, there is room for some unsolved problems in bioinformatics like computerized data mining, vigorous inference of phenotype from genotype, trainings of students and recognized researchers in bioinformatics, etc. Bioinfor‐ matics is generating job opportunities for brilliant and skilled researchers in biology, statistics, and computer science. The remarkable evolution of bioinformatics has been confronted by a number of troublesome revolutions in science and technology. Even though, bioinformatics has developed possibly itself to a level above recognition. Today's bioinformatics is a luxury to biological scientists, generating huge data in all fields of biological sciences. In near future, bioinformatics will be an indispensable part of plant research, and novel tools and methods will be incorporated by every plant scientist. The next half century is the era of "data integra‐ tion." Both basic and applied research will replenish the society for renewable energy, dropping world hunger and poverty, and protecting the environment.

### **Author details**

NCBI is a source to assess all types of information regarding genomes. Access to various biological databases is possible using "Entrez." For aligned genetic, physical, and sequence information of eukaryotes including plants, a genome browser "Map viewer" has been developed. To display aligned map from various species entered in Map Viewer, a special plant query page can be accessed. Customized plant basic local alignment search tool (Plant BLAST) facilitates the process of exploring sequence similarity from the collection of mapped plants sequence data, and the resulting alignment can be visualized in genomic text using "Map viewer" [105], R/qtl [106], JoinMap [107], OneMap [108], MSTMap [109], Lep-MAP [110],

Numerous databases offer data for exploring markers in multiple crop species. DNA markers including Single nucleotide polymorphism (SNP), Simple sequence repeat (SSR), and con‐ served ortholog set (COS) markers can be predicted using PlantMarkers [113]. A famous site for Triticeae genome is GrainGenes that contains information about linkage maps and DNA markers of wheat, rye, barley, and oat [114]. Gramene, a database for comparative genomics, contains genetic maps of multiple plant species [89]. The Triticeae Mapped EST database (TriMEDB) gives information of mapped cDNA markers related to barley and wheat [115]. The CottonGen web-based database provides information and open access to genetic, genom‐ ic, and breeding data of cotton. CottonGen has improved tools for sharing, mining, retrieval, and visualization of data as compared with the CottonDB and Cotton Marker Database [116].

In this chapter, we described comprehensively the available resources and tools of bioinfor‐ matics pertaining to gene expression, databases, protein, and metabolite analyses and genome sequencing. Bioinformatics has been evolved rapidly over the last 15 years—emerged as a new key discipline of biology. A huge amount of genetic and genomic data have been generated using next-generation sequencing technologies that provide opportunities for generating huge genetic and genomic data. However, drawing useful genetic information is handicapped due to unavailability of skilled bioinformaticians. Still, there is room for some unsolved problems in bioinformatics like computerized data mining, vigorous inference of phenotype from genotype, trainings of students and recognized researchers in bioinformatics, etc. Bioinfor‐ matics is generating job opportunities for brilliant and skilled researchers in biology, statistics, and computer science. The remarkable evolution of bioinformatics has been confronted by a number of troublesome revolutions in science and technology. Even though, bioinformatics has developed possibly itself to a level above recognition. Today's bioinformatics is a luxury to biological scientists, generating huge data in all fields of biological sciences. In near future, bioinformatics will be an indispensable part of plant research, and novel tools and methods will be incorporated by every plant scientist. The next half century is the era of "data integra‐ tion." Both basic and applied research will replenish the society for renewable energy,

dropping world hunger and poverty, and protecting the environment.

and HighMap [111], which can be used to develop genetic linkage maps [112].

**5. Conclusions**

220 Bioinformatics - Updated Features and Applications

Mehboob-ur- Rahman1\*, Tayyaba Shaheen2 , Mahmood-ur- Rahman2 , Muhammad Atif Iqbal1 and Yusuf Zafar3

\*Address all correspondence to: mehboob\_pbd@yahoo.com and mehboob@nibge.org

1 National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan

2 Department of Bioinformatics and Biotechnology, Government College University, Faisa‐ labad, Pakistan

3 Department of Technical Co-operation, IAEA, Vienna International Centre, Vienna, Aus‐ tria

### **References**


[22] Jaenicke R. Folding and association of proteins. Progress in Biophysics and Molecular Biology.1987;49:117–237. DOI: 10.1016/0079-6107(87)90011-3

[9] Babu MM. Biological databases and protein sequence analysis [Internet]. 1997. Avail‐ able from http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pdfs/biodbseq.pdf

[10] Bry FK, Kröger P. A computational biology database digest: data, data analysis, and data management. Distributed Parallel Databases 2003;13:7–42. DOI: 10.1023/A:

[11] Brown JR, editor. Handbook of Comparative Genomics: Basic and Applied Research.

[12] Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, et al. Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science.

[13] Yilmaz A, Nishiyama MY, Jr., Fuentes BG, Souza GM, Janies D, Gray J, et al. GRASSIUS: a platform for comparative regulatory genomics across the grasses. Plant Physiology.

[14] Mochida K, Yoshida T, Sakurai T, Yamaguchi-Shinozaki K, Shinozaki K, Lam-Son Phan Tran. LegumeTFDB: an integrative database of *Glycine max*, *Lotus japonicus* and *Medicago truncatula* transcription factors [Internet]. 2009. Available from: http://

[15] Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids

[16] Nuin PAS, Wang Z, Tillier ERM. The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics. 2006;7:471–488. DOI:

[17] Waterman MS, Smith TF, Beyer WA. Some biological sequence metrics. Advances in

[18] Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004;32:1792–1797. DOI: 10.1093/nar/gkh340

[19] Goldberg DE, editor. Handbook of Genetic Algorithms in Search, Optimization and Machine Learning. 1st ed. Reading, MA: Addison-Wesley Longman; 1989. 372 p.

[20] Roey KV, Davey NE. Motif co-regulation and co-operativity are common mechanisms in transcriptional, post-transcriptional and post-translational regulation. Cell Commu‐

[21] Abrahams JP, Leslie AGW, Lutter R, Walker JE. Structure at 2.8 Å resolution of F1- ATPase from bovine heart mitochondria. Nature. 1994;370:621–628. DOI:

Mathematics. 1976;20:367–387. DOI: 10.1016/0001-8708(76)90202-4

nication Signal. 2015;13:45–60. DOI: 10.1186/s12964-015-0123-9

[Accessed: 2016-04-26]

222 Bioinformatics - Updated Features and Applications

1st ed. Boca Raton, FL: CRC Press; 2007. 400 p.

2009;149:171–180. DOI: 10.1104/pp.108.128579

legumetfdb.psc.riken.jp [Accessed: 2016-04-28]

10.1186/1471-2105-7-471

10.1038/370621a0

Research. 2003;31:3497–3500. DOI: 10.1093/nar/gkg500

2000;290:2105–2110. DOI: 10.1126/science.290.5499.2105

1021540705916


[48] Brinkman FS, Leipe DD. Phylogenetic analysis. Methods of Biochemical Analysis. 2001;43:323–358.

[35] Ratan A. ASSEMBLY algorithms for next generation sequence data [Internet]. 2009.

[36] Maston GA, Evans K, Green MR. Transcriptional regulatory elements in the human genome. Annual Review of Genomics and Human Genetics. 2006;7:29–59. DOI:

[37] Lescot M, Déhais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, Rouzé P, Rombauts S. PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences [Internet]. 2002. Available from: http://

[38] Pandey S, Negi YK, Marla SS, Arora S. Comparative in silico analysis of ascorbate peroxidase protein sequences from different plant species. Journal of Bioengineering

[39] Han Y, Zheng QS, Wei YP, Chen J, Liu R, Wan HJ. *In silico* identification and analysis of phytoene synthase genes in plants. Genetics and Molecular Research. 2015;14:9412–

[40] Rao G, Sui J, Zhang J. In silico genome-wide analysis of the WRKY gene family in Salix

[41] He G, Elling AA, Deng XW. The epigenome and plant development. Annual Review of Plant Biology. 2011;62:411–435. DOI: 10.1146/annurev-arplant-042110-103806

[42] Law JA, Jacobsen SE. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nature Reviews Genetics. 2010;11:204–220. DOI:

[43] Lister R, Gregory BD, Ecker JR. Next is now: new technologies for sequencing of genomes, transcriptomes, and beyond. Current Opinion in Plant Biology. 2009;12:107–

[44] Morsy M, Gouthu S, Orchard S, Thorneycroft D, Harper JF, Mittler R, Cushman JC. Charting plant interactomes: possibilities and challenges. Trends in Plant Sciences.

[45] Guo F, Ding Y, Li Z, Tang J. Identification of protein-protein interactions by detecting correlated mutation at the interface. Journal of Chemical Information and Modeling.

[46] Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C. Protein 3D structure computed from evolutionary sequence variation. PLoS One.

[47] Waters ER, Lee GJ, Vierling E. Evolution, structure and function of the small heat shock proteins in plants. Journal of Experimental Botany. 1996;47:325–338. DOI: 10.1093/jxb/

& Biomedical Science. 2011;1:103–107. DOI: 10.4172/2155-9538.1000103

Available from: http://www.yasara.org/products.htm#structure

hdl.handle.net/1854/LU-153936 [Accessed: 2004-01-14]

arbutifolia. Plant Omics Journal. 2015;8:353–360.

2008; 13:183–191. DOI: 10.1016/j.tplants.2008.01.006

2015;55:2042–2049. DOI: 10.1021/acs.jcim.5b00320

2011;6:e28766. DOI: 10.1371/journal.pone.0028766

10.1146/annurev.genom.7.080505.115623

9422. DOI: 10.4238/2015

224 Bioinformatics - Updated Features and Applications

10.1038/nrg2719

47.3.325

118. DOI: 10.1016/j.pbi.2008.11.004


Book of Plant Omics: The Omics of Plant Science. Springer; 2015. pp. 755–790. DOI: 10.1007/978-81-322-2172-2\_1

[76] Agrawal PK, Babu BK, Saini N. Omics of model plants. In: Barh D, Khan MS, Davies E, editors. Hand Book of Plant Omics: The Omics of Plant Science. New Delhi: Springer; 2015. pp 1–32. DOI: 10.1007/978-81-322-2172-2\_1

[62] Verma M, Kumar V, Patel RK, Garg R, Jain M. CTDB: an integrated chickpea tran‐ scriptome database for functional and applied genomics. PLoS One. 2015;10:0136880.

[63] Nooren IMA, Thornton JM. Diversity of protein-protein interactions. The EMBO

[64] You L. Toward computational systems biology. Cell Biochemistry and Biophysics.

[65] Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Molecular

[66] Wang C, Marshall A, Zhang D, Wilson ZA. ANAP: an integrated knowledge base for Arabidopsis protein interaction network analysis. Plant Physiology. 2012;158:1523–

[67] Zhu P, Gu H, Jiao Y, Huang D, Chen M. Computational identification of protein-protein interactions in rice based on the predicted rice interactome network. Genomics Proteomics Bioinformatics. 2011;9:128–137. DOI: 10.1016/S1672-0229(11)60016-8

[68] Alkin B. iPlants-the world's plant on line [Internet]. 2004. Available from: iPlants

[69] Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L. Reactome: a knowl‐ edgebase of biological pathways. Nucleic Acids Research. 2005;33:428–432. DOI:

[70] Haw R, Stein L. Using the reactome database. In: Andreas D, et al editors. Hand book of Current protocols in bioinformatics. Chichester:Wiley; 2012. DOI:

[71] Croft D. Building models using Reactome pathways as templates. Methods in Molec‐

[72] Fukusaki E, Kobayashi A. Plant metabolomics: potential for practical operation. Journal of Bioscience and Bioengineering. 2005;100:347–354. DOI: 10.1263/jbb.100.347

[73] Fukushima A, Kusano M. Recent progress in the development of metabolome data‐ bases for plant systems biology. Frontiers in Plant Science. 2013;4:73. DOI: 10.3389/fpls.

[74] Deborde C, Jacob D, MeRy-B. A metabolomic database and knowledge base for exploring plant primary metabolism. Methods in Molecular Biology. 2014;1083:3–16.

[75] Iquebal MA, Jaiswal S, Mukhopadhyay CS, Sarkar C, Rai A, Kumar D. Applications of bioinformatics in plant and agriculture. In: Barh D, Khan MS, Davies E, editors. Hand

ular Biology. 2013;1021:273–283. DOI: 10.1007/978-1-62703-450-0\_14

DOI: 10.1371/journal.pone.0136880

226 Bioinformatics - Updated Features and Applications

1533. DOI: 10.1104/pp.111.192203

10.1093/nar/gki072

2013.00073

10.1002/0471250953.bi0807s38

DOI: 10.1007/978-1-62703-661-0\_1

Journal. 2003;22:3486–3492. DOI 10.1093/emboj/cdg359

System Biology. 2007;3:88–100. DOI 10.1038/msb4100129

2004;40:167–184. DOI: 10.1385/CBB:40:2:167

document library: www.iplants.intranets.com


(*Gossypium hirsutum* TM-1) provides insights into genome evolution. Nature Biotech‐ nology. 2015;33:524–530. DOI: 10.1038/nbt.3208

[102] Hudek AK, Cheung J, Boright AP, Scherer SW. Genescript: DNA sequence annotation pipeline. Bioinformatics. 2003;19:1177–1178. DOI: 10.1093/bioinformatics/btg134

[89] Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES, Casstevens T, et al. Gramene: a growing plant comparative genomics resource. Nucleic Acids Research. 2008;36:947–

[90] Grant, D, Nelson, RT, Cannon, SB, Shoemaker, RC. SoyBase, the USDA-ARS soybean genetics and genomics database. Nucleic Acids Research. 2010;38:843–846. DOI:

[91] Fernandez-Pozo N, Menda N, Edwards JD, Saha S, Tecle IY, Strickler SR, Bombarely A, Fisher-York T, Pujar A, Foerster H, Yan A, Mueller LA. The Sol Genomics Network (SGN)—from genotype to phenotype to breeding. Nucleic Acids Research.

[92] Lawrence CJ, Dong Q, Polacco ML, Seigfried TE, Brendel V. MaizeGDB, the community database for maize genetics and genomics. Nucleic Acids Research. 2004;32:393–397.

[93] Conte MG, Gaillard S, Lanau N, Rouard M, Périn C. GreenPhylDB: a database for plant comparative genomics. Nucleic Acids Research. 2008;36:991–998. DOI: 10.1093/nar/

[94] Proost S, Van Bel MV, Vaneechoutte D, de Peer YV, Inze D, Mueller-Roeber B, Vande‐ poele K. PLAZA 3.0: an access point for plant comparative genomics Nucleic Acids

[95] Wang Y, You FM, Lazo GR, Luo M-C, Thilmony R, Gordon S, Shahryar F, Kianian SF, Gu YQ. PIECE: a database for plant gene structure comparison and evolution. Nucleic

[96] Sterck L, Rombauts S, Vandepoele K, Rouze P, Van de Peer Y. How many genes are there in plants (and why are they there)? Current Opinion in Plant Biology. 2007;10:199–

[97] Wall PK, Leebens-Mack J, Muller KF, Field D, Altman NS, dePamphilis CW. Plant‐ Tribes: a gene and gene family resource for comparative genomics in plants. Nucleic

[98] Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. Synteny and collinearity in plant genomes. Science. 2008;320:486–488. DOI: 10.1126/science.1153917

[99] Lee TH, Tang H, Wang X, Paterson AH. PGDD: a database of gene and genome duplication in plants. Nucleic Acids Research. 2012;41:1152–1158. DOI: 10.1093/nar/

[100] Wang Y, Coleman-Derr D, Chen G, Gu YQ. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species. Nucleic

[101] Li F, Fan G, Lu C, Xiao G, Zou C, Kohel R, Ma Z, Shang H, Ma X, Wu J, Liang X, Huang G, G Percy R, Liu K, Yang W, et al. Genome sequence of cultivated upland cotton

Research. 2015;43:974–981. DOI: 10.1093/nar/gku986

203. DOI: 10.1016/j.pbi.2007.01.004

Acids Research. 2012;41:1159–1166. DOI: 10.1093/nar/gks1109

Acids Research. 2008;36:970–976. DOI: 10.1093/nar/gkm972

Acids Research. 2015;43:78–84. DOI: 10.1093/nar/gkv487

953. DOI: 10.1093/nar/gkm968

10.1093/nar/gkp798

228 Bioinformatics - Updated Features and Applications

2015;43:1036–1041.

gkm934

gks1104

DOI: 10.1093/nar/gkh011


## **Bioinformatics Tools and Genomic Resources Available in Understanding the Structure and Function of** *Gossypium*

Venkateswara R. Sripathi, Ramesh Buyyarapu, Siva P. Kumpatla, Abreeotta J. Williams, Seloame T. Nyaku, Yonathan Tilahun, Venu Kalavacharla and Govind C. Sharma

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/64325

### **Abstract**

[115] Mochida K, Saisho D, Yoshida T, Sakurai T, Shinozaki K. TriMEDB: a database to integrate transcribed markers and facilitate genetic studies of the tribe Triticeae. BMC

[116] Yu J, Jung S, Cheng CH, Ficklin SP, Lee T, Zheng P, Jones D, Percy RG, Main D. CottonGen: a genomics, genetics and breeding database for cotton research. Nucleic

[117] Mount DM. Hand Book of Bioinformatics: Sequence and Genome Analysis. 2nd ed.

Plant Biology. 2008;8:72. DOI: 10.1186/1471-2229-8-72

230 Bioinformatics - Updated Features and Applications

Acids Research. 2014;42:1229–1236. DOI: 10.1093/nar/gkt1064

New York: Cold Spring Harbor Laboratory Press; 2001.

Cotton is economically and evolutionarily important crop for its fiber. In order to improve fiber quality and yield, and to exploit the natural genetic potential inherent in geno‐ types, understanding genome structure and function of cultivated cotton is important. In order to achieve this, a functional understanding of bioinformatics resources such as databases, software solutions, and analysis tools is required. But currently, there are very few unified reports on bioinformatics tools and even fewer repositories to access cotton genomic information. Also, resourceful developers and bioinformatics scientists actively addressing complex genomic challenges in cotton genomes are much in need. The primary goal of this chapter is to provide a review of such tools and resources for analyzing the structure and function of the cotton genome with preferential emphasis on this com‐ plex and economically important plant species. This discourse begins with a descrip‐ tion of concurrent advances in high‐throughput genome sequencing and bioinformatics analyses and focuses on four major sections covering bioinformatics tools and resour‐ ces for analysis of: (1) genomes; (2) transcriptomes; (3) small RNAs; and (4) epige‐ nomes. In each section, recent advances in cotton have been discussed. Cotton genome sequencing and annotation efforts are outlined within these sections. This review discusses the availability of genome information of both diploid and tetraploid species that have impelled cotton genome research into the post‐genomics era, opening new avenues for exploring regulatory mechanisms associated with fine‐tuning of gene expression of fiber‐ related genes. Finally, the potential impacts of these rapid advances, especially the challenges in handling and analyzing the large datasets are discussed.

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Keywords:** genome, transcriptome, epigenome, sequencing, cotton fiber

### **1. Introduction**

Cotton is an economically and evolutionarily important crop species. Along with cotton improvement, which has progressed impressively with conventional and molecular breeding approaches, the genomic approaches utilizing next generation sequencing (NGS) technolo‐ gies have enhanced our ability to understand and utilize the genetic potential of crop species [1]. The early sequencing efforts in cotton (*Gossypium* spp.) are mostly limited to diploid species such as *G. raimondii*, as its genome is structurally less complex. Now, DNA sequencing has become a routine tool in cotton genetic research. Sequencing genomic, transcriptomic, and regulatory regions of a plant species and comparing their patterns will provide a better understanding of genome architecture, genetic variation, gene identification, regulation, and its expression [2]. Understanding cotton genome, transcriptome, and regulatory molecules (smallRNAsandepigeneticmodulators)willthereforeprovidedeeperinsightsintothestructure and function of the genome [3]. Dissecting the complexity associated with tetraploid cotton genome and narrow genetic variability is more challenging and requires efficient methodolo‐ gies. Comparative genome and transcriptome analysis of cultivated cotton and its progeni‐ torswillaidintheidentificationofnovelsingle‐nucleotidepolymorphisms(SNPs),copynumber variations (CNVs), transcripts, transcript quantification, and alternative splice junctions at the transcriptome level and the role of these elements in regulating cotton fiber quality and yield [4]. Similarly, transcriptome profiling studies are powerful in unravelling the underlying mechanisms involved in gene expression associated with the sub‐genomes of a plant species. In addition, the role of small RNAs and epigenetic modulators is increasingly evident in determining the genome landscapes of plant species including cotton.

DNA sequencing technologies have evolved tremendously over two decades from Bacterial artificial chromosome (BAC)‐based cloning to single‐molecule sequencing [5, 6]. Concomi‐ tantly, the computational tools for addressing the problems in understanding sequence data have also evolved from sequence alignment algorithms such as Needleman–Wunsch and Smith–Waterman to assembly tools such as Burrows‐Wheeler Alignment (BWA) and Bowtie [7]. As described here, a wide range of publicly available bioinformatics tools and some commercially available sequence analysis tools such as CLCBio Genomics Workbench, Strand NGS, Geneious, Laser gene suite, and NextGENe are being utilized [8].

### **2. Genome analysis in cotton**

Approximately 50 naturally occurring cotton species are available, including 45 diploid (2n = 2x = 26) and five allotetraploid species (2n = 4x = 52) each with a haploid chromosome number of 13 [9]. Based on meiotic pairing and chromosome size, diploid species (2n = 26) have been placed into genomic groups A, B, C, D, E, F, G, or K. Of these, A‐genome sources, *G. herba‐*

*ceum* (2n = 2x = A1A1 = 26) and *G. arboreum* (2n = 2x = A2A2 = 26), and D‐genome sources, *G. raimondii* (2n = 2x = D5D5 = 26), and *G. gossypioides* (2n = 2x = D6D6 = 26) are considered as closest progenitor species to cultivated allotetraploid cotton [10]. Diploid genome sizes of A‐G and K vary significantly between species ranging from ∼900 to ∼2800 Mb. Two natural allotetraploid (2n = 4x = A<sup>t</sup> At Dt Dt = 52) species, *G. hirsutum* and *G. barbadense,* are derived from a complex inter‐specific hybridization process between two diploid species carrying A‐genome and D‐ genomes [11]. However, the sequencing of polyploid plant genomes has been a tedious task due to repeat regions, whole‐genome duplication events, and chromosomal rearrangements that have occurred in the process of evolution [12]. Effective strategies must be developed and employed for sequencing such complex genomes. In order to reduce such complexity, closest diploid progenitor species carrying D5 (*G. raimondii*; 880 Mb), A2 (*G. arboreum*; 1700 Mb), and A1 (*G. herbaceum*; 1700 Mb) genomes can be sequenced and compared against cultivated tetraploid species carrying At Dt genome (*G. hirsutum*; ∼2400 Mb and *G. barbadense*; ∼2500 Mb), the estimated genome sizes [13] of the respective *Gossypium* species are given in parenthesis. The whole‐genome sequencing efforts in cotton resulted in assembling ∼775 Mb (∼88%) of *G. raimondii* and ∼1694 Mb (∼90%) of *G. arboreum* genomes [14, 15]. Also, ∼2.3 Gb (∼90%) and ∼2.5 Gb (∼88%) of the genomes have been assembled into 13 pseudochromosomes each from At and D<sup>t</sup> sub‐genomes of *G. hirsutum* [16, 17] and *G. barbadense*, [18, 19] respectively. Presently, *G. herbaceum* (A1) genome sequencing efforts are in progress with a partial genome assembled up to 1.2 Gb (∼70%) [20].

**Keywords:** genome, transcriptome, epigenome, sequencing, cotton fiber

determining the genome landscapes of plant species including cotton.

NGS, Geneious, Laser gene suite, and NextGENe are being utilized [8].

**2. Genome analysis in cotton**

DNA sequencing technologies have evolved tremendously over two decades from Bacterial artificial chromosome (BAC)‐based cloning to single‐molecule sequencing [5, 6]. Concomi‐ tantly, the computational tools for addressing the problems in understanding sequence data have also evolved from sequence alignment algorithms such as Needleman–Wunsch and Smith–Waterman to assembly tools such as Burrows‐Wheeler Alignment (BWA) and Bowtie [7]. As described here, a wide range of publicly available bioinformatics tools and some commercially available sequence analysis tools such as CLCBio Genomics Workbench, Strand

Approximately 50 naturally occurring cotton species are available, including 45 diploid (2n = 2x = 26) and five allotetraploid species (2n = 4x = 52) each with a haploid chromosome number of 13 [9]. Based on meiotic pairing and chromosome size, diploid species (2n = 26) have been placed into genomic groups A, B, C, D, E, F, G, or K. Of these, A‐genome sources, *G. herba‐*

Cotton is an economically and evolutionarily important crop species. Along with cotton improvement, which has progressed impressively with conventional and molecular breeding approaches, the genomic approaches utilizing next generation sequencing (NGS) technolo‐ gies have enhanced our ability to understand and utilize the genetic potential of crop species [1]. The early sequencing efforts in cotton (*Gossypium* spp.) are mostly limited to diploid species such as *G. raimondii*, as its genome is structurally less complex. Now, DNA sequencing has become a routine tool in cotton genetic research. Sequencing genomic, transcriptomic, and regulatory regions of a plant species and comparing their patterns will provide a better understanding of genome architecture, genetic variation, gene identification, regulation, and its expression [2]. Understanding cotton genome, transcriptome, and regulatory molecules (smallRNAsandepigeneticmodulators)willthereforeprovidedeeperinsightsintothestructure and function of the genome [3]. Dissecting the complexity associated with tetraploid cotton genome and narrow genetic variability is more challenging and requires efficient methodolo‐ gies. Comparative genome and transcriptome analysis of cultivated cotton and its progeni‐ torswillaidintheidentificationofnovelsingle‐nucleotidepolymorphisms(SNPs),copynumber variations (CNVs), transcripts, transcript quantification, and alternative splice junctions at the transcriptome level and the role of these elements in regulating cotton fiber quality and yield [4]. Similarly, transcriptome profiling studies are powerful in unravelling the underlying mechanisms involved in gene expression associated with the sub‐genomes of a plant species. In addition, the role of small RNAs and epigenetic modulators is increasingly evident in

**1. Introduction**

232 Bioinformatics - Updated Features and Applications

Transposable elements (TEs) are abundant in plant genomes and are highly variable and often deleterious, as they undergo massive amplification within the genome. The role of TEs has been implicated in gene mutation, whole‐genome duplication, chromosomal rearrangements and novel gene formation through genetic and epigenetic changes. They can alter gene expression and phenotypes by establishing and modifying gene regulatory networks. This is accomplished by inducing changes in genetic and epigenetic mechanisms [21]. The variation in genome structure and organization, even in closely related species, is primarily due to TEs and whole‐genome duplication (WGD). This may be as a result of a combination of non‐ random events such as small RNA silencing and epigenetic mechanisms [22] and random events such as natural selection and adaptation [21].With the availability of allotetraploid and diploid cotton genomes, comparative genomics has been used to analyze and understand the structural variation and role of TEs in evolution [23]. Their study identified ∼57, ∼68, and ∼67% of TEs in D5, A2, and A<sup>t</sup> Dt genomes of cotton, respectively. Long terminal repeats (LTRs) have contributed significantly in whole‐genome duplication and evolution of domesticated cotton. Though the overall TE content in A2‐ and At Dt ‐genomes has been found to be similar, the frequency of LTR‐gypsy and LTR‐copia‐type elements varied significantly. Terminal repeat retrotransposons in miniature (TRIMs) are a small, ubiquitous, conserved, poorly characterized, and scarcely reported group of repeats. TRIMs are often derived from partial deletion of long terminal repeat retrotransposons found in genic regions and in gene‐body methylation within the genome. TRIMS are often targeted by small RNAs (sRNAs) of 21–24 nt in length. Screening for TRIMs in land plants is critical to understanding differences in selection pressure and evolutionary relationships among the clades. Using high‐throughput sequencing followed by bioinformatics analysis, 145 unique families of TRIMs have been identified after screening 48 plant genomes [24].

Genetic variation is an important element for crop improvement. An understanding of the genetic and genomic relationships of cotton species and cultivars is critical for further utilization of diversity in the development of improved cultivars with favorable alleles [25]. Allelic variations within a genome of the species can be classified into three major groups at DNA level: microsatellites, insertions/deletions, and single‐nucleotide polymorphisms (SNPs) [26]. Molecular markers serve as efficient tools for genome characterization, understanding the genetic complex traits, marker‐assisted selection (MAS) and for map‐based cloning in breeding programs. Several molecular marker technologies have been used to study the genetic diversity and relationships of *Gossypium* species [27]. However, cotton crop improvement is limited by its narrow genetic base and limited variation among the cultivated cotton cultivars. Genetic variation at molecular level in cotton was previously characterized using isozyme/allozyme markers [28]; using non‐coding genomic markers such as restriction fragment length polymorphisms (RFLPs) [29]; amplified fragment length polymorphisms (AFLPs) [30]; microsatellites [31, 32]; single‐nucleotide polymorphisms [33] in *G. hirsutum* and its related species.

Simple sequence repeat (SSR) markers are widely used in many plant and animal genomes due to their abundance, hypervariability, and suitability for high‐throughput analysis. Development of SSR markers using molecular methods is time consuming, laborious, and expensive. Use of computational approaches to mine ever‐increasing sequences such as expressed sequence tags (ESTs) in public databases permits rapid and economical discovery of SSRs [34]. SSR mining programs such as Repeat Pattern tool kit; SSR Finder; Advanced Content Matching Engine for Sequences (ACMES); Spectral‐repeat finder; Adplot; REPEATS and other programs are routinely used to mine EST databases, genome survey sequences, and other nucleotide databases [35]. In cotton, SSR markers mined from diploid species such as *G. arboreum* were also successfully employed to understand the structural variation in tetraploid cultivars [32]. SSR markers had been extensively used for many genetic mapping, quantitative trait loci (QTL), and trait mapping experiments for favorable characteristics such as fiber quality, higher yield [36], pathogen resistance [37, 38], and other important traits in cotton. With the advent of next generation sequencing technologies, identification of allelic variation at single‐nucleotide level and their application in crop genetics is becoming a common practice.

SNP markers are becoming the "markers of choice" due to their abundance in the genomes, amenability to automation, and high‐throughput genotyping capability. In cotton, using available EST or transcriptome sequences, gene‐specific SNPs had been characterized [39]; however, these initial efforts and methods had limited application for genome wide SNP marker development in cotton species. Tetraploid nature, highly complex, and repetitive genome of cultivated cotton species poses significant challenges for genome wide SNP marker development. These complications usually result in high number of false positive SNPs, especially when they are developed from sequences that were not thoroughly characterized [25]. Though there is considerable genetic variation across the cultivated cotton species, the narrow germplasm base within each of the cultivated tetraploid species: *G. hirsutum* and *G. barbadense* had made the discovery of useful SNP markers more difficult. Using highly conservative parameters such as minimum coverage of 8**×** at each SNP and 20% minor allelic frequency, a total of 11,834 and 1679 non‐genic SNPs were previously identified between accessions of *G. hirsutum* and *G. barbadense* in genome reduction assemblies, respectively, by Byers et al. [40] As a part of the same study, an additional 4327 genic SNPs were also identified between accessions of *G. hirsutum* in the EST assembly. The transcriptome sequencing has been extended to SNP marker discovery [41]. Using oligonucleotide microarrays, SNP markers in seven differentially expressed EXPANSIN transcripts in early cotton fiber development have been identified [39]. The variant analysis in cotton revealed 27,956 indels and 149,616 SNPs from 268,786 EST assembled contigs [42]. Over 1000 SNPs from 92 single‐copy polymorphic loci have been characterized in tetraploid cotton [43].

sequencing followed by bioinformatics analysis, 145 unique families of TRIMs have been

Genetic variation is an important element for crop improvement. An understanding of the genetic and genomic relationships of cotton species and cultivars is critical for further utilization of diversity in the development of improved cultivars with favorable alleles [25]. Allelic variations within a genome of the species can be classified into three major groups at DNA level: microsatellites, insertions/deletions, and single‐nucleotide polymorphisms (SNPs) [26]. Molecular markers serve as efficient tools for genome characterization, understanding the genetic complex traits, marker‐assisted selection (MAS) and for map‐based cloning in breeding programs. Several molecular marker technologies have been used to study the genetic diversity and relationships of *Gossypium* species [27]. However, cotton crop improvement is limited by its narrow genetic base and limited variation among the cultivated cotton cultivars. Genetic variation at molecular level in cotton was previously characterized using isozyme/allozyme markers [28]; using non‐coding genomic markers such as restriction fragment length polymorphisms (RFLPs) [29]; amplified fragment length polymorphisms (AFLPs) [30]; microsatellites [31, 32]; single‐nucleotide polymorphisms [33] in *G. hirsutum* and

Simple sequence repeat (SSR) markers are widely used in many plant and animal genomes due to their abundance, hypervariability, and suitability for high‐throughput analysis. Development of SSR markers using molecular methods is time consuming, laborious, and expensive. Use of computational approaches to mine ever‐increasing sequences such as expressed sequence tags (ESTs) in public databases permits rapid and economical discovery of SSRs [34]. SSR mining programs such as Repeat Pattern tool kit; SSR Finder; Advanced Content Matching Engine for Sequences (ACMES); Spectral‐repeat finder; Adplot; REPEATS and other programs are routinely used to mine EST databases, genome survey sequences, and other nucleotide databases [35]. In cotton, SSR markers mined from diploid species such as *G. arboreum* were also successfully employed to understand the structural variation in tetraploid cultivars [32]. SSR markers had been extensively used for many genetic mapping, quantitative trait loci (QTL), and trait mapping experiments for favorable characteristics such as fiber quality, higher yield [36], pathogen resistance [37, 38], and other important traits in cotton. With the advent of next generation sequencing technologies, identification of allelic variation at single‐nucleotide level and their application in crop genetics is becoming a common practice.

SNP markers are becoming the "markers of choice" due to their abundance in the genomes, amenability to automation, and high‐throughput genotyping capability. In cotton, using available EST or transcriptome sequences, gene‐specific SNPs had been characterized [39]; however, these initial efforts and methods had limited application for genome wide SNP marker development in cotton species. Tetraploid nature, highly complex, and repetitive genome of cultivated cotton species poses significant challenges for genome wide SNP marker development. These complications usually result in high number of false positive SNPs, especially when they are developed from sequences that were not thoroughly characterized [25]. Though there is considerable genetic variation across the cultivated cotton species, the

identified after screening 48 plant genomes [24].

234 Bioinformatics - Updated Features and Applications

its related species.

Despite significant success in cotton breeding and genetic improvement for fiber‐related traits, the genome‐wide physical or high‐density genetic maps are scarcely available in cotton due to its large genome size. Approximately 5000 markers are needed to fully saturate the cotton genome [44]. A diverse list of markers associated with cotton is available in the cotton marker database (CMD) [45]. CottonGen is a comprehensive database that contains genetic, breeding, and genomic data in cotton including 49 genetic maps, ∼24,000 markers, ∼1000 quantitative trait loci (QTL) linked to more than 30 agronomic traits, ∼18,000 genes/transcripts and ∼460,000 expressed sequence tags (ESTs) [46]. Cotton QTLdb contains a total of 2274 QTLs that have been identified in intraspecific populations [47]. Recently, using genotype‐by‐ sequencing method ultra‐dense inter‐specific genetic map that comprised of ∼5 million SNPs distributed unevenly across 26 linkage groups has been constructed in allotetraploid cotton [4].

Currently, there is a growing repertoire of available NGS platforms that support whole‐ genome sequencing, re‐sequencing, and exome sequencing such as NextSeq, HiSeq and HiSeq‐ X series (Illumina, Inc.), and IonProton (Life technologies). While ABI3500 series (Applied Biosystems), MiniSeq and MiSeq (Illumina, Inc.), and Ion S5 and Ion PGM (Life technologies) are suitable for amplicon sequencing. Extensive genomic information related to gene regulation, genetic, and epigenetic codes are available at GenBank (NCBI), EBI (EMBL) and DDBJ (NIG). In addition, raw sequencing data obtained from NGS platforms as a result of sequencing/re‐sequencing efforts has been reported at NCBI Sequence Read Archive (SRA). NCBI‐Genome hosts information related to genomes including sequences, physical and genetic maps, chromosomes, assemblies, and annotations. Single‐nucleotide polymorphisms (SNPs) and multiple small‐scale variations that include insertions/deletions, microsatellites and non‐polymorphic variants are also available at dbSNP. The databases available for repeats are Repbase, RepeatsDB, Dfam, P‐MITE and the PGSB Repeat Database, while the information associated with plant *cis*‐acting regulatory DNA elements in promoter regions are accessible by PLACE and PlantCARE.


**Table 1.** Genome and transcriptome assemblers.

Three major categories of assemblers the widely used in whole‐genome and transcriptome assembly are as follows: Overlap Layout Consensus (OLC)‐based, Eulerian‐based/*De Bruijn* Graph (DBG)‐based and Greedy assemblers. A table of various types of available sequence assembly. Tools are summarized in **Table 1** [48].


**Table 2.** Commonly used bioinformatics programs in structural, functional and regulatory genome analyses of cotton.

The cotton genomes, D5 and A2, were assembled using SOAPdenovo, while At Dt utilized overlap‐layout‐consensus (OLC) assembly followed by SOAPdenovo to achieve a coverage of over 100**×**. Other tools commonly used in analyzing the structural and functional analysis in cotton genomes are summarized in **Table 2**.

### **3. Transcriptome analysis in cotton**

**Assembler category Assembly program References**

Newbler/GS de novo assembler, Edena

Greedy assemblers SSAKE, VCAKE, and SHARCGS [58, 59]

Three major categories of assemblers the widely used in whole‐genome and transcriptome assembly are as follows: Overlap Layout Consensus (OLC)‐based, Eulerian‐based/*De Bruijn* Graph (DBG)‐based and Greedy assemblers. A table of various types of available sequence

**Bioinformatics utility Programs References**

Genomics Workbench

Whole‐genome alignment LASTZ, AVID, VISTA [80, 81] Multiple sequence alignment MUSCLE; MEGA5 [82] Phylogenetic analysis EMBOSS; PAML [83]

TransposonPSI, MITE Digger,

Gene annotation BLAST2GO, BLAT, PASA [91, 92] Orthologous sequence search OrthoMCL [93] Paralogous sequence search Mcscan, McscanX [94] Transcriptome tools Trinity, Tuxedo, TopHat, Cufflinks [95, 96] Pseudogenes Pseudopipe [97] Single‐nucleotide polymorphisms detection GATK pipeline, SnpEff [98, 99] Copy number variation detection CNVKit [100] Linkage mapping MapMaker, JoinMap, MapManager QTX [101, 102] QTL mapping MapQTL, Windows QTL Cartographer [103, 104]

**Table 2.** Commonly used bioinformatics programs in structural, functional and regulatory genome analyses of cotton.

GenScan, GlimmerHMM, HMMER, GeneID, SNAP, and

Mapping to reference genome SAMtools, Picard, BEDtools, BWA, BWT, Bowtie2, CLC

Repeat regions RepeatMasker, Tandem Repeat Finder, MGEScanLTR,

Gene structure Augustus, GeneMark, FGENESH, GLAD, GeneWise,

GLEAN

Celera assembler, Arachne, CAP/PCAP/CAP3, CABOG, MIRA,

Euler, Velvet, AllPaths, ABySS, Trans‐ABySS, SOAP and SOAPdenovo, miraEST, Oases and Rnnotator

[49–53]

[54–57]

[77–79]

[84–86]

[87–90]

Overlap layout consensus (OLC)‐

236 Bioinformatics - Updated Features and Applications

Eulerian‐based/*De Bruijn* graph

**Table 1.** Genome and transcriptome assemblers.

assembly. Tools are summarized in **Table 1** [48].

Homology‐based search NCBI BLASTN, BLASTP

based

(DBG)‐based

The RNA‐sequencing (RNA‐Seq) and analysis research has exploded in parallel to the genome sequencing methodologies. RNA‐Seq has been widely adopted due to its high accuracy in characterization and quantification of transcriptomes [60]. The major objectives of transcrip‐ tomics are to catalog the transcripts of all species, to predict the genes with biological functions and to quantify the divergence in gene expression during varied spatial, temporal, and ecological conditions. In the beginning, RNA‐Seq was focused on deciphering the transcrip‐ tomes of model species, but it was quickly extended to non‐model species due to the portability of the method because RNA‐Seq analysis can be achieved with or without a reference genome. There was a huge shift in characterizing the high‐resolution transcriptomes in plant species from EST‐sequencing to microarray analysis and currently to RNA‐Seq. In plants, RNA‐Seq was first adopted for sequencing the transcriptome of *Arabidopsis* using massive parallel 454 sequencing platform [61]. Since then, transcriptomes of several plant species under differing genetic and physiological states have been sequenced including cotton [62]. The currently available NGS platforms that support whole‐transcriptome or total RNA/mRNA analysis are GS FLX+ System (Roche 454), SOLiD and IonProton (Applied Biosystems), MiSeq, HiSeq series and NextSeq (Illumina, Inc.). The platforms that support targeted RNA analysis are GS Junior, IonTorrent PGM, and MiniSeq.

Several independent studies used RNA‐Seq analysis to determine spatial‐, temporal‐, tissue‐, genotype‐ and genome‐specific, and stress‐induced (abiotic and biotic) expression in both diploid and tetraploid cottons. Some recent studies are discussed here. Several groups have sequenced the cotton fiber cDNA or ESTs [63, 64] and transcriptome or mRNA [65, 66] using RNA‐Seq, while others utilized microarrays for analysis [67, 68]. Recently, 40,976, 41,330, 66,434, and 80,876 high‐confidence protein‐coding genes have been predicted in *G. raimondii*, *G. arboreum*, *G. hirsutum* and *G. barbadense*, respectively. Cotton fibers are unique have the most elongated cells in plants, and they account to 40–50% of the whole transcriptome in Upland cotton [69].

Differential gene expression analyses in tetraploid and diploid cottons have been carried out [65]. Two separate studies in root transcriptome analyses of *G. hirsutum* revealed 519 and 1530 differentially expressed transcripts between well‐watered and water‐deficit conditions, respectively [66, 70]. The comparative transcriptome analysis of developing cotyledon and embryo axis in cotton revealed 17,384 differentially expressed unigenes between two tissues. Of these, ∼8000 unigenes were down‐regulated and ∼10,000 unigenes were up‐regulated in cotyledons [71]. Transcriptome analysis of interspecific hybrid F1, synthetic and natural allopolyploid cotton revealed homeolog expression bias (relative contribution of homeologs towards gene expression) and expression level dominance bias (over‐all expression from both homeologs) towards A‐genome in diploid and natural allopolyploid cotton but not in synthetic cotton [72]. Using RNA‐Seq, the global transcriptome profiles of developing cotton fibers from four wild and five domesticated cottons were compared, and this study identified over 5000 differentially expressed genes during the primary and secondary cell wall synthesis between wild and domesticated cottons [73]. Comparative RNA‐Seq analysis of *G. hirsutum* from young, mature, and different senescence stages of leaves identified 3624 differentially ex‐ pressed genes during leaf senescence [74]. Comparative transcriptome profiling of *G. hirsu‐ tum* and *G. davidsonii* (diploid wild species) further characterized 4744 and 5337 differentially expressed salt‐stress responsive genes from roots and leaves of *G. davidsonii* [75]. The potential role of signalling pathways, ethylene pathway in fiber elongation, and receptor‐like kinases (RLKs) in cell wall integrity have been proposed in determining fiber quality using compara‐ tive RNA‐Seq analyses of near isogenic lines (NILs) in *G. hirsutum* [76].

The major mRNA repositories are Crops‐ESTdb, NCBI dbEST, UniGene, RefSeq, and GEO; plant‐specific microarray databases are plant expression database (PLEXdb) and plant co‐ expression database (PLANEX). Tools for RNA‐Seq expression analysis are R/bioconductor packages such as RSEM, DEGseq, DESeq, edger, and baySeq. The open access expVIP is a web‐ based tool for visualizing, analyzing and comparing RNA‐Seq data for conducting gene expression analysis in diploid and polyploid plant species. Single‐nucleotide variant analysis in RNA‐Seq dataset can be performed by SAMtools followed by Genome Analysis Toolkit (GATK). RobiNA is a web‐based tool for accessing and comparing EST, microarray, and RNA‐ Seq databases.

Though several OLC‐based, DBG‐based, and greedy assemblers discussed above have been utilized in RNA‐Seq analyses, Trinity is the most promising tool in generating transcriptome assembly. The Tuxedo pipeline contains a series of tools including Bowtie and TopHat for aligning the RNA‐Seq reads, Cufflinks for assembling the mapped reads, Cuffdiff for identi‐ fying differentially expressed genes and CummeRbund for visualizing differentially ex‐ pressed genes and transcripts [96]. The parameters such as false discovery rate (FDR) ≤ 0.001, fold change (log2FC) ≥ 2, and *P*‐value ≤ 0.05 are considered in comparative transcriptome profiling using RNA‐Seq [105]. Using Trinity and Tuxedo, novel genes, and splice variants can be determined. In RNA‐Seq analysis, FPKM (Fragments per kilobase of transcript per million mapped reads) and RPKM (Reads per kilobase of transcript per million mapped reads) values are often used in quantifying gene expression. Express tool has been used in detecting transcript abundance in *G. hirsutum*.

Additionally, to annotate gene functions, homology‐based methods such as BLASTX, BLAST2GO, and BLAT have been used. PASA and EVM have been used in annotating 3' and 5' UTR regions and alternative splice events in the transcript assemblies. Functional annotation of transcript assembled fragments (TAFs) can be done using BLASTX (E‐value 1 **×** 10‐6) or BLASTP (E‐value 1 **×** 10‐5) against protein databases non‐redundant (NR), SwissProt, and TrEMBL. Gene ontology annotation is conducted by using BLAST2GO and AgriGO analysis. InterPro is used to annotate motifs and domains by comparing TAFs with publicly available databases such as Pfam, PRINTS, PROSITE, ProDom, and SMART. The databases used for predicting pathways are KEGG, PANTHER, Pathguide, PlantCyc, BioCyc, and MetaCyc. Using BLASTP (E‐value =< 1e‐5) and OrthoMCL gene clusters between sub‐genomes in tetraploid cotton have been classified. FASTQC and FASTX toolkit are widely used in filtering low quality and determining the quality of RNA‐Seq reads. Reads contaminated with adapters are removed using Trimmomatic software. HMMER has been used in identifying transcription factor (TF) gene families in tetraploid cotton using the PlnTFDB. Using D5 genome as a reference, homeologous genes/syntenic blocks between A<sup>t</sup> and D<sup>t</sup> sub‐genomes have been identified using MCscanX. These tools were summarized in **Table 2**.

The comparative transcriptome analysis of *G. hirsutum* with its progenitors ameliorates the complexities associated with the co‐existence of A<sup>t</sup> and D<sup>t</sup> genome transcripts in allotetraploid species. Moreover, comparative transcriptome mapping of sub‐genomes leads to identification of high‐confidence transcriptional modules that are evolutionarily conserved and are specific to the genus *Gossypium*. Identifying sub‐genome specific transcripts and analyzing their role in cotton gene regulation would help in developing novel biomarker tools associated with various complex polygenic traits in cotton. Alternative splicing in eukaryotes results in transcriptome and proteome diversity. The presence of abundant splice variants and novel transcriptionally active regions (nTARs) has been identified in *Arabidopsis* and rice using RNA‐ Seq, but majority of these novel transcripts did not overlap with known protein‐coding genes and open reading frames (ORFs), suggesting their potential role in post‐transcriptional gene regulation and novel transcript/gene formation [105]. Identification of alternative splice events and nTARs in tetraploid cotton is computationally challenging due to ploidy, duplication, chromosomal rearrangements, and presence of homeologous bias.

### **4. Small RNA analysis in cotton**

towards gene expression) and expression level dominance bias (over‐all expression from both homeologs) towards A‐genome in diploid and natural allopolyploid cotton but not in synthetic cotton [72]. Using RNA‐Seq, the global transcriptome profiles of developing cotton fibers from four wild and five domesticated cottons were compared, and this study identified over 5000 differentially expressed genes during the primary and secondary cell wall synthesis between wild and domesticated cottons [73]. Comparative RNA‐Seq analysis of *G. hirsutum* from young, mature, and different senescence stages of leaves identified 3624 differentially ex‐ pressed genes during leaf senescence [74]. Comparative transcriptome profiling of *G. hirsu‐ tum* and *G. davidsonii* (diploid wild species) further characterized 4744 and 5337 differentially expressed salt‐stress responsive genes from roots and leaves of *G. davidsonii* [75]. The potential role of signalling pathways, ethylene pathway in fiber elongation, and receptor‐like kinases (RLKs) in cell wall integrity have been proposed in determining fiber quality using compara‐

The major mRNA repositories are Crops‐ESTdb, NCBI dbEST, UniGene, RefSeq, and GEO; plant‐specific microarray databases are plant expression database (PLEXdb) and plant co‐ expression database (PLANEX). Tools for RNA‐Seq expression analysis are R/bioconductor packages such as RSEM, DEGseq, DESeq, edger, and baySeq. The open access expVIP is a web‐ based tool for visualizing, analyzing and comparing RNA‐Seq data for conducting gene expression analysis in diploid and polyploid plant species. Single‐nucleotide variant analysis in RNA‐Seq dataset can be performed by SAMtools followed by Genome Analysis Toolkit (GATK). RobiNA is a web‐based tool for accessing and comparing EST, microarray, and RNA‐

Though several OLC‐based, DBG‐based, and greedy assemblers discussed above have been utilized in RNA‐Seq analyses, Trinity is the most promising tool in generating transcriptome assembly. The Tuxedo pipeline contains a series of tools including Bowtie and TopHat for aligning the RNA‐Seq reads, Cufflinks for assembling the mapped reads, Cuffdiff for identi‐ fying differentially expressed genes and CummeRbund for visualizing differentially ex‐ pressed genes and transcripts [96]. The parameters such as false discovery rate (FDR) ≤ 0.001, fold change (log2FC) ≥ 2, and *P*‐value ≤ 0.05 are considered in comparative transcriptome profiling using RNA‐Seq [105]. Using Trinity and Tuxedo, novel genes, and splice variants can be determined. In RNA‐Seq analysis, FPKM (Fragments per kilobase of transcript per million mapped reads) and RPKM (Reads per kilobase of transcript per million mapped reads) values are often used in quantifying gene expression. Express tool has been used in detecting

Additionally, to annotate gene functions, homology‐based methods such as BLASTX, BLAST2GO, and BLAT have been used. PASA and EVM have been used in annotating 3' and 5' UTR regions and alternative splice events in the transcript assemblies. Functional annotation of transcript assembled fragments (TAFs) can be done using BLASTX (E‐value 1 **×** 10‐6) or BLASTP (E‐value 1 **×** 10‐5) against protein databases non‐redundant (NR), SwissProt, and TrEMBL. Gene ontology annotation is conducted by using BLAST2GO and AgriGO analysis. InterPro is used to annotate motifs and domains by comparing TAFs with publicly available databases such as Pfam, PRINTS, PROSITE, ProDom, and SMART. The databases used for

tive RNA‐Seq analyses of near isogenic lines (NILs) in *G. hirsutum* [76].

Seq databases.

transcript abundance in *G. hirsutum*.

238 Bioinformatics - Updated Features and Applications

Napoli et al. (1990) first identified the phenomenon of RNA interference (RNAi) and referred to it as co‐suppression in plants [106]. Hamilton & Baulcombe (1999) reported a group of antisense RNAs that mediate in post‐transcriptional gene silencing (PTGS) in plants that target both cellular and viral mRNAs [107]. The world of sRNAs exploded with the discovery of miRNAs (microRNA) in *C. elgans* [108]. In plants, miRNAs were first identified in *A. thaliana* and studied extensively in other plant species using comparative genomic, cloning, or sequencing approaches [109]. Unlike in animals, plant miRNAs are mainly found in intergenic and intronic regions [110]. MiRNA genes are mostly localized and form clusters and they transcribe together as a single transcriptional unit [111]. MiRNA discovery has gained momentum in diverse plant species. The majority of plant miRNAs identified to date nega‐ tively regulate the target gene expression at the post‐transcriptional level. MiRNAs regulate environmental stress response, metabolic processes, organogenesis, growth and development [112]. The miRNAs associated with seed‐specific transcription factors have been identified and were found to influence differentiation and developmental timing [113].

The reduction in cotton fiber quality and yields is mainly attributed to several biotic and abiotic factors, and these complex traits are also regulated by small RNAs (sRNAs) which are mostly by miRNAs. Recent advances in cotton genomics have provided an impetus to pursue the discovery of novel miRNAs in the cotton genome [3, 114, 115]. MiRNAs and small interfering RNAs (siRNAs) are the two major classes of endogenous small non‐coding RNAs that regulate the gene expression in plants. Earlier reports suggest that miRNAs mediate a variety of functional roles such as developmental timing, cell proliferation, differentiation, morphogen‐ esis, defense response, and signal transduction [116–118]. Cotton miRNA research is limited due to the lack of genomic information of cultivated cotton until recently. Cotton miRNAs have been identified in *G. hirsutum* (At Dt ), *G. barbadense* (At Dt ), *G. herbaceum* (A1), *G. arboreum* (A2), and *G. raimondii* (D5). Besides their role in fiber growth, development, initiation, and elongation, some miRNAs have been implicated in biotic and abiotic stress responses in cotton. Moreover, many cotton‐specific miRNAs have been identified, but their functions need experimental validation. Recent studies that use qRT‐PCR and degradome sequencing of cotton fiber RNA suggest that miRNAs play an important role in cotton fiber development. Similarly, Wang et al. (2012) identified 73 miRNAs that belong to 49 families in *G. arboreum* using homology‐based approach [115]. Zhang et al. (2013) identified 65 novel miRNAs and their candidate gene targets in *G. hirsutum* using transcript sequences of *G. raimondii* [116]. MiRNA was also analyzed in salt tolerance response in *G. hirsutum*. The miRNVL5 precursor was first discovered in *Arabidopsis* and later found in cotton. *G. arboreum* miRNA was studied in response to Cotton Leaf Curl Disease [118].

NGS technologies that are mentioned in transcriptome analysis are also applicable in small RNA sequencing. MiRNAs are highly conserved across the species and can be predicted by using homology‐based methods. Various resources commonly used in analyses of small RNAs are described in **Table 3**.


**Table 3.** Bioinformatics tools used in regulatory analysis of cotton.

The other classes of small RNAs identified in *Gossypium* include trans‐acting small interfering RNA (TasiRNA), long intergenic noncoding RNA (lincRNA), and long noncoding natural antisense transcript loci (lncNAT). Long noncoding RNAs (lncRNAs) are 200 bp in length, rich in repetitive sequences, preferentially expressed in a tissue‐, genome‐ and lineage‐specific manner, are poorly conserved and participate in various regulatory processes. Higher overall methylation levels are exhibited by lncRNAs when compared with protein‐coding genes, while their expression is less affected by gene‐body methylation. The lncRNAs play a pivotal role in regulating lignin metabolism, cotton fiber initiation, and elongation. There are two main classes of lncRNAs, long intergenic noncoding RNAs (lincRNAs), and long noncoding natural antisense transcript (lncNATs), which are structurally similar but vary in number and length of exons and transcripts. In cotton (*Gossypium spp.*), 30550 lincRNA and 4718 lncNAT loci have been reported. However, homeolog expression bias of lncRNAs has been identified in subgenomes of polyploid species when compared to wild parents [119]. In a similar study, 3510 lincRNAs and 2486 lncNATs that play an integral role in fiber initiation and elongation have been identified in *G. arboreum* using strand‐specific RNA sequencing (ssRNA‐Seq) of cotton fibers and leaves [120]. A preferential expression of lncRNAs was observed (∼50%) than for protein coding genes (∼20%) during rapid fiber elongation, thus establishing their function in fiber development. The direct role of 21–22 nt siRNA derived from GhMML3 gene (MYB‐ MIXTA‐like transcription factor 3) that encodes a lncNAT, implicated in fiber development has been investigated in *G. hirsutum* [119].

### **5. Epigenome analysis in cotton**

by miRNAs. Recent advances in cotton genomics have provided an impetus to pursue the discovery of novel miRNAs in the cotton genome [3, 114, 115]. MiRNAs and small interfering RNAs (siRNAs) are the two major classes of endogenous small non‐coding RNAs that regulate the gene expression in plants. Earlier reports suggest that miRNAs mediate a variety of functional roles such as developmental timing, cell proliferation, differentiation, morphogen‐ esis, defense response, and signal transduction [116–118]. Cotton miRNA research is limited due to the lack of genomic information of cultivated cotton until recently. Cotton miRNAs

), *G. barbadense* (At

(A2), and *G. raimondii* (D5). Besides their role in fiber growth, development, initiation, and elongation, some miRNAs have been implicated in biotic and abiotic stress responses in cotton. Moreover, many cotton‐specific miRNAs have been identified, but their functions need experimental validation. Recent studies that use qRT‐PCR and degradome sequencing of cotton fiber RNA suggest that miRNAs play an important role in cotton fiber development. Similarly, Wang et al. (2012) identified 73 miRNAs that belong to 49 families in *G. arboreum* using homology‐based approach [115]. Zhang et al. (2013) identified 65 novel miRNAs and their candidate gene targets in *G. hirsutum* using transcript sequences of *G. raimondii* [116]. MiRNA was also analyzed in salt tolerance response in *G. hirsutum*. The miRNVL5 precursor was first discovered in *Arabidopsis* and later found in cotton. *G. arboreum* miRNA was studied

NGS technologies that are mentioned in transcriptome analysis are also applicable in small RNA sequencing. MiRNAs are highly conserved across the species and can be predicted by using homology‐based methods. Various resources commonly used in analyses of small RNAs

**Purpose Tool References** miRNA precursors RNAfold, mfold [123] miRNA databases miRBase, MicroRNAdb [124] miRNA prediction miRCat, miREAP, miRPlant, miRDeep‐P, miRNAKey [125–128] Gene targets for miRNAs miRanda and psRNATarget [129, 130]

Epigenomic databases pENCODE, GEO, NGSmethDB, MethBase, plant methylome‐db [135–137] ChIP‐Enriched region identification HOMER [138] Methylation detection CyMATE, MeQA, MEDIPS, FASTmC [139, 140] Bisulphite data analysis Methylpipe [141]

The other classes of small RNAs identified in *Gossypium* include trans‐acting small interfering RNA (TasiRNA), long intergenic noncoding RNA (lincRNA), and long noncoding natural

lncRNA homologs RefSeq, NONCODE, lncRNADB, PLncDB, PNRD, PlncRNADB, and PLNlncRbase

Dt

), *G. herbaceum* (A1), *G. arboreum*

[131–134]

Dt

have been identified in *G. hirsutum* (At

240 Bioinformatics - Updated Features and Applications

in response to Cotton Leaf Curl Disease [118].

**Table 3.** Bioinformatics tools used in regulatory analysis of cotton.

are described in **Table 3**.

Post‐transcriptional gene silencing (PTGS) is mainly associated with the regulation of gene expression directly by targeting the mRNA, while transcriptional gene silencing (TGS) is mostly coordinated by epigenetic modifications such as DNA methylation, histone modifica‐ tions (methylation and acetylation), chromatin remodelling factors, polycomb group (PcG), and trithorax group (trxG) of proteins [121]. These epigenetic modulators can be identified by using homology‐based searching in Upland cotton and its closest progenitor species, which will aid in understanding the epigenetic landscape of Upland cotton. The role of DNA methylation has been implicated in plant growth and development, by maintaining active and inactive chromatin states and in gene silencing by employing canonical and non‐canonical RNA‐directed DNA methylation (RdDM) pathways [121]. For eliciting both the pathways, core RNAi machinery including DICER‐LIKE (DCL), RNA‐DEPENDENT RNA POLYMER‐ ASE (RDR), ARGONAUTE (AGO) proteins, RNA polymerases (Pol IV and V), and plant specific‐DNA methylases are required [122].

In plants, DNA methylation occurs in both CG and non‐CG (CHG and CHH) contexts, where "H" denotes A, T, or C. Although DNA is primarily methylated in CG context in both plants and mammals, the non‐CG contexts are abundant in plants when compared to mammals [142]. The CG, CHG methylation contexts are symmetrical and are maintained by DNA replication, while CHH context is established by *de novo* DNA methylation. The DNA methylation within the genome can occur in promoter, gene‐body, transposable elements, and repeat regions [143]. Previous reports suggest that 35–43%, 30–41%, and 30–32% of loci were methylated in *A. thaliana* ecotypes [144], *Brassica oleracea* accessions [145], and *G. hirsutum* accessions [146], respectively. The role of DNA methylation in fiber growth and gene silencing has been proposed in cotton. It has also been suggested that CHH methylation may have a role in developmental timing in cotton. Annual fluctuations in DNA demethylases and methyltrans‐ ferases have shown opposite trends in their abundance in cotton ovules. Also, substantial changes in CHH methylation was noticed in the promoter regions of three key genes ETH‐ YLENE RESPONSIVE FACTOR 6 (ERF6), SUPPRESSION OF RVS 161 DELTA 4 (SUR4) and 3‐KETOACYL‐COA SYNTHASE 13 (KCS13) that regulate cotton fiber growth. Development of homozygous RNAi lines, specifically targeting demethylases and methyltransferases, will aid in determining DNA methylation patterns in cotton fiber growth and development [147].

The variation in epigenetic modifications has been observed in spontaneous reciprocal hybrids of *Rosa sect. Caninae* and *Rubigineae* when compared to their parental species using cDNA– AFLPs and methylation‐sensitive amplified fragment length polymorphisms (MSAPs), suggesting the biased contribution of DNA methylation from parents to polyploid hybrids [148]. Similarly, genotype‐specific and tissue‐specific variations in DNA methylation have been identified in cotton by using MSAP with methylation insensitive enzyme (BsiSI). Their results suggest that CHG methylation is more diverse than the other two contexts (CG and CHH) and this work established a relationship between DNA methylation and fiber develop‐ ment but failed to establish the correlation between epigenetic regulation and fiber quality. To achieve this, a robust study involving diverse genotypes/accessions and tissues is required. Cytosine methylation is among one of the key epigenetic regulatory processes that likely silence duplicated genes in polyploid crop species such as cotton. The levels, patterns, and diversity of methylation‐polymorphism in CG context have been investigated in 20 accessions of Upland cotton (*G. hirsutum*) to identify 32 methylation‐polymorphic cytosine sites using MSAP [63]. The introgression of exotic DNA fragments from wild parents (*G. bickii*) into cultivated Upland cotton (*G. hirsutum*) has been investigated using AFLP and MSAP analysis to identify ~2000 genomic and ~800 epigenetic sites [66]. This study identified ∼0.5% of alien DNA segments and ∼0.7% of genetic variation in the genomes of introgression lines. Though the overall methylation content is close to the wild parent, the context‐specific methylation varied significantly in introgression lines [149].

Using methylcytosine‐sequencing (MethylC‐Seq) and RNA‐sequencing (RNA‐Seq) analysis, variation in CHH methylation during ovule, and fiber development was reported in cotton [150]. Further, this study suggested that CHH hypermethylation triggered RdDM‐dependent methylation in promoters, while RdDM‐independent methylation occurred in TEs and nearby genes, thus facilitating ovule and fiber development in cotton. Moreover, the contribution of CHG and CHH methylation in genic regions towards homeologous expression patterns in At , or Dt subgenomes have been proposed [150]. Evolution is a very slow process when it is purely dependent on mutation and recombination. In plants, it is believed that there are additional forces that accelerated the process of evolution, which include interspecific hybridization. However, the mechanism of acceleration is less understood. The possible underlying mecha‐ nism has been proposed in wheat [151]. It was suggested that the interaction of alien nuclear DNA with cytoplasmic macromolecules as a result of interspecific hybridization can poten‐ tially trigger a network of epigenetic changes in nuclear DNA, thus altering the expression of genes and genetic pathways associated with physiological processes, which may serve as a possible source of variation that facilitate evolutionary process. This idea sheds light into development of epigenomic‐segregating lines in crop plants, including cotton [151].

respectively. The role of DNA methylation in fiber growth and gene silencing has been proposed in cotton. It has also been suggested that CHH methylation may have a role in developmental timing in cotton. Annual fluctuations in DNA demethylases and methyltrans‐ ferases have shown opposite trends in their abundance in cotton ovules. Also, substantial changes in CHH methylation was noticed in the promoter regions of three key genes ETH‐ YLENE RESPONSIVE FACTOR 6 (ERF6), SUPPRESSION OF RVS 161 DELTA 4 (SUR4) and 3‐KETOACYL‐COA SYNTHASE 13 (KCS13) that regulate cotton fiber growth. Development of homozygous RNAi lines, specifically targeting demethylases and methyltransferases, will aid in determining DNA methylation patterns in cotton fiber growth and development [147].

The variation in epigenetic modifications has been observed in spontaneous reciprocal hybrids of *Rosa sect. Caninae* and *Rubigineae* when compared to their parental species using cDNA– AFLPs and methylation‐sensitive amplified fragment length polymorphisms (MSAPs), suggesting the biased contribution of DNA methylation from parents to polyploid hybrids [148]. Similarly, genotype‐specific and tissue‐specific variations in DNA methylation have been identified in cotton by using MSAP with methylation insensitive enzyme (BsiSI). Their results suggest that CHG methylation is more diverse than the other two contexts (CG and CHH) and this work established a relationship between DNA methylation and fiber develop‐ ment but failed to establish the correlation between epigenetic regulation and fiber quality. To achieve this, a robust study involving diverse genotypes/accessions and tissues is required. Cytosine methylation is among one of the key epigenetic regulatory processes that likely silence duplicated genes in polyploid crop species such as cotton. The levels, patterns, and diversity of methylation‐polymorphism in CG context have been investigated in 20 accessions of Upland cotton (*G. hirsutum*) to identify 32 methylation‐polymorphic cytosine sites using MSAP [63]. The introgression of exotic DNA fragments from wild parents (*G. bickii*) into cultivated Upland cotton (*G. hirsutum*) has been investigated using AFLP and MSAP analysis to identify ~2000 genomic and ~800 epigenetic sites [66]. This study identified ∼0.5% of alien DNA segments and ∼0.7% of genetic variation in the genomes of introgression lines. Though the overall methylation content is close to the wild parent, the context‐specific methylation

Using methylcytosine‐sequencing (MethylC‐Seq) and RNA‐sequencing (RNA‐Seq) analysis, variation in CHH methylation during ovule, and fiber development was reported in cotton [150]. Further, this study suggested that CHH hypermethylation triggered RdDM‐dependent methylation in promoters, while RdDM‐independent methylation occurred in TEs and nearby genes, thus facilitating ovule and fiber development in cotton. Moreover, the contribution of CHG and CHH methylation in genic regions towards homeologous expression patterns in At

 subgenomes have been proposed [150]. Evolution is a very slow process when it is purely dependent on mutation and recombination. In plants, it is believed that there are additional forces that accelerated the process of evolution, which include interspecific hybridization. However, the mechanism of acceleration is less understood. The possible underlying mecha‐ nism has been proposed in wheat [151]. It was suggested that the interaction of alien nuclear DNA with cytoplasmic macromolecules as a result of interspecific hybridization can poten‐ tially trigger a network of epigenetic changes in nuclear DNA, thus altering the expression of

,

varied significantly in introgression lines [149].

242 Bioinformatics - Updated Features and Applications

or Dt

Screening the core RNAi machinery of important DNA methylases such as DOMAINS REARRANGED METHYLASE 1/2 (DRM1/2), CHROMOMETHYLASE 2/3 (CMT 2/3), and demethylases such as ROS1 and DEMETER (DME) in Upland cotton, and its closest progenitor species will aid in elucidating the key regulatory mechanisms in whole‐genome duplication, chromosomal rearrangements, dosage compensation, and evolutionary advantage of being polyploids. Chromosome painting techniques such as fluorescence *in situ* hybridization (FISH), variations of FISH and genomic *in situ* hybridization (GISH) are not only useful in determining chromosome structure and distribution of hetero/euchromatin but also in understanding epigenomic patterns and evolution of polyploid species [152].

Although several databases and repositories are available for storing, cataloging and normal‐ izing animal, and mammalian epigenomic data, they are limited for plants. The popular resource is NCBI epigenomics sample browser to access the diverse collection of epigenomic data sets including genome‐wide DNA methylation maps and histone modifications. How‐ ever, only *A. thaliana* dataset is currently available in the sample browser. Other than NCBI sample browser, the popular visualization tools for epigenomic data are as follows: UCSC epigenome browser, Ensembl encyclopedia of DNA elements (ENCODE), and WashU epigenome browser. A plant ENCODE (pENCODE) collects and curates epigenomic data from a wide range of plant species to compare, annotate, and understand mechanisms behind plant evolution. Epigenomics of Plants International Consortium (EPIC) provides a platform to share the protocols, methods, and results across the research labs to address basic questions of genetics and genome regulation beyond routine whole‐genome sequencing. GEO is one of the seminal repositories for hosting epigenome data from various resources including animals (human) and plants (*Arabidopsis* and Maize). NGSmethDB and MethBase specifically store whole‐genome single‐base resolution data obtained from whole‐genome bisulphite‐sequenc‐ ing (WGBS/BS‐Seq) from diverse organisms. Currently, very few plant‐specific epigenome databases are publicly available and include plant methylome‐db, *Arabidopsis* Epigenome Browser, and Tomato Epigenome Database. Though the cotton‐specific epigenome database is not publicly available, methylomes of evolutionarily close relatives such as *Ricinus commu‐ nis* and *Theobroma cacao* are available at plant methylomes‐db with user restrictions [84]. Epigenetic resources and tools were summarized in **Table 3**.

The wide applications of chromatin Immunoprecipitation‐sequencing (ChIP‐Seq) analysis include understanding protein‐DNA interactions, histone modifications, chromatin states, and DNA methylation. Methylation and acetylation marked peaks can be identified using spatial clustering for identification of ChIP‐enriched regions (SICER), and later peaks can be anno‐ tated by using HOMER (motif search algorithm). Other peak‐finding algorithms, Model‐based Analysis of ChIP‐Seq (MACS), and PeakSeq have been extensively used in analysis. BS‐Seq analysis helps in identification of genome‐wide differentially methylated regions (DMRs) along with tools such as Bismark, BSMAP, and BS‐Seeker can be used. Cytosine Methylation Analysis Tool for Everyone (CyMATE) is a plant‐specific tool for analyzing BS‐Seq data. A web‐based tool, QUMA (quantification tool for methylation analysis) is suitable for BS‐Seq analysis, primarily for estimating CG methylation. MeDIP analysis helps in detecting methy‐ lated cytosines in both CG and non‐CG contexts and tools such as MeQA and MEDIPS can be used. FASTmC Webtool is available for comparative analysis of CG, CHG, and CHH DNA methylation levels in non‐model species. Methpipe is useful in analyzing bisulphite‐sequenc‐ ing data including BS‐Seq, WGBS, and reduced representation bisulfite sequencing (RRBS). Integrative epigenome analysis can be done using methylPipe and compEpiTools. Differen‐ tially methylated regions (DMRs) have been identified in *Arabidopsis* by using methylpipe. ChIP‐Seq analysis in R (CSAR) accurately account the protein bound regions in the genome in plants. PolyCat determines mapping bias between genomes that may have resulted due to SNPs; hence, it can be used in comparing allopolyploid, *G. hirsutum* against diploid *G. raimondii* or *G. arboreum*. Moreover, it accepts data from diverse NGS platforms including RNA‐Seq, BS‐Seq, ChIP‐Seq, and SNP analysis [153].

### **6. Future direction**

The reduction in cost of sequencing per base, increase in throughput and read size, and availability of better algorithms have significantly facilitated the integration of one or more genomic methods to address biological questions. For instance, continued progress in detection of SNPs and CNVs from sequencing data and combining this with genotyping‐by‐ sequencing methods will be helpful in crop improvement. Rapid advances in genomics, bioinformatics, and computational biology in the past two decades have already facilitated generation and screening of large datasets, and thus have ushered us into the "Big Data" era of genomics. The plant science community still lacks full‐fledged computational infrastructure, data curation, and novel tools to extract information from the massive datasets. It should be realized that progress in analyzing polyploid genomes will continually need improvements because the nature of the data generated from high‐throughput technologies is voluminous, heterogeneous, and often in unstructured forms. The concepts of parallel processing, cloud computing, and machine learning offer promising solutions for managing and analyzing large datasets. The other important limiting factors are trained professionals to access and analyze these resources. Nonetheless, progress is being made as we enter the burgeoning age of "Big Data" to tackle genomes of widely grown and utilized crops such as cotton using bioinformatics tools.

### **Acknowledgements**

The authors acknowledge Dr. Avinash Sreedasyam of HudsonAlpha Institute of Biotechnology and Mr. Joshua Reid at Alabama A&M University for reviewing this book chapter. Also authors would like to thank Dr. Ibrokhim Abdurakhmonov and Ms. Ivona Lovric for their valuable suggestions and comments in improving this book chapter.

### **Author details**

web‐based tool, QUMA (quantification tool for methylation analysis) is suitable for BS‐Seq analysis, primarily for estimating CG methylation. MeDIP analysis helps in detecting methy‐ lated cytosines in both CG and non‐CG contexts and tools such as MeQA and MEDIPS can be used. FASTmC Webtool is available for comparative analysis of CG, CHG, and CHH DNA methylation levels in non‐model species. Methpipe is useful in analyzing bisulphite‐sequenc‐ ing data including BS‐Seq, WGBS, and reduced representation bisulfite sequencing (RRBS). Integrative epigenome analysis can be done using methylPipe and compEpiTools. Differen‐ tially methylated regions (DMRs) have been identified in *Arabidopsis* by using methylpipe. ChIP‐Seq analysis in R (CSAR) accurately account the protein bound regions in the genome in plants. PolyCat determines mapping bias between genomes that may have resulted due to SNPs; hence, it can be used in comparing allopolyploid, *G. hirsutum* against diploid *G. raimondii* or *G. arboreum*. Moreover, it accepts data from diverse NGS platforms including

The reduction in cost of sequencing per base, increase in throughput and read size, and availability of better algorithms have significantly facilitated the integration of one or more genomic methods to address biological questions. For instance, continued progress in detection of SNPs and CNVs from sequencing data and combining this with genotyping‐by‐ sequencing methods will be helpful in crop improvement. Rapid advances in genomics, bioinformatics, and computational biology in the past two decades have already facilitated generation and screening of large datasets, and thus have ushered us into the "Big Data" era of genomics. The plant science community still lacks full‐fledged computational infrastructure, data curation, and novel tools to extract information from the massive datasets. It should be realized that progress in analyzing polyploid genomes will continually need improvements because the nature of the data generated from high‐throughput technologies is voluminous, heterogeneous, and often in unstructured forms. The concepts of parallel processing, cloud computing, and machine learning offer promising solutions for managing and analyzing large datasets. The other important limiting factors are trained professionals to access and analyze these resources. Nonetheless, progress is being made as we enter the burgeoning age of "Big Data" to tackle genomes of widely grown and utilized crops such as cotton using

The authors acknowledge Dr. Avinash Sreedasyam of HudsonAlpha Institute of Biotechnology and Mr. Joshua Reid at Alabama A&M University for reviewing this book chapter. Also authors would like to thank Dr. Ibrokhim Abdurakhmonov and Ms. Ivona Lovric

for their valuable suggestions and comments in improving this book chapter.

RNA‐Seq, BS‐Seq, ChIP‐Seq, and SNP analysis [153].

**6. Future direction**

244 Bioinformatics - Updated Features and Applications

bioinformatics tools.

**Acknowledgements**

Venkateswara R. Sripathi1\*, Ramesh Buyyarapu2 , Siva P. Kumpatla2 , Abreeotta J. Williams1 , Seloame T. Nyaku3 , Yonathan Tilahun4 , Venu Kalavacharla5 and Govind C. Sharma1

\*Address all correspondence to: v.sripathi@aamu.edu

1 Center for Molecular Biology, Alabama A&M University, Normal, AL, USA

2 Dow AgroSciences, Indianapolis, IN, USA

3 Crop Science Department, College of Basic and Applied Sciences, University of Ghana, Le‐ gon-Accra, Ghana

4 Cooperative Extension, School of Agriculture & Applied Sciences, Langston University, Langston, OK, USA

5 Molecular Genetics & EpiGenomics Laboratory, College of Agriculture & Related Sciences, Delaware State University, Dover, DE , USA

### **References**


D12 homoeologous chromosomes in Upland cotton. PloS One. 2013;8:e76757. doi: 10.1371/journal.pone.0076757


extra-long staple fiber and specialized metabolites. Scientific Reports. 2015;5:14139. doi: 10.1038/Srep14139

[19] Yuan D, Tang Z, Wang M, Gao W, Tu L, Jin X, Chen L, He Y, Zhang L, Zhu L, Li Y. The genome sequence of Sea-Island cotton (*Gossypium barbadense*) provides insights into the allopolyploidization and development of superior spinnable fibres. Scientific Reports. 2015;5:17662. doi:10.1038/Srep17662

D12 homoeologous chromosomes in Upland cotton. PloS One. 2013;8:e76757. doi:

[7] Shang J, Zhu F, Vongsangnak W, Tang Y, Zhang W, Shen B. Evaluation and comparison of multiple aligners for next-generation sequencing data analysis. BioMed Research

[8] Shang J, Zhu F, Vongsangnak W, Tang Y, Zhang W, Shen B. Evaluation and comparison of multiple aligners for next-generation sequencing data analysis. BioMed Research

[9] Fryxell P. A revised taxonomic interpretation of *Gossypium* L. (Malvaceae). Rheedea.

[10] Wendel JF, Cronn RC. Polyploidy and the evolutionary history of cotton. Advances in

[11] Wendel JF, Schnabel A, Seelanan T. Bidirectional interlocus concerted evolution following allopolyploid speciation in cotton (*Gossypium*). Proceedings of the National Academy of Sciences of the United States of America. 1995;1995:92280‐92284. doi:

[12] Bancroft I, Morgan C, Fraser F, Higgins J, Wells R, Clissold L, Baker D, Long Y, Meng J, Wang X, Liu S. Dissecting the genome of the polyploid crop oilseed rape by tran‐ scriptome sequencing. Nature Biotechnology. 2011;29:762‐766. doi:10.1038/nbt.1926

[13] Hendrix B, Stewart JM. Estimation of the nuclear DNA content of *Gossypium* species.

[14] Wang K, Wang Z, Li F, Ye W, Wang J, Song G, Yue Z, Cong L, Shang H, Zhu S, Zou C. The draft genome of a diploid cotton *Gossypium raimondii*. Nature Genetics.

[15] Li F, Fan G, Wang K, Sun F, Yuan Y, Song G, Li Q, Ma Z, Lu C, Zou C, Chen W. Genome sequence of the cultivated cotton *Gossypium arboreum*. Nature Genetics. 2014;46:567‐

[16] Li F, Fan G, Lu C, Xiao G, Zou C, Kohel RJ, Ma Z, Shang H, Ma X, Wu J, Liang X. Genome sequence of cultivated upland cotton (*Gossypium hirsutum* TM-1) provides insights into

genome evolution. Nature Biotechnology. 2015;33:524‐30. doi:10.1038/nbt.3208

[17] Zhang T, Hu Y, Jiang W, Fang L, Guan X, Chen J, Zhang J, Saski CA, Scheffler BE, Stelly DM, Hulse-Kemp AM. Sequencing of allotetraploid cotton (*Gossypium hirsutum* L. acc. TM-1) provides a resource for fiber improvement. Nature Biotechnology. 2015;33:531‐

[18] Liu X, Zhao B, Zheng HJ, Hu Y, Lu G, Yang CQ, Chen JD, Chen JJ, Chen DY, Zhang L, Zhou Y. *Gossypium barbadense* genome sequence provides insight into the evolution of

International. 2014;2014:309650. doi:10.1155/2014/309650

International. 2014;2014:309650. doi:10.1155/2014/309650

Agronomy. 2003;78:13986. doi:10.1016/S0065-2113(02)78004‐8

Annals of Botany. 2005;95:789‐797. doi:10.1093/aob/mci078

2012;44:1098‐103. doi:10.1038/ng.2371

572. doi:10.1038/ng.2987

537. doi:10.1038/nbt.3207

10.1371/journal.pone.0076757

246 Bioinformatics - Updated Features and Applications

1992;2:108‐165

10.1073/pnas.92.1.280


[42] Xie F, Sun G, Stiller JW, Zhang B. Genome-wide functional analysis of the cotton transcriptome by creating an integrated EST database. PloS One. 2011;6:e26980. doi: 10.1371/journal.pone.0026980

[31] Liu S, Cantrell RG, McCarty JC, Stewart JM. Simple sequence repeat–based assessment of genetic diversity in cotton race stock accessions. Crop Science. 2000;40:1459‐1469.

[32] Buyyarapu R, Kantety RV, Yu JZ, Saha S, Sharma GC. Development of new candidate gene and EST-based molecular markers for *Gossypium* species. International Journal of

[33] Hulse-Kemp AM, Lemm J, Plieske J, Ashrafi H, Buyyarapu R, Fang DD, Frelichowski J, Giband M, Hague S, Hinze LL, Kochan KJ. Development of a 63K SNP array for cotton and high-density mapping of intra-and inter-specific populations of *Gossypium* spp.G3:

[34] Kumpatla SP, Mukhopadhyay S. Mining and survey of simple sequence repeats in 2 expressed sequence tags of dicotyledonous species. Genome. 2005;48:985-998. doi:3

[35] Sharma PC, Grover A, Kahl G. Mining microsatellites in eukaryotic genomes. Trends

[36] Shen X, Guo W, Lu Q, Zhu X, Yuan Y, Zhang T. Genetic mapping of quantitative trait loci for fiber quality and yield trait by RIL approach in Upland cotton. Euphytica.

[37] Yang C, Guo W, Li G, Gao F, Lin S, Zhang T. QTLs mapping for Verticillium wilt resistance at seedling and maturity stages in *Gossypium barbadense* L. Plant Science.

[38] Shen X, Van Becelaere G, Kumar P, Davis RF, May OL, Chee P. QTL mapping for resistance to root-knot nematodes in the M-120 RNR Upland cotton line (*Gossypium hirsutum* L.) of the Auburn 623 RNR source. Theoretical and Applied Genetics.

[39] An C, Saha S, Jenkins JN, Scheffler BE, Wilkins TA, Stelly DM. Transcriptome profiling, sequence characterization, and SNP-based chromosomal assignment of the EXPANSIN genes in cotton. Molecular Genetics and Genomics. 2007;278:539‐553. doi:10.1007/

[40] Byers RL, Harker DB, Yourstone SM, Maughan PJ, Udall JA. Development and mapping of SNP assays in allotetraploid cotton. Theoretical and Applied Genetics.

[41] Ashrafi H, Hulse-Kemp AM, Wang F, Yang SS, Guan X, Jones DC, Matvienko M, Mockaitis K, Chen ZJ, Stelly DM, Van Deynze A. A long-read transcriptome assembly of cotton (*Gossypium hirsutum*) and intraspecific SNP discovery. The Plant Genome.

Genes| Genomes| Genetics. 2015;2015:g3‐115. doi:10.1534/g3.115.018416

in Biotechnology. 2007;25:490-498. doi:10.1016/j.tibtech.2007.07.013

2007;155:371‐380. doi:10.1007/s10681-006-9338-6

2008;174:290‐298. doi:10.1016/j.plantsci.2007.11.016

2006;113:1539‐1549. doi:10.1007/s00122‐006-0401-4

2012;124:1201‐1214. doi:10.1007/s00122-011-1780-8

2015;8:1‐14. doi:10.3835/plantgenome2014.10.0068

Plant Genomics. 2011;2011:894598. doi:10.1155/2011/894598

doi:10.2135/cropsci2000.4051459x

248 Bioinformatics - Updated Features and Applications

10.1139/g05-060

s00438-007-0270-9


*tum* L.) root tissue under water-deficit stress. PloS One. 2013;8:e82634. doi:10.1371/ journal.pone.0082634

[67] Rapp RA, Haigler CH, Flagel L, Hovav RH, Udall JA, Wendel JF. Gene expression in developing fibres of Upland cotton (*Gossypium hirsutum* L.) was massively altered by domestication. BMC Biology. 2010;8:139. doi:10.1186/1741-7007-8-139

[54] Chaisson M, Pevzner P, Tang H. Fragment assembly with short reads. Bioinformatics.

[55] Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research. 2008;18:821‐829. doi:10.1101/gr.074492.107

[56] Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Research. 2009;19:1117‐1123. doi:

[57] Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25:1966‐1967. doi:10.1093/bioinfor‐

[58] Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007;23:500‐501. doi:10.1093/bioinformatics/btl629

[59] Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Research.

[60] Lee S, Seo CH, Alver BH, Lee S, Park PJ. EMSAR: estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering. BMC

[61] Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB. Sampling the *Arabidopsis* transcriptome with massively parallel pyrosequencing. Plant Physiology. 2007;144:32‐

[62] Zhu YN, Shi DQ, Ruan MB, Zhang LL, Meng ZH, Liu J, Yang WC. Transcriptome analysis reveals crosstalk of responsive genes to multiple abiotic stresses in cotton (*Gossypium hirsutum* L.). PLoS One. 2013;8:e80218. doi:10.1371/journal.pone.0080218

[63] Lacape JM, Claverie M, Vidal RO, Carazzolle MF, Pereira GA, Ruiz M, Pré M, Llewellyn D, Al-Ghazi Y, Jacobs J, Dereeper A. Deep sequencing reveals differences in the transcriptional landscapes of fibers from two cultivated species of cotton. PloS One.

[64] Li X, Yuan D, Zhang J, Lin Z, Zhang X. Genetic mapping and characteristics of genes specifically or preferentially expressed during fiber development in cotton. PloS One.

[65] Paterson AH, Wendel JF, Gundlach H, Guo H, Jenkins J, Jin D, Llewellyn D, Showmaker KC, Shu S, Udall J, Yoo MJ. Repeated polyploidization of *Gossypium* genomes and the evolution of spinnable cotton fibres. Nature. 2012;492:423‐427. doi:10.1038/nature11798

[66] Bowman MJ, Park W, Bauer PJ, Udall JA, Page JT, Raney J, Scheffler BE, Jones DC, Campbell BT. RNA-Seq transcriptome profiling of Upland cotton (*Gossypium hirsu‐*

2004;20:2067‐2074. doi:10.1093/bioinformatics/bth205

2007;17:1697‐1706. doi:10.1101/gr.6435207

42. doi:10.1104/pp.107.096677

Bioinformatics. 2015;16:278. doi:10.1186/s12859-015-0704-z

2012;7:e48855. doi:10.1371/journal.pone.0048855

2013;8:e54444. doi:10.1371/journal.pone.0054444

10.1101/gr.089532.108

250 Bioinformatics - Updated Features and Applications

matics/btp336


program to assemble spliced alignments. Genome Biology. 2008;9:R7. doi:10.1186/ gb-2008-9-1-r7

[93] Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. 2003;13:2178‐2189. doi:10.1101/gr.1224503

[78] Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754‐1760. doi:10.1093/bioinformatics/btp324

[79] Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841‐2. doi:10.1093/bioinformatics/btq033

[80] Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. VISTA: computational tools for comparative genomics. Nucleic Acids Research. 2004;32:W273‐W279. doi:

[81] Bray N, Dubchak I, Pachter L. AVID: a global alignment program. Genome Research.

[82] Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004;32:1792‐1797. doi:10.1093/nar/gkh340

[83] Olson SA. Emboss opens up sequence analysis. Briefings in Bioinformatics. 2002;3:87‐

[84] Bedell JA, Korf I, Gish W. MaskerAid: a performance enhancement to RepeatMasker.

[85] Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids

[86] Yang G. MITE Digger, an efficient and accurate algorithm for genome wide discovery of miniature inverted repeat transposable elements. BMC Bioinformatics. 2013;14:186.

[87] Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Research. 2005;33:W465‐W467. doi:

[88] Salamov AA, Solovyev VV. Ab initio gene finding in *Drosophila* genomic DNA. Genome

[89] Hu Y, Comjean A, Perkins LA, Perrimon N, Mohr SE. GLAD: an online database of gene list annotation for *Drosophila*. Journal of Genomics. 2015;3:75‐81. doi:10.7150/jgen.

[90] Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research. 2002;30:4103‐4117. doi:10.1093/nar/

[91] Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research.

[92] Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR. Automated eukaryotic gene structure annotation using EVidenceModeler and the

Bioinformatics. 2005;21:3674‐3676. doi:10.1093/bioinformatics/bti610

Bioinformatics. 2000;16:1040‐1041. doi:10.1093/bioinformatics/16.11.1040

10.1093/nar/gkh458

252 Bioinformatics - Updated Features and Applications

91. doi:10.1093/bib/3.1.87

Research. 1999;27:573‐580.

doi:10.1186/1471-2105-14-186

Research. 2000;10:516‐522. doi:10.1101/gr.10.4.516

10.1093/nar/gki458

12863

gkf543

2003;13:97‐102. doi:10.1101/gr.789803


[117] Gao S, Yang L, Zeng HQ, Zhou ZS, Yang ZM, Li H, Sun D, Xie F, Zhang B. A cotton miRNA is involved in regulation of plant response to salt stress. Scientific Reports. 2016;6:19736. doi:10.1038/Srep19736

[104] Basten CJ, Weir BS, Zeng ZB. QTL Cartographer, version 1.17. Raleigh, NC: Department

[105] Leng X, Jia H, Sun X, Shangguan L, Mu Q, Wang B, Fang J. Comparative transcriptome analysis of grapevine in response to copper stress. Scientific Reports. 2015;5:17749. doi:

[106] Napoli C, Lemieux C, Jorgensen R. Introduction of a chimeric chalcone synthase gene into petunia results in reversible co-suppression of homologous genes in trans. The

[107] Hamilton AJ, Baulcombe DC. A species of small antisense RNA in posttranscriptional gene silencing in plants. Science. 1999;286:950‐952. doi:10.1126/science.286.5441.950

[108] Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz HR, Ruvkun G. The 21-nucleotide let-7 RNA regulates developmental timing in *Caenorhab‐*

[109] Voinnet O. Origin, biogenesis, and activity of plant microRNAs. Cell. 2009;136:669‐687.

[110] Jones-Rhoades MW, Bartel DP, Bartel B. MicroRNAs and their regulatory roles in plants. Annual Review of Plant Biology. 2006;57:19‐53. doi:10.1146/annurev.arplant.

[111] Combier JP, Frugier F, De Billy F, Boualem A, El-Yahyaoui F, Moreau S, Vernié T, Ott T, Gamas P, Crespi M, Niebel A. MtHAP2-1 is a key transcriptional regulator of symbiotic nodule development regulated by microRNA169 in Medicago truncatula.

[112] Jones-Rhoades MW, Bartel DP. Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Molecular Cell. 2004;14:787‐799. doi:

[113] Lee H, Yoo SJ, Lee JH, Kim W, Yoo SK, Fitzgerald H, Carrington JC, Ahn JH. Genetic framework for flowering-time regulation by ambient temperature-responsive miRNAs in *Arabidopsis*. Nucleic Acids Research. 2010;2010:gkp1240. doi:10.1093/nar/gkp1240

[114] Kim VN. MicroRNA biogenesis: coordinated cropping and dicing. Nature Reviews

[115] Wang M, Wang Q, Wang B. Identification and characterization of microRNAs in Asiatic cotton (*Gossypium arboreum* L.). PLoS One. 2012;7:e33696. doi:10.1371/journal.pone.

[116] Zhang H, Yang JH, Zheng YS, Zhang P, Chen X, Wu J, Xu L, Luo XQ, Ke ZY, Zhou H, Qu LH. Genome-wide analysis of small RNA and novel microRNA discovery in human acute lymphoblastic leukemia based on extensive sequencing approach. PLoS One.

Genes & Development. 2006;20:3084‐3088. doi:10.1101/gad.402806

Molecular Cell Biology. 2005;6(5):376‐385. doi:10.1038/nrm1644

2009;4:e6849. doi:10.1371/journal.pone.0069743

of Statistics, North Carolina State University; 2004.

Plant Cell. 1990;2:279‐289. doi:10.1105/Tpc.2.4.279

*ditis elegans*. Nature. 2000;403:901‐906. doi:10.1038/35002607

10.1038/srep17749

254 Bioinformatics - Updated Features and Applications

doi:10.1016/j.cell.2009.01.046

10.1016/j.molcel.2004.05.027

57.032905.105218

0033696


[141] Song Q, Decato B, Hong EE, Zhou M, Fang F, Qu J, Garvin T, Kessler M, Zhou J, Smith AD. A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PloS One. 2013;8:e81148. doi:10.1371/journal.pone.0081148

[129] Betel D, Wilson M, Gabow A, Marks DS, Sander C. The microRNA. org resource: targets and expression. Nucleic Acids Research. 2008;36:D149-D153. doi:10.1093/nar/gkm995

[130] Dai, X. and P.X. Zhao, psRNATarget: a plant small RNA target analysis server. Nucleic

[131] Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non redundant sequence database of genomes, transcripts and proteins. Nucleic Acids

[132] Liu C, Bai B, Skogerbø G, Cai L, Deng W, Zhang Y, Bu D, Zhao Y, Chen R. NONCODE: an integrated knowledge database of non-coding RNAs. Nucleic Acids Research.

[133] Amaral PP, Clark MB, Gascoigne DK, Dinger ME, Mattick JS. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Research. 2011;39:D146-D151. doi:

[134] Xuan H, Zhang L, Liu X, Han G, Li J, Li X, Liu A, Liao M, Zhang S. PLNlncRbase: a resource for experimentally identified lncRNAs in plants. Gene. 2015;573:328‐332. doi:

[135] Lane AK, Niederhuth CE, Ji L, Schmitz RJ. pENCODE: a plant encyclopedia of DNA elements. Annual Review of Genetics. 2013;48:49‐70. doi:10.1146/annurev-gen‐

[136] Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles —database and tools update. Nucleic Acids Research. 2007;35:D760‐D765. doi:10.1093/

[137] Geisen S, Barturen G, Alganza ÁM, Hackenberg M, Oliver JL. NGSmethDB: an updated genome resource for high quality, single-cytosine resolution methylomes. Nucleic

[138] Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular Cell.

[139] Hetzl J, Foerster AM, Raidl G, Scheid OM. CyMATE: a new tool for methylation analysis of plant genomic DNA after bisulphite sequencing. The Plant Journal. 2007;51:526‐536.

[140] Bewick AJ, Hofmeister BT, Lee K, Zhang X, Hall DW, Schmitz RJ. FASTmC: a suite of predictive models for non-reference-based estimations of DNA methylation. G3:

Genes, Genomes, Genetics. 2015:g3‐115. doi:10.1534/g3.115.025668

Acids Research. 2013:gkt1202. doi:10.1093/nar/gkt1202

2010;38:576‐589. doi:10.1016/j.molcel.2010.05.004

doi:10.1111/j.1365-313X.2007.03152.x

Acids Research, 2011;39:W155‐W159. doi:10.1093/nar/gkr319

Research. 2007;35:D61‐D65. doi:10.1093/nar/gkl842

2005;33:D112‐D115. doi:10.1093/nar/gki041

10.1093/nar/gkq1138

256 Bioinformatics - Updated Features and Applications

et-120213-092443

nar/gkl887

10.1016/j.gene.2015.07.069


[153] Page JT, Gingle AR, Udall JA. PolyCat: a resource for genome categorization of sequencing reads from allopolyploid organisms. G3: Genes| Genomes| Genetics. 2013;3:517‐525. doi:10.1534/g3.112.005298

[153] Page JT, Gingle AR, Udall JA. PolyCat: a resource for genome categorization of sequencing reads from allopolyploid organisms. G3: Genes| Genomes| Genetics.

2013;3:517‐525. doi:10.1534/g3.112.005298

258 Bioinformatics - Updated Features and Applications

### *Edited by Ibrokhim Y. Abdurakhmonov*

An interdisciplinary bioinformatics science aims to develop methodology and analysis tools to explore large-volume of biological data using conventional and modern computer science, statistics, and mathematics, as well as pattern recognition, reconstruction, machine learning, simulation and iterative approaches, molecular modeling, folding, networking, and artificial intelligence. Written by international team of life scientists, this Bioinformatics book provides some updates on bioinformatics methods, resources, approaches, and genome analysis tools useful for molecular sciences, medicine and drug designs, as well as plant sciences and agriculture. I trust chapters of this book should provide advanced knowledge for university students, life science researchers, and interested readers on some latest developments in the bioinformatics field.

Photo by kentoh / CanStock

Bioinformatics - Updated Features and Applications

Bioinformatics

Updated Features and Applications

*Edited by Ibrokhim Y. Abdurakhmonov*