Bioinformatics Tools and Genomic Resources Available in Understanding the Structure and Function of *Gossypium*

*Gugulothu Baloji, Lali Lingfa and Shivaji Banoth*

## **Abstract**

*Gossypium* spp. (Cotton) is the world's most valuable natural fiber crop. *Gossypium* species' variety makes them a good model for studying polyploid evolution and domestication. The past decade has seen a dramatic shift in the field of functional genomics from a theoretical idea to a well-established scientific discipline. Cotton functional genomics has the potential to expand our understanding of fundamental plant biology, allowing us to more effectively use genetic resources to enhance cotton fiber quality and yield, among with using genetic data to enhance germplasm. This chapter provides complete review of the latest techniques and resources for developing elite cotton genotypes and determining structure that have become accessible for developments in cotton functional genomics. Bioinformatics resources, including databases, software solutions and analytical tools, must be functionally understood in order to do this. Aside from GenBank and cotton specific databases like CottonGen, a wide range of tools for accessing and analyzing genetic and genomic information are also addressed. This chapter has addressed many forms of genetic and genomic data now accessible to the cotton community; fundamental bioinformatics sources related to cotton species; and with these techniques cotton researchers and scientists may use information to better understand cotton's functions and structures.

**Keywords:** *Gossypium*, genomics, bioinformatics, structure, gene sequencing

## **1. Introduction**

Providing fiber for one of the biggest and most significant sectors (textiles), cotton is the world's major natural fiber crop. It has a global economic effect of around \$500 billion every year. Cotton genetic resources, which comprise germplasm from more than 50 distinct cotton species, have allowed researchers to investigate the transformation of fibers of lint and the impact of polyploidy in enhancing lint output. It allow researchers to gain a better understanding of how domesticated cotton evolved from its wild counterpart. Bio-based alternatives to petroleum-based chemicals, such as assessing the degree of genetic diversity and exploiting it to increase

cotton yield and lint quality, are also being carefully researched [1]. A staple of the world economy, cotton (*Gossypium hirsutum*) is appreciated for its valuable renewable fiber resource and is a cornerstone of the global economy. Various biological investigations, including polyploidization [2], single-celled biological processes and genome evolution, may be carried out using this plant as a model [3]. Polyploidy and genomic size disparities with the *Gossypium* genus, as well as cotton's evolution, may be better understood through decoding cotton's genome [4].

A summary of the technological advancements achieved during the previous two decades have presented in this article. For example, progress has been made in understanding differences in a variety of physiological, biochemical, morphological and genetically relevant features, all of which have been explored in this volume. Many cutting-edge genomic methods were used to study the cotton genome of the genus *Gossypium*, as was done with other significant crop species. This work significantly contributed to laying the groundwork for maintaining lint production around the globe [5]. There are still a lot of things to look into [6]. Cotton's genetic foundation has become more limited as a consequence of extensive domestication [7]. In previous generations of cotton production, conventional genetic resources were used; nevertheless, cotton productivity has been declining over the last several years [8]. As it is imperative for producing fresh sources in order to confront the difficulties of the new millennium.

Genomic methods have been used extensively for enhancing fiber characteristics, producing cotton cultivars that are resistant to insect pests and diseases, and developing cotton varieties that are resistant to abiotic pressures. In the minds of many, genetic alteration has led to an improvement in genomes [9]. The scope of this investigation of cotton's genetic resources, both traditional and modern, as well as its potential for future use, is broad. Improved usage of existing genetic resources has the potential to alleviate the concerns to cotton production that are now being faced.

## **2. Genomics and genetic diversity of** *Gossypium* **spp**

Members of the Malvaceae family, *Gossypium* spp., belong to the genus *Gossypium*. 50 species (45 diploids 2n =2 × =26; and 5 tetraploids 2n = 4 × =52) are found in the *Gossypium* genus [10]. The tetraploid species were formed after interbreeding of the A and "D" genome species around 4–11 million years ago (MYA), and they diverged about 1–2 lakh years ago from their common ancestor. As two sub-genomes contains a single copy of practically all genes, the tetraploid cotton contains more than two copies of each genes [11]. Furthermore, these genes were organized in the same manner as the diploid progenitors [12]. There are two tetraploid species of cotton that are among the most frequently farmed in the world: *Gossypium* barbadense and *G. hirsutum*.

A through G and K in the alphabet represent the eight genome groups, which are based on chromosomal pairing affinities [13]. (AD)1 through (AD)5 are the five tetraploid species, based on their genomic constitutions. On the basis of phylogenetic analyses, *Gossypium* species were categorized into two lineages: the 13 D-genome species lineage and the 30 ∼ 32 A-, B-, E-, F-, C-, G- and K-genome species lineage. Based on phylogenetic analyses, the polyploid species have been divided into one lineage, with the 5 AD-genome lineage being the most notable (**Figure 1**). Allotetraploids (*G. hirsutum* and *G. barbadense*) and diploids (*Gossypium herbaceum* and *G. arboreum*) are two of the four *Gossypium* species grown in agriculture. It is estimated that *G. hirsutum*, or "Upland cotton," is responsible for 90% of the world's cotton production. Cotton varieties like *G. herbaceum* (Levant Cotton), and *G. arboretum* (Tree Cotton),

*Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

**Figure 1.** *Phylogeny and evolution of* Gossypium *spp. [14].*

account for about 2% of the world's production of cotton, while *G. barbadense*, (Egyptian Cotton), American Pima Cotton, Sea Island Cotton or Extra Long Staple Cotton provides 8% of the world's production of cotton [13].

## **3. Genome sequencing of** *Gossypium* **spp**

The advances made possible by genome sequencing show that functional genomics research is working to become more efficient. Insect and herbicide-resistant cotton cultivars have advanced at breakneck pace during the previous two decades [15]. When it comes to genetic modification of cotton for plant morphology and blooming as well as for fiber quality as well as yield and resistance to biological and environmental stresses, however, process was slow. The advancement of a cotton genome research collaboration depending on Arabidopsis and rice has been made possible thanks to the efficient deployment considering the availability of well-known wholegenome sequences. While setting up an approach for cotton genome sequencing, the Consortium of cotton genome [6], decided to focus on simpler diploid genomes which is applied to tetraploid cotton. Among the D-genome species like *G. raimondii* has been prioritized for full sequencing in order to meet the personal milestone of cotton genome sequence completion. Both Paterson [16] and Wang [17] authored the

review genome sequence of *G. raimondii* in 2012, which was an obvious first step in categorizing the bigger "A" diploid as well as "AD" tetraploid cotton genomes. They were not only ones to write the review genome sequence in 2012. Cotton genome sequencing in 2012 began with this as an initial financial funding source.

It wasn't long until the same study team published their results on the 1694-MB genome for *G. arboreum*, which is assumed to be a cotton's donor species. Group of A chromosome in tetraploids [18]. There are two known cotton progenitors, *G. raimondii* and *G. arboreum* both of which have their genomes sequenced, however it is still uncertain which species was responsible for the growth of the tetraploid cotton species around lakhs of years ago [19]. Furthermore, when compared to diploid cotton species, *G. hirsutum* showed significant alterations in economic characteristics and structures of plants. This indicates that throughout development, both natural and artificial selection took place. As a result, the allotetraploid cotton species must be sequenced in order to learn more about the plant's evolutionary history and fiber biology. Li [20] and Zhang [21] sequenced the allotetraploid *G. hirsutums* genomes using genes from the A and D progenitor species as a preparatory step. Also of note is Sea Island cotton, which is renowned for its great durability and excellent fiber. Its use in textile manufacturing sounds perfect for the production of high-quality goods. Both Liu et al. [22] and Yuan et al. [23] have sequenced the genome of the genus *G. barbadense* and found that it spans 2470 Mb of the genome [23].

As a result, experts feel that because of the way they were constructed, a number of the recently disclosed genome sequences reference for tetraploid and diploid species of cotton are flawed. *G. raimondii* [16, 17] as well as *G. hirsutum* [19, 20] sequenced and assembled review genomes differed in chromosomal lengths as well as the number of annotated genes between the 2 categories. At least on a large scale, it is plausible that such differences are the consequence of assembly errors. As a result of this, there has been a huge amount of genetic study done on diverse cotton species. For the time being, we must put in more effort to gather these genome assemblies for a more skeptical eye in order to do thorough comparisons, assessments, and repairs of their misassembles, among many other things.

When a reference genome is available, it is possible to investigate the link between sequence alterations and other properties by re-sequencing the genome. In order to find genomic regions that are indicators of choice in cotton, recent comprehensive genome studies on 34 [24] as well as 318 [25], along with 147 [26], and 352 [27] cotton accessions constitute considerable collection. Cotton molecular breeding has benefited tremendously from the discoveries made in these experiments, which have yielded valuable new genetic resources. It is possible to transfer favorable genes associated with high yield, wide adaptability, with high fiber grade across many gene pools under the guidance of sequencing information, in order to considerably enhance cotton output.

## **4. Genome database/bioinformatic tools available for** *Gossypium* **spp**

Identifying the genetic features that are important for the biological behavior of cotton is just the first step in the process of genome sequencing and resequencing. Cotton genomics have disclosed various DNA's physiologically active states in the same manner as studies of epigenetic alterations, fine map platforms, SNP array platforms, high density genetic, and transcript abundance across different species and tissues have done so for other model crop plants. All other model crop plants have done the same thing. Cotton plantation industry genetic research and breeding was

#### *Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

hindered by limited ultra-precision genetic mapping before the publishing of the complete genome sequences for four *Gossypium* species in 2013. The cotton plantation industry's access to fairly large cotton-genome linkage maps may enable gene mapping, high-throughput markers, cotton cloning and gene isolation [28, 29]. In the previous 10 years, approx. 1075 QTLs in 58 *G. hirsutum* studies and 1059 QTLs in interspecific *G. hirsutum* and 9 *G. barbadense* populations were submitted as yield, fiber quality, seed quality, and biotic and abiotic challenge tolerance. In the case of marker-aided selection, the newly identified QTLs provide only coarse resolution due to their location in vast genomic domains that may comprise several genes. When selecting a marker, it is crucial to have a large number to choose from, so that the genes presence in the target locus may be cloned more effectively. The glandless gene [30], leaf shape [31], and quality of fiber related QTLs [32, 33], have all been mapped in cotton along with Many genes and quantitative trait loci (QTLs).

Single nucleotide polymorphism variations (SNP) have been discovered at the 2.5-Gb of whole genome level for allotetraploid cotton genome in recent years thanks to better in silico techniques and next-generation sequencing (NGS). SNP63K is formed in cotton, that includes tests for 45,104 and 17,954 possible intraspecific as well as interspecific SNP markers [34]. SNP63K was created by Ashrafi et al. [34]. The SNP63K cotton array is a foundational high-throughput genotyping technique as well as a platform for genetic research for commercially and agronomically relevant methods. CNVs, which stand for a larger proportion of the genome than SNPs, may be beneficial in discovering phenotypic changes that are not recorded by SNPs, since they stand for more of the genome. Many studies have shown that plant genomes are full with copy number variants (CNVs), which may affect gene regulation, dosage and gene structure [35]. The vast majority of genes impacted by CNVs are linked to important traits. In a recent study, researchers found that cotton contains 989 CNV-infected genes that influence plant type, cell wall structure, and translational control [26].

A decade ago, transcriptome analysis identified as the most essential tool for determining how sequencing data might be used to get insights into the activities of individual genes. Whole-genome transcriptome profiling may be achieved with RNA-Seq, it allows high-throughput sequencing tools to sequence transcripts directly. The freshly published transcriptome assembly for the *G. hirsutum* TM-1 inbred line, as well as assembly of all publically accessible to expressed sequence tags (ESTs), were utilized as a reference for SNP detection in cotton [34]. The utilization of diploid and tetraploid genome sequences, as well as next-generation sequencing (NGS) technologies, was also described in RNA-Seq analyses for large-scale gene expression in the cotton plant. Many activities in plants have been studied using transcriptome analysis, along with the study of leaf sense [36], fiber growth [37], biotic stress [38], along abiotic stress [39]. However, there are certain obstacles with the RNA-Seq approach, such as library creation and the development of efficient techniques for storing and processing vast volumes of information [40]. As soon as these limitations to widespread use of RNASeq are removed, it has envisaged that this approach has taken over as the primary tool for evaluation of transcriptome [41].

Many characteristics of living creatures are influenced by processes known as epigenetic modifications in addition to genetic variations. Gene expression is influenced by these alterations, which alter when, how many, and how much they are expressed. Among the several epigenetic signaling approaches available, DNA methylation [42] has been shown a crucial role in agricultural plant growth and morphological variety [43]. DNA methylation changes the cotton which is connected with seasonal fluctuations in fiber production [44] and different tissue [45]. CHH methylation mediated

with RNA-directed DNA methylation (RdDM) has been associated with ovules gene activation, while CHH methylation mediated by chromomethylase2 (CMT2) has been connected to gene repression in fiber development [46].

It has been shown that between wild and domesticated cotton varieties, 519 cotton genes have been epigenetically changed, some of which have been linked to domesticated and agronomic properties [47], and others of which have not. As a result of this research, we have a better understanding of how epigenetic regulation affects many aspects of cotton's development and its polyploid evolution. In terms of bringing this technique to reality, we need to know how the methylome has evolved and been domesticated.

#### **4.1 Functional genomics databases for cotton**

To do an examination of genome, data in the map location, protein expression and mRNA, allelic variation genome sequence, and metabolism must all be accessible at the same time. With the increase of omics sets of data, it is important than ever to get a database of functional genomics that helps users to easily access and display genetic data. "CottonGen (https://www.cottongen.org) [48], Cotton Genome Resource Database (CGRD; http://cgrd.hzau.edu.cn/index.php) [49], Database for Co-expression Networks with Function Modules (ccNET; http://structuralbiology. cau.edu.cn/*Gossypium*/) [50], Join Genome Institute (JGI; http://jgi.doe.gov) [51], Cotton Genome Database (CottonDB; http://www.cottondborg) [52], Evolution of Cotton (https://learn.genetics.utah.edu/content/cotton/evolution/) [53], Platform of Functional Genomics Analysis in *Gossypium raimondii* (GraP; http://structuralbiology. cau.edu.cn/GraP/about.html) [54], Cotton Functional Genomic Database (CottonFGD; https://cottonfgd.org) [55], Cotton Genome Project (CGP; http://cgp.genomics.org.cn/ page/species/index.jsp) [56], https://www.cottongen.org/data/markers [57], https:// bacpacresources.org/ [58], https://scienceweb.clemson.edu/cugbf/clemson-genomicsand-bioinformatics-courses/ [59]." As a result, Cotton FGD provides accessibility to most of the sequenced genomes of *Gossypium*, and also other plant genomes, and also from transcriptome data along with re-sequencing data. The ccNET database contains 1155 and 1884 functional modules from the diploid *G. arboreum* as well as *G. hirsutum*, respectively, in respect of cotton species' founder patterns and structural modules.

## **5. Advances in cotton genomics research**

It has been shown that genome research may be used to maintain and improve agricultural plant genetics as a consequence, attempts in cotton genetic studies, notably the creation of genetic tools, as well as the establishment of breeding stock for genetic and genomics research, have been made. Genomic markers like simple sequence repeats or microsatellites, random amplification of polymorphic DNA, restriction fragment length polymorphism, amplified fragment length polymorphism, resistance gene analogues, sequence-related amplified polymorphism are some of the tools available. Cotton genome sequencing is taking place at the same time as genetic mapping as well as genome-wide Bacterial artificial chromosome (BAC) libraries, plant-transformation-competent binary bacterial artificial chromosome (BIBAC)-based integrated physical map is being created. Study on cotton's genome lags behind that of soybean, rice and maize mostly due to the lack of funding provided for the species in contrast to these other important crops. The following section provides an overview of recent significant advancements in research of cotton genomics [14].

*Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

#### **5.1 DNA markers and molecular linkage maps**

RFLPs were first DNA markers to be utilized in cotton genomic research, and they were also found in the most of plant species at the moment of their discovery, indicating that they were widely distributed. The development of the first *Gossypium* species genetic linkage map [60], which was formed from the F2 population of interspecific *G. barbadense*, *G. hirsutum* and founded in RFLPs, should come as no surprise to those familiar with the genus. On the map, 705 locations were included, which was organized with 41 linkage groups along with a total area of 4675 square kilometers. Rong et al. [61] designed it to be more comprehensive than the previous *Gossypium* genus map, which contained 2584 loci spaced at 1.74-cM intervals which includes all 13 homeologous chromosomes of cotton, making it most comprehensive genetic map for the species to date. Crosses among the D-genome diploid species *G. trilobum* x *G. raimondii* [61] and a diploid species *G. arboreum* x *G. herbaceum* [62] revealed a large quantity of DNA probes from the map, In addition, there are hybrids between *G. arboreum* and *G. herbaceum*, which are a A-genome diploid species. In-depth research on the relationship among the tetraploid AD subgenomes and the diploid A and D genome maps, as well as the cross-species transfer of these insights, produced important results.

RFLPs are time-consuming and need large quantity of DNA, labor-intensive blot hybridization, autoradiography processes, all of which are now being superseded by DNA marker systems based on polymerase chain reaction (PCR). The development of a broad range of markers for diverse applications has resulted from the utilization of PCR-based DNA markers in genetic investigations of cotton. Multiple techniques, including as AFLP, RAPD, SRAP and RGA, provide an ideal chance for scanning a large number of DNA loci in a short period of time, focusing on DNA elements that are quickly developing and hence more likely to include loci that vary between genotypes [63]. Using a population obtained from an interspecific cross between Texas Marker-1 (TM-1) and 3–79, Kohel [63], collected 355 DNA markers into 50 linkage groups, which covered a total of 4766 cM, to construct a genetic map of the species. This map was initially published by Brown and Brubaker [64], which was based on an interspecific *Gymnodinium nelsonii* x *G. australe* population that was geneticall linked by AFLP. For the *Gossypium* G-genome, this was the first AFLP genetic linkage map. In a *G. australe* hexaploid bridging family, it was observed that AFLPs could be used to detect chromosomal-specific molecular markers that were unique to the G-genome, and that the frequency of chromosome transmission of *G. australe* could be monitored using AFLPs.

A novel class of cottons genetic markers has been developed that is more easy to use and greatly polymorphic. as a result, the introduction of SSR or microsatellite markers in the cotton industry. Upland cotton has a low level of intraspecific polymorphism, which is especially helpful to the crop's cultivation because of the crop's minimum intraspecific polymorphism. As a result of the presence of flanking primer sequences, SSRs are easily transferred between laboratories and are highly transferable from one population to another, SSRs are the PCR-based markers that are commonly co-dominant, extensively distributed all along the genome, and readily transferable across populations [65]. SSRs have been created in cotton, according to http://www.cottonmarker.org [60], for a total of about 5484 SSRs [66].

#### **5.2 Gene and QTL mapping**

Even while maps of molecular linkage made significant advances in the knowledge of the development and organization of cotton genomes, a main motive of molecular

linkage map building was to locate the genes that impact qualitative as well as quantitative features. If DNA markers are linked to genes that impart critical agronomic characteristics that are costly or time-consuming to analyze, it is less expensive and more reliable to select for acceptable progenies in breeding programs.

## **5.3 Mapping qualitative traits**

Whether it's a qualitative or plain evaluation, Mendelian hereditary features are qualities that are passed down from one generation to the next that differ in type rather than degree. All of these characteristics are generally managed with a single gene, further the phenotypic diversity in the offspring of the segregating parent may be divided into several groups. It has been discovered that the qualitative characteristics of *G. arboreum* and *G. herbaceum* are present in both the diploid (*G. arboreum*) and tetraploid species (mostly *G. hirsutum* and *G. barbadense*) species [1]. Pollen color, leaf shape, lint color, leaf color, pubescence, bract morphology, and other traits are examples of such characteristics. Many qualitative characteristics in crop production, for example, are the result of morphological mutants that have arisen as a result of natural variation among species with interspecific hybrids, or morphological mutants that have arisen as a result of irradiation, spontaneous mutation. As a result, only a few attempts have been undertaken to map qualitative features onto the molecular genetic map as a result of this predicament. A recent publication [67] presented an overview of the qualitative qualities which are mapped with molecular markers. As a consequence, many of these features were included in the map as a kit for linking the different linkage groups to the chromosomes allocated with the classical map, which was the main goal. Genes for leaf shape and development, genes for fiber production (including fiber strength), genes for disease and insect pest resistance (including insect pest resistance), and genes for fertility restoration (including fertility restoration genes) are among those associated with cotton quality and productivity [67].

## **5.4 Mapping quantitative traits**

Qualities with quantitative approach are characteristics of persons which fluctuate in degree rather than kind, as opposed to other traits. They're usually assumed to be the result of interactions between several loci, and they show continuous variation in a segregating population, as well as being quickly altered by environmental change. In recent years, there has been an explosion of activity in the discovery and detection of quantitative trait loci (QTLs), Since the previous decade, there has been a growth in the number of DNA markers that can be used in cotton genetic mapping. Among the quantitative trait loci (QTLs) that have been found in cotton are those that affect plant architecture, disease resistance, insect resistance, and blooming date, to name a few [14].

## **5.5 BAC and BIBAC resources**

Significant-insert BAC and BIBAC libraries are necessary and sought for advanced genetics and genomics research, according to a large number of publications [68–70]. Due to the simplicity of increased purification of DNA cloned insert, low levels of chimerism, and high levels of stability in the host cell, bacteria and bacterial-infected cells (BACs) have swiftly established themselves as a significant component of genome research [71, 72]. Gene and QTL mapping [73], wholegenome or chromosome physical mapping [74, 75], large genome sequencing [76, 77],

## *Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

isolation and characterization of structural and regulatory genes [78, 79] and cytologically based gene discovery are only some of the applications that this technique has been used for in genomics. For a range of taxa, including plants, animals, insects, and bacteria, artificial chloroplast (BAC) libraries have been produced. The public may access these libraries via following websites: (i) https://bacpacresources. org/ [58], and (ii) https://scienceweb.clemson.edu/cugbf/clemson-genomics-andbioinformatics-courses/ [59]. *G. hirsutum*, an upland cotton variety, has had BAC and BIBAC libraries produced to help in the study of the cotton genome. A number of *G. hirsutum* genotypes have been screened, and libraries of BAC and BIBAC are developed for making cotton genome research more efficient. Further on May 1, 2007, the construction of minimum six binary data libraries, as well as their availability to the general public, was accomplished. Using five different genotypes of upland cotton, containing Auburn 623, Tamcot HQ95, 0-613-2R and TM-1, Maxxa, each of these libraries was constructed. The construction was carried out in 4 different BAC vectors and 1 Agrobacterium-mediated along with plant-transformation competent BIBAC vector, each of which contained three restriction enzymes, and each of which was carried out in a BAC vector containing three restriction enzymes. When all libraries are merged, the average insert size in each library varies from 93 to 175 kb, with genome coverage ranging from 2.3 to 8.3x genome equivalents, resulting in a total of >21x haploid cotton genomes in the polyploid cotton species. Other *Gossypium* species that have been studied include G. raimindii, *G. barbadense* (Pima S6), G. longicalyx, and *G. arboreum* (AKA8401), and among others. All the libraries of BAC and BIBAC are necessary additions to the field, providing crucial resources for advanced genetics and genomics research on cotton.

### **5.6 Microarray**

Gene identification, mutational tests, gene expression profiling, gene expression mapping (eQTL mapping), high-throughput genetic mapping, and comparative genome analysis, among other applications, have all benefited from the widespread use of microarrays in genomics research in recent years. For the process of array printing to take place, long (70-mer) gene-specific oligonucleotides are printed as array elements on chemically-coated glass slides, followed by the hybridization of the slide with one or more fluorescently-labeled cDNA or mRNA targets obtained by extracting specific tissues, organs, or cells from the mRNA source. As a consequence, researchers may save time and money by observing the expression and activity of all the genes represented on the microarray in a single hybridization experiment. To further the progress of research in cotton genomics, microarrays created from cotton ESTs have been built in various labs across the world to help in the discovery of new cotton ESTs. In order to produce the first batch of cotton microarrays [80], 70-mers oligos were used to generate the first batch of unigene ESTs of *G. arboretum*. NR fiber ESTs are represented by 12,227 elements in each microarray, each of which corresponds to 12,227 NR fiber ESTs. Each element is replicated twice in each microarray. Arpat et colleagues [80] found a statistically significant difference in gene expression between 10-dpa fibers during the manufacturing stage or elongation of primary cell wall and 24-dpa fibers during the stage of secondary cell wall disposal using microarrays (**Figure 2**). According to the findings, fiber gene expression changes from primary cell wall biogenesis or elongation to secondary cell wall biogenesis, with 2553 fiber genes possibly down-regulated and 81 greatly up-regulated in this phase.

#### **Figure 2.**

*Cotton fiber development and corresponding morphogenesis stages. The initiation stage is characterized by the enlargement and protrusion of epidermal cells from the ovular surface; during the elongation stage the cells expend in polar directions with a rate of 2mm/day; during the secondary cell wall deposition stage celluloses are synthesized rapidly until the fibers contain 90% of cellulose; and at the maturation stages minerals accumulate in the fibers and the fibers dehydrate [14].*

According to the findings of this research, the expression of fiber genes seems to be stage-specific or cell-expansion-dependent rather than continuous. As a result of our research, we discovered that most of the genes that were upregulated in secondary cell wall synthesis when compared with primary cell wall biogenesis belonged to three main functional categories: energy and metabolism; cellular organization and biogenesis; and cytoskeleton (cytoskeleton was the most frequently observed). The fact that such a large amount of cellulose synthesis and cell wall biogenesis is taking place at this moment makes it feasible to suppose that it is taking place in large quantities. Recent additions to the fiber gene microarrays include almost 10,000 gene elements acquired from ESTs of the tetraploid farmed cotton, *G. hirsutum*, as well as ovary ESTs of the tetraploid farmed cotton. It was necessary to employ *G. hirsutum* fiber and ovary ESTs in order to generate the fiber gene microarrays, which were later upgraded to include over 10,000 gene elements that were derived from *G. hirsutum* fiber and ovarian ESTs [14].

## **6. Application of bioinformatics-genomic tools**

Undoubtedly, one of the most important the employment of genomic technology is one of the aims of genome research. That have been developed so as to promote or assist in the continued growth of agricultural genetics in the future. It is now possible to answer a myriad of vital scientific questions in the area of cotton because of advancements in genetic resources and technology. It is possible to employ genomic resources and techniques to encourage or support cotton genetic improvement in a number of ways, depending on the situation. According to current and future projections, marker-assisted selection (MAS) has been one of the most significant and beneficial applications in the field of computer science in the present and near future. The MAS technology has the potential to bring various benefits to a breeding program in a variety of circumstances. For example, using DNA linked to a gene of interest in the first generation of a mating cycle may be utilized to boost the efficiency of selection in the subsequent generations.

When screening for phenotypes in situations where selection is costly or difficult to perform, such as when dealing with a large number of recessive genes, seasonal or geographical issues, or late expression of the characteristic, the adoption of this approach offers substantial benefits [81]. Because the majority of research in cotton genome over the last decade has been devoted for the growth of resources and

genomics techniques, for the improvement of cotton genetics as the ultimate end goal of the research, cotton breeding programs have only recently begun to use MAS.

#### **6.1 Fiber quality**

*Glossoloma anomalum* introgression line 7235 was used by Zhang [82] with excellent fiber quality attributes to uncover molecular markers related with fiber strength QTLs. The results showed that molecular markers associated with fiber strength QTLs were found in the introgression line 7235. QTLFS1, a big quantitative trait locus (QTL), was identified in the Hainan and Nanjing field sites in China, as well as in the College Station field site, Texas. QTLFS1 was discovered in the Nanjing and Hainan field sites in China, as well as in the College Station field site, Texas, USA. This QTL is shown to be joined with eight markers and to be responsible for more than 30% of the phenotypic variation in the study population. QTLFS1 is originally thinked to be located on chromosome 10, further study revealed it was actually positioned on LGD03 [81]. As established by Guo et al. [83], an unique SCAR4311920 marker was employed to undertake large-scale screening for the absence or presence of this important fiber strength QTL in breeding populations using a genetic marker [83–85]. It is possible that this QTL, as well as the DNA markers that are closely associated to it, has been crucial in the commercial cultivars with superior fiber length attributes.

The researchers detected stable fiber length QTL, qFLD2-1, in the population of Xiangzamian 2 by evaluating it in four distinct settings at the same time, as reported in Wang et al. [86]. Because of its high degree of stability, it is conceivable that this QTL has been important for use in MAS algorithms due to its high degree of stability. By applying an in-depth RFLP map to 15 parameters that reflect fiber length in 3662 BC3F2 plants from 24 independently derived BC3 families using *Gossypium barbadense* as the donor parent, Chee and coauthor [87] dissected the molecular basis of genetic variation in *G. barbadense*-derived BC3 families that governs 15 parameters that reflect fiber length. The finding of many QTLs that are identical to each characteristic shows that, to obtained the largest genetic gain, breeding works that target each trait are necessary to target each trait individually. Lacape et al., [88] done a quantitative trait locus investigation of 11 fiber characteristics in BC1, BC2, and BC2S1 backcross generations created from a cross between *G. hirsutum* "Guazuncho 2" and *G. barbadense* "VH8," which resulted in the BC1 and BC2S1 backcross generations. They founded 15, 12, 21, and 16 quantitative trait loci for strength, length, color and fineness, in atleast one populations, with the number of QTLs varied from population to population.

The data indicated that the vast majority of QTLs had advantageous alleles coming from the *G. barbadense* parent, and that QTLs colocalization for diverse traits was much prevalent to isolated placement of QTLs for unique features. By considering these QTL-rich chromosomal sites, scientists were able to identify 19 spots on 15 different chromosomes that may be used as prospective target regions in the markerassisted with introgression approach. *G. barbadense* quantitative trait loci linked to genetic markers may allow breeders to more effectively transmit and keep favorable characteristics gained from foreign sources throughout cultivar development as a result of the sources of DNA markers related to QTLs.

#### **6.2 Cytoplasmic male sterility**

The D8 restorer (D8R), which is formed for use with the D2 cytoplasmic male sterile alloplasm, and the D2 restorer (D2R), which is formed for use with the D2

cytoplasmic male sterile alloplasm, both work to restore cytoplasmic male sterility by the D8 alloplasm (CMS-D8) to fertility in cotton (CMS-D2). Following these findings, Zhang and Stewart [89] examined that the two restorer loci are not only nonallelic, as well as they are also genetically closely connected, with an approx. Genetic distance between them of 0.93 cM on average. Restoration of the D2 restorer gene has been renamed Rf1, and restoration of the Rf2 restorer gene has been assigned to the restoration of the D8 restorer gene It is possible that a molecular marker that is closely related to the restorer genes of cytoplasmic male sterility are identified and utilized to help hybrid cotton parental lines creation.

According to the findings of Guo et al. [90], one of the DNA markers utilized in the investigation, dubbed OPV-15(300), was shown to be significantly related to the fertility-restoring gene Rf1. They uncovered three RAPD markers which are linked to the restorer gene and, more crucially, they turned the three RAPD markers into markers of genome specific sequence tagged site (STS). It was identified by Liu et al. [22] on the chromosome 4 long arms, which was previously unknown, that the Rf1 locus was located. It was observed that the Rf1 gene is significantly related with two RAPD and 3 SSR markers, for a total of 4 markers. Because they are specific to restorers, MAS should find these markers to be beneficial in the creation of restorer parental lines. Later, Yin [91] developed a genetic map of Rf1 in high resolution that had 13 markers that were separated by a genetic distance of 0.9 cM. This map was utilized to determine the location of Rf1 mutations. Using the Rf1 locus physical map, the researchers determined that the gene's likely location was atleast of two Bacterial Artificial Chromosome clones with an interval of generally 100 kb among them, which were identified as 081-05 K and 052-01 N, respectively, with an interval of generally 100 kb. The method of extracting the Rf1 gene from cotton is now in the process of being completed.

#### **6.3 Resistance to diseases and insect pests**

An important consideration in breeding programs of cotton is resistance of diseases. For this purpose, the researchers identified and described the family of NBS-LRR expressing genes in the Auburn 634 Upland cotton cv. in order to allow investigation, modification and cloning of genes imparting resistance to diverse diseases including fungus, viruses, bacteria and nematodes. It was discovered that only a less percentage of AD-genome chromosomes of cotton include members of the RGA gene family, and that members of one subfamily tend to cluster together on the genetic map of cotton, with many RGAs found in subgenome. Than in subgenome D. Wright et al. [92] discovered two RGAs that comapped with previously identified QTLs for cotton bacterial blight resistance. Cotton RGAs from the NBS-LRR gene family have been crucial in the manipulation, characterization and cloning of resistant genes to a variety of pests and pathogens, accounting for approx. 80% of the genes (>40 genes) that have been cloned to date and confirmed resistance to fungus, viruses, bacteria and nematodes.

Meloidogyne incognita, an RKN, has the potential to significantly reduce cotton yields. CIR316, a SSR marker on linkage group A03, was found by Wang et al. [93] using the *G. hirsutum* "AaclaNemX." resistant cultivar. This marker was closely attached to a critical gene resistant RKN (rkn1). A bulked segregant analysis in combination with AFLP is also used in a parallel study to find additional rkn1-associated molecular markers [94]. When an AFLP marker called GHACC1 that was previously linked to rkn1 was converted to a CAPS marker, it resulted in the creation of the CAPS *Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

markers. MAS patients might benefit from the use of these two markers. Researchers from Shen et al. [95] found that RFLP markers which are on chromosomes 7 as well as 11 are related to RKN resistance in the source of Auburn 634,which is another source of resistant germplasm than the AcalaNemX source [96].

On chromosomes 7 and 11, an SSR marker-based search for a minor and major dominant quantitative trait locus further verified this relationship. It was shown that when two SSR markers were combined, they accounted for 31% of the galling index. Short arm chromosome 14 mapping is handled by BNL 3661, while long arm chromosome 11 mapping is handled by BNL 1231. It is fair to believe that minimum two genes are included in RKN resistance, given the link between RKN resistance and two different chromosomes.

Blight produced with the bacteria Xanthomonascampestris is other commercially essential disease of cotton (Xcm). There have been two studies that looked at the genetic genes location that provide bacterial resistance that cause blight disease, Wright et al. and Rungi et al., respectively [92, 97]. RFLP markers linked to specific locations on the chromosome were used in both experiments to look for genes that give resistance to the virus. Maps show an association between the B12 resistance gene marker on chromosome 14 and the resistance locus that was initially discovered in African cotton varieties. As an additional step, AFLP and SSR markers were used to discover novel markers that may be used to introduce the Xcm resistance gene into *G. barbadense* through MAS.

## **7. Conclusion and future prospectives**

A vast amount of genetic information about the cotton plant and its products has been made available despite the fact that cotton genomics research has lagged behind that of rice, maize, wheat, and soybean. Numerous genes and quantitative trait loci (QTLs) joined with quality of cotton fiber, production of fiber, biotic and abiotic stresses are seen and mapped using these resources and methodologies. At Texas Tech University, the laboratory of T. A. Wilkins includes cotton fiber microarrays that may be used for research and development purposes. In the four arrays one is printed on a single slide have seen in the picture above. Biology of cotton, as well as plant biology in general, is explored. These tools and methodologies, however, need a lot more effort to be properly utilized in improvement of cotton genetic and biology study, as well as made more accessible for usage in applications. Cotton genomes research should be emphasized, including but not limited to the following: Based on whole-genome BAC/BIBAC sequencing, we are developing physical maps for cottons. Till date it has not been an accurate and trustworthy based on BAC/BIBAC whole-genome, which is for cotton's physical/genetic map. The maps should contain minimum two species of *Gossypium*. Both Upland and *Gossypium* Raimondi cottons have 90% of world cotton output. Genome of *Gossypium*'s is the smallest among its species, which means that it has the largest density of genes among *Gossypium*'s. Mostly current genetics and research for genomics initiatives may benefit from the usage of whole-genome integrated physical or genetic maps, which are shown to be strong platforms and freeways in model and other species, such as the fruit fly, the human genome and the mouse genome [74, 75]. Additional advantages of developing integrated physical maps include a more speedy and effective integration of all current mapped genes, genetic maps, and QTLs, along with genetic resources, which resulted in enhanced research efficiency and cheaper costs.

QTLs are being finely mapped. Even though many genes and quantitative trait loci (QTLs) related to cotton fiber output and fiber quality, as well as stressors from both the natural and man-made environments have been genetically mapped, a couple of issues must be addressed: first, virtually all QTLs are discovered using F2, BC1, and early generations in only one setting, if not a few. Because quantitative elements are very subject to environmental change, findings obtained by using early generations in just one or a few conditions that differ it from one research to the next. Furthermore, DNA markers and most QTLs genetic distances are just too great for MAS applications to be successful. This is the second challenge. For mapping QTLs, huge and advanced population, such as RILs or DHs, in varied settings, and nearly connected DNA markers, comprehensive physical maps are required. Accurate mapping of QTLs and formation of DNA markers which are well-equiped for MAS (i.e., tightly connected and user-friendly) are necessary for the ultimate isolation of QTL genes for mapbased cloning. Genes which are isolated as best candidates for generating MAS markers as gene and markers have no recombination between them which is making them perfect choices. More than one key for genomes of cotton are being sequenced. The most effective method for identifying and decoding all cotton genes is whole genome sequencing, despite its high cost with current sequencing technology. It also generates the most sought-after and highly detailed map of the cotton genome, both physically and genetically integrated. Although the genome sizes of *Gossypium* species vary widely used studies of comparative genomics which show the gene content and order of genes for these species are very consistent [60, 61]. *Gossypium raimondii* has small genome for all *Gossypium* species, despite the fact that it is not cultivated in culture. This makes it an excellent candidate for genome sequencing. The sequence data from *G. raimondii* is transfers to the most important farmed cotton, If a physical map for this larger genome is available, end sequences of BAC for the integrated physical map may be used as anchors for *G. hirsutum*.

Cells at the stage of secondary cell wall development, including those produced from nonfiber and nonovary tissues and fibers. To be sure, Cotton ESTs are now more plentiful than ever before, but the distribution of these ESTs across different tissue types is still rather uneven, as seen above. After 20 dpa, when secondary cell wall deposition has occurred, there are relatively few ESTs from nonfiber/nonovary tissues as well as fibers. This is especially true during the 15–45-dpa stage. It is clear that even while the expressed genes first set do not contribute directly to fiber output with quality, the second set of expressed genes has a major influence on fiber yield and quality. A large influence on fiber output and quality may be found in the first set of expressed genes, despite the fact that they do not directly contribute to fiber strength. Researchers are working to profile and identify genes related with certain biological processes with an emphasis on genes involved in fiber production. There have been several advances in molecular biology that have been made possible by the creation and widespread availability of microarrays based on cDNA or unigene EST. Cotton research has not made much progress in any of these areas, unfortunately. The capacity of cotton breeders to improve cotton genetics have considerably enhanced by incorporating and defining genes used in the process of fiber creation, development as well as growth of plant, and responses of cotton plants to biotic and abiotic challenges.

Cotton breeders benefit greatly from the capacity to translate changes in gene activity or expression in different tissues and developmental stages into changes in fiber quality and yield. However, it is not clear what the upregulation or downregulation of fiber gene activity or active expression in developmental stages and organs means to cotton's final fiber yield or in order to discover genes included in fiber

*Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

introduction [98, 99], expansion [80, 98] along with secondary cell wall deposition [80], cotton genotypes are employed. Are longer fibers inferred by the presence of a gene that is actively expressed during the elongation stage of the fiber? There needs to be more study done on using the data of gene expression for cotton germplasm analyses and development programs.

## **Acknowledgements**

The authors are grateful to Department of Genetics & Biotechnology, Osmania University and Averinbiotech for its support.

## **Conflict of interest**

The authors declare no conflict of interest.

## **Acronyms and abbreviations**


*Cotton*

## **Author details**

Gugulothu Baloji\*, Lali Lingfa and Shivaji Banoth Department of Genetics and Biotechnology, Osmania University, Hyderabad, Telangana, India

\*Address all correspondence to: baloji2020@gmail.com

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

## **References**

[1] Shaheen T, Tabbasam N, Iqbal MA, Ashraf A, Zafar Y, Peterson AH. Cotton genetic resources. A review. Agronomy for Sustainable Development. 2012;**32**:419- 432. DOI: 10.1007/s13593-011-0051-z

[2] Qin YM, Zhu YX. How cotton fibers elongate: A tale of linear cell growth mode. Current Opinion in Plant Biology. 2011;**14**:106-111. DOI: 10.1016/j.pbi. 2010.09.010

[3] Shan CM, Shangguan XX, Zhao B, Zhang XF, Chao LM, Yang CQ, et al. Control of cotton fibre elongation by a homeodomain transcription factor GhHOX3. Nature Communications. 2014;**5**:5519. DOI: 10.1038/ncomms6519

[4] Chen ZJ, Scheffler BE, Dennis E, Triplett BA, Zhang T, Guo W, et al. Toward sequencing cotton (Gossypium) genomes. Plant Physiology. 2007;**145**:1303-1310. DOI: 10.1104/pp.107.107672

[5] Rahman M, Zafar Y, Paterson AH. Gossypium DNA markers types, number and uses. In: Paterson AH, editor. Genomics of Cotton. Berlin: Springer; 2009. pp. 101-139. DOI: 10.1007/ 978-0-387-70810-2\_5

[6] Chen H, Qian N, Guo WZ, Song QP, Li BC, Deng FJ, et al. Using three overlapped RILs to dissect genetically clustered QTL for fiber strength on chro.24 in Upland cotton. Theoretical and Applied Genetics. 2009;**119**:605-612. DOI: 10.1007/s00122-009-1070-x

[7] Rahman M, Yasmin T, Tabassum N, Ullah I, Asif M, Zafar Y. Studying the extent of genetic diversity among Gossypium arboreum L. genotypes/ cultivars using DNA fingerprinting. Genet. Resour. Crop Evoluation. 2008;**55**:331-339. DOI: 10.1007/ s10722-007-9238-1

[8] Helms AB. Yield study report. In: Duggar P, Richder D, editors. Proceedings of Beltwide Cotton Production Conference. San Antonio TX: National Cotton Council; 2000. pp. 4-9

[9] Abelson PH. A third technological revolution. Science. 1998;**279**(5359):2019. DOI: 10.1126/science.279.5359.2019a

[10] Fryxell PA, Craven LA, Stewart JM. A revision of Gossypium sect. grandicalyx (malvaceae), including the description of six new species. Systematic Botany. 1992;**17**:91-114. DOI: 10.2307/2419068

[11] Rong J, Abbey C, Bowers JE, Brubaker CL, Chang C, Chee PW, et al. A 3347-locus genetic recombination map of sequence-tagged sites reveals features of genome organization, transmission and evolution of cotton (Gossypium). Genetics. 2004;**166**:389-417. DOI: 10.1534/genetics.166.1.389

[12] Brubaker CL, Paterson AH, Wendel JF. Comparative genetic mapping of allotetraploid cotton and its diploid progenitors. Genome. 1999;**42**:184-203. DOI: 10.1139/g98-118

[13] Endrizzi JE, Turcotte EL, Kohel RJ. Qualitative genetics, cytology, and cytogenetics. In: Kohel RJ, Lewis CF, editors. Cotton. Madison, Wisconsin: American Society of Agronomy; 1984. pp. 81-129. DOI: 10.2134/agronmonogr24.c4

[14] Zhang HB, Li Y, Wang B, Chee PW. Recent advances in cotton genomics. International Journal of Plant Genomics. 2008;**742304**:14. DOI: 10.1155/2008/742304

[15] Yu LH, Wu SJ, Peng YS, Liu RN, Chen X, Zhao P, et al. Arabidopsis EDT1/HDG11 improves drought and salt tolerance in cotton and poplar and increases cotton yield in the field. Plant Biotechnology Journal. 2016;**14**:72-84. DOI: 10.1111/pbi.12358

[16] Paterson AH, Wendel JF, Gundlach H, Guo H, Jenkins J, Jin D, et al. Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature. 2012;**492**:423-427. DOI: 10.1038/nature11798

[17] Wang K, Wang Z, Li F, Ye W, Wang J, Song G, et al. The draft genome of a diploid cotton Gossypium raimondii. Nature Genetics. 2012;**44**:1098-1103. DOI: 10.1038/ng.2371

[18] Li F, Fan G, Wang K, Sun F, Yuan Y, Song G, et al. Genome sequence of the cultivated cotton Gossypium arboreum. Nature Genetics. 2014;**46**:567-572. DOI: 10.1038/ng.2987

[19] Wendel JF. New world tetraploid cottons contain old world cytoplasm. Proceedings of the National Academy of Sciences. 1989;**86**:4132-4136. DOI: 10.1073/pnas.86.11.4132

[20] Li F, Fan G, Lu C, Xiao G, Zou C, Kohel RJ, et al. Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nature Biotechnology. 2015;**33**:524-530. DOI: 10.1038/nbt.3208

[21] Zhang T, Hu Y, Jiang W, Fang L, Guan X, Chen J, et al. Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nature Biotechnology. 2015;**33**:531-537. DOI: 10.1038/nbt.3207

[22] Liu L, Guo W, Zhu X, Zhang T. Inheritance and fine mapping of fertility restoration for cytoplasmic male sterility in Gossypium hirsutum L. Theoretical and Applied Genetics. 2003;**106**(3):461- 469. DOI: 10.1007/s00122-002-1084-0

[23] Yuan D, Tang Z, Wang M, Gao W, Tu L, Jin X, et al. The genome sequence of Sea-Island cotton (Gossypium barbadense) provides insights into the allopolyploidization and development of superior spinnable fibres. Scientific Reports. 2015;**5**(1):1-16. DOI: 10.1038/ srep17662

[24] Page JT, Liechty ZS, Alexander RH, Clemons K, Hulse-Kemp AM, Ashrafi H, et al. DNA sequence evolution and rare homoeologous conversion in tetraploid cotton. PLoS Genetics. 2016;**12**:e1006012. DOI: 10.1371/journal.pgen.1006012

[25] Fang L, Wang Q, Hu Y, Jia Y, Chen J, Liu B, et al. Genomic analyses in cotton identify signatures of selection and loci associated with fiber quality and yield traits. Nature Genetics. 2017;**49**: 1089-1098. DOI: 10.1038/ng.3887

[26] Fang L, Gong H, Hu Y, Liu C, Zhou B, Huang T, et al. Genomic insights into divergence and dual domestication of cultivated allotetraploid cottons. Genome Biology. 2017;**18**:33. DOI: 10.1186/s13059- 017-1167-5

[27] Wang M, Tu L, Lin M, Lin Z, Wang P, Yang Q, et al. Asymmetric sub genome selection and cis-regulatory divergence during cotton domestication. Nature Genetics. 2017;**49**:579-587. DOI: 10.1038/ ng.3807

[28] Li X, Jin X, Wang H, Zhang X, Lin Z. Structure, evolution, and comparative genomics of tetraploid cotton based on a high-density genetic linkage map. DNA Research. 2016;**23**:283-293. DOI: 10.1093/ dnares/dsw016

[29] Wang S, Chen J, Zhang W, Hu Y, Chang L, Fang L, et al. Sequence-based ultra-dense genetic and physical maps reveal structural variations of allopolyploid cotton genomes. Genome Biology. 2015;**16**:108. DOI: 10.1186/ s13059-015-0678-1

*Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

[30] Cheng H, Lu C, Yu JZ, Zou C, Zhang Y, Wang Q, et al. Fine mapping and candidate gene analysis of the dominant glandless gene Gl2 e in cotton (Gossypium spp.). Theoretical and Applied Genetics. 2016;**129**:1347-1355. DOI: 10.1007/s00122-016-2707-1

[31] Andres RJ, Bowman DT, Kaur B, Kuraparthy V. Mapping and genomic targeting of the major leaf shape gene (L) in Upland cotton (Gossypium hirsutum L.). Theoretical and Applied Genetics. 2014;**127**:167-177. DOI: 10.1007/ s00122-013-2208-4

[32] Fang X, Liu X, Wang X, Wang W, Liu D, Zhang J, et al. Fine-mapping qFS07. 1 controlling fiber strength in upland cotton (Gossypium hirsutum L.). Theoretical and Applied Genetics. 2017;**130**:795-806. DOI: 10.1007/s00122- 017-2852-1

[33] Xu P, Gao J, Cao Z, Chee PW, Guo Q, Xu Z, et al. Fine mapping and candidate gene analysis of qFL-chr1, a fiber length QTL in cotton. Theoretical and Applied Genetics. 2017;**130**:1309-1319. DOI: 10.1007/s00122-017-2890-8

[34] Ashrafi H, Hulse-Kemp AM, Wang F, Yang SS, Guan X, Jones DC, et al. A longread transcriptome assembly of cotton (L.) and intraspecific single nucleotide polymorphism discovery. The Plant Genome. 2015;**8**:1-14. DOI: 10.3835/ plantgenome2014.10.0068

[35] Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, et al. 1000 genomes project. Mapping copy number variation by population scale genome sequencing. Nature. 2011;**470**:59- 65. DOI: 10.1038/nature09708

[36] Lin M, Pang C, Fan S, Song M, Wei H, Yu S. Global analysis of the Gossypium hirsutum L. Transcriptome during leaf senescence by RNASeq. BMC Plant Biology. 2015;**15**:43. DOI: 10.1186/ s12870-015-0433-5

[37] Islam MS, Thyssen GN, Jenkins JN, Zeng L, Delhom CD, McCarty JC, et al. A MAGIC population-based genomewide association study reveals functional association of GhRBB1\_A07 gene with superior fiber quality in cotton. BMC Genomics. 2016;**17**:903. DOI: 10.1186/ s12864-016-3249-2

[38] Artico S, Ribeiro-Alves M, Oliveira-Neto OB, de Macedo LL, Silveira S, Grossi-de-Sa MF, et al. Transcriptome analysis of Gossypiumhirsutum flower buds infested by cotton boll weevil (Anthonomusgrandis) larvae. BMC Genomics. 2014;**15**:854. DOI: 10.1186/1471-2164-15-854

[39] Bowman MJ, Park W, Bauer PJ, Udall JA, Page JT, Raney J, et al. RNA-Seqtranscriptome profiling of upland cotton (Gossypiumhirsutum L.) root tissue under water-deficit stress. PLoS ONE. 2013;**8**(12):e82634. DOI: 10.1371/ journal.pone.0082634

[40] Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews. Genetics. 2009;**10**:57-63. DOI: 10.1038/ nrg2484

[41] Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One. 2014;**9**:e78644. DOI: 10.1038/nrg2484

[42] Phillips T. The role of methylation in gene expression. Nature Education. 2008;**1**:116

[43] Cubas P, Vincent C, Coen E. An epigenetic mutation responsible for natural variation in floral symmetry. Nature. 1999;**401**:157-161. DOI: 10.1038/43657

[44] Jin X, Pang Y, Jia F, Xiao G, Li Q, Zhu Y. A potential role for CHH DNA methylation in cotton fiber growth patterns. PLoS One. 2013;**8**:e60547. DOI: 10.1371/journal.pone.0060547

[45] Osabe K, Clement JD, Bedon F, Pettolino FA, Ziolkowski L, Llewellyn DJ, et al. Genetic and DNA methylation changes in cotton (Gossypium) genotypes and tissues. PLoS One. 2014;**9**:e86049. DOI: 10.1371/journal.pone.0086049

[46] Song Q, Guan X, Chen ZJ. Dynamic roles for small RNAs and DNA methylation during ovule and fiber development in allotetraploid cotton. PLoS Genetics. 2015;**11**:e1005724. DOI: 10.1371/journal.pgen.1005724

[47] Song Q, Zhang T, Stelly DM, Chen ZJ. Epigenomic and functional analyses reveal roles of epialleles in the loss of photoperiod sensitivity during domestication of allotetraploid cottons. Genome Biology. 2017;**18**(1):99. DOI: 10.1186/s13059-017-1229-8

[48] CottonGen. Available from: https:// www.cottongen.org [Accessed: December 4, 2021]

[49] Cotton Genome Resource Database. CGRD. Available from: http://cgrd.hzau. edu.cn/index.php [Accessed: December 4, 2021]

[50] Database for Co-expression Networks with Function Modules. Available from: http://structuralbiology.cau.edu.cn/ Gossypium/ [Accessed: December 4, 2021]

[51] Join Genome Institute. Available from: http://jgi.doe.gov [Accessed: December 4, 2021]

[52] Cotton Genome Database. Available from: http://www.cottondborg [Accessed: December 4, 2021]

[53] Evolution of Cotton. Available from: https://learn.genetics.utah.edu/content/ cotton/evolution/ [Accessed: December 4, 2021]

[54] Platform of Functional Genomics Analysis in Gossypium raimondii. Available from: http://structuralbiology. cau.edu.cn/GraP/about.html [Accessed: December 4, 2021]

[55] Cotton Functional Genomic Database. Available from: https://cottonfgd.org [Accessed: December 4, 2021]

[56] Cotton Genome Project. Available from: http://cgp.genomics.org.cn/page/ species/index.jsp [Accessed: December 4, 2021]

[57] Abd El-Moghny AM, Santosh HB, Raghavendra KP, Sheeba JA, Singh SB, Kranthi KR. Microsatellite marker based genetic diversity analysis among cotton (Gossypium hirsutum) accessions differing for their response to drought stress. Journal of Plant Biochemistry and Biotechnology. 2017;**26**(3):366-370

[58] Zeng C, Kouprina N, Zhu B, Cairo A, Hoek M, Cross G, et al. Largeinsert BAC/YAC libraries for selective re-isolation of genomic regions by homologous recombination in yeast. Genomics. 2001;**77**(1-2):27-34. DOI: 10.1006/geno.2001.6616

[59] Decker CJ, Steiner HR, Hoon-Hanks LL, Morrison JH, Haist KC, Stabell AC, et al. dsRNA-Seq: Identification of viral infection by purifying and sequencing dsRNA. Viruses. 2019;**11**(10):943

[60] Reinisch AJ, Dong JM, Brubaker CL, Stelly DM, Wendel JF, Paterson AH. A detailed RFLPmap of cotton, Gossypium hirsutum x Gossypium barbadense: Chromosome organization and evolution in a disomic polyploid genome. Genetics. *Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

1994;**138**(3):829-847. DOI: 10.1093/ genetics/138.3.829

[61] Rong J, Abbey C, Bowers JE, Brubaker CL, Chang C, Chee PW, et al. A 3347-locus genetic recombination map of sequence-tagged sites reveals types of genome organization, transmission and evolution of cotton (Gossypium). Genetics. 2004;**166**:389-417. DOI: 10.1534/genetics.166.1.389

[62] Desai A, Chee PW, Rong J, May OL, Paterson AH. Chromosome structural changes in diploid and tetraploida genomes of Gossypium. Genome. 2006;**49**(4):336-345. DOI: 10.1139/g05-116

[63] Kohel RJ, Yu J, Park YH, Lazo GR. Molecular mapping and characterization of traits controlling fiber quality in cotton. Euphytica. 2001;**121**(2):163-172. DOI: 10.1023/A:1012263413418

[64] Brubaker CL, Brown AHD. The use of multiple alien chromosome addition aneuploids facilitates genetic linkage mapping of the Gossypium G genome. Genome. 2003;**46**:774-791. DOI: 10.1139/ g03-063

[65] SaghaiMaroof MA, Biyashev RM, Yang GP, Zhang Q, Allard RW. Extraordinarily polymorphic microsatellite DNA in barley: Species diversity, chromosomal locations, and population dynamics. Proceedings of the National Academy of Sciences of the United States of America. 1994;**91**(12):5466-5470. DOI: 10.1073/ pnas.91.12.5466

[66] Blenda A, Scheffler J, Scheffler B, Palmer M, Lacape JM, Yu JZ, et al. CMD: A cotton microsatellite database resource for gossypium genomics. BMC Genomics. 2006;**7**:132. DOI: 10.1186/1471-2164-7-132

[67] Ulloa M, Brubaker C, Chee P. Cotton. In: Kole C, editor. Technical Crops. Genome Mapping and Molecular Breeding in Plants. Vol. 6. Berlin, Heidelberg: Springer; 2000. DOI: 10.1007/978-3-540-34538-1\_1

[68] Lichtenzveig J, Scheuring C, Dodge J, Abbo S, Zhang H-B. Construction of BAC and BIBAC libraries and their applications for generation of SSR markers for genome analysis of chickpea, Cicer arietinum L. Theoretical and Applied Genetics. 2005;**110**(3):492-510

[69] He L, Du C, Li Y, Scheuring C, Zhang HB. Large insert bacterial clone libraries and their applications. In: Liu Z, editor. Aquaculture Genome Technologies. Ames, Iowa, USA: Blackwell; 2007. pp. 215-244

[70] Ren C, Xu ZY, Sun S, Lee MK, Wu C, Scheuring C, Zhang HB. Genomic DNA libraries and physical mapping. In: Meksem K, Kahl G, editors. The Handbook of Plant Genome Mapping: Genetic and Physical Mapping. Wiley-VCH Verlag GmbH; Weinheim: Germany; 2005. p.173-213. DOI: 10.1002/3527603514. ch8

[71] Ioannou PA, Amemiya CT, Garnes J, Kroisel PM, Shizuya H, Chen C, et al. A new bacteriophage P1-derived vector for the propagation of large human DNA fragments. Nature Genetics. 1994;**6**(1):84-89. DOI: 10.1038/ng0194-84

[72] Shizuya H, Birren B, Kim UJ, Mancino V, Slepak T, Tachiiri Y, et al. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proceedings of the National Academy of Sciences of the United States of America. 1992;**89**(18):8794-8797. DOI: 10.1073/ pnas.89.18.8794

[73] Zhang H-B. Map-based cloning of genes and QTLs. In: Kole C, Abbott A, editors. Plant Molecular Mapping and Breeding. New York, NY, USA: Springer; 2007

[74] Wu C, Sun S, Lee MK, Xu ZY, Ren C, Zhang HB. Whole genome physical mapping: An overview on methods for DNA fingerprinting. In: Meksem K, Kahl G, editors. The Handbook of Plant Genome Mapping: Genetic and Physical Mapping. Wiley-VCH Verlag GmbH. Weinheim, Germany; 2007. p. 257-283. DOI:10.1002/3527603514.ch11

[75] Zhang HB, Wing RA. Physical mapping of the rice genome with BACs. Plant Molecular Biology. 1997;**35**(1-2):115-127

[76] International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature. 2005;**436**(7052):793-800. DOI: 10.1038/ nature03895

[77] Tyler BM, Tripathy S, Zhang X, Dehal P, Jiang RH, Aerts A, et al. Phytophthora genome sequences uncover evolutionary origins and mechanisms of pathogenesis. Science. 2006;**313**(5791):1261-1266. DOI: 10.1126/ science.1128796

[78] Chen M, SanMiguel P, de Oliveira AC, Woo SS, Zhang H, Wing RA, et al. Microcolinearity in sh2-homologous regions of the maize, rice, and sorghum genomes. Proceedings of the National Academy of Sciences of the United States of America. 1997;**94**(7):3431-3435. DOI: 10.1073/pnas.94.7.3431

[79] Patocchi A, Vinatzer BA, Gianfranceschi L, Tartarini S, Zhang HB, Sansavini S, et al. Construction of a 550 kb BAC contig spanning the genomic region containing the apple scab resistance gene Vf. Molecular & General Genetics. 1999;**262**(4-5):884-891. DOI: 10.1007/s004380051154

[80] Arpat AB, Waugh M, Sullivan JP, Gonzales M, Frisch D, Main D, et al. Functional genomics of cell elongation in developing cotton fibers. Plant Molecular Biology. 2004;**54**(6):911-929. DOI: 10.1007/s11103-004-0392-y

[81] Dreher K, Morris M, Khairallah M, Ribaut JM, Pandey S, Srinivasan G. Is marker-assisted selection cost-effective compared to conventional plant breeding methods? The case of quality protein maize. In: Evenson RE, Santaniello V, Zilberman D, editors. Economic and Social Issues in Agricultural Biotechnology. Wallingford, UK: CABI Publishing; 2002. pp. 203-236

[82] Zhang T, Yuan Y, Yu J, Guo W, Kohel RJ. Molecular tagging of a major QTL for fiber strength in Upland cotton and its markerassisted selection. Theoretical and Applied Genetics. 2003;**106**(2):262-268. DOI: 10.1007/s00122-002-1101-3

[83] Guo W, Zhang T, Shen X, Yu JZ, Kohel RJ. Development of SCAR marker linked to a major QTL for high fiber strength and its usage in molecularmarker assisted selection in upland cotton. Crop Science. 2003;**43**(6):2252- 2256. DOI: 10.2135/cropsci2003.2252

[84] Shen X, Guo W, Zhu X, Yuan Y, Yu J, Kohel R, et al. Molecular mapping of QTLs for fiber qualities in three diverse lines in Upland cotton using SSR markers. Molecular Breeding: New Strategies in Plant Improvement. 2005;**15**(2):169-181

[85] He DH, Lin ZX, Zhang XL, Nie YC, Guo XP, Feng CD, et al. Mapping QTLs of traits contributing to yield and analysis of genetic effects in tetraploid cotton. Euphytica. 2005;**144**:141-149. DOI: 10.1007/s10681-005-5297-6

[86] Wang BH, Wu YT, Huang NT, Zhu XF, Guo WZ, Zhang TZ. QTL mapping

*Bioinformatics Tools and Genomic Resources Available in Understanding the Structure… DOI: http://dx.doi.org/10.5772/intechopen.102355*

for plant architecture traits in upland cotton using RILs and SSR markers. Yi ChuanXueBao. 2006;**33**(2):161-170. DOI: 10.1016/S0379-4172(06)60035-8

[87] Chee PW, Draye X, Jiang CX, Decanini L, Delmonte TA, Bredhauer R, et al. Molecular dissection of phenotypic variation between Gossypium hirsutum and Gossypium barbadense (cotton) by a backcross-self approach: III. Fiber length. Theoretical and Applied Genetics. 2005;**111**(4):772-781. DOI: 10.1007/s00122-005-2062-0

[88] Lacape JM, Llewellyn D, Jacobs J, Arioli T, Becker D, Calhoun S, et al. Meta-analysis of cotton fiber quality QTLs across diverse environments in a Gossypiumhirsutum x G. barbadense RIL population. BMC Plant Biology. 2010;**10**:132. DOI: 10.1186/1471-2229-10-132

[89] Zhang JF, Stewart MDJ. Inheritance and genetic relationships of the D8 and D2-2 restorer genes for cotton cytoplasmicmale sterility. Crop Science. 2001;**41**(2):289-294. DOI: 10.2135/ cropsci2001.412289x

[90] Guo W, Zhang T, Pan J, Kohel RJ. Identification of RAPD marker linked with fertility-restoring gene of cytoplasmic male sterile lines in upland cotton. Chinese Science Bulletin. 1998;**43**(1):52-54

[91] Yin J, Guo W, Yang L, Liu L, Zhang T. Physical mapping of the Rf1 fertility-restoring gene to a 100 kb region in cotton. Theoretical and Applied Genetics. 2006;**112**(7):1318-1325. DOI: 10.1007/s00122-006-0234-1

[92] Wright RJ, Thaxton PM, El-Zik KM, Paterson AH. D-subgenome bias of Xcm resistance genes in tetraploid Gossypium (cotton) suggests that polyploid formation has created novel avenues for

evolution. Genetics. 1998;**149**(4):1987- 1996. DOI: 10.1093/genetics/149.4.1987

[93] Wang C, Ulloa M, Roberts PA. Identification and mapping of microsatellite markers linked to a root-knot nematode resistance gene (rkn1) in AcalaNemX cotton (Gossypiumhirsutum L.). Theoretical and Applied Genetics. 2006;**112**(4):770- 777. DOI: 10.1007/s00122-005-0183-0

[94] Wang C, Roberts PA. Development of AFLP and derived CAPS markers for root-knot nematode resistance in cotton. Euphytica. 2006;**152**(2):185-196. DOI: 10.1007/s10681-006-9197-1

[95] Shen X, Van Becelaere G, Kumar P, Davis RF, May OL, Chee P. QTL mapping for resistance to root-knot nematodes in the M-120 RNR Upland cotton line (Gossypium hirsutum L.) of the Auburn 623 RNR source. Theoretical and Applied Genetics. 2006;**113**(8):1539-1549. DOI: 10.1007/s00122-006-0401-4

[96] Shen X, Zhang T, Guo W, Zhu X, Zhang X. Mapping fiber and yield QTLs with main, epistatic, and QTL x environment interaction effects in recombinant inbred lines of Upland cotton. Crop Science. 2006;**46**(1):61-66. DOI: 10.2135/cropsci2005.0056

[97] Rungis D, Llewellyn ES. Dennis, Lyon BR. Investigation of the chromosomal location of the bacterial blight resistance gene present in an Australian cotton (Gossypiumhirsutum L.) cultivar. Crop & Pasture Science. 2002;**53**(5):551-560. DOI: 10.1071/AR01121

[98] Shi YH, Zhu SW, Mao XZ, Feng JX, Qin YM, Zhang L, et al. Transcriptome profiling, molecular biological, and physiological studies reveal a major role for ethylene in cotton fiber cell elongation. The Plant Cell.

2006;**18**(3):651-664. DOI: 10.1105/ tpc.105.040303

[99] Wu Y, Machado AC, White RG, Llewellyn DJ, Dennis ES. Expression profiling identifies genes expressed early during lint fibre initiation in cotton. Plant & Cell Physiology. 2006;**47**(1): 107-127. DOI: 10.1093/pcp/pci228

## **Chapter 4**

## Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton: A Review

*Shalini P. Etukuri, Varsha C. Anche, Mirzakamol S. Ayubov, Lloyd T. Walker and Venkateswara R. Sripathi*

## **Abstract**

The cotton crop is economically important and primarily grown for its fiber. Although the genus *Gossypium* consists of over 50 species, only four domesticated species produce spinnable fiber. However, the genes determine the molecular phenotype of fiber, and variation in their expression primarily contributes to associated phenotypic changes. Transcriptome analyses can elucidate the similarity or variation in gene expression (GE) among organisms at a given time or a circumstance. Even though several algorithms are available for analyzing such high-throughput data generated from RNA Sequencing (RNA-Seq), a reliable pipeline that includes a combination of tools such as an aligner for read mapping, an assembler for quantitating full-length transcripts, a differential gene expression (DGE) package for identifying differences in the transcripts across the samples, a gene ontology tool for assigning function, and enrichment and pathway mapping tools for finding interrelationships between genes based on their associated functions are needed. Therefore, this chapter first introduces the cotton crop, fiber phenotype, transcriptome, then discusses the basic RNA-Seq pipeline and later emphasizes various transcriptome analyses studies focused on genes associated with fiber quality and its attributes.

**Keywords:** *Gossypium*, species, gene, expression, sequencing, fiber, transcriptome, and RNA-Seq

## **1. Introduction**

Cotton is a globally and economically important fiber, oil, and protein source. It belongs to the family Malvaceae and genus *Gossypium* with more than 50 species, but only four domesticated species produce spinnable fiber. Of these, *G. hirsutum* and *G. barbadense* are allopolyploids that originated in the United States. The remaining two species are diploids (*G. herbaceum* and *G. arboreum*) from Africa and/or Asia [1]. Genus *Gossypium* consists of one tetraploid (AD) and eight diploid genome groups (A – G and K). It is believed that the allotetraploid, upland cotton, *G. hirsutum*

(AtAtDtDt; ~2.5 Gb) is most likely evolved from the diploid A-genome ancestor, *G. arboreum* (A2A2; ~1.8 Gb) or *G. herbaceum* (A1A1; ~1.6 Gb), and the diploid D-genome progenitor, *G. raimondii* (D5D5; ~900 Mb) [1–3]. According to the United States Department of Agriculture (USDA) and Foreign Agricultural Service (FAS), the acreage, yield, and production projections for the year 2021–2022 across the world and the United States are ~32 and ~4 million hectares, ~810 and ~950 kilograms per hectare, and ~120 and ~18 million 480-lb bales, respectively [4]. However, the United Nations (UN) projected that the global population would surpass the mark of 10 billion by 2050 [5]. Therefore, to meet the clothing needs of a constantly expanding population, the yield or production must be increased by developing or employing better cotton crop improvement strategies.

The percent composition of cotton fibers includes cellulose (94%), waxes (0.6%), pectin (0.9%), proteins (1.3%), minerals (1.2%), organic acids (0.8%), sugars (0.3%), and miscellaneous (0.9%) substances [6–8]. Cellulose is a homopolymer containing repeated units of β-(1→4)-D-anhydroglycopyranose. Polymerization and crystallinity of these units impart strength to the fiber [6–8]. Cotton fibers are chemically composed of five layers: (i) cuticle, (ii) primary wall, (iii) winding (transition) layer, (iv) secondary wall, and (v) lumen. The outermost layer of the cotton fiber is the cuticle, and it is composed of waxes (cutin and suberin), pectins, proteins, sugars, ash, and other substances. The primary wall and winding layer comprise amorphous cellulose, hemicelluloses, esterified and non-esterified pectins, proteins, and metal ions. The secondary cell wall (SCW) is made of crystallinity cellulose. Finally, the innermost layer is the lumen, and it comprises proteins, malic, citric, and other organic acids [6–8]. Mature cotton fibers are cellulose-rich with a thicker secondary wall and smaller lumen, while immature fibers are low in cellulose content with a thin wall and a large lumen. Based on the length, the mature cotton fibers can be classified as lint (long) and fuzz (short) fiber [9]. The unicellular cotton fibers emerge from the ovule surface immediately after flowering (days post-anthesis, DPA). The temporal progression of fiber development occurs in approximately 50 days, and it can be divided into four overlapping phases based on morphological features: (i) initiation (0–4 DPA), (ii) elongation (3–21 DPA), (iii) SCW formation and thickening (15–40 DPA), and (iv) maturation (38–52 DPA) [10]. However, genotype and environmental interactions affect the duration of phases and the rate of progression. Moreover, an array of transcripts expressed at these phases vary substantially. Thus, enabling us to study these differences in gene expression (GE).

The term transcriptome is referred to as a repertoire of RNA types found in a cell or a tissue or an organism at a given period or a collection point or a location or a specific treatment or a developmental phase or an environmental condition or a physiological state [11, 12]. The transcriptome is dynamic, and it tends to respond to subtle changes in environment or experimental condition or treatment [11, 12]. The earlier GE or transcript analysis methods such as *in situ* hybridization (ISH), northern blot, Quantitative reverse transcription PCR (RT-qPCR), microarrays, serial gene expression analysis (SAGE), and expressed sequence tags (EST) analysis primarily focused on a single gene or a group of genes, while RNA-Seq or whole transcriptome sequencing (WTS) technique aimed at studying a wide range of transcripts [11, 13]. WTS is a contemporary and more promising strategy that has been widely used in studying the similarity or variability of GE, depending on the objective of the study. Although microarrays is an inexpensive and user-friendly method with a decent throughput, the fundamental limitation is the requirement of prior knowledge for immobilizing a probe set on the chip. RNA-Seq has largely replaced microarrays as the

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

predominant method for quantifying GE due to its significant advantages in providing a more comprehensive overview of the transcriptome [14, 15]. RNA-Seq can be used to study differential expressed genes (DEGs), a dynamic range of transcripts including, novel transcripts, alternative splice variants (ASVs), novel isoforms, and gene fusions, non-coding RNA (ncRNA), and single nucleotide variants (SNVs) from a complex landscape of transcripts collected from a sample [11, 15].

RNA-Seq is also the method of choice because of its throughput, ease, and portability in using hypothesis-free experimental designs. RNA-Seq has been widely used in profiling the transcriptomes of diverse crop species, including cotton [3] and its fiber [16]. In cotton, RNA-Seq based differential gene expression (DGE) for fiberrelated genes among diverse species [17–19], genotypes [20, 21], and in various biotic [22, 23] and abiotic [24, 25] stresses have been reported. The inherent variation in the biochemical composition of the cotton fiber, and the temporal and spatial GE in fiber development phases, offers broad scope for analyzing the fiber transcriptome using RNA-Seq. Combining biochemical, molecular, and microscopic approaches to analyze cotton fiber showed that the significantly enriched genes were associated with the cytoskeleton, cell wall, cellulose biosynthesis, carbohydrate, and energy metabolism during fiber development [26, 27]. Since the fiber transcriptome has been reported [28], various transcriptome-based studies reported in cotton have shown the differential expression (DE) of several genes among different phases of fiber development [26]. Further, RNA-Seq enabled researchers to investigate a comprehensive set of genes involved in fiber yield and quality [20, 29, 30]. This chapter reviews the basic RNA-Seq analysis pipeline and tools currently being used in Section 2 below. Also, a few RNA-Seq studies primarily aimed at fiber quality attributes such as fiber length, strength, development, initiation, elongation, and color are discussed in Section 3. However, fiber single-cell transcriptome analysis is beyond the scope of this chapter.

## **2. RNA-seq analysis pipeline**

RNA-Seq is a complex subject with several aspects to be considered. A few key factors include experimental design, sequencing, and data analysis. This section provides an overview of the experimental design and primarily focuses on the basic pipeline of RNA-Seq analysis.

#### **2.1 Experimental design**

The time, effort, and cost of RNA-Seq analysis primarily rely on the experimental design [31]. A single sample on a short-read (Illumina) and long-read (PacBio SMRT/ Oxford Nanopore) sequencing including RNA-Seq are offered as low as \$100 [32]. However, the sequencing costs vary with the experimental design (control vs. treatment; normal vs. condition; early vs. late collection time point; and the number of samples vs. the number of replicates), throughput (30 vs.100 million reads), NGS platform (Illumina vs. PacBio), sequencing time (today vs. a year later), and setup (academic vs. commercial). In addition, there are several factors one needs to consider before RNA-Sequencing that include the size of the genome (e.g., ~2.5 Gb), the number of genes (e.g., ~50,000), gene density (e.g., 20 genes/1 Mb), the targeted coverage (1X vs. 10X), sequencing chemistry (short-read vs. long-read), the read type (singleend vs. paired-end), the library type (stranded vs. non-stranded), the number of reads (e.g., >30 million) per sample, and the number of samples per lane (1 vs. 12) [11].

It is ideal for including six biological/technical replicates per sample in any experiment [33]. However, in most cases, a minimum of three replicates are used to draw statistically significant conclusions [11].

Different variations in RNA-Seq methodology are available, and they are primarily based on coding (e.g., mRNA-Seq) and non-coding regions (e.g., small RNA-Seq). Besides total RNA-Seq, targeted RNA-Seq, digital gene expression (DGE)-Seq, and single-cell RNA-Seq (scRNA-Seq) are widely used [11], but only mRNA-Seq is highlighted in this chapter as it represents the coding portion of the genome. A typical short-read-based RNA-Seq experiment generally starts by selecting a population of cells or tissue, then extracting total RNA, enrichment of mRNA, fragmenting mRNA, converting fragments into cDNAs, adding adapters to cDNAs, creating a library, sequencing (Illumina), and data analysis [11]. The detailed RNA-Seq methodology has been reviewed earlier [15, 34–36], and it is beyond the scope of this chapter.

#### **2.2 Read processing and quality check**

After collecting the raw read data from an NGS platform, we must consider several metrics. The most important statistic is Phred quality (numeric) scores calculated at each position or base within the read. Phred score indicates the probability of base calling at each position is likely accurate. In addition, the Phred score is the mean value calculated at each position for all reads in that sample. For instance, in a read library of 100 bases, the Phred score is the mean value calculated across each sample within the reads between positions 0 and 100. Therefore, a higher Phred score value is expected at each position in a read and across all reads in a sample as an indicator of quality data. The second quality metric to consider is the adapter content, the percent of sequences containing the adapter. It is often detected *in silico* by checking for the adapter contamination at each read position and across all reads in a sample. Sometimes, adapter contamination is seen towards the end of the read. Especially when the sequencing fragment (cDNA) is too short, the DNA polymerase continues to sequence until the end of the fragment and into the opposite adapter. Another quality metric to consider is GC content, in which we expect to see a single peak with a smooth progression without multiple peaks. The presence of multiple peaks in GC content represents the possible contamination during the library preparation.

RNA-Seq is primarily performed in core laboratories or offered as a charge per service. However, it is obvious to find traces of ribosomal RNA from humans or other species, if contaminated during the sample handling and processing. Before analysis, we can eliminate such contamination issues by comparing query sequences against the reference databases such as DeconSeq [37] and PrinSeq [38]. Some efficient algorithms like FASTQC [39] will take a random subset of the reads from each sample and map them to various possible contaminant datasets to determine the quality. A clean RNA-Seq data should have the vast majority of reads (80–90%) from the organism being sequenced, and the remaining reads (10–20%) can be from related species owing to their homology. Also, the genomic origin of the reads is determined by their exonic proportion. In mRNA-Sequencing, ~80% of the reads must have exonic regions, and the remaining (~20%) can be intronic or intergenic regions. Before performing the alignment, raw reads collected are preprocessed for filtering low-quality reads and removing non-biological sequences such as adapter sequences, barcodes, and indexes using trimming tools such as Cutadapt [40], TrimGalore [41], Scythe [42], Trimmomatic [43], HTStream [44], and BBduk [45]. The popular tools

for quality checking of raw reads before and after trimming include iSeqQC [46], MultiQC [47], NGS QC [48], FASTQC [39], and FASTX-Toolkit [49].

### **2.3 Read alignment**

The alignment approaches available for both short and long reads for alignment and quantification are: (i) traditional reference-based splice-aware alignment, and (ii) pseudo alignment. The reads are mapped directly to the reference genome in reference-based alignment without ignoring the splice junctions across the exons. For instance, pre-mRNA contains exons and introns, but the processed mRNA joins the spliced-out exons. So, it is inevitable to find some short reads that intersect these exon-exon junctions that do not map directly back to the reference genome. The selected aligner must be aware of such spliced products and junctions. It is important to realize that if our goal is to discover novel isoforms, we must consider exon-exon junctions. These methods typically run-on clusters because they require a large amount of memory or central processing unit (CPU) time. There are several readalignment tools available for both short and long reads. The popular reference-based read alignment tools are PuffAligner [50], STAR [51], HISAT2 [52], HTSeq [53], TopHat2 [54], and Bowtie2 [55].

The other relatively advanced approach in RNA-Seq is pseudo alignment. In which, the mapper joins reads together based on their compatibility with the transcripts and not based on precise alignment, as its primary goal is to quantify the transcript expression. The main idea of pseudo alignment is to ignore the location of mapped reads and to consider only the aligned reads. Computationally, pseudo alignment is more efficient, faster, less memory intensive, and a better prediction tool than reference-based alignment. However, a transcriptome or a transcript repertoire related to an organism of our interest is used as a reference in pseudo alignment. A splice-aware aligner is not required here, as reads are mapped to a reference transcriptome but not the genome. The popular pseudo alignment tools such as Salmon [56], Kallisto [57], and Sailfish [58] are currently being used. A few tools available for checking the alignment quality are QC3 [59], QoRTs [60], Qualimap [61], RseQC [62], and RNA-SeQC [63]. Further, %GC, base quality, and mapping efficiency of aligned reads are assessed along with distribution of read count, insert size, and depth to detect sample bias resulting from library preparation.

### **2.4 Transcript abundance estimation and quantification**

In a standard RNA-Seq pipeline, read quantification is performed after filtering low-quality reads, aligning them to an annotated genome or the transcriptome to identify their genomic origin. The standard alignment and counting methods mainly relied on base-to-base alignment or by mapping to an annotated or unannotated genome. The major limitation of standard tools is that genes often have multi-mapped reads. As a result, the algorithms such as STAR [51], HTSeq [53], and Tophat2 [54] underestimate gene expression (GE), thus resulting in false negatives. In contrast, the Cufflinks [64] overestimates GE and results in many false positives (FPs). Most transcriptome-based tools avoid base-to-base alignment of the reads, thus reducing the computational time and costs. Also, transcriptome-based tools provide quantification estimates much faster and more accurately at the transcript level. Less memory intensive tools such as Salmon [56], Kallisto [57], and Sailfish [58] are used for transcript quantification and abundance estimation with minor differences. For instance,

Salmon first quantifies pseudo-counts, then quasi-mapping, and finally estimates transcript abundance. The pseudo-counts obtained can be used to find the differential gene or isoform-level expression. A few tools available for transcript abundance estimation and quantification are StringTie [65], STAR [51], tximport [66], DESeq2 [67], and Cufflinks [64].

The most straightforward approach for quantifying GE by RNA-Seq is to count the reads that align with each gene. The gene-level quantification approaches such as HTSeq commonly utilize annotated information, where gene models correspond to the structure of transcripts. Raw read counts are usually affected by transcript length and the total number of reads. For instance, the longer transcripts have higher read counts at the same expression level. Thus, the raw read counts are normalized to compare expression levels between samples. The reads per kilobase (kb) of the exon per million mapped reads (RPKM) are used to normalize the single-end read data for sequencing depth and gene length differences. While fragments per kb of transcript per million reads mapped (FPKM) is used to normalize the paired-end read data for differences in sequencing depth and gene length. In contrast to RPKM and FPKM, transcripts per million reads (TPM) is used to normalize the differences in gene length first and library size later [68]. Therefore, correcting gene length within the same gene across samples is avoided. However, it is required to precisely rank the GE levels within the sample to accurately report longer genes with relatively more reads at the same expression level.

### **2.5 Finding differentially expressed genes**

Normalized read count data is taken in differential expression (DE) analysis, and statistical analysis is performed to identify quantitative changes in GE levels between experimental groups. For instance, statistical testing determines whether the observed differences in read counts are significant compared to natural random variation. Often the selection of the analysis tool depends on the experimental design and availability. We can use DE analysis tools for pair-wise or multiple comparisons between or among the samples. When the same GE contribution is observed in several samples, their average value is taken as the eventual GE level. Testing for DE across thousands of genes requires correction for multiple comparisons. The two common ways in statistics for correction are Bonferroni correction and false discovery rate (FDR). FDR is the most widely adopted approach in RNA-Seq as it operates on the whole population and aims to keep the false positive rate below the acceptable threshold (<5%). The upregulated or downregulated DEGs are typically represented using volcano plots or MA plots. Top-ranked significant (p-value < 0.05; Log2 Fold Change, Log2FC > 1.00; and FDR < 0.05) DEGs are usually shown as heatmaps. The average linkage method is used to compute the hierarchical clustering, whereas the euclidean algorithm computes the closeness or distance between rows and columns. Dimensionality reduction on expression data is obtained to eliminate outliers and batch effects. Principal component analysis (PCA) is most commonly used as it reduces the complexity of expression data by showing relationships among samples or replicates as clusters in two-dimensional space.

Even though different tools such as Glimma [69], Ballgown [70], EBSeq [71], limma [72], voom [73], DESeq2 [67], edgeR [74], and baySeq [75] are available for DE analysis, DESeq2 and edgeR were most widely used. DESeq2 normalizes the gene read counts by library size and composition to avoid sampling bias and batch effects. In addition, it models gene read counts with the negative binomial distribution

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

and uses hierarchical modeling to stabilize the gene variance. Further, it uses the Benjamini-Hochberg (BH) statistic to calculate the false discovery rate. Although DESeq2 and edgeR rely on the negative binomial distribution assumption, they differ in the test statistic. For example, DESeq2 depends on the Wald test, and edgeR relies on the quasi-likelihood F-test. Also, the distribution assumption of Bayesian approaches, baySeq, and EBSeq is the negative binomial model. While limma, voom, limma+voom tools use a normal linear model. The test statistic used for these tools is empirical Bayes moderated t-statistic. In comparison, Ballgown uses a nested linear model and parametric F-test. In addition, quality control on expression data is determined using tools such as iSeqQC [46], DEGreport [76], NOISeq [77], and EDASeq [78] for detecting the sample heterogeneity, outliers, and cross-sample contamination. These tools mostly rely on statistical approaches such as correlation analysis and dimensionality reduction.

#### **2.6 Functional annotation**

After finding DEGs or gene clusters, assigning a function is commonly employed. These genes or gene sets are screened to see whether they are enriched in a particular pathway, localized to a specific cell location, or have a specific function. Based on these features, the DEGs are classified into Biological Process (BP), Cellular Component (CC), and Molecular Function (MF) [79]. In gene ontology (GO) analysis, initially, an individual or a set of DEGs are assigned with functions or GO terms, and then the enrichment analysis is performed on gene sets. Finally, filters lowly expressed genes to reduce the number of hypotheses to be tested. For instance, given a set of DEGs that are upregulated among samples under a particular condition, an enrichment analysis will find GO terms that are overrepresented or underrepresented using functional annotations for that gene set. For example, we can use the goana and camera functions in the limma Bioconductor package to find the most enriched GO terms on the gene sets and enrichment analysis. A few routinely used functional annotation and enrichment tools include: Panther [80], FoldGO [81], DAVID [82], ReviGO [83], and AmiGO [84].

#### **2.7 Enrichment and pathway analysis**

The three methods used to assess the gene sets or pathways are enrichment-based, pathway topology-based, and combined. The enrichment-based approaches, overrepresentation (ORA) and functional class scoring (FCS) analysis/gene set enrichment analysis (GSEA) are widely used. At the same time, pathway topology tools help us understand GE as a set in a coordinated network [85]. While the combined approach utilizes the features of both enrichment-based and pathway topology approaches.

A typical ORA pipeline includes DE analysis to find the number of DEGs and their reference genes associated with each pathway. ORA is simple and robust in identifying a few significant genes or gene-sets, i.e., it relies on a portion of the data. As the background assumption is based on low-input, independent genes or gene sets in a pathway are treated as separate entities, and the interaction among the genes or gene clusters are ignored, it may result in many false positives. The GSEA is more accurate than ORA as the entire list of genes is considered. A typical GSEA first enriches significant genes and gene sets based on their P-values, rank order, and weighted scoring, and then identifies independent pathways. However, GSEA also ignores

the interaction between the gene sets or pathways. A few widely used pathway tools are Cytoscape [86], BioCyc [87], and EcoCyc [88]. Pathway topology is developed to mimic the biological perspective as, in reality, genes work in a coordinated or regulated environment in the form of networks or pathways. The idea is to perturb a pathway and thus leverage the topology to study the effect on a single gene or gene set. Pathway topology analysis predicts the gene function, gene position, fold change, and interactions among genes. It relies on more data and is computationally intensive, and it is currently limited to signaling pathways alone. A few integrated tools such as iDEP [89], ingenuity pathway analysis [90], and ipathway guide [91] are gaining more attention recently with the availability of cloud-based and data science tools in RNA-Seq analysis. We can visualize the up-regulated and down-regulated genes on a Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway using the Pathview package [92] and KEGG mapper [93]. Protein-protein interactions (PPIs) and their enrichment among up-regulated or down-regulated genes can be retrieved using the STRING database [94].

Most transcriptome analysis studies include a combination of tools discussed above based on the objective or biological question under investigation. In general, the upstream analysis includes read processing and alignment, while the downstream analysis includes quantification, annotation, enrichment, and pathway assignment. A few recent RNA-Seq studies in identifying DEGs associated with fiber quality and its characteristics are discussed in the next section.

## **3. Transcriptome analysis in cotton for finding fiber related genes and pathways**

### **3.1 Transcriptome analysis for fiber quality**

Several attributes that determine fiber quality include fiber length (mm), strength (cN/tex), development: initiation and elongation (%), uniformity (%), and micronaire (μg/inch). Fiber strength and length are important in deciding the spinning and yarn quality [95]. In addition, the micronaire value reflects the fiber fineness and maturity, which influences its processing and dyeing [96]. We reviewed a few recent RNA-Seq studies in identifying DEGs associated with fiber quality attributes below (**Table 1**).

In an integrated study, a high-density mapping has been used to identify 36 stable and 18 novel quantitative trait locus (QTLs) associated with fiber quality in the CCR170 RIL hybrid generated from sGK156 and 901-001 varieties of *G. hirsutum*. Their RNA-Seq analysis included these two parental types and two RILs (MBZ70-053 and MBZ70-236) to identify 24,941 unique and 473 DEGs associated with pectin and phenylpropanoid biosynthesis and plant hormone signaling. Their bioinformatics analysis included Trimmomatic for trimming reads, HISAT2 for read alignment, StringTie for quantification of genes, DESeq2 for differential gene expression, BLASTX program and GO tools for gene set enrichment, and KAAS for pathway analysis [99]. In a different transcriptome-based study, a high yielding cotton cultivar Jimian 5 and a high fiber quality *G. thurberi* introgression line, DH962 have been used to identify 780 DEGs linked with fiber quality at 10 DPA [97]. Also, their study integrated DEGs from transcriptome data and QTLs from phenotypic data to identify 31 genes associated with nine QTLs. Further, their study included Bowtie and TopHat tools for aligning clean reads to the reference genome, Cufflinks for transcript

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*


#### **Table 1.**

*Transcriptome analysis for fiber quality.*

prediction, DESeq for DEGs, OmicShare tools for functional annotation, and Cotton Functional Genomics Database for finding pathways associated with the significantly enriched DEGs [97].

In a different study, RNA-Seq analysis has been performed on data obtained from different TM-1 tissues from NCBI to identify genes controlling fiber quality. Results showed that 91 genes had been expressed at different fiber developmental stages (5, 10, 20, and 25 DPA). Functional annotation of these genes using GO analysis revealed that most of the genes have been involved in binding and enzymatic activity [98]. Further, their study revealed 11 candidate genes for fiber quality linked with the genomic locations of chromosomes, A07 and A13, by combining the genome-wide association data (GWAS) with publicly available RNA-Seq datasets [98]. A transcriptome study has been conducted between two recombinant inbred lines, L1 and L2, with a varied fiber quality, to underline the differences in gene expression (GE) during fiber development stages and identify the genes responsible for the fiber quality in upland cotton (*G. hirsutum*) [100]. Their RNA-Seq analysis utilized Trimmomatic, Bowtie, TopHat2, Cufflinks, and DESeq2 tools to find over 1000 DEGs between L1 and L2 at 15, 20, 25, and 30 DPA. Among these, 363 DEGs colocalized within the fiber strength QTL. Further, DEGs have been annotated, and pathways were assigned by STEM, BLASTX, KOBAS, WGCNA, and Cytoscape tools. In addition, their co-expression network analysis revealed five modules closely associated with fiber-development stages. The significantly enriched genes belonged to leucine-rich repeat (LRR) receptor-like protein kinase, Rho GTPase-activating protein, bHLH transcription factor (TF), TPX2 protein, and actin-1 protein classes [100]. In a separate study, by integrating fine-mapping data with RNA-Seq, four DEGs linked with a QTL responsible for fiber quality traits have been identified in *G. hirsutum* RIL118 and Yumian 1 lines [30]. Their analysis included Bowtie and TopHat for mapping reads to the reference genome, HTSeq for transcript abundance estimation, DEGSeq for finding DEGs, KEGG, and KOBAS for gene enrichment analysis and finding KEGG pathways [30].

## **3.2 Transcriptome analysis for fiber length**

The cotton fiber length and other attributes such as strength, elongation, and evenness determine the spinning or yarn quality. However, the cotton fiber length varies from one variety to the other and generally ranges between 0.9 and 1.6 inches. Longer fibers are preferred over shorter ones in the textile industry due to their uniformity, fineness, and strength. However, the fiber length is determined by the underlying molecular mechanisms, including gene expression and regulation. Recent RNA-Seq studies related to fiber length attribute are presented below (**Table 2**).

Integrated transcriptome and genotyping study aimed at deciphering molecular mechanisms associated with cotton fiber length identified 2662 significant DEGs that belonged to energy metabolism during fiber initiation and auxin signaling pathway during fiber elongation by utilizing ovule and fiber samples collected at −3, 0, 5, and 10 DPA from two contrasting RILs (MBZ70-053 and MBZ70-236) of *G. hirsutum* hybrid, CCRI70 [104]. Their pipeline included Trimmomatic for trimming adapters from raw reads, HISAT2 for aligning clean reads to the reference genome, StringTie2 for quantifying genes, DESeq2 for finding DEGs, BLASTX and GO tools for functional annotation, and KAAS, WGCNA, and Cytoscape for gene set enrichment and pathway analysis [104]. A study conducted in Upland cotton investigated the role of class II KNOX protein (GhKNL1) in fiber development, which primarily acts as a transcription repressor in regulating SCW formation. The comparative transcriptome profiling of two transgenic cotton varieties (silenced and dominant repression for GhKNL1 gene) and a genetic standard (TM-1). The GhKNL1 silenced variety showed improved fiber length and thickened SCW, whereas the dominant repression for GhKNL1s showed shortened fiber length and thinner SCW. Furthermore, it has been reported that GhKNL1 could bind to promoters to facilitate cellulose synthesis and SCW development, thus affecting the cotton fiber length [101].

In a study conducted to evaluate Germin-like proteins (GLPs) in regulating cotton fiber development, the RNA-Seq analysis between the wild type and RNAi line for GbGLP1 (YZ-1) gene with an overexpression promoter revealed that higher expression levels of GhGLP1 lead to shortened fibers. Their RNA-Seq analysis identified


**Table 2.**

*Transcriptome analysis for fiber length.*

#### *Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

566 DEGs in the RNAi lines, while most of them belonged to genes and TFs involved in SCW biosynthesis. Also, comparative transcriptome screening of thirty long and short fiber varieties of cotton revealed that the GhGLP1 promotes fiber elongation by delaying the SCW thickening. Moreover, the YZ-1 knockdown line for the GhGLP1 gene resulted in improved fiber length and retarded SCW thickening, thus suggesting its negative role in fiber elongation [105]. A separate study combined GWAS and linkage mapping identified a sucrose synthesis-like gene linked with a significant QTL on chromosome D03 that affects fiber length in Upland cotton [102]. Their RNA-Seq and qRT-PCR results consistently showed elevated expression levels of eleven candidate genes related to fiber length in *G. hirsutum*, including the sucrose synthesis gene at -5 DPA to 20 DPA with increments of five. In addition, their study included Bowtie2 for mapping clean reads to the reference genome and cufflinks for obtaining GE levels for analyzing the publicly available data [102].

Comparative transcriptome analysis between *G. hirsutum* CSB25 line developed for fiber elongation and genetic standard, TM-1 has been performed to evaluate fiber traits. Their data analysis included Tophat2 for aligning clean reads to the reference genome, HTSeq to estimate GE, EdgeR for finding DEGs, GAGE and REVIGO for GO enrichment, and KEGG tools for finding gene sets and pathways. They identified 1872 DEGs in their study, and most of them belonged to cytoskeleton and cell wall metabolism. In addition, their investigation revealed that most of the genes were enriched in plant hormone signaling, phenylpropanoid, amino acid, sucrose, and starch biosynthesis [103].

#### **3.3 Transcriptome analysis for fiber strength**

There is a huge demand for stronger cotton fiber in the global textile industry. Individual fiber strength determines the yarn strength. However, fiber strength is often measured as bundle fiber strength (BFS), i.e., grams per tex (g/tex), where 'g' is the breaking force, and tex (g/km) is the fineness. BFS is usually not an accurate measure for yarn strength because of the variability in fiber properties and interaction between the fibers. The fiber strength of Upland cotton has been improved considerably through molecular breeding approaches. However, efforts on gene expression and regulation of fiber strength during its development are limited, and a few such studies are discussed here (**Table 3**).

Using comparative fiber transcriptome analysis between *G. mustelinum* introgression line (IL9) and its recurrent parent (PD94042), over 250 significantly enriched DEGs associated with the fiber strength QTL have been identified at 17 and 21 DPA. Among which, 52 DEGs have been identified as candidate genes and two DEGs associated fiber strength QTL regions. Their GO enrichment and KEGG analysis showed that most of these DEGs belonged to the biosynthesis of secondary metabolites and metabolic pathways [106]. An RNA-Seq analysis study aimed to understand the molecular mechanism underlying the fiber development and quality included a CSL line (SL7) and *G. hirsutum* line (L22) to identify 70 significantly enriched DEGs associated with plant hormone transduction pathways. Their findings indicated that the introgressed chromosomal segment of SL7 plays a crucial role in expressing a transcription factor that contributes to the fiber strength [109]. A study that screened publicly available transcriptomic data for bHLH transcription factors found that GhbHLH18 is coexpressed with most lignin biosynthesis genes [107]. Furthermore, they suggested that GhbHLH18 is preferentially expressed during the early fiber elongation and is negatively regulates fiber strength and length by binding to the E-box of its promoter and enhancing peroxidase-mediated (GhPER8) lignin biosynthesis [107].


#### **Table 3.**

*Transcriptome analysis for fiber strength genes.*

Transcriptome analysis in chromosome segment substitution lines (CSSLs) revealed 71 significant DEGs associated with fiber strength among four lines, CCRI45, MBI7561, MBI7747, and MBI7285, collected at 15, 20, 25, and 28 DPA [110]. They suggested the possible roles of these genes in cell wall biogenesis, SCW deposition, and cotton fiber strength. Their analysis further identified 16 DEGs consistently found in the introgressed segments from the *G. barbadense* chromosomes across all possible comparisons. Their data analysis included NGS QC Toolkit for trimming and filtering raw reads, TopHat for aligning clean reads to the reference genome, HTSeq for quantifying expression levels, GFOLD for DEGs, and BLAST2GO for functional annotation and protein class assignment [110]. Comparative transcriptome analysis of two contrasting near-isogenic lines (NILs) of *G. hirsutum*, MD90ne, and MD52ne for fiber strength at 15 and 20 DPA revealed over 1000 significant DEGs [108]. In addition, the fiber elongation and cell wall integrity genes have been enriched in ethylene and receptor-like kinases (RLKs) signaling pathways. In data processing, they utilized Sickle, GSNAP, Bedtools, EdgeR, and AgriGO tools for trimming, mapping, annotation, differential gene expression, and enrichment analysis, respectively [108]. Further, they compared the RNA-Seq data with previously published microarray data [111].

### **3.4 Transcriptome analysis for fiber development**

Cotton fibers are natural, unicellular outgrowths that emerge from the epidermis of the ovules. The differentiation and developmental phases (initiation, elongation, SCW synthesis, and maturation) of the cotton fiber determine the other attributes such as fiber length and strength. The variation in GE at different developmental stages of cotton fiber can be assessed using RNA-Seq. A few RNA-Seq studies related to fiber development are discussed below (**Table 4**).

Comparative transcriptome analyses of three fiber developmental stages with non-fiber tissues (leaf, root, stigma, and anther) identified 1205, 1135, and 937 significantly upregulated, and 124, 179, and 213 downregulated DEGs at 7, 14, and 26 DPA, respectively in *G. hirsutum* during fiber development. Moreover, the identified DEGs have been enriched in functional and metabolic pathways, including signal transduction, catalytic activity, and carbohydrate metabolism [26]. Their pipeline included cutadapt for collecting quality reads, TopHat2 for aligning clean reads to the

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*


**Table 4.**

*Transcriptome analysis for fiber development.*

reference genome, StringTie for assembling mapped reads into transcripts, HTSeq for quantification of GE, DeSeq for finding DEGs, and GOseq R and KOBAS for finding protein classes and pathways [26]. A study aimed to understand the genes and complex networks associated with cotton fiber development and its domestication utilized *G. hirsutum* and screened transcriptomes collected at 5, 10, 15, and 20 DPA to reveal convergence and divergence in duplicated and homoeologous coexpression networks. Their analysis included GSNAP for mapping, PolyCat for homoeolog-specific expression, HTseq for read count data, DEseq2 for finding DEGs, and weighted gene coexpression network analysis (WGCNA) for coexpression data to corroborate the idea of widespread gene usage in cotton fibers, subgenome-specific expression bias, and similarities and differences in coexpression modules within the subgenomes of a polyploid [114].

In a study aimed at screening publicly available datasets for myeloblastosis (MYB) like TFs and finding their role in fiber development, a research group has identified 36 R2R3-MYBs highly expressed at 20 DPA in Upland cotton suggested them as potential SCW synthesis regulators [115]. Comparative transcriptome analysis of *G. arboreum*, *G. hirsutum*, and transgenic line revealed that the GhTCP4 TF plays an essential role in activating SCW genes by interacting with cis-elements in the promoter region. In contrast, GhHOX3 regulates TCP gene expression, thus promoting fiber cell elongation [116]. A comprehensive genome-wide analysis of *G. arboreum*, *G. raimondii*, and *G. hirsutum* from publicly available data revealed 196, 195, and 386 C2H2-like zinc finger genes, respectively. Also, the phylogenetic analysis of C2H2-like zinc finger proteins identified seven subgroups with similar exon-intron and protein motif compositions. Further, the differential expression (DE) pattern of 16 C2H2-like zinc finger genes identified in RNA-Seq data has been validated with RT-qPCR analysis in Ligon-lintless-1 (Li1) mutant and TM-1 at 0, 5, 8, and 10 DPA and suggested the role of these transcription factors in biochemical and physiological functions during cotton fiber development [112]. In another study, comparative transcriptome analysis of phytochrome A1 gene (PHYA1) RNAi line and its parent Coker 312 using RNA-Seq to study the GE profiles of 10 DPA fibers identified 142 DEGs that play an essential role in fiber development. Their pipeline included Trimmomatic for trimming low-quality reads, ArrayStar for aligning clean reads to the reference genome, DeSeq2 for finding DEGs, Blast2GO and InterProScan for functional annotation, and KEGG database for functional protein classes and pathways [113].

## **3.5 Transcriptome analysis for fiber initiation and elongation**

Cotton fibers are single and elongated cells derived from epidermis of seed as external outgrowths. Therefore, cotton fibers are ideal for studying cell developmental stages such as differentiation and elongation. Further, the initiation step is critical in the fiber development process because it is the stage where the cell fate is determined or committed to developing into a fiber. Therefore, fiber initiation and elongation are ideal stages for undertaking RNA-Seq analysis to understand early fiber development, and a few such studies are discussed here (**Table 5**).

A recent study combined the Laser-capture microdissection (LCM) technology with RNA-Seq to understand the cotton cell types during the fiber developmental shifts. LCM can differentiate the epidermal cells from the fiber, while RNA-Seq can identify the subtle differences between these cell types [29]. Their results suggested that the fiber cell initiation in cotton can be triggered by phytohormones and MYB-like transcription factors, cell cycle arrest, ribosome biosynthesis, and homoeolog expression bias of cell cycle and ribosome biosynthetic genes [29]. A recent omics-based study conducted in the ovules collected immediately after anthesis in upland cotton showed DE of several MYB-like TFs and early fiber development genes associated with biosynthesis and signaling of phytohormones, indole-3-acetic acid, cytokinins, gibberellic acid, jasmonic acid, and brassinosteroids [27]. Another RNA-Seq study has been conducted to understand effect of temperature on fuzz fiber initiation in a thermo-sensitive variety of *G. barbadense*, L7009 subjected temperature stress at 4 DPA to identify 43,826 DEGs. Of these, 189, 9667, 240, and 901 DEGs belonged to plant hormone signal transduction, fiber development, fuzz fiber initiation, and transcription factors, respectively. Also, they reported that high temperatures could induce fiber development, fiber quality, and fuzz initiation. Further, the significantly enriched DEGs belonged to stress response, asparagine, and cell wall biosynthesis. However, the fuzz initiation can be inhibited by low-temperature treatment in L7009. Furthermore, they reported the 4 DPA stage as the most susceptible stage to temperature stress during the fuzz initiation [117].

A genome-wide transcriptome profiling of fiber-bearing ovules of *G. arboreum* at an increment of 0.5 from -0.5 DPA till 3.0 DPA has been investigated to understand the molecular basis for fiber initiation. A total of 12,049 DEGs and 1049 DE


**Table 5.**

*Transcriptome analysis for fiber Initiation and elongation.*

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

transcription factors have been detected from the analyses. Most identified DEGs belonged to the ribosome and amino acid biosynthesis and carbon metabolism. A few significantly enriched DEGs belonged to fatty acid degradation and flavonoid biosynthesis. Further, during fiber initiation, the significantly induced DE transcription factors belonged to the trihelix family, referred to as GaGTs, and often found on 12 of 13 chromosomes in *G. arboreum* [119]. In a study aimed at screening the transcriptome profiles for finding variations in fiber initiation and elongation among diverse fiber types of *G. hirsutum*, for example, long-staple cotton (LSC), short-staple cotton (SSC), long fiber group (LFG), and short-fiber group (SFG) to identify twelve genes in fiber development; among these, glycosyl hydrolase, Pectin lyase-like superfamily protein (PER64), and Pectin lyase (PL) were down-regulated in fiber elongation [118]. Their pipeline included Trimmomatic for filtering low-quality reads, TopHat2 for aligning clean reads to the reference genome, StringTie for assembling mapped reads into transcripts, HTSeq for quantification of GE, DeSeq for finding DEGs, GO for functional annotation, and KEGG for finding protein classes and pathways [118].

### **3.6 Transcriptome analysis for fiber color**

Most cotton (*G. hirsutum*) fibers produced worldwide are white, despite the lint and fiber of tetraploid cotton (*G. barbadense*), exhibiting various colors including red, blue, green, and several shades of brown. The fiber color trait in cotton is genetically inherited, resulting from pigments blended with cellulose. Generally, the yields of colored cotton are typically lower, and the fiber is shorter and weaker but softer when compared with white cotton. However, the fiber qualities such as fiber length, strength, and color have been improved in hybrids between *G. barbadense* and *G. hirsutum*. More recently, colored cotton fiber has gained importance due to its unique and desirable characteristics and emerged as an eco-friendly dye-free textile material. A few recent RNA-Seq studies focussed on fiber color-related genes are presented below (**Table 6**). Using multi-omic approaches (metabolome and RNA-Seq analysis), the biochemical and regulatory roles of genes involved in light-induced green color formation in cotton have been reported [120]. Their study


**Table 6.** *Transcriptome analysis for fiber color.* compared early initiation (15 DPA) and late accumulated (45 DPA) metabolites under different lighting conditions and identified 236 differential metabolites. Among which, 20% of metabolites belonged to the phenylpropanoid pathway. Their RNA-Seq analysis included gene set enrichment and KEGG pathway analysis to identify genes and regulatory networks linked with light-induced fiber color formation. These networks are highly correlated with the corresponding phenylpropanoid metabolites [120].

Another study compared transcriptomes and metabolomes of Green Colored Fiber (GCF) accession and its near-isogenic line, White Colored Fiber (WCF) at 12, 18, and 24 DPA, to identify 2047 non-redundant metabolites enriched in eighty pathways, including biosynthesis of phenylpropanoid, wax, cutin, and suberin [121]. Their metabolome analysis identified higher levels of metabolites (sinapaldehyde) linked with the phenylpropanoid pathway in the GCF line compared with the WCF phenotype. Moreover, the metabolites identified in their study overlapped with the transcriptome analysis showing significant up-regulation of the genes responsible for the biosynthesis of select metabolites. The WGCNA analysis on DEGs identified between GCF and WCF has shown 16 gene modules co-expressed with fiber color at selected time points. At a visually different fiber color stage between GCF and WCF, the blue module at 24 DPA was of prime importance due to the upregulation of 56 hub and two homoeologous Gh4CL4 genes that have a potential role in green pigment biosynthesis [121]. A study aimed at understanding the gene expression and regulation of the pigment biosynthesis generated RNAi lines for the chalcone flavanone isomerase gene in the brown-colored fiber (BCF) line [122]. In addition, they compared the transcriptome profiles of BCF with its transgenic fiber phenotypes, white and green, to identify 13 significantly enriched DEGs in flavonoid and phenylpropanoid pathways [122].

## **4. Conclusions**

In conclusion, comprehensive phenotyping, genotyping, and transcriptome approaches coupled with integrated bioinformatics pipelines have considerably improved our understanding of genes associated with fiber quality and yield traits. However, besides the genes related to the fiber quality characteristics discussed in this chapter, fiber uniformity, fineness, and micronaire attributes must also be considered in cotton germplasm improvement programs. Further, functional annotation, enrichment, and gene network analyses tools will continue to evolve with better features to visualize subtle changes in gene expression associated with biological pathways. Furthermore, to better understand complex traits (e.g., fiber quality and yield) and polyploid plant genomes (e.g., Upland cotton), more advanced computational pipelines need to be developed to integrate multi-omic and multi-dimensional phenotypic data. Moreover, fiber cell is a single elongated structure that serves as an ideal model for single-cell genomics. Thereby, dissecting the complexity associated with the initial input mRNA quantity in single-cell RNA-Seq will aid in screening thousands of samples per sequencing run. Therefore, the bulk RNA-Seq data generated by such voluminous efforts demands lightweight data science tools that utilize less memory footprint.

## **Acknowledgements**

The authors acknowledge Ms. Padma S. Ragam for reviewing this book chapter. Also, the authors would like to thank anonymous reviewers and editors for their

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

efforts in improving this book chapter. Finally, the authors acknowledge the funding support by the Capacity Building grant #2020-38821-31103 from the USDA National Institute of Food and Agriculture (NIFA).

## **Conflict of interest**

The authors declare no conflict of interest.

## **Author details**

Shalini P. Etukuri1 , Varsha C. Anche1 , Mirzakamol S. Ayubov2 , Lloyd T. Walker1 and Venkateswara R. Sripathi1 \*

1 Center for Molecular Biology, Alabama A&M University, Normal, AL, USA

2 Center of Genomics and Bioinformatics, Academy of Sciences of Uzbekistan, Tashkent, Uzbekistan

\*Address all correspondence to: v.sripathi@aamu.edu

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Wendel JF, Grover CE. Taxonomy and evolution of the cotton genus, *Gossypium*. Cotton. 2015;**57**:25-44

[2] Hendrix B, Stewart JM. Estimation of the nuclear DNA content of *Gossypium* species. Annals of Botany. 2005;**95**(5):789-797

[3] Sripathi VR, Buyyarapu R, Kumpatla SP, Williams AJ, Nyaku ST, Tilahun Y, et al. Bioinformatics tools and genomic resources available in understanding the structure and function of *Gossypium*. Bioinformatics-Updated Features and Applications. 2016. p. 231

[4] FAS-USDA. 2022. Cotton: World Markets and Trade: February 2022 report. Available from: https://apps.fas. usda.gov/psdonline/circulars/cotton.pdf

[5] United Nations. World Population Prospects: The 2017 Revision, Key Findings and Advance Tables. Department of Economics and Social Affairs PD, editor. New York: United Nations; 2017. p. 46

[6] Liu Y. Chemical composition and characterization of cotton fibers. In: Cotton Fiber: Physics, Chemistry and Biology. Cham: Springer; 2018. pp. 75-94

[7] McGrath JE, Hickner MA, Höfer R. Polymers for a sustainable environment and green energy. Polymer Science. 2013;**10**:849

[8] Dochia M, Sirghie C, Kozłowski RM, Roskwitalski Z. Cotton fibres. In: Handbook of Natural Fibres. Cambridge, England: Woodhead Publishing; 2012. pp. 11-23

[9] Hu H, Wang M, Ding Y, Zhu S, Zhao G, Tu L, et al. Transcriptomic repertoires depict the initiation of lint and fuzz fibres in cotton (*Gossypium hirsutum* L.). Plant Biotechnology Journal. 2018;**16**(5):1002-1012

[10] Salih H, Leng X, He SP, Jia YH, Gong WF, Du XM. Characterization of the early fiber development gene, Ligonlintless 1 (Li1), using microarray. Plant Gene. 2016;**6**:59-66

[11] Sripathi VR, Anche VC, Gossett ZB, Walker LT. Recent applications of RNA-sequencing in food and agriculture. Applications of RNA-Seq in Biology and Medicine. 2021. p. 97

[12] Lowe R, Shirley N, Bleackley M, Dolan S, Shafee T. Transcriptomics technologies. PLoS Computational Biology. 2017;**13**(5):e1005457

[13] Morozova O, Hirst M, Marra MA. Applications of new sequencing technologies for transcriptome analysis. Annual Review of Genomics and Human Genetics. 2009;**10**:135-151

[14] Wolf JB. Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. Molecular Ecology Resources. 2013;**13**(4): 559-572

[15] Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews. Genetics. 2009;**10**(1):57-63

[16] Yoo MJ, Wendel JF. Comparative evolutionary and developmental dynamics of the cotton (*Gossypium* hirsutum) fiber transcriptome. PLoS Genetics. 2014;**10**(1):e1004073

[17] Chen ZJ, Sreedasyam A, Ando A, Song Q, De Santiago LM, Hulse-Kemp AM, et al. Genomic diversifications of five

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

*Gossypium* allopolyploid species and their impact on cotton improvement. Nature Genetics. 2020;**52**(5):525-533

[18] Parekh MJ, Kumar S, Fougat RS, Zala HN, Pandit RJ. Transcriptomic profiling of developing fiber in levant cotton (*Gossypium* herbaceum L.). Functional and Integrative Genomics. 2018;**18**(2):211-223

[19] Lin M, Pang C, Fan S, Song M, Wei H, Yu S. Global analysis of the *Gossypium hirsutum* L. Transcriptome during leaf senescence by RNA-Seq. BMC Plant Biology. 2015;**15**(1):1-8

[20] Ma Q, Wu M, Pei W, Wang X, Zhai H, Wang W, et al. RNA-Seqmediated transcriptome analysis of a fiberless mutant cotton and its possible origin based on SNP markers. PLoS One. 2016;**11**(3):e0151994

[21] Peng Z, He S, Gong W, Sun J, Pan Z, Xu F, et al. Comprehensive analysis of differentially expressed genes and transcriptional regulation induced by salt stress in two contrasting cotton genotypes. BMC Genomics. 2014;**15**(1):1-28

[22] Naqvi RZ, Zaidi SS, Akhtar KP, Strickler S, Woldemariam M, Mishra B, et al. Transcriptomics reveals multiple resistance mechanisms against cotton leaf curl disease in a naturally immune cotton species. *Gossypium* arboreum. Scientific reports. 2017;**7**(1):1-5

[23] Xu L, Zhu L, Tu L, Liu L, Yuan D, Jin L, et al. Lignin metabolism has a central role in the resistance of cotton to the wilt fungus Verticillium dahliae as revealed by RNA-Seqdependent transcriptional analysis and histochemistry. Journal of Experimental Botany. 2011;**62**(15):5607-5621

[24] Zhang F, Zhu G, Du L, Shang X, Cheng C, Yang B, et al. Genetic

regulation of salt stress tolerance revealed by RNA-Seq in cotton diploid wild species, *Gossypium* davidsonii. Scientific Reports. 2016;**6**(1):1-5

[25] Bowman MJ, Park W, Bauer PJ, Udall JA, Page JT, Raney J, et al. RNA-Seq transcriptome profiling of upland cotton (*Gossypium hirsutum* L.) root tissue under water-deficit stress. PLoS One. 2013;**8**(12):e82634

[26] Yang J, Gao L, Liu X, Zhang X, Wang X, Wang Z. Comparative transcriptome analysis of fiber and nonfiber tissues to identify the genes preferentially expressed in fiber development in *Gossypium* hirsutum. Scientific Reports. 2021;**11**(1):1-8

[27] Wang L, Wang G, Long L, Altunok S, Feng Z, Wang D, et al. Understanding the role of phytohormones in cotton fiber development through omic approaches; recent advances and future directions. International Journal of Biological Macromolecules. 2020;**15**(163):1301-1313

[28] Wilkins TA, Arpat AB. The cotton fiber transcriptome. Physiologia Plantarum. 2005;**124**(3):295-300

[29] Ando A, Kirkbride RC, Jones DC, Grimwood J, Chen ZJ. LCM and RNA-Seq analyses revealed roles of cell cycle and translational regulation and homoeolog expression bias in cotton fiber cell initiation. BMC Genomics. 2021;**22**(1):1-6

[30] Liu D, Zhang J, Liu X, Wang W, Liu D, Teng Z, et al. Fine mapping and RNA-Seq unravels candidate genes for a major QTL controlling multiple fiber quality traits at the T 1 region in upland cotton. BMC Genomics. 2016;**17**(1):1-3

[31] Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best

practices for RNA-Seq data analysis. Genome Biology. 2016;**17**(1):1-9

[32] Naphade S, Bhatnagar R, Hanson-Smith V, Choi I, Zhang A. Systematic comparative analysis of strandspecific RNA-Seq library preparation methods for low input samples. Scientific Reports. 2022;**12**(1):1-10

[33] Schurch NJ, Schofield P, Gierliński M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNA-Seq experiment and which differential expression tool should you use? RNA. 2016;**22**(6):839-851

[34] Hrdlickova R, Toloue M, Tian B. RNA-Seq methods for transcriptome analysis. Wiley Interdisciplinary Reviews: RNA. 2017;**8**(1):e1364

[35] Jazayeri SM, Melgarejo Munoz LM, Romero HM. RNA-Seq: a glance at technologies and methodologies. Acta Biológica Colombiana. 2015;**20**(2):23-35

[36] Nagalakshmi U, Waern K, Snyder M. RNA-Seq: a method for comprehensive transcriptome analysis. Current Protocols in Molecular Biology. 2010;**89**(1):4-11

[37] Schmieder R, Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One. 2011;**6**(3):e17288

[38] Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011;**27**(6):863-864

[39] Andrews, S. FastQC: A Quality Control Tool for High Throughput Sequence Data. 2010. Available from: http://www.bioinformatics.babraham. ac.uk/projects/fastqc/

[40] Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal. 2011;**17**(1):10-12

[41] Krueger F. Trim galore. A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files. Babraham Bioinformatics. 2015. Available from: https://www.bioinformatics.babraham. ac.uk/projects/trim\_galore/

[42] Buffalo V. Scythe—A Bayesian adapter trimmer. UC Davis Bioinforma. Core. 2011. Available from: https:// github.com/ucdavis-bioinformatics/ scythe

[43] Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics. 2014;**30**(15):2114-2120

[44] David S. HTStream: A toolkit for high throughput sequencing analysis. In: Theses and Dissertations Collection, Digital Initiatives. University of Idaho Library; 2017. Available from: https:// www.lib.uidaho.edu/digital/etd/items/ streett\_idaho\_0089n\_11229.html

[45] Bushnell B. BBDuk. Jt Genome Inst. 2020. Available from: https://jgi.doe. gov/data-and-tools/bbtools/bb-toolsuserguide/bbduk-guide/ [Accessed: June 25, 2020]

[46] Kumar G, Ertel A, Feldman G, Kupper J, Fortina P. iSeqQC: A tool for expression-based quality control in RNA-sequencing. BMC Bioinformatics. 2020;**21**(1):1-10

[47] Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;**32**(19):3047-3048

[48] Patel RK, Jain M. NGS QC Toolkit: A toolkit for quality control of next generation sequencing data. PLoS One. 2012;**7**(2):e30619

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

[49] Gordon A, Hannon GJ. Fastx-toolkit. FASTQ/A short-reads preprocessing tools (unpublished). 2010. Available from: http://hannonlab.cshl.edu/fastx\_toolkit.

[50] Almodaresi F, Zakeri M, Patro R. PuffAligner: A fast, efficient and accurate aligner based on the pufferfish index. Bioinformatics. 2021;**37**(22):4048-4055

[51] Dobin A, Gingeras TR. Mapping RNA-seq reads with STAR. Current Protocols in Bioinformatics. 2015;**51**(1):11-14

[52] Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nature Methods. 2015;**12**(4):357-360

[53] Anders S, Pyl PT, Huber W. HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;**31**(2):166-169

[54] Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 2013;**14**(4):1-3

[55] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012;**9**(4):357-359

[56] Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 2017;**14**(4):417-419

[57] Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-Seq quantification. Nature Biotechnology. 2016;**34**(5):525-527

[58] Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from RNA-Seq reads

using lightweight algorithms. Nature Biotechnology. 2014;**32**(5):462-464

[59] Guo Y, Zhao S, Sheng Q, Ye F, Li J, Lehmann B, et al. Multi-perspective quality control of Illumina exome sequencing data using QC3. Genomics. 2014;**103**(5-6):323-328

[60] Hartley SW, Mullikin JC. QoRTs: A comprehensive toolset for quality control and data processing of RNA-Seq experiments. BMC Bioinformatics. 2015;**16**(1):1-7

[61] García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, et al. Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics. 2012;**28**(20):2678-2679

[62] Wang L, Wang S, Li W. RSeQC: quality control of RNA-Seq experiments. Bioinformatics. 2012;**28**(16):2184-2185

[63] DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD, Williams C, et al. RNA-SeQC: RNA-Seq metrics for quality control and process optimization. Bioinformatics. 2012;**28**(11):1530-1532

[64] Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;**28**(5):511-515

[65] Shumate A, Wong B, Pertea G, Pertea M. Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie. bioRxiv. 2021

[66] Soneson C, Love MI, Robinson MD. Differential analyses for RNA-Seq: transcript-level estimates improve genelevel inferences. F1000Research. 2015;**4**: 1521

[67] Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology. 2014;**15**(12):1-21

[68] Monga I, Kaur K, Dhanda SK. Revisiting hematopoiesis: applications of the bulk and single-cell transcriptomics dissecting transcriptional heterogeneity in hematopoietic stem cells. Briefings in Functional Genomics. 2022. Elac002

[69] Su S, Law CW, Ah-Cann C, Asselin-Labat ML, Blewitt ME, Ritchie ME. Glimma: Interactive graphics for gene expression analysis. Bioinformatics. 2017;**33**(13):2050-2052

[70] Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL. Transcriptlevel expression analysis of RNA-Seq experiments with HISAT, StringTie and Ballgown. Nature Protocols. 2016;**11**(9):1650-1667

[71] Leng N, Dawson J, Kendziorski C. EBSeq: An R package for differential expression analysis using RNA-Seq data. R Package Version. 2015;**1**(10):1-39

[72] Ritchie ME, Phipson B, Wu DI, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-Sequencing and microarray studies. Nucleic Acids Research. 2015;**43**(7):e47

[73] Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biology. 2014;**15**(2):1-7

[74] Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;**26**(1):139-140

[75] Hardcastle TJ, Kelly KA. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;**11**(1):1-4

[76] Pantano L, Hutchinson J, Barrera V, Piper M, Khetani R, Daily K. DEGreport: Report of DEG analysis 2017. R package version 1.16.0. Available from: https:// bioconductor.org/packages/release/bioc/ html/DEGreport.html.

[77] Tarazona S, Furió-Tarí P, Turrà D, Pietro AD, Nueda MJ, Ferrer A, et al. Data quality aware analysis of differential expression in RNA-Seq with NOISeq R/ Bioc package. Nucleic Acids Research. 2015;**43**(21):e140

[78] Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;**12**(1):1-7

[79] Gene Ontology Consortium. Gene ontology consortium: Going forward. Nucleic Acids Research. 2015;**43**(D1):D1049-D1056

[80] Mi H, Ebert D, Muruganujan A, Mills C, Albou LP, Mushayamaha T, et al. PANTHER version 16: A revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Research. 2021;**49**(D1):D394-D403

[81] Wiebe DS, Mukhin AM, Omelyanchuk NA, Mironova VV. FoldGO for functional annotation of transcriptome data to identify fold-change-specific GO categories. In: Mathematical Modeling and High-Performance Computing in Bioinformatics, Biomedicine and Biotechnology (MM-HPC-BBB-2018). 2018. p. 73

[82] Jiao X, Sherman BT, Huang DW, Stephens R, Baseler MW, Lane HC, et al. DAVID-WS: A stateful web service to facilitate gene/protein list analysis. Bioinformatics. 2012;**28**(13):1805-1806

[83] Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

visualizes long lists of gene ontology terms. PLoS One. 2011;**6**(7):e21800

[84] Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S. AmiGO Hub, Web Presence Working Group. AmiGO: online access to ontology and annotation data. Bioinformatics. 2009;**25**(2):288-289

[85] Fernandes M, Husi H. ORA, FCS, and PT strategies in functional enrichment analysis. In: Proteomics Data Analysis. New York, NY: Humana; 2021. pp. 163-178

[86] Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Cytoscape 2.8: New features for data integration and network visualization. Bioinformatics. 2011;**27**(3):431-432

[87] Green ML, Karp PD. The outcomes of pathway database computations depend on pathway ontology. Nucleic Acids Research. 2006;**34**(13):3687-3697

[88] Karp PD. Pathway databases: A case study in computational symbolic theories. Science. 2001;**293**(5537):2040-2044

[89] Ge SX, Son EW, Yao R. iDEP: An integrated web application for differential expression and pathway analysis of RNA-Seq data. BMC Bioinformatics. 2018;**19**(1):1-24

[90] Krämer A, Green J, Pollard J, Tugendreich S. Causal analysis approaches in ingenuity pathway analysis. Bioinformatics. 2014;**30**(4):523-530

[91] Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, et al. A systems biology approach for pathway level analysis. Genome Research. 2007;**17**(10):1537-1545

[92] Luo W, Pant G, Bhavnasi YK, Blanchard SG, Brouwer C. Pathview web: User friendly pathway visualization and data integration. Nucleic Acids Research. 2017;**45**(W1):W501-W508

[93] Kanehisa M, Sato Y. KEGG mapper for inferring cellular functions from protein sequences. Protein Science. 2020;**29**(1):28-35

[94] Szklarczyk D, Gable AL, Nastou KC, Lyon D, Kirsch R, Pyysalo S, et al. The STRING database in 2021: Customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Research. 2021;**49**(D1):D605-D612

[95] Yang X, Wang Y, Zhang G, Wang X, Wu L, Ke H, et al. Detection and validation of one stable fiber strength QTL on c9 in tetraploid cotton. Molecular Genetics and Genomics. 2016;**291**(4):1625-1638

[96] Rodgers J, Zumba J, Fortier C. Measurement comparison of cotton fiber micronaire and its components by portable near infrared spectroscopy instruments. Textile Research Journal. 2017;**87**(1):57-69

[97] Wang H, Zhang R, Shen C, Li X, Zhu D, Lin Z. Transcriptome and QTL analyses reveal candidate genes for fiber quality in Upland cotton. The Crop Journal. 2020;**8**(1):98-106

[98] Dong C, Wang J, Yu Y, Ju L, Zhou X, Ma X, et al. Identifying functional genes influencing *Gossypium hirsutum* fiber quality. Frontiers in Plant Science. 2019;**9**:1968

[99] Jiang X, Gong J, Zhang J, Zhang Z, Shi Y, Li J, et al. Quantitative trait loci and transcriptome analysis reveal genetic basis of fiber quality traits in CCRI70 RIL population of *Gossypium* hirsutum. Frontiers in Plant Science. 2021;**12**:753755-753755

[100] Zou X, Liu A, Zhang Z, Ge Q, Fan S, Gong W, et al. Co-expression network analysis and hub gene selection for high-quality fiber in upland cotton (*Gossypium* hirsutum) using

RNA-Sequencing analysis. Genes. 2019;**10**(2):119

[101] Wang Y, Li Y, Gong SY, Qin LX, Nie XY, Liu D, et al. GhKNL1 controls fiber elongation and secondary cell wall synthesis by repressing its downstream genes in cotton (*Gossypium* hirsutum). Journal of Integrative Plant Biology. 2022;**64**(1):39-55

[102] Zhang C, Li L, Liu Q, Gu L, Huang J, Wei H, et al. Identification of loci and candidate genes responsible for fiber length in upland cotton (*Gossypium hirsutum* L.) via association mapping and linkage analyses. Frontiers in Plant Science. 2019;**10**:53

[103] Hsu CY, Arick MA, Miao Q, Saha S, Jenkins JN, Ayubov MS, et al. Transcriptome analysis of ten days post anthesis elongating fiber in the upland cotton (*Gossypium* hirsutum) chromosome substitution line CS-B25. American Journal of Plant Sciences. 2018;**9**(06):1334

[104] Jiang X, Fan L, Li P, Zou X, Zhang Z, Fan S, et al. Co-expression network and comparative transcriptome analysis for fiber initiation and elongation reveal genetic differences in two lines from upland cotton CCRI70 RIL population. PeerJ. 2021;**9**:e11812

[105] Sun M, Ye Z, Tan J, Chen S, Zhang X, Tu L. A cotton germin-like protein GbGLP2 controls fiber length via regulating genes involved in secondary cell wall synthesis. Molecular Breeding. 2020;**40**(10):1-4

[106] Chen Q, Wang W, Khanal S, Han J, Zhang M, Chen Y, et al. Transcriptome analysis reveals genes potentially related to high fiber strength in a *Gossypium hirsutum* line IL9 with *Gossypium mustelinum* introgression. Genome. 2021;**64**(11):985-995

[107] Gao Z, Sun W, Wang J, Zhao C, Zuo K. GhbHLH18 negatively regulates fiber strength and length by enhancing lignin biosynthesis in cotton fibers. Plant Science. 2019;**286**:7-16

[108] Islam MS, Fang DD, Thyssen GN, Delhom CD, Liu Y, Kim HJ. Comparative fiber property and transcriptome analyses reveal key genes potentially related to high fiber strength in cotton (*Gossypium hirsutum* L.) line MD52ne. BMC Plant Biology. 2016;**16**(1):1-9

[109] Song Z, Chen Y, Zhang C, Zhang J, Huo X, Gao Y, et al. RNA-Seq reveals hormone-regulated synthesis of noncellulose polysaccharides associated with fiber strength in a single-chromosomalfragment-substituted upland cotton line. The Crop Journal. 2020;**8**(2):273-286

[110] Lu Q, Shi Y, Xiao X, Li P, Gong J, Gong W, et al. Transcriptome analysis suggests that chromosome introgression fragments from sea island cotton (*Gossypium* barbadense) increase fiber strength in upland cotton (*Gossypium* hirsutum). G3: Genes, Genomes, Genetics. 2017;**7**(10):3469-3479

[111] Hinchliffe DJ, Meredith WR, Yeater KM, Kim HJ, Woodward AW, Chen ZJ, et al. Near-isogenic cotton germplasm lines that differ in fiberbundle strength have temporal differences in fiber gene expression patterns as revealed by comparative highthroughput profiling. Theoretical and Applied Genetics. 2010;**120**(7):1347-1366

[112] Salih H, Odongo MR, Gong W, He S, Du X. Genome-wide analysis of cotton C2H2-zinc finger transcription factor family and their expression analysis during fiber development. BMC Plant Biology. 2019;**19**(1):1-7

[113] Miao Q, Deng P, Saha S, Jenkins JN, Hsu CY, Abdurakhmonov IY, et al.

*Transcriptome Analysis Using RNA Sequencing for Finding Genes Related to Fiber in Cotton… DOI: http://dx.doi.org/10.5772/intechopen.104572*

Transcriptome analysis of ten-DPA fiber in an upland cotton (Gossypium hirsutum) line with improved fiber traits from phytochrome A1 RNAi plants. American Journal of Plant Sciences. 2017;**8**(10):2530-2553

[114] Gallagher JP, Grover CE, Hu G, Jareczek JJ, Wendel JF. Conservation and divergence in duplicated fiber coexpression networks accompanying domestication of the polyploid *Gossypium* hirsutum L. G3: Genes, Genomes, Genetics. 2020;**10**(8):2879-2892

[115] Huang J, Guo Y, Sun Q, Zeng W, Li J, Li X, et al. Genome-wide identification of R2R3-MYB transcription factors regulating secondary cell wall thickening in cotton fiber development. Plant and Cell Physiology. 2019;**60**(3):687-701

[116] Cao JF, Zhao B, Huang CC, Chen ZW, Zhao T, Liu HR, et al. The miR319-targeted GhTCP4 promotes the transition from cell elongation to wall thickening in cotton fiber. Molecular Plant. 2020;**13**(7):1063-1077

[117] Cheng G, Zhang L, Wei H, Wang H, Lu J, Yu S. Transcriptome analysis reveals a gene expression pattern associated with fuzz fiber initiation induced by high temperature in Gossypium barbadense. Genes. 2020;**11**(9):1066

[118] Qin Y, Sun H, Hao P, Wang H, Wang C, Ma L, et al. Transcriptome analysis reveals differences in the mechanisms of fiber initiation and elongation between long-and short-fiber cotton (*Gossypium* hirsutum L.) lines. BMC Genomics. 2019;**20**(1):1-6

[119] Mo H, Wang L, Ma S, Yu D, Lu L, Yang Z, et al. Transcriptome profiling of *Gossypium* arboreum during fiber initiation and the genome-wide identification of trihelix transcription factors. Gene. 2019;**709**:36-47

[120] Tang Z, Fan Y, Zhang L, Zheng C, Chen A, Sun Y, et al. Quantitative metabolome and transcriptome analysis reveals complex regulatory pathway underlying photoinduced fiber color formation in cotton. Gene. 2021;**767**:145180

[121] Sun S, Xiong XP, Zhu Q, Li YJ, Sun J. Transcriptome sequencing and metabolome analysis reveal genes involved in pigmentation of greencolored cotton fibers. International Journal of Molecular Sciences. 2019;**20**(19):4838

[122] Liu HF, Luo C, Song W, Shen H, Li G, He ZG, et al. Flavonoid biosynthesis controls fiber color in naturally colored cotton. PeerJ. 2018;**6**:e4537

## **Chapter 5**
