**1. Introduction**

Transcriptome is defined as the sum of all the messenger RNA molecules expressed from the genes of an organism, tissue, or a cell. Transcriptome analysis is a powerful method for

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

plant biology research since studying expressed genes facilitates investigation into plant development, responses to environmental stresses, plant‐microbe interactions and so on. Transcriptomic analysis of model organisms, such as the classical object of plant genetics, *Arabidopsis thaliana* (L.) Heyhn., with available full‐genome sequence enables researchers to conduct more precise measurements of gene expression level, including alternative splicing and epigenetic modifications studies, in order to reveal the molecular mechanisms involved in specific biological processes [1]. Undoubtedly, many aspects of plant biology, for example, economically important traits such as specific immunity, pathogen resistance and symbiotic efficiency contributing to high crop productivity, cannot be studied with the use of model plants only, making the investigation of non‐model plants a necessity.

*pratense* L. [21], were sequenced and are currently available at Phytozome website (https://phy‐ tozome.jgi.doe.gov/pz/portal.html) and in the integrative bioinformatic platform Legume IP providing information about gene and protein sequences, gene models and annotations, syn‐

Transcriptomic Studies in Non-Model Plants: Case of *Pisum sativum* L. and *Medicago lupulina* L.

http://dx.doi.org/10.5772/intechopen.69057

229

Despite all the recent research progress, most of the agriculturally important legumes were considered 'orphan' crops for a long time as separated from the intense genomic studies due to large genomes, and their agricultural significance mainly in developing countries lacking funds for large‐scale 'omics' studies [3]. Most genome and transcriptome analysis tools were developed for particular model objects [23] and can generally be used for studying 'orphan' species [24, 25], although careful fine‐tuning may be necessary for successful deployment of said tools in non‐model organisms (see **Figure 1**). With the cost of genome assemblies remain‐ ing prohibitively high, researchers are forced to work with only transcriptome data, making

It is worth noting that one of the most challenging steps of transcriptome analysis pipelines is cor‐ rect transcript annotation. The simplest approach giving a sufficiently accurate result is BLAST search against annotated sequences of other species. The development of transcriptome annota‐ tion pipelines, for example, Trinotate [26], has more or less taken the burden of transcriptome

**Figure 1.** Pipelines of transcriptome assembly in non‐model plants (based on the information from Refs. [23, 24].) Three strategies for RNA‐seq analysis. (A) Using a draft genome. Novel transcript discovery, quantification and functional annotation. (B) De novo transcriptome assembly with no reference. For quantification, reads are mapped back to the novel reference transcriptome followed by the functional annotation of the novel transcripts as in (A). (C) Combination of the two methods. Transcriptomes are first assembled using methods (A) and (B) then merged using CD‐HIT‐EST and

tenic regions, protein families and phylogenetic trees [22].

the analysis strategy all the more important.

cap3. Transcripts are then annotated as in (B).

The rapid decrease of per‐base sequencing cost coupled with unprecedented development rates of computational biology practices opened the field of transcriptomics for in‐depth inves‐ tigation of non‐model plants [1]. In the last few years, a large number of studies concerning differential gene expression, mapping of genes and quantitative trait loci (QTLs), analysis of genotyping variations and so on using next‐generation sequencing (NGS) techniques has been conducted on several non‐model plants including legumes (members of family Fabaceae) [2–4].

The leguminous plants (chickpea (*Cicer arietinum* L.), pea (*Pisum sativum* L.) and lentil (*Lens culinaris* Medik.)) were among the earliest domesticated plant species [5] and are to this day an integral part of agricultural systems [6]. These and other members of the Fabaceae family are essential for economics as a food, fodder and oil source [3]. A significant feature of most legume species is their capability of forming mutualistic symbioses with soil microorganisms. Root‐nodule symbiosis, the association of the legumes with nodule bacteria collectively called rhizobia, provides the plant with fixed atmospheric nitrogen [7]. This fact makes the legume‐ rhizobial inter‐organismal system an essential component of natural and agricultural ecosys‐ tems [8]. Arbuscular‐mycorrhizal (AM) symbiosis (association with arbuscular mycorrhizal fungi), inherent to over 80% of land plants including most of legumes [9], facilitates water and mineral (especially phosphorous) uptake of the plant and consequently the nutritional value of the crop. Legumes are also capable of forming symbioses with endophytic plant growth promoting bacteria also contributing to plant productivity [10, 11].

In the early 1990s, two legume species—*Medicago truncatula* Gaertn. and *Lotus japonicus* (Regel.) K. Larsen—were introduced as model objects for studying plant genetics of symbi‐ otic nitrogen fixation and AM development [12–14]. Both species have small diploid genomes (approx. 500 Mb) [15] and are self‐pollinators with short generation time able to produce hundreds to thousands of seeds per plant. Intensive studies of genetics resulted in high‐qual‐ ity annotated genomes for both *L. japonicus* and *M. truncatula*, accumulation of gene expres‐ sion microarray datasets and development of several tools and repositories combining the diverse genetic, genomic and transcriptomic data in these model species (the *Medicago* Gene Expression Atlas [16, 17], the *Medicago* genome database [18], the *Lotus* Base information por‐ tal [19], etc.).

During the last decade, rapid development of sequencing and bioinformatics technologies sig‐ nificantly improved the state of genomics in non‐model legumes. In the past few years, genomes of important legumes, such as *Glycine max* (L.) Merr. [20], *Phaseolus vulgaris* L. and *Trifolium*  *pratense* L. [21], were sequenced and are currently available at Phytozome website (https://phy‐ tozome.jgi.doe.gov/pz/portal.html) and in the integrative bioinformatic platform Legume IP providing information about gene and protein sequences, gene models and annotations, syn‐ tenic regions, protein families and phylogenetic trees [22].

plant biology research since studying expressed genes facilitates investigation into plant development, responses to environmental stresses, plant‐microbe interactions and so on. Transcriptomic analysis of model organisms, such as the classical object of plant genetics, *Arabidopsis thaliana* (L.) Heyhn., with available full‐genome sequence enables researchers to conduct more precise measurements of gene expression level, including alternative splicing and epigenetic modifications studies, in order to reveal the molecular mechanisms involved in specific biological processes [1]. Undoubtedly, many aspects of plant biology, for example, economically important traits such as specific immunity, pathogen resistance and symbiotic efficiency contributing to high crop productivity, cannot be studied with the use of model

The rapid decrease of per‐base sequencing cost coupled with unprecedented development rates of computational biology practices opened the field of transcriptomics for in‐depth inves‐ tigation of non‐model plants [1]. In the last few years, a large number of studies concerning differential gene expression, mapping of genes and quantitative trait loci (QTLs), analysis of genotyping variations and so on using next‐generation sequencing (NGS) techniques has been conducted on several non‐model plants including legumes (members of family Fabaceae) [2–4]. The leguminous plants (chickpea (*Cicer arietinum* L.), pea (*Pisum sativum* L.) and lentil (*Lens culinaris* Medik.)) were among the earliest domesticated plant species [5] and are to this day an integral part of agricultural systems [6]. These and other members of the Fabaceae family are essential for economics as a food, fodder and oil source [3]. A significant feature of most legume species is their capability of forming mutualistic symbioses with soil microorganisms. Root‐nodule symbiosis, the association of the legumes with nodule bacteria collectively called rhizobia, provides the plant with fixed atmospheric nitrogen [7]. This fact makes the legume‐ rhizobial inter‐organismal system an essential component of natural and agricultural ecosys‐ tems [8]. Arbuscular‐mycorrhizal (AM) symbiosis (association with arbuscular mycorrhizal fungi), inherent to over 80% of land plants including most of legumes [9], facilitates water and mineral (especially phosphorous) uptake of the plant and consequently the nutritional value of the crop. Legumes are also capable of forming symbioses with endophytic plant growth

In the early 1990s, two legume species—*Medicago truncatula* Gaertn. and *Lotus japonicus* (Regel.) K. Larsen—were introduced as model objects for studying plant genetics of symbi‐ otic nitrogen fixation and AM development [12–14]. Both species have small diploid genomes (approx. 500 Mb) [15] and are self‐pollinators with short generation time able to produce hundreds to thousands of seeds per plant. Intensive studies of genetics resulted in high‐qual‐ ity annotated genomes for both *L. japonicus* and *M. truncatula*, accumulation of gene expres‐ sion microarray datasets and development of several tools and repositories combining the diverse genetic, genomic and transcriptomic data in these model species (the *Medicago* Gene Expression Atlas [16, 17], the *Medicago* genome database [18], the *Lotus* Base information por‐

During the last decade, rapid development of sequencing and bioinformatics technologies sig‐ nificantly improved the state of genomics in non‐model legumes. In the past few years, genomes of important legumes, such as *Glycine max* (L.) Merr. [20], *Phaseolus vulgaris* L. and *Trifolium* 

plants only, making the investigation of non‐model plants a necessity.

228 Applications of RNA-Seq and Omics Strategies - From Microorganisms to Human Health

promoting bacteria also contributing to plant productivity [10, 11].

tal [19], etc.).

Despite all the recent research progress, most of the agriculturally important legumes were considered 'orphan' crops for a long time as separated from the intense genomic studies due to large genomes, and their agricultural significance mainly in developing countries lacking funds for large‐scale 'omics' studies [3]. Most genome and transcriptome analysis tools were developed for particular model objects [23] and can generally be used for studying 'orphan' species [24, 25], although careful fine‐tuning may be necessary for successful deployment of said tools in non‐model organisms (see **Figure 1**). With the cost of genome assemblies remain‐ ing prohibitively high, researchers are forced to work with only transcriptome data, making the analysis strategy all the more important.

It is worth noting that one of the most challenging steps of transcriptome analysis pipelines is cor‐ rect transcript annotation. The simplest approach giving a sufficiently accurate result is BLAST search against annotated sequences of other species. The development of transcriptome annota‐ tion pipelines, for example, Trinotate [26], has more or less taken the burden of transcriptome

**Figure 1.** Pipelines of transcriptome assembly in non‐model plants (based on the information from Refs. [23, 24].) Three strategies for RNA‐seq analysis. (A) Using a draft genome. Novel transcript discovery, quantification and functional annotation. (B) De novo transcriptome assembly with no reference. For quantification, reads are mapped back to the novel reference transcriptome followed by the functional annotation of the novel transcripts as in (A). (C) Combination of the two methods. Transcriptomes are first assembled using methods (A) and (B) then merged using CD‐HIT‐EST and cap3. Transcripts are then annotated as in (B).

annotation off of the researcher. Trinotate combines the output of a number of annotation tools into an integrated database simplifying the following deeper analysis of acquired data.

unlike genome, is closer in size to transcriptomes of other legumes, including model plant *M. truncatula*, making it more susceptible to analysis. Due to the existence of tissue‐specific gene expression, different plant tissues possess unique sets of transcripts, making the choice of tis‐ sue samples important for further research. Furthermore, transcriptome assemblies from distinct plant organs should be used as reference for analysis of tissue‐specific processes. A high‐quality transcriptome assembly with full tissue representation is therefore crucial for studies associated with gene interactions (differential gene expression, see section 3), gene polymorphism studies

Transcriptomic Studies in Non-Model Plants: Case of *Pisum sativum* L. and *Medicago lupulina* L.

http://dx.doi.org/10.5772/intechopen.69057

231

In the last 5 years, several pea transcriptome assemblies of distinct organs and tissues were presented by different workgroups. The first publication of pea transcriptome sequencing and assembly was made by Franssen et al. [30]. Total of 20 libraries from flowers, leaves, cotyledons, epicotyls and hypocotyls and etiolated and light‐treated etiolated seedlings were sequenced using the Roche 454 sequencing platform. Several iterations of de novo assembly and merging yielded 81,449 unigenes. Sudheesh et al. [31] sequenced transcriptomes from dif‐ ferent parts (leaf, stipule, stem, tendril tissues from multiple nodes, root‐tip tissues, flowers, stamens, pistils, immature pods, immature seeds and nodules) of two pea cultivars (Parafield and Kaspa) differing in both seed and plant morphological characteristics. Read assembly for separate cultivars yielded 126,335 and 145,730 contigs, respectively, with 87% showing signif‐ icant expression levels in both cultivars. Later on, Liu et al. sequenced samples from pea seeds harvested at the stage of 10 and 25 days after pollination and assembled 77,273 unigenes [32]. Several transcriptome assembly sets were generated for Single Nucleotide Polymorphism (SNP) marker development and genetic mapping in pea (see section 4). Duarte et al. [33] sequenced libraries from eight pea cultivars (six spring sown, one winter sown field pea, one fodder pea cultivar) with Roche 454 technology. A total of 3,826,797 reads were assembled into 68,850 contigs by MIRA transcriptome assembler [34]. Sindhu et al. sequenced 3'‐anchored libraries of eight diverse pea accessions (six *P. sativum* cultivars (CDC Bronco, Alfetta, Cooper, CDC Striker, Nitouche and Orb) and two wild accessions P651 (*P. fulvum*), PI 358610 (*P. sati‐ vum* ssp. *abyssinicum*)) with Roche 454 technology, generating 4,008,648 reads in total. De novo assembly was performed for 520,797 reads from the CDC Bronco by MIRA, resulting in a set of 29,725 reference contigs representing a significant proportion of the 3′ end of genes in pea [35].

Since analysis of inter organismal genetic network between pea and rhizobia is a poorly developed field, assembly of a high‐quality transcriptome provided researchers with the much‐needed data on nodule‐specific transcripts. Transcriptomes of pea nodules and root tips were obtained by Zhukov et al. [36]. Transcriptome sequencing using the Illumina Genome Analyzer IIx platform (Illumina Inc.) generated 52,021,865 reads from the 'Nodules' library and 17,684,604 reads from the 'Root Tips' library, yielding 58,397 and 37,287 contigs assembled de novo by Trinity, respectively [37]. A total of 13,000 nodule‐specific contigs were annotated by alignment to known plant protein‐coding sequences and by Gene Ontology search. Of these, 581 sequences were found to possess full Coding DNA Sequence (CDSs) and could thus be considered novel nodule‐specific transcripts of pea. Further investigation of those transcripts can potentially lead to the discovery of key regulators of nodule symbiosis, such as identifica‐ tion of pea gene homologous to *Nodulation signaling pathway 1 (NSP1*) gene of *M. truncatula* [38]. In this study, pea gene *Sym34* was shown to be homologous to the *M. truncatula NSP1* gene,

and proteome analysis.

One example of an 'orphan' legume is garden pea (*Pisum sativum* L.), a valuable pulse crop capable of forming both nitrogen‐fixing symbiosis and arbuscular mycorrhiza. Global pro‐ duction of green pea in 2014 was 17.4 million tons, harvested from 2.3 million hectares, with an additional 11.2 million tons of dried pea from 6.9 million hectares [6]. The genome of the species is considered to be about 4300 Mb with high percentage of repetitive sequences [27]. Adaptation of RNA‐seq data analysis approaches standardised for model plants to *P. sativum* should facilitate both studying of pea molecular genetics and breeding of new cultivars pos‐ sessing agriculturally important traits.

Black medick (*Medicago lupulina* L.), a close relative of a model legume plant barrel medick (*M.truncatula* Gaertn.), is another example of an important (but almost not studied in terms of genet‐ ics) non‐model legume. It is valuable as a pasture legume component in complex grass mixtures and can also be used as an intermediate culture in crop rotation and as green manure. Black medick is characterised by high protein, vitamin and mineral content, long growing season and ability for improving soil fertility due to nitrogen fixation, therefore being a perfect lawn plant [28]. Black medick is a very promising object for studying AM functioning and development, since a unique genetic line of *M. lupulina* obligatory dependent on arbuscular mycorrhiza symbi‐ osis formation has been selected from the spring landrace population VIK‐32 of *M. lupulina* var. *vulgaris* Koch originating from Kazakhstan [28, 29]. Plants of the line MlS‐1 (for *Medicago lupulina* Spring) [28] demonstrate dwarfism when grown in the soil with low Pi (inorganic phosphorus) level in the absence of the AM fungi inoculation but can grow normally when inoculated with AM fungus. Therefore, MlS‐1 line is considered highly effective in AM symbiosis formation (as inoculation by fungi dramatically heightens the plant biomass). Apparently, MlS‐1 line is only capable of using the symbiotrophic way of phosphorus uptake from the soil, supposedly due to yet unidentified mutation(s) and, consequently, can serve as a model object for the investigation of arbuscular‐mycorrhizal symbiosis. For instance, this line is suitable for mutagenesis aimed at selection of mutants with defects in arbuscular mycorrhiza development, since plants carrying mutations in genes related to AM formation can be easily identified by visual examination as demonstrating dwarfism under inoculation with AM fungi [29].

High level of genome synteny, similarity of gene sequences and developmental processes pro‐ vide the opportunity to use the vast amounts of data accumulated on *M. truncatula* in genetics, genomic and transcriptomics of these non‐model legumes *M. lupulina* and *P. sativum*. In this chapter, we give a brief description of the current achievements in the field of transcriptomics of non‐model legumes black medick (*M. lupulina*) and garden pea (*P. sativum*).
