**3. Data analysis workflow**

### **3.1 Data mining of potential genes and transcription factor related to flavonoid biosynthesis**

Pubmed (https://pubmed.ncbi.nlm.nih.gov/), Web of Science (https://www. webofknowledge.com/) and Scopus (https://www.scopus.com/) were used systematically to search for publications related to flavonoid biosynthetic genes in *Oryza sativa* using the following keywords and their combinations such as "flavonoid", "rice", "SNP", "anthocyanin", "proanthocyanidin" and "pigmented". The keywords such as "flavonoid", "C4H", "4CL", "CHS", "F3H", "F3'H", "UGT", "DFR", "CHI", "PAL", "LAR", "ANS", "ANR", "LDOX", anthocyanin" and "proanthocyanidin" were used in the genome (i.e. RAPDB) and pathway (i.e. KEGG, RiceCyc, PlantReactome) databases to search for gene identifier, gene sequences and gene descriptions.

All genes that are related to the flavonoid genes were identified using similarity search analysis from OmicsBox (https://www.biobam.com/omicsbox/). BLASTX [22] analysis was performed with the e-value cut off of less than 1e-10 and sequence identity more than 75%. eggNOG [23] analysis was carried out to characterize whether the sequences are paralogous or orthologous.

Bibliomic, database and similarity search are the first steps in identifying candidate genes related to the biosynthesis pathways. These genes that can be used in downstream analysis1 such as evolutionary study [24] and identification of SNPs in the biosynthetic genes [25]. From these steps, a total of 95 flavonoid related genes (FRGs) and two transcription factors were successfully identified. Structural genes (i.e. F3H, CHI, CHS,DFR, LDOX) are found to be more conserved than transcription factors. These flavonoid related genes were used to screen the SNPs that reside in the sequences. **Figure 1** shows the three steps analysis workflow to identify the potential flavonoid related genes.

#### **3.2 Mining of SNPs from genome and transcriptome data**

Single nucleotide polymorphism (SNP) has been widely used as a genetic marker tool in crop improvement. Previous studies used SNPs to investigate the evolutionary relationship [26], to facilitate cultivar identification [27] and to associate SNPs with agronomic traits [28]. SNPs located in the intergenic and genic region [29]. The genic SNPs can be classified into the coding region, untranslated region (UTR) and intron. SNPs in the coding region are divided into synonymous and non-synonymous. The non-synonymous SNPs has two effects such as deleterious and tolerance that might represent causal genetic variation which could lead to the phenotypic consequences [30].

**87**

of interest.

**Figure 1.**

*Computational Analysis of Rice Transcriptomic and Genomic Datasets in Search for SNPs…*

resistance to disease, resistance to drought and high nutritional content [31]. The non-synonymous SNPs might affect the protein function through amino acid substitution [32] whilst synonymous SNP has the ability to affect gene function by regulating mRNA splicing [33], stability [34] and protein translation [35]. However, synonymous SNP is not a preferred polymorphism to be further validated. Therefore, the utilization of non-synonymous SNPs is important to prioritize their involvement in the biosynthetic pathway and to determine the effect of non-

*Bioinformatic analysis workflow on the genome and transcriptomic datasets of six Malaysian rice varieties.*

SNPs in the intron region are able to shift the gene splicing or regulate the transcript level by changing the binding sites of miRNA [36]. Lower number of SNPs in untranslated region (UTR) has demonstrated that mutation in this region can change local mRNA structure and will affect the translation process [36]. SNPs in the UTR region are conserved and consist of binding sites for proteins or antisense RNAs that able to modulate transport, RNA stability, cellular localization, expres-

Functional effect of SNPs in the flavonoid related genes can affect the expression and regulation of structural genes and transcription factor during the flavonoid biosynthesis process. Previous study has demonstrated that genetic variation caused by SNP can influenced the variation and accumulation of metabolites in the biosynthetic pathway [19, 20]. By selecting SNPs with specific functional effect such as non-synonymous and deleterious, it can be effectively used as a molecular marker in the development of new and improved rice variety with agronomic traits

However, it is a challenge to depend only on SNPs to reveal the interaction and relationship that occur in the multiple levels of biological mechanisms [38], especially in the qualitative and complex traits (i.e. abiotic stress, quality, yield).

*DOI: http://dx.doi.org/10.5772/intechopen.94876*

synonymous SNPs to phenotypic expression.

sion level and translation [37].

SNPs in the coding regions are usually associated with functional SNP and they can influence the phenotypic expression of agronomic traits such as *Computational Analysis of Rice Transcriptomic and Genomic Datasets in Search for SNPs… DOI: http://dx.doi.org/10.5772/intechopen.94876*

resistance to disease, resistance to drought and high nutritional content [31]. The non-synonymous SNPs might affect the protein function through amino acid substitution [32] whilst synonymous SNP has the ability to affect gene function by regulating mRNA splicing [33], stability [34] and protein translation [35]. However, synonymous SNP is not a preferred polymorphism to be further validated. Therefore, the utilization of non-synonymous SNPs is important to prioritize their involvement in the biosynthetic pathway and to determine the effect of nonsynonymous SNPs to phenotypic expression.

SNPs in the intron region are able to shift the gene splicing or regulate the transcript level by changing the binding sites of miRNA [36]. Lower number of SNPs in untranslated region (UTR) has demonstrated that mutation in this region can change local mRNA structure and will affect the translation process [36]. SNPs in the UTR region are conserved and consist of binding sites for proteins or antisense RNAs that able to modulate transport, RNA stability, cellular localization, expression level and translation [37].

Functional effect of SNPs in the flavonoid related genes can affect the expression and regulation of structural genes and transcription factor during the flavonoid biosynthesis process. Previous study has demonstrated that genetic variation caused by SNP can influenced the variation and accumulation of metabolites in the biosynthetic pathway [19, 20]. By selecting SNPs with specific functional effect such as non-synonymous and deleterious, it can be effectively used as a molecular marker in the development of new and improved rice variety with agronomic traits of interest.

However, it is a challenge to depend only on SNPs to reveal the interaction and relationship that occur in the multiple levels of biological mechanisms [38], especially in the qualitative and complex traits (i.e. abiotic stress, quality, yield).

*Recent Advances in Rice Research*

**3. Data analysis workflow**

**biosynthesis**

descriptions.

inadequate to be applied in rice breeding selection. Furthermore, this constraint limits detailed insight into the underlying mechanism and regulation of antioxidant content on the system level. The association of SNPs in the causative biosynthetic genes might be useful to uncover key alleles that influence the accumulation of flavonoid. Linking the SNPs with their co-expressed genes that involved in the flavonoid biosynthesis process could be a promising approach to prioritize the functional SNPs and causal genes to be used in the experimental validation towards the molecular and genetic improvement of rice variety enriched with flavonoid content.

**3.1 Data mining of potential genes and transcription factor related to flavonoid** 

Pubmed (https://pubmed.ncbi.nlm.nih.gov/), Web of Science (https://www. webofknowledge.com/) and Scopus (https://www.scopus.com/) were used systematically to search for publications related to flavonoid biosynthetic genes in *Oryza sativa* using the following keywords and their combinations such as "flavonoid", "rice", "SNP", "anthocyanin", "proanthocyanidin" and "pigmented". The keywords such as "flavonoid", "C4H", "4CL", "CHS", "F3H", "F3'H", "UGT", "DFR", "CHI", "PAL", "LAR", "ANS", "ANR", "LDOX", anthocyanin" and "proanthocyanidin" were used in the genome (i.e. RAPDB) and pathway (i.e. KEGG, RiceCyc, PlantReactome) databases to search for gene identifier, gene sequences and gene

All genes that are related to the flavonoid genes were identified using similarity search analysis from OmicsBox (https://www.biobam.com/omicsbox/). BLASTX [22] analysis was performed with the e-value cut off of less than 1e-10 and sequence identity more than 75%. eggNOG [23] analysis was carried out to characterize

Single nucleotide polymorphism (SNP) has been widely used as a genetic marker

tool in crop improvement. Previous studies used SNPs to investigate the evolutionary relationship [26], to facilitate cultivar identification [27] and to associate SNPs with agronomic traits [28]. SNPs located in the intergenic and genic region [29]. The genic SNPs can be classified into the coding region, untranslated region (UTR) and intron. SNPs in the coding region are divided into synonymous and non-synonymous. The non-synonymous SNPs has two effects such as deleterious and tolerance that might represent causal genetic variation which could lead to the

SNPs in the coding regions are usually associated with functional SNP and they can influence the phenotypic expression of agronomic traits such as

Bibliomic, database and similarity search are the first steps in identifying candidate genes related to the biosynthesis pathways. These genes that can be used in downstream analysis1 such as evolutionary study [24] and identification of SNPs in the biosynthetic genes [25]. From these steps, a total of 95 flavonoid related genes (FRGs) and two transcription factors were successfully identified. Structural genes (i.e. F3H, CHI, CHS,DFR, LDOX) are found to be more conserved than transcription factors. These flavonoid related genes were used to screen the SNPs that reside in the sequences. **Figure 1** shows the three steps analysis workflow to identify the

whether the sequences are paralogous or orthologous.

**3.2 Mining of SNPs from genome and transcriptome data**

potential flavonoid related genes.

phenotypic consequences [30].

**86**

Therefore, numerous studies have been performed to identify and evaluate the functional effects of SNPs through multi-omics data integration. For instance, the integration of whole genome and transcriptome data offers opportunities to highlight the expressed SNPs [39] and to better understand the biological processes and mechanisms underlying the pigmented rice varieties. Previous study has performed integration of SNPs with gene co-expression network to identify the causal genes and causal SNPs that might be responsible in the mechanism of blast disease resistance [40], salt tolerance [41] and amylose content [42]. Throughout these integration processes, several bioinformatics tools have been applied to identify the putative SNPs and to annotate the SNPs into their functional effects.

The identification of SNPs in the flavonoid related genes started with mining of SNPs from the genome and transcriptome datasets of *O. sativa indica* cv. Bali, PH9, MRM16, MRQ100, MR297 and MRQ76 [43, 44]. Bali, PH9, MRM16 and MRQ100 are pigmented rice varieties that contain high antioxidant contents. MR297 and MRQ76 are white rice varieties with medium tolerance to diseases. **Figure 1** shows the bioinformatic workflow to identify SNPs from the genome and transcriptome datasets. The first step in SNPs identification is to map the sequence reads onto selected reference genome sequences; in this case it was Nipponbare. It was chosen because of it was well-assembled and annotated. The genome and transcriptome of reads mapping were individually aligned using different mapping tools such as BWA [45] for genome reads mapping and TopHat2 [46] for transcriptome reads mapping.

Then PICARD version 0.7.12 [47] was used to add and replace the read groups, marking the duplicate reads and fixing mate information on the mapped reads in order to obtain the high-quality SNPs during SNPs calling. This process is known as post-processing and it is a standard step used to identify potential SNPs from the genome and transcriptome datasets. Once this process is completed, GATK version 3.6 [47] was used in the SNPs calling process and this is a crucial step in SNPs discovery as it helps in obtaining high-quality SNPs as well as reducing false-positive SNPs. In our case, the parameters used are as follow:


All SNPs obtained from the filtering process were annotated using several annotation tools such as SnpEff version 4.1 [48], Variant Effect Predictor (VEP) [49], CooVar [50] and Annovar [51] for their intergenic, genic, coding region, UTR, intron, synonymous and non-synonymous. Most of these tools can be locally installed to ease the users who perform large-number of SNPs annotation. Those SNPs that were annotated as genic were then filtered using R packages (i.e. dplyr, tidyr, sqldf) for the identification of SNP position, chromosome, allele, gene identifier and SNPs effect. This information was then used in matching the SNPs in genome and transcriptome datasets as well as screening all SNPs that reside in the flavonoid related genes.

Comparative SNPs analysis of six rice varieties were performed to investigate the uniqueness, differences and similarities in the potential SNPs. Currently several rice SNP databases are available to providing the information on SNPs that were mined from various rice varieties and sub-specie such as Rice SNP-Seek [52],

**89**

*Computational Analysis of Rice Transcriptomic and Genomic Datasets in Search for SNPs…*

Rice VarMap [53], IC4R [54], and Ensembl Plants Variation database [55]. Comparative SNPs analysis on SNP databases can provide the SNPs information and usability in

SNP occurs in transcription factor often lead to altering or loss of function of key pathway enzymes that are required to regulating the production of anthocyanin [56]. Hence, it could affect the expression levels of flavonoid. Previous research has investigated that mutations were accumulated during the domestication process, which suggests the presence of agronomically valuable genes in landraces as well as in wild relative [15]. Hence, SNPs in transcription factor, such as *Kala*4 and *Rc,* can

A gene co-expression network analysis was performed to correlate gene and phenotypic expressions, to infer the function of unknown genes and to identify the key regulatory networks in biosynthetic processes [57, 58]. The principle used in the gene co-expression network shows that genes that cluster in the same network represent similar biological process [58]. To date, the increasing number of genes co-expression network databases such as ATTED, STRING, RED facilitate the exploitation of co-expressed genes from different conditions in various crops [59]. In the co-expressed gene databases, the gene identifier or gene name can be used as a query to search for the co-expressed genes of interest. Parameter, such as maximum number of interactors = 0.1 and confidence score cut-off = 0.40 are used to select the significant co-expressed genes. Most of the co-expressed databases obtained the information from microarray gene expression and transcriptome datasets from public databases such as NCBI Gene Expression Omnibus (NCBI

Gene co-expression network analysis was performed using the expression values of genes in fragments per kilobase of transcript per million mapped reads (FPKM) as an input data. Pearson correlation coefficient (PCC) value was used to measure the co-expression correlation between paired genes in the network. The PCC score represents the confidence level describing the association of the two genes whether they are functionally associated. In this case, the FPKM value with more than and equal to 0.1 (FPKM > = 0.1) was used to perform a gene co-expression network analysis. Pearson Correlation Coefficient (PCC) in corr R package [62] was used to measure the correlation between paired genes in the network. The cut-off value of PCC more than and equal to 0.7 (PCC > = 0.7) was used for selection of co-expressed genes. The flavonoid related genes were selected as guided-genes to identify their interactors. **Figure 1** shows the analysis workflow to perform a gene co-expression network analysis.

Cytoscape v3.7 was used to construct and visualize the gene co-expression network. The Network Analyzer plugin was used to calculate the degree connectivity of each nodes and edges. ShinyGO version 0.6.1 [63] was used in gene ontology and pathway enrichment analysis to elucidate the biological processes and molecular

There are several publications on the bioinformatics approaches on the functional effect of SNPs where by assessing their effect on the functional site of protein

be prioritized for further integration with genes co-expression network.

**3.3 Gene co-expression network analysis from transcriptome data**

GEO) [60] and Sequence Read Archive (SRA) [61].

function in clusters of network.

**4.1 Integration of SNPs with co-expressed genes**

**4. Data integration**

*DOI: http://dx.doi.org/10.5772/intechopen.94876*

various rice varieties and sub-species.

*Computational Analysis of Rice Transcriptomic and Genomic Datasets in Search for SNPs… DOI: http://dx.doi.org/10.5772/intechopen.94876*

Rice VarMap [53], IC4R [54], and Ensembl Plants Variation database [55]. Comparative SNPs analysis on SNP databases can provide the SNPs information and usability in various rice varieties and sub-species.

SNP occurs in transcription factor often lead to altering or loss of function of key pathway enzymes that are required to regulating the production of anthocyanin [56]. Hence, it could affect the expression levels of flavonoid. Previous research has investigated that mutations were accumulated during the domestication process, which suggests the presence of agronomically valuable genes in landraces as well as in wild relative [15]. Hence, SNPs in transcription factor, such as *Kala*4 and *Rc,* can be prioritized for further integration with genes co-expression network.

#### **3.3 Gene co-expression network analysis from transcriptome data**

A gene co-expression network analysis was performed to correlate gene and phenotypic expressions, to infer the function of unknown genes and to identify the key regulatory networks in biosynthetic processes [57, 58]. The principle used in the gene co-expression network shows that genes that cluster in the same network represent similar biological process [58]. To date, the increasing number of genes co-expression network databases such as ATTED, STRING, RED facilitate the exploitation of co-expressed genes from different conditions in various crops [59]. In the co-expressed gene databases, the gene identifier or gene name can be used as a query to search for the co-expressed genes of interest. Parameter, such as maximum number of interactors = 0.1 and confidence score cut-off = 0.40 are used to select the significant co-expressed genes. Most of the co-expressed databases obtained the information from microarray gene expression and transcriptome datasets from public databases such as NCBI Gene Expression Omnibus (NCBI GEO) [60] and Sequence Read Archive (SRA) [61].

Gene co-expression network analysis was performed using the expression values of genes in fragments per kilobase of transcript per million mapped reads (FPKM) as an input data. Pearson correlation coefficient (PCC) value was used to measure the co-expression correlation between paired genes in the network. The PCC score represents the confidence level describing the association of the two genes whether they are functionally associated. In this case, the FPKM value with more than and equal to 0.1 (FPKM > = 0.1) was used to perform a gene co-expression network analysis. Pearson Correlation Coefficient (PCC) in corr R package [62] was used to measure the correlation between paired genes in the network. The cut-off value of PCC more than and equal to 0.7 (PCC > = 0.7) was used for selection of co-expressed genes. The flavonoid related genes were selected as guided-genes to identify their interactors. **Figure 1** shows the analysis workflow to perform a gene co-expression network analysis.

Cytoscape v3.7 was used to construct and visualize the gene co-expression network. The Network Analyzer plugin was used to calculate the degree connectivity of each nodes and edges. ShinyGO version 0.6.1 [63] was used in gene ontology and pathway enrichment analysis to elucidate the biological processes and molecular function in clusters of network.

### **4. Data integration**

#### **4.1 Integration of SNPs with co-expressed genes**

There are several publications on the bioinformatics approaches on the functional effect of SNPs where by assessing their effect on the functional site of protein

*Recent Advances in Rice Research*

mapping.

Therefore, numerous studies have been performed to identify and evaluate the functional effects of SNPs through multi-omics data integration. For instance, the integration of whole genome and transcriptome data offers opportunities to highlight the expressed SNPs [39] and to better understand the biological processes and mechanisms underlying the pigmented rice varieties. Previous study has performed integration of SNPs with gene co-expression network to identify the causal genes and causal SNPs that might be responsible in the mechanism of blast disease resistance [40], salt tolerance [41] and amylose content [42]. Throughout these integration processes, several bioinformatics tools have been applied to identify the

The identification of SNPs in the flavonoid related genes started with mining of SNPs from the genome and transcriptome datasets of *O. sativa indica* cv. Bali, PH9, MRM16, MRQ100, MR297 and MRQ76 [43, 44]. Bali, PH9, MRM16 and MRQ100 are pigmented rice varieties that contain high antioxidant contents. MR297 and MRQ76 are white rice varieties with medium tolerance to diseases. **Figure 1** shows the bioinformatic workflow to identify SNPs from the genome and transcriptome datasets. The first step in SNPs identification is to map the sequence reads onto selected reference genome sequences; in this case it was Nipponbare. It was chosen because of it was well-assembled and annotated. The genome and transcriptome of reads mapping were individually aligned using different mapping tools such as BWA [45] for genome reads mapping and TopHat2 [46] for transcriptome reads

Then PICARD version 0.7.12 [47] was used to add and replace the read groups, marking the duplicate reads and fixing mate information on the mapped reads in order to obtain the high-quality SNPs during SNPs calling. This process is known as post-processing and it is a standard step used to identify potential SNPs from the genome and transcriptome datasets. Once this process is completed, GATK version 3.6 [47] was used in the SNPs calling process and this is a crucial step in SNPs discovery as it helps in obtaining high-quality SNPs as well as reducing false-positive

i.mapping quality (MQ ) more than and equal to 30 (MQ > = 30);

iii.number of supporting reads for every base (DP) more than and equal to

All SNPs obtained from the filtering process were annotated using several annotation tools such as SnpEff version 4.1 [48], Variant Effect Predictor (VEP) [49], CooVar [50] and Annovar [51] for their intergenic, genic, coding region, UTR, intron, synonymous and non-synonymous. Most of these tools can be locally installed to ease the users who perform large-number of SNPs annotation. Those SNPs that were annotated as genic were then filtered using R packages (i.e. dplyr, tidyr, sqldf) for the identification of SNP position, chromosome, allele, gene identifier and SNPs effect. This information was then used in matching the SNPs in genome and transcriptome datasets as well as screening all SNPs that reside in the

Comparative SNPs analysis of six rice varieties were performed to investigate the uniqueness, differences and similarities in the potential SNPs. Currently several rice SNP databases are available to providing the information on SNPs that were mined from various rice varieties and sub-specie such as Rice SNP-Seek [52],

ii.variant quality more than and equal to 50 (VQ > =50);

putative SNPs and to annotate the SNPs into their functional effects.

SNPs. In our case, the parameters used are as follow:

10 (DP > = 10)

flavonoid related genes.

**88**

structure [64, 65] and integrate with biological pathway [66]. In this study, SNP, especially with non-synonymous effect and deleterious impact is better appreciated to study its impact through the pathway-network analysis coupled with a different omics data type.

SNPs can be integrated with omics data type to rank the high-potential SNPs and to highlight the causal genes for development of genetic markers and functional genomics studies. The integration of SNPs and co-expressed genes through construction of biological network offers a comprehensive interpretation of genetic variation at the biological system level. Integrative approach links biologically meaningful sets of genes to reveal the molecular basis of trait variation.

In this study, integration of SNPs and co-expressed genes was performed to prioritize the functional SNPs that can be suggested as a candidate for the development of molecular markers and to highlight the causal genes that might contribute in the flavonoid biosynthesis process. The co-expressed genes can be suggested as a functional target for functional validation in the future study to unravel their potential in rice breeding improvement.

Integration of SNPs and co-expressed genes was performed using Cytoscape v3.7 [67]. Flavonoid biosynthetic pathway templates in BioPAX, KGXML and SIF formats was retrieved from KEGG, Plant Reactome and RiceCyc databases. To merge the SNP and co-expressed genes, gene identifier (ID) was used as a matching identity. Every gene in the network consists of gene ID, chromosome, start and stop position based on physical genome coordinates (bp). Similarly, SNPs consists of the position in genome locations (bp) and gene ID.

Analysis on network integration between SNPs and co-expressed genes highlight that co-expressed genes can be integrated by multiple numbers of SNPs and will reveal, which SNPs appear to play an essential role in the flavonoid biosynthesis process. The co-expressed genes connected to SNPs can be prioritized as candidate genes. The high false-positive rate in SNPs also can be reduced by incorporating putative functional co-expressed genes information [68]. Several biological questions can be asked from the integration of SNPs and co-expressed genes into pathway-network, such as i) how many functional SNPs and co-expressed genes important to the expression of black and red pigmentation; ii) any regulatory genes regulate the biosynthetic pathway?; iii) and which biological process are underlying this trait.

#### **4.2 Pathway-network analysis**

There are different ways in bioinformatics approaches that could be applied to the pathway-network analysis. In pathway-network analysis, the description of connected genes can be interpreted into biologically meaningful information and provide insights into biological processes, molecular function and cellular components. This analysis is known as gene ontology (GO) enrichment and pathway enrichment analysis.

To date, several bioinformatic tools are available to perform GO and pathway enrichment analysis in the network. For example, ClueGO [69] and BiNGO [70]. Both plugins are available in the Cytoscape and are user-friendly. Statistical values such as Hypergeometric testing and Bonferroni method is used to calculate the p-value. Parameter such as p-value less and equal than 0.05 (p-value <= 005) and a minimum number of mapping entries > = 2 can be used to select the significant or enrich genes in the pathway-network.

**Tables 1** and **2** provide list of bioinformatics tools and databases that are used for data mining and data integration in search for potential SNPs involved in the flavonoid biosynthesis.

**91**

**Table 2.**

*pathways.*

**5. Repository of omics data**

Continuously increased number of biological data is in need for a database to systematically store, organize and manage them. To date, several rice databases are available for the application in rice breeding programme, such as rice genome

*Summary of the biological databases used in search for potential SNPs involved in the flavonoid biosynthetic* 

*Computational Analysis of Rice Transcriptomic and Genomic Datasets in Search for SNPs…*

BWA Genome mapping [45] TopHat2 v2.3 Transcriptome mapping [46] Picard v0.7.12 Post mapping processes [47] GATK v3.6 (HaplotypeCaller) SNPs discovery [47] SnpEff v4.1 SNPs annotation [48] R package (corrr) Gene co-expression analysis [62] ShinyGO Gene ontology enrichment analysis [63] Cytoscape v3.7 Data integration [66]

*Summary of the bioinformatic tools used in search for potential SNPs involved in the flavonoid biosynthetic* 

Is a rice genome database that has been developed by the International Rice Genome Sequencing. Information provided are genome sequences, chromosom, gene

A pathway database that provides biological information related to genes, proteins, enzymes and pathways involve in biological

user genes, proteins, enzymes and reactions involve in the specific biological systems.

has been developed to provide predicted biochemical pathways in rice. Several biological information, such as genes, proteins, enzymes and reactions have been displayed in diagram and all the data can be

annotation dan description.

systems.

PlantReactome A plant pathway database that provides

RiceCyc A rice metabolic pathway database that

downloaded by user.

eggNog Annotation of orthologous and paralogous sequences

**Bioinformatic tools Description References** BLASTX Sequence similarity search analysis [22]

Gene ontology enrichment analysis [68]

Data cleaning and filtering [71]

**Descriptions URL References**

https://rapdb.dna. affrc.go.jp/

https://www.kegg.jp/ [75]

https://plantreactome. gramene.org/

http://pathway. gramene.org/ gramene/ricecyc.

shtml

[23]

[72] [73]

[74]

[76]

[77]

*DOI: http://dx.doi.org/10.5772/intechopen.94876*

Cytoscape v3.7 plugin ClueGO

**Bioinformatic databases**

Rice Annotation Project Database (RAP-DB)

Kyoto Encyclopedia of Genes and Genomes (KEGG)

R packages (dplyr, tidyr, sqldf)

**Table 1.**

*pathways.*

*Computational Analysis of Rice Transcriptomic and Genomic Datasets in Search for SNPs… DOI: http://dx.doi.org/10.5772/intechopen.94876*


#### **Table 1.**

*Recent Advances in Rice Research*

potential in rice breeding improvement.

position in genome locations (bp) and gene ID.

omics data type.

structure [64, 65] and integrate with biological pathway [66]. In this study, SNP, especially with non-synonymous effect and deleterious impact is better appreciated to study its impact through the pathway-network analysis coupled with a different

SNPs can be integrated with omics data type to rank the high-potential SNPs and to highlight the causal genes for development of genetic markers and functional genomics studies. The integration of SNPs and co-expressed genes through construction of biological network offers a comprehensive interpretation of genetic variation at the biological system level. Integrative approach links biologically meaningful sets of genes to reveal the molecular basis of trait variation.

In this study, integration of SNPs and co-expressed genes was performed to prioritize the functional SNPs that can be suggested as a candidate for the development of molecular markers and to highlight the causal genes that might contribute in the flavonoid biosynthesis process. The co-expressed genes can be suggested as a functional target for functional validation in the future study to unravel their

Integration of SNPs and co-expressed genes was performed using Cytoscape v3.7 [67]. Flavonoid biosynthetic pathway templates in BioPAX, KGXML and SIF formats was retrieved from KEGG, Plant Reactome and RiceCyc databases. To merge the SNP and co-expressed genes, gene identifier (ID) was used as a matching identity. Every gene in the network consists of gene ID, chromosome, start and stop position based on physical genome coordinates (bp). Similarly, SNPs consists of the

Analysis on network integration between SNPs and co-expressed genes highlight that co-expressed genes can be integrated by multiple numbers of SNPs and will reveal, which SNPs appear to play an essential role in the flavonoid biosynthesis process. The co-expressed genes connected to SNPs can be prioritized as candidate genes. The high false-positive rate in SNPs also can be reduced by incorporating putative functional co-expressed genes information [68]. Several biological questions can be asked from the integration of SNPs and co-expressed genes into pathway-network, such as i) how many functional SNPs and co-expressed genes important to the expression of black and red pigmentation; ii) any regulatory genes regulate the biosynthetic pathway?; iii) and which biological process are underlying

There are different ways in bioinformatics approaches that could be applied to the pathway-network analysis. In pathway-network analysis, the description of connected genes can be interpreted into biologically meaningful information and provide insights into biological processes, molecular function and cellular components. This analysis is known as gene ontology (GO) enrichment and pathway

To date, several bioinformatic tools are available to perform GO and pathway enrichment analysis in the network. For example, ClueGO [69] and BiNGO [70]. Both plugins are available in the Cytoscape and are user-friendly. Statistical values such as Hypergeometric testing and Bonferroni method is used to calculate the p-value. Parameter such as p-value less and equal than 0.05 (p-value <= 005) and a minimum number of mapping entries > = 2 can be used to select the significant or

**Tables 1** and **2** provide list of bioinformatics tools and databases that are used for data mining and data integration in search for potential SNPs involved in the

**90**

this trait.

**4.2 Pathway-network analysis**

enrich genes in the pathway-network.

enrichment analysis.

flavonoid biosynthesis.

*Summary of the bioinformatic tools used in search for potential SNPs involved in the flavonoid biosynthetic pathways.*


#### **Table 2.**

*Summary of the biological databases used in search for potential SNPs involved in the flavonoid biosynthetic pathways.*
