We are IntechOpen, the world's leading publisher of Open Access books Built by scientists, for scientists

4,100+

Open access books available

116,000+

International authors and editors

120M+

Downloads

Our authors are among the

Top 1%

most cited scientists

12.2%

Contributors from top 500 universities

Selection of our books indexed in the Book Citation Index in Web of Science™ Core Collection (BKCI)

## Interested in publishing with us? Contact book.department@intechopen.com

Numbers displayed above are based on latest data collected. For more information visit www.intechopen.com

## **Meet the editors**

Dr Germana Meroni is a graduate of the University of Milan, Italy. She was a post-graduate fellow at the Department of Biotechnology of the San Raffaele Hospital in Milan, Italy, and then post-doctoral fellow at the Department of Human and Molecular Genetics of Baylor College of Medicine, Houston, TX (USA). She established her research group at the Telethon Institute

of Genetics and Medicine (TIGEM) in Naples and then moved as leader of the Functional Genomics Laboratory at the Cluster in Biomedicine within AREA Science Park in Trieste, Italy.

Francesca Petrera is a graduate of the University of Trieste and she was a PhD student at the ICGEB - International centre for genetic engineering and biotechnology in Trieste, Italy. She works in Germana Meroni's group in the Funtional Genomics Laboratory at the Cluster in Biomedicine within AREA Science Park in Trieste.

Contents

**Preface VII** 

Bregje Wertheim

Chapter 1 **Beyond the Gene List: Exploring Transcriptomics** 

Chapter 2 **The REACT Suite: A Software Toolkit for Microbial** 

**and Genetic Architecture 1** 

Peter Ricke and Thorsten Mascher

**Using Biclustering Algorithms 51** 

Chapter 4 **RNAi Towards Functional Genomics Studies 67**  Gabriela N. Tenea and Liliana Burlibasa

Chapter 5 **Genome-Wide RNAi Screen for the Discovery of Gene Function, Novel Therapeutical Targets and Agricultural Applications 95** 

Chapter 6 **How RNA Interference Combat Viruses in Plants 113**  Bushra Tabassum, Idrees Ahmad Nasir, Usman Aslam and Tayyab Husnain

**Animal Genomes/Transcriptomes 155** 

Chapter 9 **Dynamic Proteomics: Methodologies and Analysis 181**  Sara ten Have, Kelly Hodge and Angus I. Lamond

Chapter 7 *Medicago truncatula* **Functional Genomics – An Invaluable** 

**Resource for Studies on Agriculture Sustainability 131**  Francesco Panara, Ornella Calderini and Andrea Porceddu

Chapter 3 **Analysis of Gene Expression Data** 

Chapter 8 **Repetitive DNA: A Tool to Explore** 

Deepali Pathak and Sher Ali

Fadhl M. Al-Akwaa

Hua Bai

**Data in Search for Gene Function, Trait Mechanisms** 

*RE***gulon** *A***nnotation and** *C***omparative** *T***ranscriptomics 25** 

## Contents

#### **Preface** XI


Preface

zootechniques.

The completion of the Human Genome Project was not only the achievement of a goal, but the starting point for new projects towards complete understanding of the function

Today sequencing data obtained from large international consortia, deriving from an incredible number of species, not only humans, are revealing unexpected functions for non coding regions of the genome and also the effect of the variability among individuals. Genome-wide variation, gene expression, both at mRNA and at protein level, protein-protein interaction and network organization are only some examples of

High throughput data for genome-wide analysis derive from new technologies on the market, like next-generation sequencing, which produce extraordinary information that require computer-based analysis downstream. In this scenario, new bioinformatics tools are needed for the analysis of these complex data and the integration of existing information with new ones deriving from different sources.

This book titled **"Functional Genomics"** contains a selection of chapters focused on crucial topics in functional genomics, from the analysis of the genetic code, to the understanding of the role of the different genes and to the proteomic implications. The book provides an overview on basic issues and some of the recent developments in medicinal science and technology. Covering all the aspects involved in such a broad theme as functional genomics and in all its applications would be impossible within the same book. The different chapters represent a brief introduction to the topic, connecting the most promising developments in functional genomics technologies, focusing on specific applications in biomedicine, agro-food technologies and

of genes and of the regulatory regions in the genome.

the data produced by these kind of analyses.

## Preface

The completion of the Human Genome Project was not only the achievement of a goal, but the starting point for new projects towards complete understanding of the function of genes and of the regulatory regions in the genome.

Today sequencing data obtained from large international consortia, deriving from an incredible number of species, not only humans, are revealing unexpected functions for non coding regions of the genome and also the effect of the variability among individuals. Genome-wide variation, gene expression, both at mRNA and at protein level, protein-protein interaction and network organization are only some examples of the data produced by these kind of analyses.

High throughput data for genome-wide analysis derive from new technologies on the market, like next-generation sequencing, which produce extraordinary information that require computer-based analysis downstream. In this scenario, new bioinformatics tools are needed for the analysis of these complex data and the integration of existing information with new ones deriving from different sources.

This book titled **"Functional Genomics"** contains a selection of chapters focused on crucial topics in functional genomics, from the analysis of the genetic code, to the understanding of the role of the different genes and to the proteomic implications. The book provides an overview on basic issues and some of the recent developments in medicinal science and technology. Covering all the aspects involved in such a broad theme as functional genomics and in all its applications would be impossible within the same book. The different chapters represent a brief introduction to the topic, connecting the most promising developments in functional genomics technologies, focusing on specific applications in biomedicine, agro-food technologies and zootechniques.

#### XII Preface

The primary target audience for the book includes students, researchers, biologists, bioinformaticians, biotechnologists and professionals who are interested in application of functional genomics in the different life science areas.

> **Dr. Germana Meroni**  Functional Genomics Lab, CBM - Cluster in Biomedicine, Trieste, Italy

> **Dr. Francesca Petrera** Functional Genomics Lab, CBM - Cluster in Biomedicine, Trieste, Italy

## **Beyond the Gene List: Exploring Transcriptomics Data in Search for Gene Function, Trait Mechanisms and Genetic Architecture**

Bregje Wertheim

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48239

#### **1. Introduction**

Since the start of genomics research, genome-wide expression studies have been used prolifically as a tool to improve our understanding of the involvement of genes in various biological processes. Measuring gene expression patterns simultaneously across all the genes in the genome, i.e. transcriptomics, is a uniquely powerful technology to explore potential novel candidate genes for a particular process. This genome-wide approach has the huge advantage that we do not have to specify in advance which genes we believe to be involved, and as such, we are not limited by our current knowledge. Transcriptomics is an important first step to study traits that are under the control of several to many genes (i.e., polygenic traits) and responsive to external conditions and internal states (i.e., multifactorial traits).

The identification of potential novel candidate genes, however, is only a limited part of the power of transcriptomics. With this technology, the expression of thousands of genes is measured simultaneously. It provides a snapshot of all genes that are actively transcribed during a particular process. When we compare these measurements between conditions or treatments, those genes that are expressed at higher or lower level under a particular condition can be identified. As such, transcriptomics maximizes the awareness of effects anywhere in the genome, including those associated by costs, trade-offs and epistatic interactions. This could be viewed as a complication of transcriptomics data, because a change in expression does not necessarily reflect a causal relationship to the process of interest. In fact, however, it is also one of the major strengths of this technology. By combining various bio-informatic tools and resources, it is possible to obtain an insight into intricate gene-interaction networks, the regulatory control of traits, and the implications of a trait or process on the full phenotype.

© 2012 Wertheim, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Wertheim, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In functional genomics, transcriptomics studies are typically a comparison between biological samples (e.g., a cell type, organ, individual, or group of individuals) that were collected under different conditions, to analyse which genes were up-regulated or downregulated (i.e., were expressed at higher, respectively, lower levels) in response or relation to the condition. These conditions can be experimentally induced (e.g., treatment *versus* control, different dosages of a chemical, different food conditions or temperatures, etc.), or they represent different natural stages (e.g., diseased *versus* healthy, male *versus* female, different developmental stages or aged individuals, different genotypes, different tissues, different epigenetic profiles, etc). Including a proper control treatment or reference is crucially important for the interpretation of gene expression differences that results from such a comparison. There will always be a large number of genes expressed in any biological sample, and without control or reference, it is impossible to attribute expression of particular genes to the condition of interest. The purpose of transcriptomics is to reveal how the expression patterns *change* under different conditions.

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 3

or after a partial digestion), which is then used as template for high-throughput sequencing. The generated sequence information is mapped to, or assembled into, a reference transcriptome, and the number of sequence copies generated for each gene is used to infer the number of mRNA copies in the original sample (Figure 1). Sequencing approaches provide more comprehensive information on the transcript characteristics (e.g. splice variants, mRNA sequence variations, gene fusions, etc.), they are not limited to the known or predicted genes of an organism or the genes represented on an microarray, and they avoid some problems inherent to slide-based technologies [4]. A downside of transcriptome sequencing is that the Quality Control and pre-processing and analysis procedures for these data have not yet fully matured, and the assembly of, or mapping against, the reference transcriptome requires substantial computing power, making this technology still less

In essence, both technological platforms yield data of very similar nature, although the information of sequencing approaches may be more specific and detailed than array-based approaches. After the specific pre-processing that each platform requires, the data can be analysed with similar methods, leading to a list or ranking of genes that show changes in expression patterns or transcript characteristics (e.g. splice variants) among the compared conditions. As such, the gene list provides a first step to identify the genes that potentially matter or are affected by a particular condition. A change in expression, however, is insufficient evidence for establishing a clear link between a gene and the trait of interest. At best, the genes on the list may be associated with the trait or condition of interest, while causality or direct involvement in the trait still needs to be established through additional

Before discussing how gene lists can be generated or used for further analysis, it is important to emphasize that certain limitations are inherent to transcriptomics data. These limitations can be specific to the used platform, for instance microarrays can only report on the activity of genes that are known or represented on the array. Most limitations, however, are irrespective of the technology. As mentioned, genes that are differentially expressed are not necessarily causal to a particular trait or response. Moreover, not all the genes that are involved in a response or trait are detected by a changes in expression. Any posttranscriptional modifications or non-transcriptional processes (such as the re-directing of a transcription factor from its regular processes towards another function) are typically not detectable by a change in gene expression. A further precautionary note is warranted for the design, set-up and execution of any transcriptomics study. An essential requirement for associating changes in gene expression among different samples to a particular condition or treatment of interest, is to ensure that the only difference is the condition or treatment of interest. For example, the collection of control and treatment samples should be done simultaneously (e.g., not before infection and 12 hours after infection) by the same person, to avoid that circadian rhythms or handling effects differ between the samples. When such precautions would not be taken, genes responsive to the treatment would be confounded with genes responsive to these extraneous factors. It is impossible to resolve such confounding effects after the measuring of gene expression. The only way to avoid such

accessible.

empirical approaches.

Transcriptomics technology is used to characterize the composition of the messenger RNA (mRNA) pools from each biological sample. The mRNAs are the transcripts of a gene that carry the information encoded in the gene to the site of protein synthesis. When a particular mRNA is present in a biological sample, it implies that the corresponding gene was expressed, and a template is available for the synthesis of the protein product of that gene. The abundance of each mRNA in the pool represents the level of expression of the corresponding gene. By comparing the relative proportional representation of each mRNA in the total mRNA pool among the samples, we can identify which genes differed in expression in response or relation to the compared conditions. The most widely use technological platform for whole-genome expression studies are microarrays, although the sequencing of the transcriptome is rapidly increasing in popularity (Figure 1).

Microarrays are solid-based platforms (e.g., glass slides), containing millions of copies for thousands of 'reporter probes' that comprise part of the sequences of the genes in the genome. By binding (or 'hybridizing') fluorescent-labelled copies of the original mRNAs to the probes, measuring the label intensities for each position on the array, and associating these positions to their specific reporter probes, one can infer the presence and abundance of each transcript in the labelled RNA pool (Figure 1). It is assumed this representation is proportional to their abundances in the original mRNA samples. Microarrays are relatively cheap, and the tools to analyse the data have been developed, matured and tested. This makes microarrays an affordable and accessible platform for many applications [1]. After the initial introduction of expression arrays that reported only on known or predicted genes, tiling arrays were developed that contained reporter probes across the full genome, including the non-coding, non-translated and non-transcribed chromosomal regions. This enabled the identification of novel transcripts, including non-coding RNA genes, as well as a better characterization of splice variants and exons [2].

The latest developments in next-generation sequencing technologies are making transcriptome sequencing more affordable, and they provide a number of advantages over microarrays [3]. For this approach, the mRNA pool is converted into cDNA (either wholly or after a partial digestion), which is then used as template for high-throughput sequencing. The generated sequence information is mapped to, or assembled into, a reference transcriptome, and the number of sequence copies generated for each gene is used to infer the number of mRNA copies in the original sample (Figure 1). Sequencing approaches provide more comprehensive information on the transcript characteristics (e.g. splice variants, mRNA sequence variations, gene fusions, etc.), they are not limited to the known or predicted genes of an organism or the genes represented on an microarray, and they avoid some problems inherent to slide-based technologies [4]. A downside of transcriptome sequencing is that the Quality Control and pre-processing and analysis procedures for these data have not yet fully matured, and the assembly of, or mapping against, the reference transcriptome requires substantial computing power, making this technology still less accessible.

2 Functional Genomics

In functional genomics, transcriptomics studies are typically a comparison between biological samples (e.g., a cell type, organ, individual, or group of individuals) that were collected under different conditions, to analyse which genes were up-regulated or downregulated (i.e., were expressed at higher, respectively, lower levels) in response or relation to the condition. These conditions can be experimentally induced (e.g., treatment *versus* control, different dosages of a chemical, different food conditions or temperatures, etc.), or they represent different natural stages (e.g., diseased *versus* healthy, male *versus* female, different developmental stages or aged individuals, different genotypes, different tissues, different epigenetic profiles, etc). Including a proper control treatment or reference is crucially important for the interpretation of gene expression differences that results from such a comparison. There will always be a large number of genes expressed in any biological sample, and without control or reference, it is impossible to attribute expression of particular genes to the condition of interest. The purpose of transcriptomics is to reveal

Transcriptomics technology is used to characterize the composition of the messenger RNA (mRNA) pools from each biological sample. The mRNAs are the transcripts of a gene that carry the information encoded in the gene to the site of protein synthesis. When a particular mRNA is present in a biological sample, it implies that the corresponding gene was expressed, and a template is available for the synthesis of the protein product of that gene. The abundance of each mRNA in the pool represents the level of expression of the corresponding gene. By comparing the relative proportional representation of each mRNA in the total mRNA pool among the samples, we can identify which genes differed in expression in response or relation to the compared conditions. The most widely use technological platform for whole-genome expression studies are microarrays, although the

Microarrays are solid-based platforms (e.g., glass slides), containing millions of copies for thousands of 'reporter probes' that comprise part of the sequences of the genes in the genome. By binding (or 'hybridizing') fluorescent-labelled copies of the original mRNAs to the probes, measuring the label intensities for each position on the array, and associating these positions to their specific reporter probes, one can infer the presence and abundance of each transcript in the labelled RNA pool (Figure 1). It is assumed this representation is proportional to their abundances in the original mRNA samples. Microarrays are relatively cheap, and the tools to analyse the data have been developed, matured and tested. This makes microarrays an affordable and accessible platform for many applications [1]. After the initial introduction of expression arrays that reported only on known or predicted genes, tiling arrays were developed that contained reporter probes across the full genome, including the non-coding, non-translated and non-transcribed chromosomal regions. This enabled the identification of novel transcripts, including non-coding RNA genes, as well as a

The latest developments in next-generation sequencing technologies are making transcriptome sequencing more affordable, and they provide a number of advantages over microarrays [3]. For this approach, the mRNA pool is converted into cDNA (either wholly

sequencing of the transcriptome is rapidly increasing in popularity (Figure 1).

how the expression patterns *change* under different conditions.

better characterization of splice variants and exons [2].

In essence, both technological platforms yield data of very similar nature, although the information of sequencing approaches may be more specific and detailed than array-based approaches. After the specific pre-processing that each platform requires, the data can be analysed with similar methods, leading to a list or ranking of genes that show changes in expression patterns or transcript characteristics (e.g. splice variants) among the compared conditions. As such, the gene list provides a first step to identify the genes that potentially matter or are affected by a particular condition. A change in expression, however, is insufficient evidence for establishing a clear link between a gene and the trait of interest. At best, the genes on the list may be associated with the trait or condition of interest, while causality or direct involvement in the trait still needs to be established through additional empirical approaches.

Before discussing how gene lists can be generated or used for further analysis, it is important to emphasize that certain limitations are inherent to transcriptomics data. These limitations can be specific to the used platform, for instance microarrays can only report on the activity of genes that are known or represented on the array. Most limitations, however, are irrespective of the technology. As mentioned, genes that are differentially expressed are not necessarily causal to a particular trait or response. Moreover, not all the genes that are involved in a response or trait are detected by a changes in expression. Any posttranscriptional modifications or non-transcriptional processes (such as the re-directing of a transcription factor from its regular processes towards another function) are typically not detectable by a change in gene expression. A further precautionary note is warranted for the design, set-up and execution of any transcriptomics study. An essential requirement for associating changes in gene expression among different samples to a particular condition or treatment of interest, is to ensure that the only difference is the condition or treatment of interest. For example, the collection of control and treatment samples should be done simultaneously (e.g., not before infection and 12 hours after infection) by the same person, to avoid that circadian rhythms or handling effects differ between the samples. When such precautions would not be taken, genes responsive to the treatment would be confounded with genes responsive to these extraneous factors. It is impossible to resolve such confounding effects after the measuring of gene expression. The only way to avoid such

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 5

issues is to take due care during experimental design, sample collection and sample preparation. Despite these limitations that are inherent to any transcriptomics technology, the resulting data does provide an array of possibilities for further meaningful analysis.

In this chapter, I will illustrate various ways in which transcriptomics data can be analysed, to identify novel candidate genes for the process of interest, and additionally, how to move beyond this list of candidate genes towards the molecular mechanisms and gene interaction networks of a trait. For these illustrations, I will mostly use transcriptomics data on the innate immunity in *Drosophila* larvae after parasitism. Our analysis on the transcriptomics during the acute immune response to infection by parasitic wasps [5], as well as between strains that differ genetically in resistance to these parasites [6], revealed a complex gene interaction network associated with defense mechanisms. The immune response to parasites is triggered by recognizing the invasion of the parasite, and comprises of the proliferation and differentiation of specialized blood cells that surround the parasite in a multi-layered capsule, and sealing the capsule with a layer of melanin. This melanotic encapsulation sequesters and kills the parasite [7]. By integrating the data from our studies with various resources and bioinformatics approaches, we gained a more comprehensive insight in the interactive and regulatory network of genes that are associated with the immune response to parasitism. We identified shared regulatory elements among genes that showed similar expression patterns, physiological costs associated with evoking the immune response, chromosomal positions that were associated with resistance traits and indications for epistatic gene-interactions. Combined, this information provided us with new insights on

the mechanisms and complex genetic architecture of the innate immune response.

The fundamental purpose of a transcriptomics experiment is to identify the genes with changed expression under a particular condition, which is done by comparing the abundance measurements for each gene transcript among the biological samples. Depending on the platform used, these abundance measurements are derived from fluorescence intensity measurements captured in digital images of the microarrays, or the counts of the number of transcripts for sequencing approaches (Figure 1). These measurements, however, are not only reflecting the biologically interesting variation in gene expression under the different conditions, but also a substantial level of technical variation that is introduced during the preparation and measuring of the samples. This includes, for example, residues of reagents that create a background signal on microarrays, short fragments of RNA that bind non-specifically to microarray probes or cannot be uniquely mapped to a reference genome, slight differences in RNA doses for the different samples, or slight differences among samples/batches in the efficiency of the molecular techniques. Some of these aspects affect whole samples, while others are specific to particular genes. To perform the meaningful comparisons on the variation in gene expression measurements, it is typically essential to first eliminate the bias introduced by technical variation as much as

**2. Constructing a list of genes with differential expression** 

possible.

**Figure 1.** Schematic overview of transcriptomics approaches, using microarrays or transcriptome sequencing. Although the technologies differ, both approaches compare all the mRNAs in biological samples under different conditions, and provide quantifications of the abundance of all gene transcripts for each sample. *Images of GeneChips® courtesy of Affymetrix.*

issues is to take due care during experimental design, sample collection and sample preparation. Despite these limitations that are inherent to any transcriptomics technology, the resulting data does provide an array of possibilities for further meaningful analysis.

4 Functional Genomics

**Figure 1.** Schematic overview of transcriptomics approaches, using microarrays or transcriptome sequencing. Although the technologies differ, both approaches compare all the mRNAs in biological samples under different conditions, and provide quantifications of the abundance of all gene transcripts

for each sample. *Images of GeneChips® courtesy of Affymetrix.*

In this chapter, I will illustrate various ways in which transcriptomics data can be analysed, to identify novel candidate genes for the process of interest, and additionally, how to move beyond this list of candidate genes towards the molecular mechanisms and gene interaction networks of a trait. For these illustrations, I will mostly use transcriptomics data on the innate immunity in *Drosophila* larvae after parasitism. Our analysis on the transcriptomics during the acute immune response to infection by parasitic wasps [5], as well as between strains that differ genetically in resistance to these parasites [6], revealed a complex gene interaction network associated with defense mechanisms. The immune response to parasites is triggered by recognizing the invasion of the parasite, and comprises of the proliferation and differentiation of specialized blood cells that surround the parasite in a multi-layered capsule, and sealing the capsule with a layer of melanin. This melanotic encapsulation sequesters and kills the parasite [7]. By integrating the data from our studies with various resources and bioinformatics approaches, we gained a more comprehensive insight in the interactive and regulatory network of genes that are associated with the immune response to parasitism. We identified shared regulatory elements among genes that showed similar expression patterns, physiological costs associated with evoking the immune response, chromosomal positions that were associated with resistance traits and indications for epistatic gene-interactions. Combined, this information provided us with new insights on the mechanisms and complex genetic architecture of the innate immune response.

#### **2. Constructing a list of genes with differential expression**

The fundamental purpose of a transcriptomics experiment is to identify the genes with changed expression under a particular condition, which is done by comparing the abundance measurements for each gene transcript among the biological samples. Depending on the platform used, these abundance measurements are derived from fluorescence intensity measurements captured in digital images of the microarrays, or the counts of the number of transcripts for sequencing approaches (Figure 1). These measurements, however, are not only reflecting the biologically interesting variation in gene expression under the different conditions, but also a substantial level of technical variation that is introduced during the preparation and measuring of the samples. This includes, for example, residues of reagents that create a background signal on microarrays, short fragments of RNA that bind non-specifically to microarray probes or cannot be uniquely mapped to a reference genome, slight differences in RNA doses for the different samples, or slight differences among samples/batches in the efficiency of the molecular techniques. Some of these aspects affect whole samples, while others are specific to particular genes. To perform the meaningful comparisons on the variation in gene expression measurements, it is typically essential to first eliminate the bias introduced by technical variation as much as possible.

The raw intensity measurements first need to be pre-processed to deal with the technical variation, normalized to scale all samples to the same range, and combined into a single expression value per gene per sample for comparisons. Many different approaches have been developed for the pre-processing and normalization of microarrays, and subsequent studies have tried to determine the optimal strategy to remove the noise without introducing bias. Some approaches are outperforming others and consensus has been mostly reached for the commonly used microarray platforms, although full consensus for all microarray platforms is still lacking [8]. Also for transcriptome sequencing, normalization is important to address deviations due to slight differences in doses, the gene length and GCcontent. The exploration of the best pre-processing and normalization approaches for transcriptome sequencing are still being established (e.g., [4,9]).

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 7

Not only statistical significance, but also the magnitude of a change in expression (or the 'fold change') between conditions is often provided, sometimes as an auxiliary for biological significance. Fold changes are typically provided at a log2 scale, so that the fold changes are centred around zero, and a doubling or halving of expression level in the treatment compared to the control would result in an equal deviation from zero. These fold change data can be plotted to visualize the differentially expressed genes, either in relation to the average expression level of that gene (MA-plot, Figure 2a), or in relation to the statistical significance (volcano plot, Figure 2b). It should be realized, however, that fold changes are fickle indicators of biological significance. Firstly, depending on the position and role of a particular gene in a regulatory network (e.g., a central transcription factor, or a direct regulator of transcriptional activity), a small fold change may have large biological implications. Large fold changes could be primarily expected at the margins of these networks, which may involve the final effectors of the response while that may reveal little about the key regulators of the response. Secondly, microarrays typically only detect large fold changes in the intermediate range of expression values. Low levels of expression may be below the detection limit of the array, and background noise or corrections may obscure any changes in the expression of such genes. High levels of expression may result in saturation of the probes, vastly underestimating the actual fold changes. Transcriptome sequencing approaches would not be biased towards these intermediate expression levels, but instead, could suffer from exaggerated fold-change estimates for genes not expressed, or expressed at very low level, in one sample or both

**Figure 2.** Plots that summarize the fold-change differences in gene expression between two conditions. a) MA plots portray for each gene the average gene expression across the two conditions on the x-axis (A), and the log2 fold change difference in expression between the two conditions on the y-axis (M). b) Volcano plots portray for each gene the log2 fold change in difference of expression between the two conditions on the x-axis (Fold Change) and the statistical significance for the *t*-test on expression measurements between the two conditions on the y-axis (–log10 *P*-value). The presented data is on *Drosophila* larvae 12 hours after being parasitized and control larvae (that had not been parasitized) [5]. The 'outliers' in both plots represent genes that differed in expression between the two conditions. In red are the genes that both scored a *P*-value < 0.001 and had at least a 2-fold change in expression between the two conditions. Applying these combined criteria for assigning significance would exclude

several 'outlier' genes with high average expression levels (a) and/or with low p-values (b).

samples (when the denominator approaches zero).

To statistically test for changes in gene expression, biological replication is essential. Having multiple biological units for each condition enables the estimation of variation within and between conditions, which allows for the partitioning of all variation into noise (i.e., technical and random variation), and the biologically interesting variation reflecting the changes in gene expression patterns. Technical replications are sometimes also incorporated in the platform or analysis, for example by repeating the same probes on a microarray, by applying a dye-swap on samples, or by testing the same samples twice. Although this can increase the accuracy and sensitivity for the estimation of technical variation, it is generally not as important as biological replication is for increasing the sensitivity and power of the analysis. The minimum number of replications that is required for any transcriptomics experiment depends, among others, on the objective of the experiment, the required sensitivity, the type of microarray or sequencing method used, the experimental design, and the number of treatment groups [10]. Measuring gene expression across a time course may also be a powerful way to increase the power of the analysis, as well as providing a means to determine the sequence of action for genes.

For the statistical analysis of transcriptomics data, many different alternatives are available. Most tests developed for microarray data or transcriptome sequencing are essentially modifications of more standard statistical tests [8]. To identify the genes showing differential expression (i.e., differences in expression level) among treatments or conditions, many of the statistical procedures consist of some form of variance analysis and test whether the variance in expression patterns among treatments or conditions exceeds the variance between biological replicates within a treatment. The most commonly used tests include (modifications of) t-tests, ANOVAs, regression analysis, mixed models and generalized linear models. The modifications for these tests are primarily to increase power for the often small sample sizes, and to avoid violation of the assumptions for the parametric tests, in particular the assumptions of a Normal distribution and independence among measurements. Modifications include methods to shrink variance estimates (using combined information on variance for the large number of measurement on a single sample), permutation approaches and empirical Bayesian methods. Similar to the best choice for the number of biological replicates, the best statistical approach depends on the objective of the experiment, the transcriptomics platform used, the experimental design, the number of treatment groups and the number of replicates per treatment.

Not only statistical significance, but also the magnitude of a change in expression (or the 'fold change') between conditions is often provided, sometimes as an auxiliary for biological significance. Fold changes are typically provided at a log2 scale, so that the fold changes are centred around zero, and a doubling or halving of expression level in the treatment compared to the control would result in an equal deviation from zero. These fold change data can be plotted to visualize the differentially expressed genes, either in relation to the average expression level of that gene (MA-plot, Figure 2a), or in relation to the statistical significance (volcano plot, Figure 2b). It should be realized, however, that fold changes are fickle indicators of biological significance. Firstly, depending on the position and role of a particular gene in a regulatory network (e.g., a central transcription factor, or a direct regulator of transcriptional activity), a small fold change may have large biological implications. Large fold changes could be primarily expected at the margins of these networks, which may involve the final effectors of the response while that may reveal little about the key regulators of the response. Secondly, microarrays typically only detect large fold changes in the intermediate range of expression values. Low levels of expression may be below the detection limit of the array, and background noise or corrections may obscure any changes in the expression of such genes. High levels of expression may result in saturation of the probes, vastly underestimating the actual fold changes. Transcriptome sequencing approaches would not be biased towards these intermediate expression levels, but instead, could suffer from exaggerated fold-change estimates for genes not expressed, or expressed at very low level, in one sample or both samples (when the denominator approaches zero).

6 Functional Genomics

The raw intensity measurements first need to be pre-processed to deal with the technical variation, normalized to scale all samples to the same range, and combined into a single expression value per gene per sample for comparisons. Many different approaches have been developed for the pre-processing and normalization of microarrays, and subsequent studies have tried to determine the optimal strategy to remove the noise without introducing bias. Some approaches are outperforming others and consensus has been mostly reached for the commonly used microarray platforms, although full consensus for all microarray platforms is still lacking [8]. Also for transcriptome sequencing, normalization is important to address deviations due to slight differences in doses, the gene length and GCcontent. The exploration of the best pre-processing and normalization approaches for

To statistically test for changes in gene expression, biological replication is essential. Having multiple biological units for each condition enables the estimation of variation within and between conditions, which allows for the partitioning of all variation into noise (i.e., technical and random variation), and the biologically interesting variation reflecting the changes in gene expression patterns. Technical replications are sometimes also incorporated in the platform or analysis, for example by repeating the same probes on a microarray, by applying a dye-swap on samples, or by testing the same samples twice. Although this can increase the accuracy and sensitivity for the estimation of technical variation, it is generally not as important as biological replication is for increasing the sensitivity and power of the analysis. The minimum number of replications that is required for any transcriptomics experiment depends, among others, on the objective of the experiment, the required sensitivity, the type of microarray or sequencing method used, the experimental design, and the number of treatment groups [10]. Measuring gene expression across a time course may also be a powerful way to increase the power of the analysis, as well as providing a means

For the statistical analysis of transcriptomics data, many different alternatives are available. Most tests developed for microarray data or transcriptome sequencing are essentially modifications of more standard statistical tests [8]. To identify the genes showing differential expression (i.e., differences in expression level) among treatments or conditions, many of the statistical procedures consist of some form of variance analysis and test whether the variance in expression patterns among treatments or conditions exceeds the variance between biological replicates within a treatment. The most commonly used tests include (modifications of) t-tests, ANOVAs, regression analysis, mixed models and generalized linear models. The modifications for these tests are primarily to increase power for the often small sample sizes, and to avoid violation of the assumptions for the parametric tests, in particular the assumptions of a Normal distribution and independence among measurements. Modifications include methods to shrink variance estimates (using combined information on variance for the large number of measurement on a single sample), permutation approaches and empirical Bayesian methods. Similar to the best choice for the number of biological replicates, the best statistical approach depends on the objective of the experiment, the transcriptomics platform used, the experimental design, the

number of treatment groups and the number of replicates per treatment.

transcriptome sequencing are still being established (e.g., [4,9]).

to determine the sequence of action for genes.

**Figure 2.** Plots that summarize the fold-change differences in gene expression between two conditions. a) MA plots portray for each gene the average gene expression across the two conditions on the x-axis (A), and the log2 fold change difference in expression between the two conditions on the y-axis (M). b) Volcano plots portray for each gene the log2 fold change in difference of expression between the two conditions on the x-axis (Fold Change) and the statistical significance for the *t*-test on expression measurements between the two conditions on the y-axis (–log10 *P*-value). The presented data is on *Drosophila* larvae 12 hours after being parasitized and control larvae (that had not been parasitized) [5]. The 'outliers' in both plots represent genes that differed in expression between the two conditions. In red are the genes that both scored a *P*-value < 0.001 and had at least a 2-fold change in expression between the two conditions. Applying these combined criteria for assigning significance would exclude several 'outlier' genes with high average expression levels (a) and/or with low p-values (b).

Finally, to determine the genes with significant differences in expression among conditions or treatments, a statistical correction needs to be applied for the large number of statistical tests for each experiment (i.e., multiplicity or multiple testing). In a transcriptomics experiments, several thousands of genes are tested, and each gene is analysed for differences in expression among conditions. In statistics, we normally use a type I error rate of α = 0.05, which means that we accept that in 5% of cases where we rejected the null hypotheses (*H0*: no differences among conditions) and called something 'significantly' different, the observed difference was purely by chance. When we do not correct the type I error rate while performing thousands of statistical tests (i.e., one for each gene), this would result in hundreds of genes called significantly differentially expressed, while these differences were merely by chance. Genes that are deemed differentially expressed while they are not, are *false positives*. Genes that are deemed not differentially expressed while they are, are *false negatives*. Correcting for false positives in large scale experiments is needed to avoid including many erroneous calls, but it needs to be carefully balanced by controlling for false negatives to ensure optimal sensitivity and accuracy of the analysis.

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 9

Integrin alpha chain

Integrin alpha-2

FG-GAP

C-type lectin C-type lectin fold

component

bond-forming

Integrin alpha beta-propellor

Terpenoid cylases/protein

prenyltransferase alpha-alpha toroid Alpha-macroglobulin, receptor-binding Alpha-2-macroglobulin, N-terminal Alpha-2-macroglobulin, N-terminal 2 A-macroglobulin complement

Alpha-2-macroglobulin, conserved site Alpha-2-macroglobulin, thiol-ester

Integrin alpha chain, C-terminal cytoplasmic region, conserved site

The first inspection of a gene list typically is to link the gene names to what is known, predicted and published about these genes, both in terms of the function of the gene (product), the protein family or protein domains that the gene codes for, and the signal transduction pathways in which it participates. For model species and other species for which the full genomic sequence is available, repositories exist that combine several sources of information on individual genes (for example, see "www.nature.com/scitable/content/ Genomics-Databases-744357" for a list of species-specific repositories [12]). The annotation of genes is mostly following a controlled vocabulary or restricted terminology. For functional annotations, Gene Ontology (GO) is a widely used vocabulary. Gene Ontology describes the genes and their products (e.g., the proteins for which a gene codes) within three main Ontology domains: Molecular Function, Biological Process and Cellular Component. Genes can be described at various hierarchical levels using this GO terminology, ranging from broad over-arching themes to very specific descriptions. Descriptions of protein domains are often inferred based on sequence similarity to other organism, for example using the InterPro terminology. Since many proteins are involved in several biological processes or contain more than one functional domain, genes (or gene products) have often different GO

annotations across the three GO domains and different IP annotations (Table 1).

Biological Process: Cell adhesion Biological Process: Cell-matrix adhesion Biological Process: Heterophilic cell-cell

Molecular Function: Cell adhesion

Molecular Function: Receptor activity

Biological Process: Galactose binding

Biological Process: Defense response to

Molecular Function: Endopeptidase

**Table 1.** Examples of gene annotations, using the vocabulary of the Gene Ontology (GO) and InterPro (IP). Annotations are provided for three genes that were differentially expressed during the immune response of *Drosophila* after infection by parasites [5]. The GO annotations describe the function and

Molecular Function: Peptidase

Cellular Component: Extracellular

Biological Process: Antibacterial

αPS4 Cellular Component: Integrin complex

molecule binding

Molecular Function: -

humoral response

engulfment

inhibitor activity

inhibitor activity

gram-negative bacterium Biological Process: Phagocytosis,

adhesion

lectin-24A Cellular Component: -

space

Gene Ontology Annotation InterPro Annotation

**3. Standard explorations of the gene list** 

Gene Name (symbol)

Thiolester containing protein II (TepII)

The typical statistical correction for false positives in non-genomic experiments with multiple testing is a Bonferroni correction, which divides α by the number of statistical tests applied to the data. This approach, however, is often too conservative (i.e., accepting the null hypothesis *H0*, while it was false) for the thousands of tests in transcriptomics analyses, and would result in a large number of false negatives. The most widely used correction for multiple testing in transcriptomics analysis is a False Discovery Rate (FDR) correction, which attempts to provide a more even balance between false positives and false negatives. Several FDR approaches exist, but they generally adjust or replace the *P*-value for significance to reflect the likely proportion of false positives among the genes that are called significant. For example, when we would identify 100 genes with an FDR adjusted *P*-value (*Padj* or *q*-value) of <0.05, we would on average expect less than 5 of these genes to be false positives [11]. The acceptance level for significance used with FDR often ranges from *Padj* <0.001 to <0.10, depending on the desired sensitivity and accuracy, the sample size (i.e., power) and the estimated numbers of genes with differential expression.

The end result of all pre-processing steps, normalisation, statistical analyses and corrections for false positives is a list or ranking of genes that significantly changed expression in response or relation to the different conditions that were compared. This lists contains potential candidate genes that may be actively involved in the process of interest. However, many genes are also included in the list that are only indirectly associated with the response or process of interest. Moreover, the gene list does not contain *all* the (candidate) genes that are involved in the process, but only these that could be detected by transcriptomics and under the particular experimental conditions (e.g. time points during the response, sample sizes, technological platform) and analysis choices (e.g. normalization approach, acceptance thresholds for significance). Finding gene expression changes in a transcriptomics experiments is not required, nor sufficient, evidence for the function of a gene or its involvement in a biological process. It is, however, a valuable starting point for further analysis.

#### **3. Standard explorations of the gene list**

8 Functional Genomics

analysis.

Finally, to determine the genes with significant differences in expression among conditions or treatments, a statistical correction needs to be applied for the large number of statistical tests for each experiment (i.e., multiplicity or multiple testing). In a transcriptomics experiments, several thousands of genes are tested, and each gene is analysed for differences in expression among conditions. In statistics, we normally use a type I error rate of α = 0.05, which means that we accept that in 5% of cases where we rejected the null hypotheses (*H0*: no differences among conditions) and called something 'significantly' different, the observed difference was purely by chance. When we do not correct the type I error rate while performing thousands of statistical tests (i.e., one for each gene), this would result in hundreds of genes called significantly differentially expressed, while these differences were merely by chance. Genes that are deemed differentially expressed while they are not, are *false positives*. Genes that are deemed not differentially expressed while they are, are *false negatives*. Correcting for false positives in large scale experiments is needed to avoid including many erroneous calls, but it needs to be carefully balanced by controlling for false

The typical statistical correction for false positives in non-genomic experiments with multiple testing is a Bonferroni correction, which divides α by the number of statistical tests applied to the data. This approach, however, is often too conservative (i.e., accepting the null hypothesis *H0*, while it was false) for the thousands of tests in transcriptomics analyses, and would result in a large number of false negatives. The most widely used correction for multiple testing in transcriptomics analysis is a False Discovery Rate (FDR) correction, which attempts to provide a more even balance between false positives and false negatives. Several FDR approaches exist, but they generally adjust or replace the *P*-value for significance to reflect the likely proportion of false positives among the genes that are called significant. For example, when we would identify 100 genes with an FDR adjusted *P*-value (*Padj* or *q*-value) of <0.05, we would on average expect less than 5 of these genes to be false positives [11]. The acceptance level for significance used with FDR often ranges from *Padj* <0.001 to <0.10, depending on the desired sensitivity and accuracy, the sample size (i.e.,

The end result of all pre-processing steps, normalisation, statistical analyses and corrections for false positives is a list or ranking of genes that significantly changed expression in response or relation to the different conditions that were compared. This lists contains potential candidate genes that may be actively involved in the process of interest. However, many genes are also included in the list that are only indirectly associated with the response or process of interest. Moreover, the gene list does not contain *all* the (candidate) genes that are involved in the process, but only these that could be detected by transcriptomics and under the particular experimental conditions (e.g. time points during the response, sample sizes, technological platform) and analysis choices (e.g. normalization approach, acceptance thresholds for significance). Finding gene expression changes in a transcriptomics experiments is not required, nor sufficient, evidence for the function of a gene or its involvement in a biological process. It is, however, a valuable starting point for further

negatives to ensure optimal sensitivity and accuracy of the analysis.

power) and the estimated numbers of genes with differential expression.

The first inspection of a gene list typically is to link the gene names to what is known, predicted and published about these genes, both in terms of the function of the gene (product), the protein family or protein domains that the gene codes for, and the signal transduction pathways in which it participates. For model species and other species for which the full genomic sequence is available, repositories exist that combine several sources of information on individual genes (for example, see "www.nature.com/scitable/content/ Genomics-Databases-744357" for a list of species-specific repositories [12]). The annotation of genes is mostly following a controlled vocabulary or restricted terminology. For functional annotations, Gene Ontology (GO) is a widely used vocabulary. Gene Ontology describes the genes and their products (e.g., the proteins for which a gene codes) within three main Ontology domains: Molecular Function, Biological Process and Cellular Component. Genes can be described at various hierarchical levels using this GO terminology, ranging from broad over-arching themes to very specific descriptions. Descriptions of protein domains are often inferred based on sequence similarity to other organism, for example using the InterPro terminology. Since many proteins are involved in several biological processes or contain more than one functional domain, genes (or gene products) have often different GO annotations across the three GO domains and different IP annotations (Table 1).


**Table 1.** Examples of gene annotations, using the vocabulary of the Gene Ontology (GO) and InterPro (IP). Annotations are provided for three genes that were differentially expressed during the immune response of *Drosophila* after infection by parasites [5]. The GO annotations describe the function and

process that have been reported for the protein, and the IP annotations describe the protein domains. Genes that are involved in different processes, or coding for proteins with multiple functional domains, may contain a variety of annotations. Many genes, however, are not fully annotated.

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 11

**Figure 3.** Venn diagram of differentially expressed genes in *Drosophila* larvae after infection by a parasitic wasp, and genes that have been previously implicated in defense responses and anti-microbial immune responses. Infection by a parasitic wasp ('macro-parasite') triggers a cellular immune response that is substantially different from general defense responses and the mostly humoral immune response against bacterial and fungal infections ('micro-parasites'). This is reflected both in the relatively large number of known immunity genes that did not change expression after infection with macro-parasites, and in the large number of differentially expressed genes after macro-parasite infection that had not previously been associated with immunity and defense. *Redrawn with permission after* [5]*, first published* 

When several conditions or time points are included in the experimental design, clustering the genes according to their expression pattern across these conditions or time points allows for identifying groups of genes that responded similarly, and analysing these separately from genes with different behaviour. An enrichment analyses on such groups of genes may identify a common theme to groups with a particular expression profile across the conditions or time points. For example, in our transcriptomics study after infection with macroparasites, we identified groups of genes with a peak in up-regulated expression 1-6 hours after infection, at 6-24 hours after infection, and at 24-72 hours after infection, and groups with down-regulated expression either throughout the time course, or at 72 hours after infection (Figure 4). The first group of genes was enriched for immunity genes (clusters 1 and 2), the second group of genes for proteolysis and serine-type endopeptidases (cluster 12), and the last group in puparial adhesion (cluster 9). These patterns can be used both to get a more detailed profile for the various processes that occur during the response. Additionally, it may serve as a starting point for inferring the functions of unannotated genes. For example, the *Drosophila* genome codes for 201 genes with serine-type endopeptidase activity, which function in development, immunity and various other biological processes. Only 22 of these genes had been functionally annotated with a role in immunity, but unannotated serine-type endopeptidase genes that responded similarly to infection as genes with a functional annotation in immunity or defense could be putatively

*by BioMed Central.*

assigned the same functions [16].

The abundance and reliability of annotation information is highly variable among genes and species: some genes are well studied and annotations are solidly supported by empirical evidence, while other genes are not annotated, only partially annotated or annotations are based only on unconfirmed computer predictions or non-traceable author statements. Furthermore, for model organisms the functional annotations have accumulated by the studies of many researchers over long periods, while for non-model organisms or new model organisms, there is often only limited detailed knowledge available. Yet, even for these non-model organisms, various resources exist that enable high level analysis of transcriptomics data based on homologies, such as, for example, the Blast2GO suite [13].

Gene lists from transcriptomics experiments are particularly amenable for enrichment analyses of functional annotations. An enrichment of a particular functional annotation implies that it is represented more often among the gene list members than would be expected by chance alone, based on the proportion of the genes in the genome with that annotation. Multiple interfaces and online tools have been developed for this purpose (e.g., DAVID for large gene lists [14] and Catmap for gene lists that are ranked for significance, but without actually applying a significance threshold cutoff [15]). When the conditions or treatments of interest resulted in a coordinated response in the gene interaction network, the likelihood increases of finding genes with changed expression sharing the same annotation. Such enrichments may be informative for identifying different biological processes or protein families that are associated with, or affected by, a response to the condition or treatment of interest. This may also be informative to identify possible costs or trade-offs that are associated with the response. For example, within the gene list for the response to parasite infection [5], we identified a set of genes involved in puparial adhesion. These genes were expressed at lower levels in the infected larvae at 72 hours after infection, and reflect the delay in development these larvae incurred by investing energy and resources in the immune response.

The list of differentially expressed genes can be compared to other gene lists, which could be derived from other transcriptomics studies, known candidate genes for the process of interest, or any other approach that identified a set of genes associated with a particular condition. Venn diagrams can summarize these gene list comparisons (Figure 3). Reporting how many of the genes were shared with other gene list(s), and how many are unique for each gene list, provides a quick overview of the numbers of genes that may be of particular interest. Sometimes it is the genes that are also present in the other gene list(s) that are of particular interest, for example when multiple sources of evidence are combined or to identify cross-talk between gene interaction networks. Alternatively, one could focus on the unique genes to identify novel candidate genes that had not previously been associated with the process of interest.

Blast2GO suite [13].

the immune response.

the process of interest.

process that have been reported for the protein, and the IP annotations describe the protein domains. Genes that are involved in different processes, or coding for proteins with multiple functional domains,

The abundance and reliability of annotation information is highly variable among genes and species: some genes are well studied and annotations are solidly supported by empirical evidence, while other genes are not annotated, only partially annotated or annotations are based only on unconfirmed computer predictions or non-traceable author statements. Furthermore, for model organisms the functional annotations have accumulated by the studies of many researchers over long periods, while for non-model organisms or new model organisms, there is often only limited detailed knowledge available. Yet, even for these non-model organisms, various resources exist that enable high level analysis of transcriptomics data based on homologies, such as, for example, the

Gene lists from transcriptomics experiments are particularly amenable for enrichment analyses of functional annotations. An enrichment of a particular functional annotation implies that it is represented more often among the gene list members than would be expected by chance alone, based on the proportion of the genes in the genome with that annotation. Multiple interfaces and online tools have been developed for this purpose (e.g., DAVID for large gene lists [14] and Catmap for gene lists that are ranked for significance, but without actually applying a significance threshold cutoff [15]). When the conditions or treatments of interest resulted in a coordinated response in the gene interaction network, the likelihood increases of finding genes with changed expression sharing the same annotation. Such enrichments may be informative for identifying different biological processes or protein families that are associated with, or affected by, a response to the condition or treatment of interest. This may also be informative to identify possible costs or trade-offs that are associated with the response. For example, within the gene list for the response to parasite infection [5], we identified a set of genes involved in puparial adhesion. These genes were expressed at lower levels in the infected larvae at 72 hours after infection, and reflect the delay in development these larvae incurred by investing energy and resources in

The list of differentially expressed genes can be compared to other gene lists, which could be derived from other transcriptomics studies, known candidate genes for the process of interest, or any other approach that identified a set of genes associated with a particular condition. Venn diagrams can summarize these gene list comparisons (Figure 3). Reporting how many of the genes were shared with other gene list(s), and how many are unique for each gene list, provides a quick overview of the numbers of genes that may be of particular interest. Sometimes it is the genes that are also present in the other gene list(s) that are of particular interest, for example when multiple sources of evidence are combined or to identify cross-talk between gene interaction networks. Alternatively, one could focus on the unique genes to identify novel candidate genes that had not previously been associated with

may contain a variety of annotations. Many genes, however, are not fully annotated.

**Figure 3.** Venn diagram of differentially expressed genes in *Drosophila* larvae after infection by a parasitic wasp, and genes that have been previously implicated in defense responses and anti-microbial immune responses. Infection by a parasitic wasp ('macro-parasite') triggers a cellular immune response that is substantially different from general defense responses and the mostly humoral immune response against bacterial and fungal infections ('micro-parasites'). This is reflected both in the relatively large number of known immunity genes that did not change expression after infection with macro-parasites, and in the large number of differentially expressed genes after macro-parasite infection that had not previously been associated with immunity and defense. *Redrawn with permission after* [5]*, first published by BioMed Central.*

When several conditions or time points are included in the experimental design, clustering the genes according to their expression pattern across these conditions or time points allows for identifying groups of genes that responded similarly, and analysing these separately from genes with different behaviour. An enrichment analyses on such groups of genes may identify a common theme to groups with a particular expression profile across the conditions or time points. For example, in our transcriptomics study after infection with macroparasites, we identified groups of genes with a peak in up-regulated expression 1-6 hours after infection, at 6-24 hours after infection, and at 24-72 hours after infection, and groups with down-regulated expression either throughout the time course, or at 72 hours after infection (Figure 4). The first group of genes was enriched for immunity genes (clusters 1 and 2), the second group of genes for proteolysis and serine-type endopeptidases (cluster 12), and the last group in puparial adhesion (cluster 9). These patterns can be used both to get a more detailed profile for the various processes that occur during the response. Additionally, it may serve as a starting point for inferring the functions of unannotated genes. For example, the *Drosophila* genome codes for 201 genes with serine-type endopeptidase activity, which function in development, immunity and various other biological processes. Only 22 of these genes had been functionally annotated with a role in immunity, but unannotated serine-type endopeptidase genes that responded similarly to infection as genes with a functional annotation in immunity or defense could be putatively assigned the same functions [16].

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 13

accuracy of expression measures and provide details on the role of the untranslated regions

The descriptions of the analyses so far have centred on querying repositories containing the functional annotations for genes, to explore what is known on the genes in the gene list and what additional light this may shed on sub-processes, the unannotated genes and associated responses. Yet, many additional resources and genomic databases are available that may be cross-referenced and combined with the gene list, to obtain additional information on these genes and their interactions. Rather than focussing on individual members of the gene list and what is known, these approaches search for emergent properties of the gene list. Especially when the organism that is studied is a model organism for which many sources of additional information are publicly available, there is a large array of possibilities for

In addition to searching in specific repositories for functional annotations of genes, the extraction of information on genes and proteins from text documents (e.g., scientific papers) can leap across the boundaries of scientific disciplines. Text mining is the automated extraction of information on proteins or genes from a large literature collection (such as PubMed). It searches for associations between proteins and functional descriptors in the text. These descriptors can be of molecular origin to describe the annotations of the protein (as in the repositories), but also of a physiological, phenotypic or pathological origin to describe the inferences for the organism, or of phylogenomics origin related to the evolution of a gene. Through this additional dimension of information, text mining can help, for instance, to identify associations of the protein to rare mutations that are implicated in diseases, or to protein-protein interactions and regulatory pathways [19]. Text mining is different from a typical literature search, in that it not simply lists the hits, but parses the retrieved information according to further specifications (Figure 5). Various tools are available online (see for example www.ebi.ac.uk/Rebholz/resources.html for an

Physiological responses or the focal tissue of a response to the treatment or condition of interest may also be explored through analysis of the gene list. For some model organisms, a tissue atlas is publicly available that specifies the level of expression for each gene in all tissues and/or developmental stages (e.g., FlyAtlas, Human Atlas Suite and eMouse Atlas). A large fraction of genes in the genome are not expressed homogeneously throughout the body, but show high specificity for particular tissues [21]. Using this information provides a means to screen for tissues that may contribute disproportionally to the response. For example, when the gene list is enriched for genes that are primarily expressed in a particular tissue (e.g. testes, brain, liver or salivary glands), this could indicate that these tissues are most severely affected or responding to the treatment of interest. Additionally, the atlases have raised an awareness for experimental design in transcriptomics studies: when the transcriptomics responses are localized in a particular (minor) tissue, it is difficult to detect

in gene expression regulation [18].

**4. Beyond the gene list** 

further analyses.

overview).

**Figure 4.** Clustering of genes that show similar expression patterns in *Drosophila* larvae across the 72 hour time course after infection by parasitic wasps. The average expression levels (± standard errors of the mean) for the genes within the clusters (log2 transformed and divided by the median expression level for that gene across all time points) is shown. Dashed red lines represent the gene expression in parasitized larvae and solid blue lines represent the gene expression in control (not parasitized) larvae. *Partially redrawn with permission after* [5] *, first published by BioMed Central.* 

In addition to these general approaches for any transcriptomics analysis, regardless of the platform that was used, some additional insights could be gained from using tiling arrays or transcriptome sequencing. Not only the expression level could be determined for each gene, but also alternative isoforms of transcripts, including splice variants and sequence variations (either in the coding regions or in the untranslated regions of the transcripts). In humans, transcriptome sequencing revealed that splicing isoforms from various tissues showed systematic differences, including exon skipping, alternative 3' or 5' splice sites, mutually exclusive exons and alternative first or last exons [17]. New methods allow for the quantification of gene expression levels for the individual isoforms, which can improve the accuracy of expression measures and provide details on the role of the untranslated regions in gene expression regulation [18].

#### **4. Beyond the gene list**

12 Functional Genomics

**Figure 4.** Clustering of genes that show similar expression patterns in *Drosophila* larvae across the 72 hour time course after infection by parasitic wasps. The average expression levels (± standard errors of the mean) for the genes within the clusters (log2 transformed and divided by the median expression level for that gene across all time points) is shown. Dashed red lines represent the gene expression in parasitized larvae and solid blue lines represent the gene expression in control (not parasitized) larvae.

In addition to these general approaches for any transcriptomics analysis, regardless of the platform that was used, some additional insights could be gained from using tiling arrays or transcriptome sequencing. Not only the expression level could be determined for each gene, but also alternative isoforms of transcripts, including splice variants and sequence variations (either in the coding regions or in the untranslated regions of the transcripts). In humans, transcriptome sequencing revealed that splicing isoforms from various tissues showed systematic differences, including exon skipping, alternative 3' or 5' splice sites, mutually exclusive exons and alternative first or last exons [17]. New methods allow for the quantification of gene expression levels for the individual isoforms, which can improve the

*Partially redrawn with permission after* [5] *, first published by BioMed Central.* 

The descriptions of the analyses so far have centred on querying repositories containing the functional annotations for genes, to explore what is known on the genes in the gene list and what additional light this may shed on sub-processes, the unannotated genes and associated responses. Yet, many additional resources and genomic databases are available that may be cross-referenced and combined with the gene list, to obtain additional information on these genes and their interactions. Rather than focussing on individual members of the gene list and what is known, these approaches search for emergent properties of the gene list. Especially when the organism that is studied is a model organism for which many sources of additional information are publicly available, there is a large array of possibilities for further analyses.

In addition to searching in specific repositories for functional annotations of genes, the extraction of information on genes and proteins from text documents (e.g., scientific papers) can leap across the boundaries of scientific disciplines. Text mining is the automated extraction of information on proteins or genes from a large literature collection (such as PubMed). It searches for associations between proteins and functional descriptors in the text. These descriptors can be of molecular origin to describe the annotations of the protein (as in the repositories), but also of a physiological, phenotypic or pathological origin to describe the inferences for the organism, or of phylogenomics origin related to the evolution of a gene. Through this additional dimension of information, text mining can help, for instance, to identify associations of the protein to rare mutations that are implicated in diseases, or to protein-protein interactions and regulatory pathways [19]. Text mining is different from a typical literature search, in that it not simply lists the hits, but parses the retrieved information according to further specifications (Figure 5). Various tools are available online (see for example www.ebi.ac.uk/Rebholz/resources.html for an overview).

Physiological responses or the focal tissue of a response to the treatment or condition of interest may also be explored through analysis of the gene list. For some model organisms, a tissue atlas is publicly available that specifies the level of expression for each gene in all tissues and/or developmental stages (e.g., FlyAtlas, Human Atlas Suite and eMouse Atlas). A large fraction of genes in the genome are not expressed homogeneously throughout the body, but show high specificity for particular tissues [21]. Using this information provides a means to screen for tissues that may contribute disproportionally to the response. For example, when the gene list is enriched for genes that are primarily expressed in a particular tissue (e.g. testes, brain, liver or salivary glands), this could indicate that these tissues are most severely affected or responding to the treatment of interest. Additionally, the atlases have raised an awareness for experimental design in transcriptomics studies: when the transcriptomics responses are localized in a particular (minor) tissue, it is difficult to detect

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 15

accurate expression differences when the tissue is not studied in isolation. The chances of missing or underestimating the change in gene expression in mixed-tissue comparisons, or

To gain insight in the regulatory control of the response to the treatment or condition, a screen for *cis*-regulatory elements in the upstream regions of genes with differential expression may reveal transcription factors and/or co-factors that are involved. These *cis*regulatory elements can consist of Transcription Factor Binding Motifs (TFBM), promotors, enhancers, silencers and other sequence motifs that regulate the genes [22]. To identify (putative) *cis*-regulatory elements, one could search for known sequence motifs (e.g., TFBMs or promotors) within a specified region upstream of the start codon and in the first intron. Several databases exist (for example, TRANSFAC, RegTransBase and JASPAR) that contain the published TFBMs and promotors. As the binding sites are often relatively short (often 4- 12, but up to 30 bases long), and not all positions in the sequence are interacting (strongly) with the transcription factor, some sequence variation in the motif is common. Therefore, the TFBM are usually provided as positional weight matrices, which describe the relative occurrences of each base for each position. This can be converted into a graphical representation, or sequence logo, where the size and order of the stacked letters (A,C,G,T) represents the relative occurrence of the base at that position (Figure 6). These motifs may be investigated for particular genes of interest to obtain a prediction on the Transcription

**Figure 6.** The Transcription Factor Binding Motif for the NF-ƘB transcription factor *Relish* / *dorsal* of *Drosophila melanogaster*, depicted as sequence logo and Positional Frequency Matrix. The variation that is commonly found in the binding motif for a transcription factor is incorporated by specifying for each position in the motif the frequency at which each base is recorded. The size of the stacked letters for

POSITIONAL FREQUENCY MATRIX

A 0 0 0 0 4 2 0 0 0 1 C 1 0 0 0 1 1 0 1 9 8 G 7 7 9 8 0 0 0 0 0 0 T 1 2 0 1 4 6 9 8 0 0

each position represent the relative occurrence of the respective bases on each position.

inappropriate tissues, are substantial.

Factor(s) that regulate their expression.

**Figure 5.** Example of the output from a text mining tool, iHOP [20], for one of the genes that was differentially expressed in *Drosophila* larvae after parasite infection. The functional annotations for the same gene, TepII, are summarized in Table 1. The text mining tool provided additional information on the evolution of the gene through information on related genes (paralogs) and domains of the gene that show signs of positive selection. *Screenshot retrieved from "iHOP - http://www.ihop-net.org/".*

accurate expression differences when the tissue is not studied in isolation. The chances of missing or underestimating the change in gene expression in mixed-tissue comparisons, or inappropriate tissues, are substantial.

14 Functional Genomics

**Figure 5.** Example of the output from a text mining tool, iHOP [20], for one of the genes that was differentially expressed in *Drosophila* larvae after parasite infection. The functional annotations for the same gene, TepII, are summarized in Table 1. The text mining tool provided additional information on the evolution of the gene through information on related genes (paralogs) and domains of the gene that

show signs of positive selection. *Screenshot retrieved from "iHOP - http://www.ihop-net.org/".*

To gain insight in the regulatory control of the response to the treatment or condition, a screen for *cis*-regulatory elements in the upstream regions of genes with differential expression may reveal transcription factors and/or co-factors that are involved. These *cis*regulatory elements can consist of Transcription Factor Binding Motifs (TFBM), promotors, enhancers, silencers and other sequence motifs that regulate the genes [22]. To identify (putative) *cis*-regulatory elements, one could search for known sequence motifs (e.g., TFBMs or promotors) within a specified region upstream of the start codon and in the first intron. Several databases exist (for example, TRANSFAC, RegTransBase and JASPAR) that contain the published TFBMs and promotors. As the binding sites are often relatively short (often 4- 12, but up to 30 bases long), and not all positions in the sequence are interacting (strongly) with the transcription factor, some sequence variation in the motif is common. Therefore, the TFBM are usually provided as positional weight matrices, which describe the relative occurrences of each base for each position. This can be converted into a graphical representation, or sequence logo, where the size and order of the stacked letters (A,C,G,T) represents the relative occurrence of the base at that position (Figure 6). These motifs may be investigated for particular genes of interest to obtain a prediction on the Transcription Factor(s) that regulate their expression.

**Figure 6.** The Transcription Factor Binding Motif for the NF-ƘB transcription factor *Relish* / *dorsal* of *Drosophila melanogaster*, depicted as sequence logo and Positional Frequency Matrix. The variation that is commonly found in the binding motif for a transcription factor is incorporated by specifying for each position in the motif the frequency at which each base is recorded. The size of the stacked letters for each position represent the relative occurrence of the respective bases on each position.

Apart from investigating the *cis*-regulatory elements for particular genes of interest, transcriptomics data is also highly suitable to test for over-represented *cis*-regulatory elements across (clusters of) co-expressed genes. This approach can identify groups of genes that are possibly co-regulated by the same Transcription Factor(s). Programs have been specifically developed to screen whether certain known motifs occur more often than you would expect by chance (for example, Clover [23]). These programs can also be extended with custom-made libraries of motifs, to include sequence motifs that could contain yet unidentified *cis*-regulatory elements. These novel motifs could be derived from aligning the upstream sequences of orthologs to identify conserved sequences among related taxa, or through the use of *de novo* motif discovery programs. Alternatively, MotifRegressor searches for any motif that is shared among genes that responded similarly in an expression study [24].

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 17

for known microRNA binding motifs within the 3'UTRs of differentially expressed genes, or by searching for any over-represented or conserved motifs in the 3'UTRs among the genes

Another approach to analyse the genetic architecture for a trait is to make use of proteinprotein interaction (PPI) network databases. These databases contain the known and predicted protein-protein association network, based on experimental approaches (e.g., two hybrid assays, purification of protein complexes, Chromatin immunoprecipitation (ChIP), etc.) and/or computational methods for predicting protein interactions. A large collection of these PPI databases is publicly available (see for example the Jena Protein-Protein Interaction website ppi.fli-leibniz.de/jcb\_ppi\_databases.html for an extensive overview). Several web-based tools can be used to analyse and visualize the PPI networks (e.g., STRING [28] and VisANT [29]). Gene lists submitted to these tools are being assembled into inter-connected networks of proteins, based on the PPI databases. The submitted proteins, as well as the proteins that it is known (or predicted) to interact with, form the 'nodes' in the network. All connections between any of these proteins (directly, or through an intermediary protein) are depicted by lines or 'edges' (Figure 7). The topology of these networks describe the frequency distributions of edges per node, and this can reveal

in the gene list and trying to associate those to microRNAs.

whether the network resembles a random assembly of proteins or not [30].

**Figure 7.** A Protein-Protein Interaction (PPI) network for a subset of the genes involved in the regulation of blood cell proliferation and differentiation in *Drosophila*. The proteins (or 'nodes') are depicted by red or blue circles. The red symbols represent genes with changed expressed in a *Drosophila* strain with an increased immunological resistance against parasites [6]. The known PPIs among these proteins are depicted by lines (or 'edges') between nodes, mostly based on two-hybrid data. Some of the proteins are highly interconnected to other modules of proteins (e.g., pnt, bsk), and these genes can be

considered 'hubs' or key coordinators of the changes in expression.

Analyzing the *cis*-regulatory elements in co-expressed genes can be used to start unravelling the genetic architecture of a trait. In our study for the response to parasite infection, we identified seven *cis*-regulatory elements that were over-represented among the differentially expressed genes, using a combination of MotifRegressor and Clover. Three of these motifs were TFBMs for transcription factors that were known to be involved in the immune responses (the GATA-factor *serpent*, the NF-ƘB *Relish*/*dorsal* and the Janus kinase *Stat92E*), while three others novel motifs were identified that have not yet been associated with any regulatory function. The expression levels of the transcription factor *serpent* was not changed after parasitation, which may appear counter-intuitive as the TFBM was overrepresented in differentially expressed genes. Analysing the expression patterns of the clusters of co-expressed genes with the enriched TFBMs, however, and linking these to functional annotations for these groups of genes, suggested that this transcription factor was drawn away from it regular functions in development and metabolism (co-regulated genes with lower expression levels), towards the activation of the immune response (co-regulated genes higher expression levels) [5]. Additionally, we could hypothesize that the novel motifs may also be involved in coordinating the immune response to parasite infection. Using the cisRED database [25] as a first exploration of these novel motifs, two of these motifs were retrieved as a predicted regulatory element in the human genome sequences, including a hit in the upstream region of a known trans-activator of the MHC II complex (ZXDA). Although the functional characterization of the novel motif is still awaiting, these examples illustrate the complex genetic interactions that may coordinate the regulation of a trait.

Not only transcription is regulated through regulatory sequences associated with genes, translation into proteins is also partially coordinated by regulatory sequences. A rich world of small non-coding RNA molecules have been discovered since the start of the genomic era, which added a completely new dimension to the regulation of gene interaction networks [26]. One large class of these non-coding RNAs, the microRNAs, bind to the 3' untranslated regions (3' UTRs) of mRNAs, inhibiting their translation by polymerases and targeting the mRNAs for degradation. Several databases exist that link target genes or sequence motifs in the 3' UTR to specific microRNAs. These tools are accessible through the microRNA database miRBase [27]. Associating microRNAs to the genes in a gene list could be achieved in an analogous manner as the association to the transcription factors: either by searching for known microRNA binding motifs within the 3'UTRs of differentially expressed genes, or by searching for any over-represented or conserved motifs in the 3'UTRs among the genes in the gene list and trying to associate those to microRNAs.

16 Functional Genomics

[24].

Apart from investigating the *cis*-regulatory elements for particular genes of interest, transcriptomics data is also highly suitable to test for over-represented *cis*-regulatory elements across (clusters of) co-expressed genes. This approach can identify groups of genes that are possibly co-regulated by the same Transcription Factor(s). Programs have been specifically developed to screen whether certain known motifs occur more often than you would expect by chance (for example, Clover [23]). These programs can also be extended with custom-made libraries of motifs, to include sequence motifs that could contain yet unidentified *cis*-regulatory elements. These novel motifs could be derived from aligning the upstream sequences of orthologs to identify conserved sequences among related taxa, or through the use of *de novo* motif discovery programs. Alternatively, MotifRegressor searches for any motif that is shared among genes that responded similarly in an expression study

Analyzing the *cis*-regulatory elements in co-expressed genes can be used to start unravelling the genetic architecture of a trait. In our study for the response to parasite infection, we identified seven *cis*-regulatory elements that were over-represented among the differentially expressed genes, using a combination of MotifRegressor and Clover. Three of these motifs were TFBMs for transcription factors that were known to be involved in the immune responses (the GATA-factor *serpent*, the NF-ƘB *Relish*/*dorsal* and the Janus kinase *Stat92E*), while three others novel motifs were identified that have not yet been associated with any regulatory function. The expression levels of the transcription factor *serpent* was not changed after parasitation, which may appear counter-intuitive as the TFBM was overrepresented in differentially expressed genes. Analysing the expression patterns of the clusters of co-expressed genes with the enriched TFBMs, however, and linking these to functional annotations for these groups of genes, suggested that this transcription factor was drawn away from it regular functions in development and metabolism (co-regulated genes with lower expression levels), towards the activation of the immune response (co-regulated genes higher expression levels) [5]. Additionally, we could hypothesize that the novel motifs may also be involved in coordinating the immune response to parasite infection. Using the cisRED database [25] as a first exploration of these novel motifs, two of these motifs were retrieved as a predicted regulatory element in the human genome sequences, including a hit in the upstream region of a known trans-activator of the MHC II complex (ZXDA). Although the functional characterization of the novel motif is still awaiting, these examples illustrate the complex genetic interactions that may coordinate the regulation of a trait.

Not only transcription is regulated through regulatory sequences associated with genes, translation into proteins is also partially coordinated by regulatory sequences. A rich world of small non-coding RNA molecules have been discovered since the start of the genomic era, which added a completely new dimension to the regulation of gene interaction networks [26]. One large class of these non-coding RNAs, the microRNAs, bind to the 3' untranslated regions (3' UTRs) of mRNAs, inhibiting their translation by polymerases and targeting the mRNAs for degradation. Several databases exist that link target genes or sequence motifs in the 3' UTR to specific microRNAs. These tools are accessible through the microRNA database miRBase [27]. Associating microRNAs to the genes in a gene list could be achieved in an analogous manner as the association to the transcription factors: either by searching Another approach to analyse the genetic architecture for a trait is to make use of proteinprotein interaction (PPI) network databases. These databases contain the known and predicted protein-protein association network, based on experimental approaches (e.g., two hybrid assays, purification of protein complexes, Chromatin immunoprecipitation (ChIP), etc.) and/or computational methods for predicting protein interactions. A large collection of these PPI databases is publicly available (see for example the Jena Protein-Protein Interaction website ppi.fli-leibniz.de/jcb\_ppi\_databases.html for an extensive overview). Several web-based tools can be used to analyse and visualize the PPI networks (e.g., STRING [28] and VisANT [29]). Gene lists submitted to these tools are being assembled into inter-connected networks of proteins, based on the PPI databases. The submitted proteins, as well as the proteins that it is known (or predicted) to interact with, form the 'nodes' in the network. All connections between any of these proteins (directly, or through an intermediary protein) are depicted by lines or 'edges' (Figure 7). The topology of these networks describe the frequency distributions of edges per node, and this can reveal whether the network resembles a random assembly of proteins or not [30].

**Figure 7.** A Protein-Protein Interaction (PPI) network for a subset of the genes involved in the regulation of blood cell proliferation and differentiation in *Drosophila*. The proteins (or 'nodes') are depicted by red or blue circles. The red symbols represent genes with changed expressed in a *Drosophila* strain with an increased immunological resistance against parasites [6]. The known PPIs among these proteins are depicted by lines (or 'edges') between nodes, mostly based on two-hybrid data. Some of the proteins are highly interconnected to other modules of proteins (e.g., pnt, bsk), and these genes can be considered 'hubs' or key coordinators of the changes in expression.

Constructing a PPI network for genes that changed expression in a transcriptomics study may reveal modules of genes that are associated through functional processes, or identify key regulators/modulators to the treatment or condition of interest. Different than with the clustering of genes based on similarity of expression patterns for various conditions or time points, a PPI network will also group genes together that behaved very different transcriptionally, yet may participate in the same signal transduction cascade. We assembled a PPI network for the genes that changed expression between two *Drosophila* lines from the same genetic background, but differing genetically in their resistance to parasites after only five generations of artificial selection. Approximately a third of the nearly 900 genes with changed expression were inter-connected in several modules through an intricate and non-random PPI network [6]. Some genes could be identified within the network that had a central position with a high level of interconnectedness, and these genes may function as a 'hub', as they have the potential to influence the activity of a large number of genes. These 'hub' genes, or their regulators, could be hypothesized to provide targets for selection for increased resistance to parasites, in regulating and coordinating a multitude of phenotypic responses.

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 19

best examples of a strong selective sweep is a mutation in the lactase gene in humans, that confers lactose resistance and is highly common among Europeans. Yet, not only this mutation has spread through the European population, but a region spanning approximately a million bases was swept along as well [35]. Such a sweep can also be detectable in expression assays. In our own studies, we imposed a strong selective sweep for immunological resistance in *Drosophila* against parasites, and mapped the genes with changed expression to the chromosomes. This revealed a part of one chromosome bearing a

Especially when information is available on sequence variations (i.e., different genotypes, or alleles) among the different biological samples in the experiment, genome-wide association mapping (GWAS) is another option to start unravelling the genetic architecture of a trait. In this approach, the individual variation in sequence is related to the variation in expression by statistical modelling. Using a multiple regression approach, the allelic states at various loci (e.g., whether it has an A, T, C or G at locus x, an insertion or deletion (indel), or inversion) is related to the expression level of each gene. This approach can be applied both when the sequence variation is independently acquired, for example through independent genotyping assays on the same samples, or from the more detailed information that can be extracted from tiling arrays or transcriptome sequencing data. This approach requires large sample sizes to obtain sufficient power and resolution for the statistical modelling, and has been used in a medical context to associate rare mutations with diseases. Causally linking sequence variants to diseases, however, has proved to be daunting [36]. Yet, this approach has been useful in obtaining more basal knowledge on genome functioning, and the relative importance of various sequence variants (e.g., copy number variants (CNVs), Single Nucleotide Polymorphisms (SNPs), small insertions and deletions (indels)) on gene

Transcriptomics analysis has been hugely popular to explore the unknown players in a wide range of biological processes, diseases, traits and responses to stimuli. The technique is extremely powerful as a first step to implicate novel genes and pathways that may be involved or associated with a particular condition. It should be emphasized, however, that a difference in expression *per se* is not sufficient evidence to infer a direct involvement of the gene in the particular process or trait. This is a limitation of the technology, and it urgently requires the development of high-throughput empirical approaches to validate and functionally characterize the large numbers of genes that are putatively of interest. The availability of genome-wide libraries of RNAi stocks to knock down any gene of interest [38], or reference panels of genetic variants with fully sequenced genomes [39] are prime examples of the resources that are needed to follow up on transcriptomics studies. At the same time, the list of genes with potential involvement is certainly not the only information that can be derived from a transcriptomics study. It is especially the information on all the differentially expressed genes, including those that are not directly involved, that provides an exceptional source of

information on regulation, correlated responses and the genetic architecture of a trait.

signature of positive selection [6].

expression variation [37].

**5. Conclusions** 

Another aspect of the genetic architecture of a trait is its relation to the genome architecture. The genes in a gene list can be mapped to chromosomal positions to search for chromosomal 'hotspots' of differential expression. Transcriptional activity varies for chromosomal domains or regions, and characterizing these patterns may indicate regulatory mechanisms that act on these genes. For example, some chromosomal domains are highly transcribed due to epigenetic mechanisms (e.g., chromatin architectures) that maintain a high activity state, as is seen for heat-shock genes [31]. Such domains under epigenetic control may be recognized by mapping multiple highly expressed genes, or conversely, a complete lack of expression, in the same chromosomal region. Such genomic domains may evolve at a different rate. For instance, the regions around heat-shock genes are more susceptible to insertion by Transposable Elements (i.e., mobile DNA sequences that can translocate themselves within the genome) due to their chromatin architecture, which may lead to a faster accumulation of mutations [32]. Furthermore, some chromosomal domains are highly transcribed in particular tissues only, and the gene arrangements within these domains are highly conserved across taxa [33]. Moreover, chromosomal regions show different expression patterns in healthy tissues compared to cancers [34]. These examples indicate that the physical arrangement of genes within the genome may be a target of evolution, likely due to epigenetic and other regulatory mechanisms that control gene expression of sections of the genome.

Additionally, examining the genomic positions of differentially expressed genes may reveal evolutionary processes that acted on the genes. Strong selection for a particular allele or genomic variant leaves a detectable pattern in the genome, which may be represented by a genomic clustering of genes with changed expression levels. When a particular allele provides an selective advantages to the individual, this locus may be swept through the population. Any allelic variation that is physically linked to this locus (i.e., resides in the nearby chromosomal domain) would be swept through the population as well. One of the best examples of a strong selective sweep is a mutation in the lactase gene in humans, that confers lactose resistance and is highly common among Europeans. Yet, not only this mutation has spread through the European population, but a region spanning approximately a million bases was swept along as well [35]. Such a sweep can also be detectable in expression assays. In our own studies, we imposed a strong selective sweep for immunological resistance in *Drosophila* against parasites, and mapped the genes with changed expression to the chromosomes. This revealed a part of one chromosome bearing a signature of positive selection [6].

Especially when information is available on sequence variations (i.e., different genotypes, or alleles) among the different biological samples in the experiment, genome-wide association mapping (GWAS) is another option to start unravelling the genetic architecture of a trait. In this approach, the individual variation in sequence is related to the variation in expression by statistical modelling. Using a multiple regression approach, the allelic states at various loci (e.g., whether it has an A, T, C or G at locus x, an insertion or deletion (indel), or inversion) is related to the expression level of each gene. This approach can be applied both when the sequence variation is independently acquired, for example through independent genotyping assays on the same samples, or from the more detailed information that can be extracted from tiling arrays or transcriptome sequencing data. This approach requires large sample sizes to obtain sufficient power and resolution for the statistical modelling, and has been used in a medical context to associate rare mutations with diseases. Causally linking sequence variants to diseases, however, has proved to be daunting [36]. Yet, this approach has been useful in obtaining more basal knowledge on genome functioning, and the relative importance of various sequence variants (e.g., copy number variants (CNVs), Single Nucleotide Polymorphisms (SNPs), small insertions and deletions (indels)) on gene expression variation [37].

#### **5. Conclusions**

18 Functional Genomics

phenotypic responses.

sections of the genome.

Constructing a PPI network for genes that changed expression in a transcriptomics study may reveal modules of genes that are associated through functional processes, or identify key regulators/modulators to the treatment or condition of interest. Different than with the clustering of genes based on similarity of expression patterns for various conditions or time points, a PPI network will also group genes together that behaved very different transcriptionally, yet may participate in the same signal transduction cascade. We assembled a PPI network for the genes that changed expression between two *Drosophila* lines from the same genetic background, but differing genetically in their resistance to parasites after only five generations of artificial selection. Approximately a third of the nearly 900 genes with changed expression were inter-connected in several modules through an intricate and non-random PPI network [6]. Some genes could be identified within the network that had a central position with a high level of interconnectedness, and these genes may function as a 'hub', as they have the potential to influence the activity of a large number of genes. These 'hub' genes, or their regulators, could be hypothesized to provide targets for selection for increased resistance to parasites, in regulating and coordinating a multitude of

Another aspect of the genetic architecture of a trait is its relation to the genome architecture. The genes in a gene list can be mapped to chromosomal positions to search for chromosomal 'hotspots' of differential expression. Transcriptional activity varies for chromosomal domains or regions, and characterizing these patterns may indicate regulatory mechanisms that act on these genes. For example, some chromosomal domains are highly transcribed due to epigenetic mechanisms (e.g., chromatin architectures) that maintain a high activity state, as is seen for heat-shock genes [31]. Such domains under epigenetic control may be recognized by mapping multiple highly expressed genes, or conversely, a complete lack of expression, in the same chromosomal region. Such genomic domains may evolve at a different rate. For instance, the regions around heat-shock genes are more susceptible to insertion by Transposable Elements (i.e., mobile DNA sequences that can translocate themselves within the genome) due to their chromatin architecture, which may lead to a faster accumulation of mutations [32]. Furthermore, some chromosomal domains are highly transcribed in particular tissues only, and the gene arrangements within these domains are highly conserved across taxa [33]. Moreover, chromosomal regions show different expression patterns in healthy tissues compared to cancers [34]. These examples indicate that the physical arrangement of genes within the genome may be a target of evolution, likely due to epigenetic and other regulatory mechanisms that control gene expression of

Additionally, examining the genomic positions of differentially expressed genes may reveal evolutionary processes that acted on the genes. Strong selection for a particular allele or genomic variant leaves a detectable pattern in the genome, which may be represented by a genomic clustering of genes with changed expression levels. When a particular allele provides an selective advantages to the individual, this locus may be swept through the population. Any allelic variation that is physically linked to this locus (i.e., resides in the nearby chromosomal domain) would be swept through the population as well. One of the Transcriptomics analysis has been hugely popular to explore the unknown players in a wide range of biological processes, diseases, traits and responses to stimuli. The technique is extremely powerful as a first step to implicate novel genes and pathways that may be involved or associated with a particular condition. It should be emphasized, however, that a difference in expression *per se* is not sufficient evidence to infer a direct involvement of the gene in the particular process or trait. This is a limitation of the technology, and it urgently requires the development of high-throughput empirical approaches to validate and functionally characterize the large numbers of genes that are putatively of interest. The availability of genome-wide libraries of RNAi stocks to knock down any gene of interest [38], or reference panels of genetic variants with fully sequenced genomes [39] are prime examples of the resources that are needed to follow up on transcriptomics studies. At the same time, the list of genes with potential involvement is certainly not the only information that can be derived from a transcriptomics study. It is especially the information on all the differentially expressed genes, including those that are not directly involved, that provides an exceptional source of information on regulation, correlated responses and the genetic architecture of a trait.

A large number of databases and bio-informatic tools are publically available to explore and annotate the individual genes on the gene list, and more importantly, to analyse the gene list collectively. The latter provides both additional power and a more comprehensive insight in the mechanisms and genetic architecture of a trait. Most traits, diseases and responses to environmental stimuli are highly complex, with environmental factors and genetic networks of interactions that contribute to the trait, disease or response. The factors and genetic network underlying a trait may be elucidated by a combination of bioinformatics approaches, and the emergent properties of such approaches may be more revealing than the search for individual candidates for a trait or process.

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 21

[1] Hey Y, Pepper SD. (2009) Interesting times for microarray expression profiling. Brief

[2] Mockler TC, Chan S, Sundaresan A, Chen H, Jacobsen SE, Ecker JR. (2005) Applications

[3] Wang Z, Gerstein M, Snyder M. (2009) RNA-Seq: a revolutionary tool for

[4] Ozsolak F, Milos PM. (2011) RNA sequencing: advances, challenges and opportunities.

[5] Wertheim B, Kraaijeveld AR, Schuster E, Blanc E, Hopkins M, Pletcher SD, Strand MR, Partridge L, Godfray HCJ. (2005) Genome-Wide Gene Expression in Response to

[6] Wertheim B, Kraaijeveld AR, Hopkins MG, Walther Boer M, Godfray HC. (2011) Functional genomics of the evolution of increased resistance to parasitism in

[7] Lemaitre B, Hoffmann J. (2007) The host defense of *Drosophila melanogaster*. Annual

[8] Allison DB, Cui X, Page GP, Sabripour M. (2006) Microarray data analysis: from disarray

[9] Hansen KD, Irizarry RA, Wu Z. (2012) Removing technical variability in RNA-seq data

[10] Tsai CA, Wang SJ, Chen DT, Chen JJ. (2005) Sample size for gene expression microarray

[11] Storey JD, Tibshirani R. (2003) Statistical significance for genomewide studies.

[12] Lathe W, Williams J, Mangan M, Karolchik D. (2008) Genomic Data Resources:

[13] Gotz S, Garcia-Gomez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, Robles M, Talon M, Dopazo J, Conesa A. (2008) High-throughput functional annotation and data

[14] Huang da W, Sherman BT, Lempicki RA. (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res.

[15] Breslin T, Eden P, Krogh M. (2004) Comparing functional annotation analyses with

[16] Shah PK, Tripathi LP, Jensen LJ, Gahnim M, Mason C, Furlong EE, Rodrigues V, White KP, Bork P, Sowdhamini R. (2008) Enhanced function annotations for Drosophila serine proteases: A case study for systematic annotation of multi-member gene families. Gene.

of DNA tiling arrays for whole-genome analysis. Genomics. 85(1):1-15.

Parasitoid Attack in *Drosophila*. Genome Biology. 11(6):R94.

to consolidation and consensus. Nat.Rev.Genet. 7(1):55-65.

using conditional quantile normalization. Biostatistics. 13(2):204-216.

Proceedings of the National Academy of Sciences USA. 100(16):9440-9445.

mining with the Blast2GO suite. Nucleic Acids Res. 36(10):3420-3435.

**6. References** 

37(1):1-13.

407(1-2):199.

Funct.Genomic Proteomic. 8(3):170-173.

transcriptomics. Nat.Rev.Genet. 10(1):57-63.

Nat.Rev.Genet. 12(2):87-98.

Drosophila. Mol.Ecol. 20(5):932-949.

Review of Immunology. 25:697-743.

Catmap. BMC Bioinformatics. 5:193.

experiments. Bioinformatics. 21(8):1502-1508.

Challenges and Promises. Nature Education. 1(3).

Many of the bio-informatic tools that can be applied for these analyses have been made accessible to the research community through the Bioconductor platform (www.bioconductor.org) [40]. This platform is based primarily on the open-source R programming language and runs on all operating systems. A good introduction into this versatile bio-informatic environment has been made available by the Girke lab at the University of California, Riverside through a combination of online manuals (http://manuals.bioinformatics.ucr.edu/). Other freely available, online suites for the analysis of transcriptomics data include Babelomics (http://www.babelomics.org) [41] and Galaxy (http://galaxy.psu.edu/, especially for transcriptome sequencing) [42-44].

The latest development in high-throughput sequencing are opening up new possibilities for the analysis of transcriptomics data. More detailed characterization of transcripts is achievable, and the power of transcriptomics analysis can now also be fully harnessed for organisms without a sequenced genome. Many of the approaches that have been developed for transcriptomics data with microarrays are equally applicable to data from transcriptome sequencing. In that sense, the knowledge-base that has accumulated in the research community in transcriptomics analysis over the past decade will largely remain a valuable resource. The experience and expertise that has been developed in dealing with the limitations and possibilities of analysing microarray data will also be of use while exploring the specific limitations and opportunities that are associated with this new platform. Robust and accurate methods need to be developed fast for the pre-processing, normalizing and analysing of transcriptome sequencing data. This will ensure that the full potential of this new technology can be made accessible to the wide research community.

#### **Author details**

Bregje Wertheim *Evolutionary Genetics, Centre for Ecological and Evolutionary Studies, University of Groningen, Groningen, The Netherlands* 

#### **Acknowledgement**

I thank Eric Blanc and Eugene Schuster for their advice and our valuable discussions on the various bioinformatics approaches in gene expression studies. BW was supported by funding from the Netherlands Organization for Scientific Research (NWO) (Vidi grant no. 864.08.008).

#### **6. References**

20 Functional Genomics

**Author details** 

Bregje Wertheim

**Acknowledgement** 

A large number of databases and bio-informatic tools are publically available to explore and annotate the individual genes on the gene list, and more importantly, to analyse the gene list collectively. The latter provides both additional power and a more comprehensive insight in the mechanisms and genetic architecture of a trait. Most traits, diseases and responses to environmental stimuli are highly complex, with environmental factors and genetic networks of interactions that contribute to the trait, disease or response. The factors and genetic network underlying a trait may be elucidated by a combination of bioinformatics approaches, and the emergent properties of such approaches may be more revealing than

Many of the bio-informatic tools that can be applied for these analyses have been made accessible to the research community through the Bioconductor platform (www.bioconductor.org) [40]. This platform is based primarily on the open-source R programming language and runs on all operating systems. A good introduction into this versatile bio-informatic environment has been made available by the Girke lab at the University of California, Riverside through a combination of online manuals (http://manuals.bioinformatics.ucr.edu/). Other freely available, online suites for the analysis of transcriptomics data include Babelomics (http://www.babelomics.org) [41] and Galaxy

The latest development in high-throughput sequencing are opening up new possibilities for the analysis of transcriptomics data. More detailed characterization of transcripts is achievable, and the power of transcriptomics analysis can now also be fully harnessed for organisms without a sequenced genome. Many of the approaches that have been developed for transcriptomics data with microarrays are equally applicable to data from transcriptome sequencing. In that sense, the knowledge-base that has accumulated in the research community in transcriptomics analysis over the past decade will largely remain a valuable resource. The experience and expertise that has been developed in dealing with the limitations and possibilities of analysing microarray data will also be of use while exploring the specific limitations and opportunities that are associated with this new platform. Robust and accurate methods need to be developed fast for the pre-processing, normalizing and analysing of transcriptome sequencing data. This will ensure that the full potential of this

I thank Eric Blanc and Eugene Schuster for their advice and our valuable discussions on the various bioinformatics approaches in gene expression studies. BW was supported by funding from the Netherlands Organization for Scientific Research (NWO) (Vidi grant no. 864.08.008).

the search for individual candidates for a trait or process.

(http://galaxy.psu.edu/, especially for transcriptome sequencing) [42-44].

new technology can be made accessible to the wide research community.

*Evolutionary Genetics, Centre for Ecological and Evolutionary Studies,* 

*University of Groningen, Groningen, The Netherlands* 


[17] Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature. 456(7221):470-476.

Beyond the Gene List: Exploring Transcriptomics Data

in Search for Gene Function, Trait Mechanisms and Genetic Architecture 23

[33] Yamashita T, Honda M, Takatori H, Nishino R, Hoshino N, Kaneko S. (2004) Genomewide transcriptome mapping analysis identifies organ-specific gene expression patterns

[34] Caron H, van Schaik B, van der Mee M, Baas F, Riggins G, van Sluis P, Hermus MC, van Asperen R, Boon K, Voute PA, Heisterkamp S, van Kampen A, Versteeg R. (2001) The human transcriptome map: clustering of highly expressed genes in chromosomal

[35] Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA, Rhodes M, Reich DE, Hirschhorn JN. (2004) Genetic signatures of strong recent positive selection at

[36] Marian AJ. (2012) Molecular genetic studies of complex phenotypes. Translational

[37] Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavaré S, Deloukas P, Hurles ME, Dermitzakis ET. (2007) Relative impact of nucleotide and copy number variation

[38] Dietzl G, Chen D, Schnorrer F, Su KC, Barinova Y, Fellner M, Gasser B, Kinsey K, Oppel S, Scheiblauer S, Couto A, Marra V, Keleman K, Dickson BJ. (2007) A genome-wide transgenic RNAi library for conditional gene inactivation in Drosophila. Nature.

[39] Mackay TF, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, Casillas S, Han Y, Magwire MM, Cridland JM, Richardson MF, Anholt RR, Barron M, Bess C, Blankenburg KP, Carbone MA, Castellano D, Chaboub L, Duncan L, Harris Z, Javaid M, Jayaseelan JC, Jhangiani SN, Jordan KW, Lara F, Lawrence F, Lee SL, Librado P, Linheiro RS, Lyman RF, Mackey AJ, Munidasa M, Muzny DM, Nazareth L, Newsham I, Perales L, Pu LL, Qu C, Ramia M, Reid JG, Rollmann SM, Rozas J, Saada N, Turlapati L, Worley KC, Wu YQ, Yamamoto A, Zhu Y, Bergman CM, Thornton KR, Mittelman D, Gibbs RA. (2012) The Drosophila melanogaster Genetic Reference Panel. Nature.

[40] Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J. (2004) Bioconductor: open software development for computational biology and

[41] Medina I, Carbonell J, Pulido L, Madeira SC, Goetz S, Conesa A, Tarraga J, Pascual-Montano A, Nogales-Cadenas R, Santoyo J, Garcia F, Marba M, Montaner D, Dopazo J. (2010) Babelomics: an integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling. Nucleic Acids Res.

[42] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. (2010) Galaxy: a web-based genome analysis tool for

along human chromosomes. Genomics. 84(5):867-875.

the lactase gene. Am.J.Hum.Genet. 74(6):1111-1120.

on gene expression phenotypes. Science. 315(5813):848-53.

domains. Science. 291(5507):1289-1292.

Research. 159(2):64-79.

448(7150):151-156.

482(7384):173-178.

38:W210-W213.

bioinformatics. Genome Biology. 5:R80.


[33] Yamashita T, Honda M, Takatori H, Nishino R, Hoshino N, Kaneko S. (2004) Genomewide transcriptome mapping analysis identifies organ-specific gene expression patterns along human chromosomes. Genomics. 84(5):867-875.

22 Functional Genomics

2:S8.

Nature. 456(7221):470-476.

Nat.Genet. 36(7):664-664.

issue):D68-73.

309(5740):1519-1524.

[17] Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. (2008) Alternative isoform regulation in human tissue transcriptomes.

[18] Haas BJ, Zody MC. (2010) Advancing RNA-Seq analysis. Nat.Biotechnol. 28(5):421-423. [19] Krallinger M, Valencia A, Hirschman L. (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 9 Suppl

[20] Hoffmann R, Valencia A. (2004) A gene network for navigating the literature.

[21] Chintapalli VR, Wang J, Dow JAT. (2007) Using FlyAtlas to identify better Drosophila

[22] Maston GA, Evans SK, Green MR. (2006) Transcriptional regulatory elements in the

[23] Frith MC, Fu Y, Yu L, Chen J-, Hansen U, Weng Z. (2004) Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Research. 32.(4):1372-1381. [24] Conlon EM, Liu XS, Lieb JD, Liu JS. (2003) Integrating regulatory motif discovery and

[25] Robertson G, Bilenky M, Lin K, He A, Yuen W, Dagpinar M, Varhol R, Teague K, Griffith OL, Zhang X, Pan Y, Hassel M, Sleumer MC, Pan W, Pleasance ED, Chuang M, Hao H, Li YY, Robertson N, Fjell C, Li B, Montgomery SB, Astakhova T, Zhou J, Sander J, Siddiqui AS, Jones SJ. (2006) cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Res. 34(Database

[26] Zamore P, Haley B. (2005) Ribo-gnome: The big world of small RNAs. Science.

[27] Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ. (2008) miRBase: tools for

[28] Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C. (2011) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored.

[29] Hu Z, Hung JH, Wang Y, Chang YC, Huang CL, Huyck M, DeLisi C. (2009) VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology.

[30] Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. (2000) The large-scale

[31] Farkas G, Leibovitch BA, Elgin SC. (2000) Chromatin organization and transcriptional

[32] Walser JC, Chen B, Feder ME. (2006) Heat-shock promoters: targets for evolution by P

microRNA genomics. Nucleic Acids Res. 36(Database issue):D154-8.

Nucleic Acids Res. 39(Database issue):D561-8.

Nucleic Acids Res. 37(Web Server issue):W115-21.

organization of metabolic networks. Nature. 407(6804):651-654.

control of gene expression in Drosophila. Gene. 253(2):117-136.

transposable elements in Drosophila. PLoS Genet. 2(10):e165.

melanogaster models of human disease. Nat Genet. 39(6):715.

human genome. Annu.Rev.Genomics Hum.Genet. 7:29-59.

genome-wide expression analysis. PNAS. 100(6):3339–3344.


experimentalists. Current protocols in molecular biology / edited by Frederick M.Ausubel ...[et al.]. Chapter 19:21.

**Chapter 2** 

© 2012 Ricke and Mascher, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Ricke and Mascher, licensee InTech. This is a paper distributed under the terms of the Creative Commons

**The REACT Suite: A Software Toolkit** 

The 'age of omics' has provided a wealth of genomic and transcriptomic information that is readily available in public databases. In September 2011, GOLD (the Genomes OnLine Database)1 (Liolios *et al.*, 2010) lists more than 2,600 finished microbial genome sequences and more than twice this number for ongoing and incomplete genome projects, not counting the plethora of metagenome projects, which provide even larger sequence compilations. Comparable numbers of datasets can be retrieved from the two major microarray databases, the Stanford Microarray Database (SMD)2 (Demeter *et al.*, 2007), and the Gene Expression Omnibus (GEO database)3 (Barrett *et al.*, 2011) hosted by the National Center for Biotechnology Information (NCBI), which in September 2011 together provide over 9,000

This enormous amount of data provides a treasure chest of information ready to explore. In recent years, a number of powerful comparative genomics databases such as GenoList4 (Lechat *et al.*, 2008), or MicrobesOnline5 (Dehal *et al.*, 2010) have provided the community

Combining microarray data with genomic information is a particular powerful approach for identifying and predicting regulons, which are regulatory units consisting of a number of genes or operons under the control of specific transcription factors. Such studies require the

**for Microbial** *RE***gulon** *A***nnotation** 

**and** *C***omparative** *T***ranscriptomics** 

Peter Ricke and Thorsten Mascher

bacterial microarray datasets to the public.

with toolkits to make use of this information.

1 URL for the GOLD database: http://genomesonline.org

4 URL for GenoList: http://genolist.pasteur.fr/

2 URL for the Stanford Microarray database: http://smd.stanford.edu 3 URL for the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/

5 URL for the MicrobesOnline database: http://www.microbesonline.org

http://dx.doi.org/10.5772/48040

**1. Introduction** 

Additional information is available at the end of the chapter


## **The REACT Suite: A Software Toolkit for Microbial** *RE***gulon** *A***nnotation and** *C***omparative** *T***ranscriptomics**

Peter Ricke and Thorsten Mascher

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48040

#### **1. Introduction**

24 Functional Genomics

M.Ausubel ...[et al.]. Chapter 19:21.

research in the life sciences. Genome Biol. 11(8):R86.

experimentalists. Current protocols in molecular biology / edited by Frederick

[43] Goecks J, Nekrutenko A, Taylor J, Galaxy Team. (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational

[44] Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. (2005) Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 15(10):1451-1455.

> The 'age of omics' has provided a wealth of genomic and transcriptomic information that is readily available in public databases. In September 2011, GOLD (the Genomes OnLine Database)1 (Liolios *et al.*, 2010) lists more than 2,600 finished microbial genome sequences and more than twice this number for ongoing and incomplete genome projects, not counting the plethora of metagenome projects, which provide even larger sequence compilations. Comparable numbers of datasets can be retrieved from the two major microarray databases, the Stanford Microarray Database (SMD)2 (Demeter *et al.*, 2007), and the Gene Expression Omnibus (GEO database)3 (Barrett *et al.*, 2011) hosted by the National Center for Biotechnology Information (NCBI), which in September 2011 together provide over 9,000 bacterial microarray datasets to the public.

> This enormous amount of data provides a treasure chest of information ready to explore. In recent years, a number of powerful comparative genomics databases such as GenoList4 (Lechat *et al.*, 2008), or MicrobesOnline5 (Dehal *et al.*, 2010) have provided the community with toolkits to make use of this information.

> Combining microarray data with genomic information is a particular powerful approach for identifying and predicting regulons, which are regulatory units consisting of a number of genes or operons under the control of specific transcription factors. Such studies require the

<sup>5</sup> URL for the MicrobesOnline database: http://www.microbesonline.org

© 2012 Ricke and Mascher, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Ricke and Mascher, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

<sup>1</sup> URL for the GOLD database: http://genomesonline.org

<sup>2</sup> URL for the Stanford Microarray database: http://smd.stanford.edu

<sup>3</sup> URL for the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/

<sup>4</sup> URL for GenoList: http://genolist.pasteur.fr/

identification of co-expressed genes (indicative of co-regulation) from in-depth comparative transcriptome profiling, combined with genomic information, including operon structure, genomic context conservation and the presence of specific regulator binding sites.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 27

signal or ratio-values, as well as gene- and array-clustering with various hierarchical clustering methods, distance methods and correction algorithms (normalization, gene-

The "*MotifView*" contains the information of all sequence motifs of known or putative regulator binding sites collected in the current REACT database. Moreover, it enables users to perform MEME analyses to discover new regulatory elements in the upstream regions of selected annotated genes or operons and MAST analyses of previously computed or

The concept of REACT includes an in-depth integration of the different views via links, enabling users to switch easily between different aspects of the data. Most views are flexible and can be extended with additional data fields to accommodate additional external links,

Moreover, wherever gene or array data are displayed, the user can easily collect them, thereby creating a data subset available as input for all other implemented analyses. During the various analysis steps, these collections can be continuously changed and expanded, again by selecting single genes and arrays or whole groups of them, such as groups of genes clustering together within a scatter plot analysis. All collected or "marked" arrays and genes are displayed throughout the various views of REACT in form of sortable lists. The items of these lists act as internal links to the corresponding detailed *Array*- or *GeneViews*. Current collections can be saved and opened again for later use, so that the user can easily switch between different data sets any time. In addition, the sequences of the selected genes can

The implemented REACT-databases are organism-specific. In its current version, REACT contains two databases for the model bacteria *Escherichia coli* and *Bacillus subtilis*, but could also be extended to other microbial species. Each REACT-database is based on the detailed genomic data of the model organism, which will be described in the following paragraphs, as well as of an extendable amount of microarray data of this organism. Moreover, basic genomic information on related organisms, the so-called "reference organisms" is also integrated and can be included into some of the analyses. The list of reference genomes can easily be extended, to adjust a given database to the dynamics in genome sequencing.

The information stored in REACT databases can be accessed via so-called views that display the data, allow their selection and provide functional links between different types of data for their interactive analysis. In the following sections, we will describe the major views of

The "*GeneView*" bundles all available information about individual genes (Fig. 1). On top of the page, the gene identifiers are displayed, accordingly to the existing genomic

**3. Description of the individual views of the REACT suite** 

centering, log-transformation) of all or only selected (collected) subsets of the data.

imported motifs against pre-compiled upstream sequence datasets.

allowing more individualized views and analyses of the data.

also be exported into external FASTA files.

REACT, to provide an overview of their features.

**3.1. The** *GeneView*

The major problem of combining genomic with transcriptomic data to ultimately extract meaningful regulon information is the lack of defined standard formats and software interfaces that allow a direct transfer of data sets derived from transcriptome analyses to comparative genomics databases and vice versa. The REACT suite was developed with the purpose in mind to facilitate such combinations of the different analysis steps outlined above in one intuitive and user-friendly environment. Transcriptome datasets from different sources can be integrated into REACT via a sophisticated import interface and are stored, together with the cognate genomic information, in a MySQL database. This database, together with the central part of the software toolkit and all interlinked third-party tools run on a central computer, which actually performs the analyses: the "REACT-server". It is accessed by the user-interface ("REACT-client") via inter- or intranet. The user will solely work with the corresponding client program, which can be installed on the personal computers or laptops of various users. While the installation of the REACT-server demands some technical knowledge, the client can be run easily on computers with a java runtime environment.

Taken together, the REACT suite provides users with a simple-to-use but powerful bioinformatics environment to perform regulon annotation and comparative genomics analyses based on microarray data and genome sequences. Both server and client software of the REACT suite are freely available from the corresponding author.

#### **2. The basic concept of REACT**

REACT was developed to enable users to perform the various steps of expression- and regulon analyses in a quick and intuitive manner. Tools are no longer separated entities demanding different and often incompatible data formats, but can be rather regarded as parts of a comprehensive, fully integrated unit. Data from a wide range of sources can be collected and analysed together. When working with REACT, the user has access to the various representations of the data as well as to the analysis tools via so-called "views" that are intuitively interlinked to enable an interactive flow of both data and analyses:

The "*GeneView*" displays gene-centric information including DNA- and amino acid sequences, links to a number of external databases, as well as the genomic context of a gene in the form of a simple genome browser.

The "*RegulonView*" lists all genes controlled by the same regulator as well as binding motif(s), individual binding sites, alternative promoters etc., based both on the information stored in curated public regulon databases and data added by REACT users.

The "*ArrayView*" allows both importing new microarray datasets and performing data analysis on existing datasets. REACT has a sophisticated interface for the import of array data in nearly every tabulated data format from individual proprietary formats up to GEO/SMD datasets. Data analysis includes one- and two-dimensional scatterplot analyses of signal or ratio-values, as well as gene- and array-clustering with various hierarchical clustering methods, distance methods and correction algorithms (normalization, genecentering, log-transformation) of all or only selected (collected) subsets of the data.

The "*MotifView*" contains the information of all sequence motifs of known or putative regulator binding sites collected in the current REACT database. Moreover, it enables users to perform MEME analyses to discover new regulatory elements in the upstream regions of selected annotated genes or operons and MAST analyses of previously computed or imported motifs against pre-compiled upstream sequence datasets.

The concept of REACT includes an in-depth integration of the different views via links, enabling users to switch easily between different aspects of the data. Most views are flexible and can be extended with additional data fields to accommodate additional external links, allowing more individualized views and analyses of the data.

Moreover, wherever gene or array data are displayed, the user can easily collect them, thereby creating a data subset available as input for all other implemented analyses. During the various analysis steps, these collections can be continuously changed and expanded, again by selecting single genes and arrays or whole groups of them, such as groups of genes clustering together within a scatter plot analysis. All collected or "marked" arrays and genes are displayed throughout the various views of REACT in form of sortable lists. The items of these lists act as internal links to the corresponding detailed *Array*- or *GeneViews*. Current collections can be saved and opened again for later use, so that the user can easily switch between different data sets any time. In addition, the sequences of the selected genes can also be exported into external FASTA files.

The implemented REACT-databases are organism-specific. In its current version, REACT contains two databases for the model bacteria *Escherichia coli* and *Bacillus subtilis*, but could also be extended to other microbial species. Each REACT-database is based on the detailed genomic data of the model organism, which will be described in the following paragraphs, as well as of an extendable amount of microarray data of this organism. Moreover, basic genomic information on related organisms, the so-called "reference organisms" is also integrated and can be included into some of the analyses. The list of reference genomes can easily be extended, to adjust a given database to the dynamics in genome sequencing.

#### **3. Description of the individual views of the REACT suite**

The information stored in REACT databases can be accessed via so-called views that display the data, allow their selection and provide functional links between different types of data for their interactive analysis. In the following sections, we will describe the major views of REACT, to provide an overview of their features.

#### **3.1. The** *GeneView*

26 Functional Genomics

identification of co-expressed genes (indicative of co-regulation) from in-depth comparative transcriptome profiling, combined with genomic information, including operon structure,

The major problem of combining genomic with transcriptomic data to ultimately extract meaningful regulon information is the lack of defined standard formats and software interfaces that allow a direct transfer of data sets derived from transcriptome analyses to comparative genomics databases and vice versa. The REACT suite was developed with the purpose in mind to facilitate such combinations of the different analysis steps outlined above in one intuitive and user-friendly environment. Transcriptome datasets from different sources can be integrated into REACT via a sophisticated import interface and are stored, together with the cognate genomic information, in a MySQL database. This database, together with the central part of the software toolkit and all interlinked third-party tools run on a central computer, which actually performs the analyses: the "REACT-server". It is accessed by the user-interface ("REACT-client") via inter- or intranet. The user will solely work with the corresponding client program, which can be installed on the personal computers or laptops of various users. While the installation of the REACT-server demands some technical knowledge,

Taken together, the REACT suite provides users with a simple-to-use but powerful bioinformatics environment to perform regulon annotation and comparative genomics analyses based on microarray data and genome sequences. Both server and client software

REACT was developed to enable users to perform the various steps of expression- and regulon analyses in a quick and intuitive manner. Tools are no longer separated entities demanding different and often incompatible data formats, but can be rather regarded as parts of a comprehensive, fully integrated unit. Data from a wide range of sources can be collected and analysed together. When working with REACT, the user has access to the various representations of the data as well as to the analysis tools via so-called "views" that

The "*GeneView*" displays gene-centric information including DNA- and amino acid sequences, links to a number of external databases, as well as the genomic context of a gene

The "*RegulonView*" lists all genes controlled by the same regulator as well as binding motif(s), individual binding sites, alternative promoters etc., based both on the information

The "*ArrayView*" allows both importing new microarray datasets and performing data analysis on existing datasets. REACT has a sophisticated interface for the import of array data in nearly every tabulated data format from individual proprietary formats up to GEO/SMD datasets. Data analysis includes one- and two-dimensional scatterplot analyses of

are intuitively interlinked to enable an interactive flow of both data and analyses:

stored in curated public regulon databases and data added by REACT users.

genomic context conservation and the presence of specific regulator binding sites.

the client can be run easily on computers with a java runtime environment.

of the REACT suite are freely available from the corresponding author.

**2. The basic concept of REACT** 

in the form of a simple genome browser.

The "*GeneView*" bundles all available information about individual genes (Fig. 1). On top of the page, the gene identifiers are displayed, accordingly to the existing genomic

nomenclature. The first identifier is the gene name, e.g. "*icd*" in case of the *B. subtilis* isocitrate dehydrogenase. If more than one gene name exists for a given gene, the nomenclature applied by REACT is derived from the genome annotation stored in the NCBI genome database6. Here, as in other views of the REACT suite, active features working as internal links are highlighted by red letters (Fig. 1). In case of gene names, a double click would bring the user to the corresponding *GeneView*, while a single click would mark the gene (= add it to the gene collection in the left panel) for further analyses.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 29

database8 (Bairoch, 2000), the already mentioned MicrobesOnline comparative genomics database, the NCBI Protein database9, the protein data bank PDB10, a collection of protein structures and structure-related information (Rose *et al.*, 2011), the Pfam11 (Finn *et al.*, 2010), Prosite12 (Sigrist *et al.*, 2010) and SMART13 (Letunic *et al.*, 2009) databases, all of which are dedicated to the definition, maintenance and easy identification of protein domains and families. Further predefined links include a link to Pubmed and to Google. In addition to these general sites, the *GeneView* page also links to organism-specific databases and genome

For all of the above, the links in the REACT databases are gene-specific and directly connect the user with the cognate gene/protein-specific page of the external database. Depending on the type of the external database and the information available for the displayed gene, zero to many external hits will be provided as links. If no such specific database identifier exists, as in the case of Google`s search engine, a gene-related term (e.g. the gene name) has been chosen as the link parameter. REACT is highly adjustable to the individual users' needs. Hence, the external links are not limited to those preimplemented in the existing REACT databases for *E. coli* and *B. subtilis*. (see section 5.3 "Modifying REACT: the administrator

In addition to the links and data fields, the *GeneView* also displays the DNA and amino acid sequences of the current gene, which are linked to the BLASTn and BLASTp tools16 at NCBI. The user is therefore able to directly search for similar sequences in the public domain. Moreover, a genome browser is implemented at the bottom of the *GeneView* for a quick glance on the genomic environment of the current gene (Fig. 1). The gene icons are coloured according to the COG-functional classes assigned to each gene and serve as links to the

Two additional functions are available in the *GeneView*. First, the user can retrieve the upstream genomic region of the gene via a specific dialog box, based on user-provided information, such as upstream region length, inclusion of start codon, or choice between the upstream region of the current gene or the first gene of its operon. The latter function is very useful for collecting upstream regions for motif searches (see section 3.5). Second, expression data of the active gene can directly be retrieved from the REACT microarray database. For this, the user can choose either all or only selected microarray datasets, and limit the set of extracted values by a certain threshold expression ratio level. The *GeneView* is therefore not only the central platform for all gene-centric data, but is also directly linked to all other

resources, such as BSORF14 or SubtiList15 in case of *B. subtilis.* 

mode" for details).

corresponding *GeneViews*.

views described in the following sections.

11 URL for the Pfam database: http://pfam.sanger.ac.uk/ 12 URL for the Prosite database: http://prosite.expasy.org/ 13 URL for the SMART database: http://smart.embl-heidelberg.de/ 14 URL for the BSORF site: http://bacillus.genome.ad.jp/ 15 URL for SubtiList: http://genolist.pasteur.fr/SubtiList/

10 URL for PDB: http://www.rcsb.org/

8 URL for the ENZYME database: http://enzyme.expasy.org/enzyme\_ref.html 9 URL for the NCBI Protein database: http://www.ncbi.nlm.nih.gov/protein

16 URL for BLASTn and BLASTp: http://blast.ncbi.nlm.nih.gov/Blast.cgi


**Figure 1.** The *GeneView*. Exemplary screenshot for the gene *icd*, encoding the isocitrate dehydrogenase. See text for details.

In addition to the name, each gene has a unique gene-ID or gene number, which consists of an abbreviation for the organism and a number of the gene (based on the chromosomal position). For example, the identifier of the *B. subtilis* isocitrate dehydrogenase is "BSU29130" (Fig. 1). Gene names and numbers are the major identifiers that are used throughout the several displays of the REACT suite. The gene numbers cannot be modified by the users to ensure the integrity of the database. The putative or known functions of the encoded proteins are shown below the gene name, including synonyms and alternative descriptions (if present).

The central part of the *GeneView* are the external links and descriptive data fields. For each genome database implemented in the REACT suite, links to important public databases are already predefined for each gene. This include links to the COGs (Cluster of Orthologous Genes) database7 (Tatusov *et al.*, 1997), hosted again by the NCBI, the Enzyme Nomenclature

<sup>6</sup> URL for the NCBI genome database: http://www.ncbi.nlm.nih.gov/sites/genome

<sup>7</sup> URL for the COGs database: http://www.ncbi.nlm.nih.gov/**COG**

database8 (Bairoch, 2000), the already mentioned MicrobesOnline comparative genomics database, the NCBI Protein database9, the protein data bank PDB10, a collection of protein structures and structure-related information (Rose *et al.*, 2011), the Pfam11 (Finn *et al.*, 2010), Prosite12 (Sigrist *et al.*, 2010) and SMART13 (Letunic *et al.*, 2009) databases, all of which are dedicated to the definition, maintenance and easy identification of protein domains and families. Further predefined links include a link to Pubmed and to Google. In addition to these general sites, the *GeneView* page also links to organism-specific databases and genome resources, such as BSORF14 or SubtiList15 in case of *B. subtilis.* 

For all of the above, the links in the REACT databases are gene-specific and directly connect the user with the cognate gene/protein-specific page of the external database. Depending on the type of the external database and the information available for the displayed gene, zero to many external hits will be provided as links. If no such specific database identifier exists, as in the case of Google`s search engine, a gene-related term (e.g. the gene name) has been chosen as the link parameter. REACT is highly adjustable to the individual users' needs. Hence, the external links are not limited to those preimplemented in the existing REACT databases for *E. coli* and *B. subtilis*. (see section 5.3 "Modifying REACT: the administrator mode" for details).

In addition to the links and data fields, the *GeneView* also displays the DNA and amino acid sequences of the current gene, which are linked to the BLASTn and BLASTp tools16 at NCBI. The user is therefore able to directly search for similar sequences in the public domain. Moreover, a genome browser is implemented at the bottom of the *GeneView* for a quick glance on the genomic environment of the current gene (Fig. 1). The gene icons are coloured according to the COG-functional classes assigned to each gene and serve as links to the corresponding *GeneViews*.

Two additional functions are available in the *GeneView*. First, the user can retrieve the upstream genomic region of the gene via a specific dialog box, based on user-provided information, such as upstream region length, inclusion of start codon, or choice between the upstream region of the current gene or the first gene of its operon. The latter function is very useful for collecting upstream regions for motif searches (see section 3.5). Second, expression data of the active gene can directly be retrieved from the REACT microarray database. For this, the user can choose either all or only selected microarray datasets, and limit the set of extracted values by a certain threshold expression ratio level. The *GeneView* is therefore not only the central platform for all gene-centric data, but is also directly linked to all other views described in the following sections.

28 Functional Genomics

See text for details.

descriptions (if present).

nomenclature. The first identifier is the gene name, e.g. "*icd*" in case of the *B. subtilis* isocitrate dehydrogenase. If more than one gene name exists for a given gene, the nomenclature applied by REACT is derived from the genome annotation stored in the NCBI genome database6. Here, as in other views of the REACT suite, active features working as internal links are highlighted by red letters (Fig. 1). In case of gene names, a double click would bring the user to the corresponding *GeneView*, while a single click would mark the

**Figure 1.** The *GeneView*. Exemplary screenshot for the gene *icd*, encoding the isocitrate dehydrogenase.

In addition to the name, each gene has a unique gene-ID or gene number, which consists of an abbreviation for the organism and a number of the gene (based on the chromosomal position). For example, the identifier of the *B. subtilis* isocitrate dehydrogenase is "BSU29130" (Fig. 1). Gene names and numbers are the major identifiers that are used throughout the several displays of the REACT suite. The gene numbers cannot be modified by the users to ensure the integrity of the database. The putative or known functions of the encoded proteins are shown below the gene name, including synonyms and alternative

The central part of the *GeneView* are the external links and descriptive data fields. For each genome database implemented in the REACT suite, links to important public databases are already predefined for each gene. This include links to the COGs (Cluster of Orthologous Genes) database7 (Tatusov *et al.*, 1997), hosted again by the NCBI, the Enzyme Nomenclature

6 URL for the NCBI genome database: http://www.ncbi.nlm.nih.gov/sites/genome

7 URL for the COGs database: http://www.ncbi.nlm.nih.gov/**COG**

gene (= add it to the gene collection in the left panel) for further analyses.

<sup>8</sup> URL for the ENZYME database: http://enzyme.expasy.org/enzyme\_ref.html

<sup>9</sup> URL for the NCBI Protein database: http://www.ncbi.nlm.nih.gov/protein

<sup>10</sup> URL for PDB: http://www.rcsb.org/

<sup>11</sup> URL for the Pfam database: http://pfam.sanger.ac.uk/

<sup>12</sup> URL for the Prosite database: http://prosite.expasy.org/

<sup>13</sup> URL for the SMART database: http://smart.embl-heidelberg.de/

<sup>14</sup> URL for the BSORF site: http://bacillus.genome.ad.jp/

<sup>15</sup> URL for SubtiList: http://genolist.pasteur.fr/SubtiList/

<sup>16</sup> URL for BLASTn and BLASTp: http://blast.ncbi.nlm.nih.gov/Blast.cgi

#### **3.2. The** *OperonView*

Operons are transcriptional units consisting of two or more neighbouring genes that are coexpressed. If a gene has been assigned to an operon and annotated accordingly in the REACT database, a link from the *GeneView* leads to the *OperonView*. Both views are organized in a similar fashion and the *OperonView* also contains a genome browser. It is identical to the *GeneView*'s with the exception that here the current operon is highlighted.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 31

as sets of co-expressed genes. If the regulator DNA-binding site is known and defined, it will also be displayed as a so-called SequenceLogo (see section 3.5 for details). Below the general regulon-associated information and, if available, the sequence motif overview,

The central part of the *RegulonView* provides individual information on all associated operons and genes, including the name, the first gene in case of operons, the corresponding factor and the position and sequence of the putative binding site in front of the regulated transcriptional units. This information enables the user to get a quick first impression of the

All three views explained so far are highly similar to one another and strongly integrated, not only regarding the information provided but also in the way the user can navigate from one view to the next. They all provide a gene-centric view on the REACT data and invariantly rely on the genomic sequence as a reference. Regulons consist of operons, which are made up of individual genes with a defined position on the chromosome. The same is

In contrast, the *ArrayView* provides access to the second central data pool stored within the REACT database, the microarray datasets. Array data exist in a great variety of different formats. Especially data sets from the early years of transcriptome studies often are available only in form of simple tables or excel spread sheets without any defined data format, distributed over numerous journal homepages or webpages of individual research groups, making their implementation into comparative transcriptome analysis very difficult. As a result, public databases, such as the already mentioned GEO or SMD, have been developed for the storage and description of microarray datasets that comply to the MIAME (Minimal Information About a Microarray Experiment)19 standard (Brazma *et al.*, 2001). Unfortunately, these databases still contain only a fraction of the published microarray datasets. The biggest challenge for a comparative transcriptome database is therefore to organize and import

A complete microarray dataset contains at least three types of information. (i) A list of all genes represented by a given DNA microarray, which is linked to the corresponding expression values, either expressed as (ii) raw fluorescence values for the reference and experimental condition, or as (iii) the respective ratio (or fold-change) between the two conditions. Within the REACT suite, such a data collection is called an "Array". Obviously, the Array is only useful if additional descriptive information (meta-information) is available. This can be a short description of the specific experimental set-up or a link where this information is stored. Often, a group of array datasets are related to each other and

19 URL for the description of the MIAME standard: http://www.mged.org/Workgroups/MIAME/miame.html

additional data fields and external links are displayed.

microarray data from diverse sources into a compatible format.

*3.4.1. Organizing microarray data in the REACT suite* 

regulated genes of this regulon.

true for regulator binding sites.

**3.4. The** *ArrayView*

The operon identifier is again immutable, since it is used by REACT as internal reference. The operon name by default consists of the concatenated names of the genes within this operon. When displayed outside of the *OperonView*, it functions as an internal link, enabling the user to jump directly to the corresponding *OperonView*. From within the *OperonView*, it can be used as a link to an external database containing additional information regarding this operon.

In addition to providing a direct link to all corresponding *GeneView* pages, the *OperonView*  also provides a list of and links to all regulons, to which the current operon belongs. They are represented by their REACT-internal regulon identifier, the name of the corresponding transcription factor and a brief description of the regulon (see the following section for details).

#### **3.3. The** *RegulonView*

The next higher level of genetic units is the regulon, which consists of a number of genes or operons under the direct control of a specific transcription factor. Regulons are displayed within the REACT suite in the *RegulonView*, which is subdivided into two panels. The first section ("All regulons") contains a tabulated list of all regulons currently defined in the REACT database, which includes the most important information such as the regulon-ID, the main description of the regulon and the associated transcription factor. The second part displays the detailed view of a specific regulon selected from the first list ("Act. regulon"). This view is organized similar to the *GeneView* or *OperonView*. It contains regulon identifier, a link to the corresponding transcription factor (if known) and the sequence motif of its cognate DNA-binding site, which is found upstream of the regulated operons or genes comprising this regulon.

The regulon-ID is derived from the gene-ID of the corresponding transcription factor and marked by the extension "\_R". It is implemented as an active link that directly connects to an external regulon database. In case of *B. subtilis*, this is primarily DBTBS17, the database of transcriptional regulation in *B. subtilis* (Sierro *et al.*, 2008), but BSORF or SubtiList have also been used for the initial regulon annotation. For *E. coli*, the regulon information has been extracted from RegulonDB18 (Gama-Castro *et al.*, 2011). Additional regulon definitions can be added at any time, including putative regulons with only rudimentary information, such

<sup>17</sup> URL for the DBTBS database: http://dbtbs.hgc.jp/

<sup>18</sup> URL for RegulonDB: http://regulondb.ccg.unam.mx/

as sets of co-expressed genes. If the regulator DNA-binding site is known and defined, it will also be displayed as a so-called SequenceLogo (see section 3.5 for details). Below the general regulon-associated information and, if available, the sequence motif overview, additional data fields and external links are displayed.

The central part of the *RegulonView* provides individual information on all associated operons and genes, including the name, the first gene in case of operons, the corresponding factor and the position and sequence of the putative binding site in front of the regulated transcriptional units. This information enables the user to get a quick first impression of the regulated genes of this regulon.

#### **3.4. The** *ArrayView*

30 Functional Genomics

this operon.

details).

**3.3. The** *RegulonView*

comprising this regulon.

17 URL for the DBTBS database: http://dbtbs.hgc.jp/ 18 URL for RegulonDB: http://regulondb.ccg.unam.mx/

**3.2. The** *OperonView*

Operons are transcriptional units consisting of two or more neighbouring genes that are coexpressed. If a gene has been assigned to an operon and annotated accordingly in the REACT database, a link from the *GeneView* leads to the *OperonView*. Both views are organized in a similar fashion and the *OperonView* also contains a genome browser. It is identical to the *GeneView*'s with the exception that here the current operon is highlighted.

The operon identifier is again immutable, since it is used by REACT as internal reference. The operon name by default consists of the concatenated names of the genes within this operon. When displayed outside of the *OperonView*, it functions as an internal link, enabling the user to jump directly to the corresponding *OperonView*. From within the *OperonView*, it can be used as a link to an external database containing additional information regarding

In addition to providing a direct link to all corresponding *GeneView* pages, the *OperonView*  also provides a list of and links to all regulons, to which the current operon belongs. They are represented by their REACT-internal regulon identifier, the name of the corresponding transcription factor and a brief description of the regulon (see the following section for

The next higher level of genetic units is the regulon, which consists of a number of genes or operons under the direct control of a specific transcription factor. Regulons are displayed within the REACT suite in the *RegulonView*, which is subdivided into two panels. The first section ("All regulons") contains a tabulated list of all regulons currently defined in the REACT database, which includes the most important information such as the regulon-ID, the main description of the regulon and the associated transcription factor. The second part displays the detailed view of a specific regulon selected from the first list ("Act. regulon"). This view is organized similar to the *GeneView* or *OperonView*. It contains regulon identifier, a link to the corresponding transcription factor (if known) and the sequence motif of its cognate DNA-binding site, which is found upstream of the regulated operons or genes

The regulon-ID is derived from the gene-ID of the corresponding transcription factor and marked by the extension "\_R". It is implemented as an active link that directly connects to an external regulon database. In case of *B. subtilis*, this is primarily DBTBS17, the database of transcriptional regulation in *B. subtilis* (Sierro *et al.*, 2008), but BSORF or SubtiList have also been used for the initial regulon annotation. For *E. coli*, the regulon information has been extracted from RegulonDB18 (Gama-Castro *et al.*, 2011). Additional regulon definitions can be added at any time, including putative regulons with only rudimentary information, such All three views explained so far are highly similar to one another and strongly integrated, not only regarding the information provided but also in the way the user can navigate from one view to the next. They all provide a gene-centric view on the REACT data and invariantly rely on the genomic sequence as a reference. Regulons consist of operons, which are made up of individual genes with a defined position on the chromosome. The same is true for regulator binding sites.

In contrast, the *ArrayView* provides access to the second central data pool stored within the REACT database, the microarray datasets. Array data exist in a great variety of different formats. Especially data sets from the early years of transcriptome studies often are available only in form of simple tables or excel spread sheets without any defined data format, distributed over numerous journal homepages or webpages of individual research groups, making their implementation into comparative transcriptome analysis very difficult. As a result, public databases, such as the already mentioned GEO or SMD, have been developed for the storage and description of microarray datasets that comply to the MIAME (Minimal Information About a Microarray Experiment)19 standard (Brazma *et al.*, 2001). Unfortunately, these databases still contain only a fraction of the published microarray datasets. The biggest challenge for a comparative transcriptome database is therefore to organize and import microarray data from diverse sources into a compatible format.

#### *3.4.1. Organizing microarray data in the REACT suite*

A complete microarray dataset contains at least three types of information. (i) A list of all genes represented by a given DNA microarray, which is linked to the corresponding expression values, either expressed as (ii) raw fluorescence values for the reference and experimental condition, or as (iii) the respective ratio (or fold-change) between the two conditions. Within the REACT suite, such a data collection is called an "Array". Obviously, the Array is only useful if additional descriptive information (meta-information) is available. This can be a short description of the specific experimental set-up or a link where this information is stored. Often, a group of array datasets are related to each other and

<sup>19</sup> URL for the description of the MIAME standard: http://www.mged.org/Workgroups/MIAME/miame.html

described in a single format, e.g. as a result of one experiment. This is reflected by the REACT data format "Array Set".

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 33

signal and control – can be daunting. Gene identifiers are either not used consistently (as synonyms often exist), or the DNA microarray might not contain all genes, or duplication of some. Likewise, signals can be represented as raw fluorescence values, either as mean or average values, in which case control values need to be provided or defined. Alternatively, a table might provide ratio of signal to control, which can be either expressed as log-values or as fold-changes. To facilitate handling and import such diverse types of data, the REACT

During import, microarray data in any tabulated format is initially pre-loaded into the REACT import panel. REACT automatically detects the number of columns in the file and generates an adequate number of numerated preview columns for easy identification. After semiautomated discrimination of commentary lines, the appropriate type of information has to be assigned to each column. REACT needs at least one column containing the gene identifiers and one column for the signal or ratio values. Other types of information can also be assigned, such as the signal background, the control value, and the control background. Based on the assignment, REACT 'knows' what to do with the individual data, e.g. if background columns are specified, their values will be subtracted from the corresponding signal or control values. Ratios between signal and control values can either be directly imported or will be calculated, depending on the data provided. It is even possible to import data with only a single column containing the signal values (e.g. during time course experiments). In a later step, one of the imported arrays (e.g. time 0) can then be used as a standard control for all datasets to calculate the ratio values needed for most analyses. Large datasets containing many replicates of one experiment can be imported in a single table. In this case, REACT offers the possibility to average the sets of columns assigned for signal,

suite contains an easy-to-use microarray import interface (Fig. 3).

**Figure 2.** The microarray import interface of the *ArrayView*. See text for details.

If large numbers of different experiments are stored in a single table, they can be parsed at once using the "batch"-import. The user defines the different ratio-columns, and each column will be treated as a separate array, within a common array set. Moreover, it is

control or ratio values.

The *ArrayView* is split into four sub-views. The first sub-view, "All Arrays" contains a tabulated view of all array sets of the current REACT database. If one array set is selected, all arrays of this set are displayed below the upper window, again in tabulated format. Both tables contain some basic meta-information on either the array or array set, respectively.

Selection of one array or array set leads to the next sub-view "Act. Array", which provides the detailed information, including the ID, name, a description of the underlying experiment, the source of the data, available literature, and external links. The "Array Set" subview lists all individual arrays within the set, which can be marked separately for further analysis. The most important feature of this sub-view is a tabulated, sortable list of all genes, for which data are available within this array. It contains information on the gene name, the signal value, the control value, the ratio of signal to control, the number of replicates that were combined, the arithmetic mean and error of the values. This data is normally directly derived from the original data sets. Two additional columns indicate which genes are currently marked and if their value can be trusted. The trust value is a simple way to allow users to flag single values as untrusted, thereby automatically excluding them from subsequent analyses. Trust values can be easily set for marked genes within the current subview.

The data table is sortable based on any column, e.g. high or low signals or ratio values. Genes of interest can be collected as "marked genes" for inclusion into follow-up analyses. Each gene-specific data row of the table functions as an internal link to the corresponding *GeneView*, thereby providing a direct connection between the array-centric data of the *ArrayView* and the gene-centric data of the *Gene/Operon/RegulonViews*.

An additional feature of the *ArrayView* is the "Similar gene" function. For each array displayed in the dialog, the user can define ratio-thresholds similar to the ratio of the current gene. REACT then automatically retrieves a list of all genes fulfilling the user defined criteria. These genes can then be marked for subsequent analyses. The "Similar gene" function therefore provides a simple but efficient and direct way to find genes with similar expression characteristics from the available microarray database.

#### *3.4.2. Importing microarray data into the REACT database*

As already mentioned, one of the major problems in comparative transcriptome analyses is the lack of a mandatory gold standard for array datasets, especially from the early, pre-MIAME era. But even ten years after this standard has been introduced, this problem is still far from be solved, and the number of microarray datasets not complying to these standards is still rising (Brazma, 2009).

Even implementing the minimum amount of information needed to integrate an array data set into the REACT database – a two-column table, with one column containing the gene identifiers and the second containing either the signal values or expression ratios between signal and control – can be daunting. Gene identifiers are either not used consistently (as synonyms often exist), or the DNA microarray might not contain all genes, or duplication of some. Likewise, signals can be represented as raw fluorescence values, either as mean or average values, in which case control values need to be provided or defined. Alternatively, a table might provide ratio of signal to control, which can be either expressed as log-values or as fold-changes. To facilitate handling and import such diverse types of data, the REACT suite contains an easy-to-use microarray import interface (Fig. 3).

32 Functional Genomics

subview.

REACT data format "Array Set".

described in a single format, e.g. as a result of one experiment. This is reflected by the

The *ArrayView* is split into four sub-views. The first sub-view, "All Arrays" contains a tabulated view of all array sets of the current REACT database. If one array set is selected, all arrays of this set are displayed below the upper window, again in tabulated format. Both tables contain some basic meta-information on either the array or array set, respectively.

Selection of one array or array set leads to the next sub-view "Act. Array", which provides the detailed information, including the ID, name, a description of the underlying experiment, the source of the data, available literature, and external links. The "Array Set" subview lists all individual arrays within the set, which can be marked separately for further analysis. The most important feature of this sub-view is a tabulated, sortable list of all genes, for which data are available within this array. It contains information on the gene name, the signal value, the control value, the ratio of signal to control, the number of replicates that were combined, the arithmetic mean and error of the values. This data is normally directly derived from the original data sets. Two additional columns indicate which genes are currently marked and if their value can be trusted. The trust value is a simple way to allow users to flag single values as untrusted, thereby automatically excluding them from subsequent analyses. Trust values can be easily set for marked genes within the current

The data table is sortable based on any column, e.g. high or low signals or ratio values. Genes of interest can be collected as "marked genes" for inclusion into follow-up analyses. Each gene-specific data row of the table functions as an internal link to the corresponding *GeneView*, thereby providing a direct connection between the array-centric data of the

An additional feature of the *ArrayView* is the "Similar gene" function. For each array displayed in the dialog, the user can define ratio-thresholds similar to the ratio of the current gene. REACT then automatically retrieves a list of all genes fulfilling the user defined criteria. These genes can then be marked for subsequent analyses. The "Similar gene" function therefore provides a simple but efficient and direct way to find genes with similar

As already mentioned, one of the major problems in comparative transcriptome analyses is the lack of a mandatory gold standard for array datasets, especially from the early, pre-MIAME era. But even ten years after this standard has been introduced, this problem is still far from be solved, and the number of microarray datasets not complying to these standards

Even implementing the minimum amount of information needed to integrate an array data set into the REACT database – a two-column table, with one column containing the gene identifiers and the second containing either the signal values or expression ratios between

*ArrayView* and the gene-centric data of the *Gene/Operon/RegulonViews*.

expression characteristics from the available microarray database.

*3.4.2. Importing microarray data into the REACT database* 

is still rising (Brazma, 2009).

During import, microarray data in any tabulated format is initially pre-loaded into the REACT import panel. REACT automatically detects the number of columns in the file and generates an adequate number of numerated preview columns for easy identification. After semiautomated discrimination of commentary lines, the appropriate type of information has to be assigned to each column. REACT needs at least one column containing the gene identifiers and one column for the signal or ratio values. Other types of information can also be assigned, such as the signal background, the control value, and the control background. Based on the assignment, REACT 'knows' what to do with the individual data, e.g. if background columns are specified, their values will be subtracted from the corresponding signal or control values. Ratios between signal and control values can either be directly imported or will be calculated, depending on the data provided. It is even possible to import data with only a single column containing the signal values (e.g. during time course experiments). In a later step, one of the imported arrays (e.g. time 0) can then be used as a standard control for all datasets to calculate the ratio values needed for most analyses. Large datasets containing many replicates of one experiment can be imported in a single table. In this case, REACT offers the possibility to average the sets of columns assigned for signal, control or ratio values.


**Figure 2.** The microarray import interface of the *ArrayView*. See text for details.

If large numbers of different experiments are stored in a single table, they can be parsed at once using the "batch"-import. The user defines the different ratio-columns, and each column will be treated as a separate array, within a common array set. Moreover, it is

possible to define, if ratio data are in logarithmic format (they will be converted to internal non-logarithmic values) or not.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 35

scoring matrices (PSSMs, also known as Position Weight Matrices, PWM) or regular expressions (RE), which both describe the probability for specific bases to occur at a specific position of the motif. Such matrices are graphically displayed as so-called "SequenceLogos", in which the height of the letters representing the four bases is a measure for the degree of

In REACT, defined motifs of known regulator-binding sites are stored in the "MotifTable". In this table, each motif is represented by the REACT-internal ID, the name of the motif (normally equivalent with the name of its cognate regulator), the motif length, the associated regular expression or PSSM, as well as the corresponding SequenceLogo. Selection of a motif opens the "Act. Motif"-panel, which provides all available information of one motif, including the name of the regulon it is associated with. This regulon name serves as an internal link to the corresponding page within the *RegulonView*. Moreover, a multiple sequence alignment of all (upstream) sequences underlying this motif is shown (if

So far, this book chapter has described the major views that represent and display the data stored within the organism-specific REACT databases. In the following sections, we will describe the tools that allow the user to search the database and analyse genes, motifs, and microarray datasets in order to extract and define regulons. These tools include a search engine, an internal BLAST tool, cluster analysis and scatter blot tools for microarray datasets, as well as the MEME/MAST algorithms to identify and search for regulator

The wealth of information stored in the REACT databases requires search tools to find specific data sets. The REACT Search tool contains four panels, enabling the user to search for genes, regulons, arrays and array-sets, respectively. These panels share the same general structure and differ only in minor features. The common features will be described for the

Genes of interest can be searched by all gene-specific data fields, e.g. by gene-ID, name, synonyms, function, comments, but also any other user-defined field. These fields can be searched by a number of search strings, such as <containing>, <being equal to> or <starting with> a certain term. After the search hits are displayed in tabulated form in a result window below the search panel, where they can be marked or used as internal links to the respective views. Consecutive searches can be combined by <add>, <remove>, <keep> results

The search functions introduced so far are available in all four search panels. For genes and arrays, an additional function allows searching marked genes or arrays, respectively. Moreover, genes can also be successively searched by COG categories and COG terms.

or <negate> operations, thus enabling even for more sophisticated searches.

conservation at any given position within the motif.

available), which can be exported as FASTA format.

binding sites in upstream regions of co-expressed genes.

**4.1. The** *Search* **tool**

gene search panel (Fig. 3).

**4. Search options and analysis tools within the REACT suite** 

One major challenge when comparing data from different sources and hence formats is dealing with variations and differences in the gene identifiers used in different microarray templates. REACT knows a large amount of different gene descriptions, as mentioned in section 3.1. During data import, REACT will accept any of these names and synonyms. But if unknown identifiers occur or synonyms have been assigned more than once in a microarray dataset (e.g. in case of different probes representing a single gene), REACT will ask the user for a specific decision. The user can then skip/delete the line, manually assign a gene name, or add the new synonym to the database for future use.

Taken together, REACT should be able to import virtually all formats of array data, as long as they are tabulated. For the more complex datasets, such as those generated by the GEO, special parsing options for the corresponding meta-information are available in REACT.

#### **3.5. The** *MotifView*

The *MotifView* is divided into five sub-sections, three of which (the "Upstream"-panel, the "Act. Motif" and the "MotifTable") are used for displaying the data and will be described here. The remaining two – MEME- and MAST-panel – are interfaces for the eponymous external analysis tools and will be discussed in more detail later (section 4.4).

The "Upstream"-panel is used to collect and display DNA regions upstream of coding sequences. Mostly, this will be intergenic regions, which are of particular interest, since they contain both (alternative) promoters and putative DNA-binding sequences of transcriptional regulators. The possibility to retrieve and manage such upstream regions is therefore of crucial importance in the context of regulon analyses. Upstream regions can be added to the "Upstream"-table by one of three means: (i) collectively from the active list of marked genes, (ii) individually by gene name, or (iii) directly from within the *GeneView* for the corresponding gene. In all cases, the user can define parameters for the retrieval, such as the sequence length, inclusion of sequence up to the upstream stop codon or exclusion of sequences of upstream genes, in the case that they overlap with the selected upstream sequence length.

The "Upstream"-table displays all upstream regions collected by the user in the course of an analysis by any of the three methods described above. For each upstream region, the ID and name of the corresponding gene, and the sequence and position of the respective region in the genome are displayed. These regions (or subsets thereof) can easily be removed or added, exported as FASTA-formatted sequence files or selected for further analyses, such as the MEME/MAST analyses (see section 4.4).

In the context of regulon analyses performed within the REACT suite, motifs are defined as short stretches of nucleotide sequence that are conserved in a collection of upstream regions, derived e.g. from co-expressed genes. They are expressed as so-called position-specific scoring matrices (PSSMs, also known as Position Weight Matrices, PWM) or regular expressions (RE), which both describe the probability for specific bases to occur at a specific position of the motif. Such matrices are graphically displayed as so-called "SequenceLogos", in which the height of the letters representing the four bases is a measure for the degree of conservation at any given position within the motif.

In REACT, defined motifs of known regulator-binding sites are stored in the "MotifTable". In this table, each motif is represented by the REACT-internal ID, the name of the motif (normally equivalent with the name of its cognate regulator), the motif length, the associated regular expression or PSSM, as well as the corresponding SequenceLogo. Selection of a motif opens the "Act. Motif"-panel, which provides all available information of one motif, including the name of the regulon it is associated with. This regulon name serves as an internal link to the corresponding page within the *RegulonView*. Moreover, a multiple sequence alignment of all (upstream) sequences underlying this motif is shown (if available), which can be exported as FASTA format.

#### **4. Search options and analysis tools within the REACT suite**

So far, this book chapter has described the major views that represent and display the data stored within the organism-specific REACT databases. In the following sections, we will describe the tools that allow the user to search the database and analyse genes, motifs, and microarray datasets in order to extract and define regulons. These tools include a search engine, an internal BLAST tool, cluster analysis and scatter blot tools for microarray datasets, as well as the MEME/MAST algorithms to identify and search for regulator binding sites in upstream regions of co-expressed genes.

#### **4.1. The** *Search* **tool**

34 Functional Genomics

non-logarithmic values) or not.

**3.5. The** *MotifView*

sequence length.

the MEME/MAST analyses (see section 4.4).

possible to define, if ratio data are in logarithmic format (they will be converted to internal

One major challenge when comparing data from different sources and hence formats is dealing with variations and differences in the gene identifiers used in different microarray templates. REACT knows a large amount of different gene descriptions, as mentioned in section 3.1. During data import, REACT will accept any of these names and synonyms. But if unknown identifiers occur or synonyms have been assigned more than once in a microarray dataset (e.g. in case of different probes representing a single gene), REACT will ask the user for a specific decision. The user can then skip/delete the line, manually assign a

Taken together, REACT should be able to import virtually all formats of array data, as long as they are tabulated. For the more complex datasets, such as those generated by the GEO, special parsing options for the corresponding meta-information are available in REACT.

The *MotifView* is divided into five sub-sections, three of which (the "Upstream"-panel, the "Act. Motif" and the "MotifTable") are used for displaying the data and will be described here. The remaining two – MEME- and MAST-panel – are interfaces for the eponymous

The "Upstream"-panel is used to collect and display DNA regions upstream of coding sequences. Mostly, this will be intergenic regions, which are of particular interest, since they contain both (alternative) promoters and putative DNA-binding sequences of transcriptional regulators. The possibility to retrieve and manage such upstream regions is therefore of crucial importance in the context of regulon analyses. Upstream regions can be added to the "Upstream"-table by one of three means: (i) collectively from the active list of marked genes, (ii) individually by gene name, or (iii) directly from within the *GeneView* for the corresponding gene. In all cases, the user can define parameters for the retrieval, such as the sequence length, inclusion of sequence up to the upstream stop codon or exclusion of sequences of upstream genes, in the case that they overlap with the selected upstream

The "Upstream"-table displays all upstream regions collected by the user in the course of an analysis by any of the three methods described above. For each upstream region, the ID and name of the corresponding gene, and the sequence and position of the respective region in the genome are displayed. These regions (or subsets thereof) can easily be removed or added, exported as FASTA-formatted sequence files or selected for further analyses, such as

In the context of regulon analyses performed within the REACT suite, motifs are defined as short stretches of nucleotide sequence that are conserved in a collection of upstream regions, derived e.g. from co-expressed genes. They are expressed as so-called position-specific

external analysis tools and will be discussed in more detail later (section 4.4).

gene name, or add the new synonym to the database for future use.

The wealth of information stored in the REACT databases requires search tools to find specific data sets. The REACT Search tool contains four panels, enabling the user to search for genes, regulons, arrays and array-sets, respectively. These panels share the same general structure and differ only in minor features. The common features will be described for the gene search panel (Fig. 3).

Genes of interest can be searched by all gene-specific data fields, e.g. by gene-ID, name, synonyms, function, comments, but also any other user-defined field. These fields can be searched by a number of search strings, such as <containing>, <being equal to> or <starting with> a certain term. After the search hits are displayed in tabulated form in a result window below the search panel, where they can be marked or used as internal links to the respective views. Consecutive searches can be combined by <add>, <remove>, <keep> results or <negate> operations, thus enabling even for more sophisticated searches.

The search functions introduced so far are available in all four search panels. For genes and arrays, an additional function allows searching marked genes or arrays, respectively. Moreover, genes can also be successively searched by COG categories and COG terms.


The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 37

A scatter plot is a graphical way to project values for two variables of a data set into a twodimensional grid, thereby placing similar samples in the same regions of the grid. The data is displayed as a collection of points, each having the value of one variable determining the position on the *x* axis and the value of the other variable determining the position on the *y*  axis (Utts, 2005). A scatter plot is a very useful tool to identify similarities and differences in large, comparable datasets that agree in large parts with each other. The more the two data sets agree, the more the scatter tends to concentrate in the vicinity of a so-called identity line,

Within REACT, scatter plot analyses are normally used to display genes according to their expression data of two selected arrays, using the expression value of the first array as *x* coordinate and the values of the other as *y* coordinate. This representation of the data results in an interactive panel where genes with similar expression patterns are grouped together.

In most cases, the vast majority of analysed genes should show the same expression values/ratios under both conditions and will therefore be placed closely together on the *x=y*  line. In contrast, genes that differ significantly in their behaviour between the two conditions will appear as outliers and can therefore be easily identified in the plot. Of course, comparisons of array datasets from different research groups tend to deviate more or less significantly from this ideal situation. Hence, the differences in experimental conditions

Scatter plot analyses can be performed using either signal or ratio array values, thereby allowing to compare the behaviour of genes in the presence of different stimuli (ratio data), but also to compare different time points from one time course experiment (using signal data). Such comparisons of expression data from two different microarrays are called two-

But the user can also compare the data of one array against itself, using the same signal or ratio values as coordinates for the *x* and *y* axes. As this results in the placement of all genes on one line (the identity line), it is called a one-dimensional scatter plot. Such an analysis can be helpful to verify that a group of related genes (e.g. from one operon) behaves in a similar

The input (expression data) for both types of analyses can either be log-transformed or normalized for the arrays or for the genes (array- and gene-centering, respectively). Moreover, the data can be filtered to remove "untrusted" genes prior to the analysis. Here, REACT removes all genes previously flagged as untrusted and un-reliable (either

The major advantage over using external standard scatter plot tools is the deep integration of the REACT scatter plots with the REACT database. Without pre-selection of genes, the analysis will be carried out with the complete microarray data sets. Genes that specifically respond to only one of the two conditions will appear as outliers and can then be easily

automatically during the import or later by the user) in one or both array datasets.

need to be kept in mind when comparing array data sets.

dimensional scatter plots (see Fig. 4 for an example).

fashion within one experiment.

*4.3.1. The scatterplot tool* 

where *y* = *x*.

**Figure 3.** The *Search* tool of REACT, exemplified by the search panel for genes.

#### **4.2. The internal** *BLAST* **tool**

Within the REACT suite, BLAST analyses (Altschul *et al.*, 1990) can be performed in two different ways. First, it can be performed from within the *GeneView* via a direct external link to the NCBI BLAST server (see section 3.1). Second, REACT also provides an internal BLAST search, which allows comparing a gene of interest with the internal reference genomes of the corresponding REACT database. This internal search, which can be accessed by the corresponding BLAST panel, allows retrieving not only the homologous gene or protein sequences, but also the corresponding upstream regions for further analyses, such as MEME/MAST (see section 4.4).

Both external (pasted into the input window) and internal (derived from the gene/protein displayed in the current *GeneView*) sequences can be used as query, either as DNA or protein sequence. After choosing the appropriate BLAST algorithm and the sequence data to be analysed, the results are displayed in tabulated form in the corresponding panel. For each match, both gene-specific information (ID, name, function, organism) and BLAST-specific values (E-value, per cent identity, match length, number of mismatches/insertions/ deletions) are displayed. Moreover, the genomic context is illustrated in a genome browser.

For each match, the DNA or amino acid sequence can be retrieved. Moreover, REACT also provides access to the corresponding upstream region via the "Retrieve upstream" function. The corresponding sequences will then be added to the "Upstream"-panel of the *MotifView* as described above (see section 3.5).

#### **4.3. Microarray analysis tools**

As mentioned before, the REACT suite is based on organism-specific databases that contain two types of data. The gene-centric data is derived from public genome sequence information and accessible through the *Gene-, Operon-, Regulon-,* and *MotifView*, while the array-centric data is displayed in the *ArrayView*. Two different types of tools have been implemented into the REACT suite in order to analyse this second type of data: (i) Scatter plot analyses (4.3.1) allow the comparison of up to two experimental conditions, while Cluster analyses (4.3.2) are used to extract expression values from multi-array comparisons.

#### *4.3.1. The scatterplot tool*

36 Functional Genomics

**4.2. The internal** *BLAST* **tool**

MEME/MAST (see section 4.4).

as described above (see section 3.5).

**4.3. Microarray analysis tools** 

**Figure 3.** The *Search* tool of REACT, exemplified by the search panel for genes.

Within the REACT suite, BLAST analyses (Altschul *et al.*, 1990) can be performed in two different ways. First, it can be performed from within the *GeneView* via a direct external link to the NCBI BLAST server (see section 3.1). Second, REACT also provides an internal BLAST search, which allows comparing a gene of interest with the internal reference genomes of the corresponding REACT database. This internal search, which can be accessed by the corresponding BLAST panel, allows retrieving not only the homologous gene or protein sequences, but also the corresponding upstream regions for further analyses, such as

Both external (pasted into the input window) and internal (derived from the gene/protein displayed in the current *GeneView*) sequences can be used as query, either as DNA or protein sequence. After choosing the appropriate BLAST algorithm and the sequence data to be analysed, the results are displayed in tabulated form in the corresponding panel. For each match, both gene-specific information (ID, name, function, organism) and BLAST-specific values (E-value, per cent identity, match length, number of mismatches/insertions/ deletions) are displayed. Moreover, the genomic context is illustrated in a genome browser. For each match, the DNA or amino acid sequence can be retrieved. Moreover, REACT also provides access to the corresponding upstream region via the "Retrieve upstream" function. The corresponding sequences will then be added to the "Upstream"-panel of the *MotifView*

As mentioned before, the REACT suite is based on organism-specific databases that contain two types of data. The gene-centric data is derived from public genome sequence information and accessible through the *Gene-, Operon-, Regulon-,* and *MotifView*, while the array-centric data is displayed in the *ArrayView*. Two different types of tools have been implemented into the REACT suite in order to analyse this second type of data: (i) Scatter plot analyses (4.3.1) allow the comparison of up to two experimental conditions, while Cluster analyses (4.3.2) are used to extract expression values from multi-array comparisons.

A scatter plot is a graphical way to project values for two variables of a data set into a twodimensional grid, thereby placing similar samples in the same regions of the grid. The data is displayed as a collection of points, each having the value of one variable determining the position on the *x* axis and the value of the other variable determining the position on the *y*  axis (Utts, 2005). A scatter plot is a very useful tool to identify similarities and differences in large, comparable datasets that agree in large parts with each other. The more the two data sets agree, the more the scatter tends to concentrate in the vicinity of a so-called identity line, where *y* = *x*.

Within REACT, scatter plot analyses are normally used to display genes according to their expression data of two selected arrays, using the expression value of the first array as *x* coordinate and the values of the other as *y* coordinate. This representation of the data results in an interactive panel where genes with similar expression patterns are grouped together.

In most cases, the vast majority of analysed genes should show the same expression values/ratios under both conditions and will therefore be placed closely together on the *x=y*  line. In contrast, genes that differ significantly in their behaviour between the two conditions will appear as outliers and can therefore be easily identified in the plot. Of course, comparisons of array datasets from different research groups tend to deviate more or less significantly from this ideal situation. Hence, the differences in experimental conditions need to be kept in mind when comparing array data sets.

Scatter plot analyses can be performed using either signal or ratio array values, thereby allowing to compare the behaviour of genes in the presence of different stimuli (ratio data), but also to compare different time points from one time course experiment (using signal data). Such comparisons of expression data from two different microarrays are called twodimensional scatter plots (see Fig. 4 for an example).

But the user can also compare the data of one array against itself, using the same signal or ratio values as coordinates for the *x* and *y* axes. As this results in the placement of all genes on one line (the identity line), it is called a one-dimensional scatter plot. Such an analysis can be helpful to verify that a group of related genes (e.g. from one operon) behaves in a similar fashion within one experiment.

The input (expression data) for both types of analyses can either be log-transformed or normalized for the arrays or for the genes (array- and gene-centering, respectively). Moreover, the data can be filtered to remove "untrusted" genes prior to the analysis. Here, REACT removes all genes previously flagged as untrusted and un-reliable (either automatically during the import or later by the user) in one or both array datasets.

The major advantage over using external standard scatter plot tools is the deep integration of the REACT scatter plots with the REACT database. Without pre-selection of genes, the analysis will be carried out with the complete microarray data sets. Genes that specifically respond to only one of the two conditions will appear as outliers and can then be easily

selected directly from the plot and thereby added to the list of marked genes directly for further analyses within REACT. This deep integration and direct connection of array-centric results with gene-centric information is one of the major strengths of the REACT suite, which enables the user to efficiently analyse even complex datasets.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 39

from within the same cluster is more similar to each other than to those in other clusters. Clustering is based on calculating a distance measure, which determines the similarity of two elements. During this calculation, the often *n*-dimensional data sets are reduced to their respective distances (one distance for each pair of objects). This less complex data set is then used as input for the final clustering. The corresponding algorithms achieving this differ significantly in their notion of what constitutes a cluster and in their efficiency of finding them. The hierarchical cluster analyses embedded in the REACT suite provide a way to compare the expression behaviour of genes over multiple microarray datasets but also, if needed, to group and cluster arrays. The result is a two-dimensional, colour-coded matrix (or grid) in which each row represents one gene, while each column corresponds to one array dataset. Rows and/or columns are sorted according to their overall distances, and this clustering is illustrated by flanking distance trees, in which the length of the branches serves as a

measure for similarity: The shorter the branches, the higher their similarity (Fig. 5).

**Figure 5.** Example of a cluster analysis performed with the *HeatMap* tool. The inset on the right shows

The complexity of the data is not lost, as all ratio or signal values for each gene within all arrays are visualized by the colour of the individual cells within the heat map grid. When ratio values are displayed, green colour indicates an increase (positive ratio value) and red a decrease (negative ratio value) of the expression in comparison to the control condition of the array, while the intensity of the colour is an indicator for the magnitude of change (Fig. 5). Signal values are coloured according to their percentage from the lowest and highest

the parameter window for choosing the settings for a cluster analysis.

measured value within the array.

**Figure 4.** Example of a two-dimensional scatter plot analysis with labelled genes. The inset on the right shows the parameter window for choosing the settings for a scatter plot analysis.

But scatter plots can also be performed on only a small group of genes collected in previous analyses, thereby enabling the user to focus on a relevant subset of the data. The second approach is for example useful if these genes are known or suspected to belong to one regulon, in which case they should show a similar behaviour under various conditions. Two-dimensional scatter plots provide an easy way to test this hypothesis, since currently marked genes can be labelled in the plot and thereby easily visualized (Fig. 4).

Images of the scatter-plots can be directly retrieved. For presentation or publication purposes, individual genes can be labelled with their names, or specific symbols can be assigned to groups of genes, in order to distinguish them.

#### *4.3.2. The cluster analysis (HeatMap) tool*

To perform more sophisticated expression analyses of multiple microarray datasets, the hierarchical clustering functions of the Cluster 3.0 Software (de Hoon *et al.*, 2004), an enhanced version of the Cluster Software20, were integrated into REACT. This analysis assigns sets of genes into groups (the so-called clusters), so that the behaviour of the genes

<sup>20</sup> URL for the source code of the Cluster software: http://rana.lbl.gov/EisenSoftwareSource.htm

from within the same cluster is more similar to each other than to those in other clusters. Clustering is based on calculating a distance measure, which determines the similarity of two elements. During this calculation, the often *n*-dimensional data sets are reduced to their respective distances (one distance for each pair of objects). This less complex data set is then used as input for the final clustering. The corresponding algorithms achieving this differ significantly in their notion of what constitutes a cluster and in their efficiency of finding them.

38 Functional Genomics

selected directly from the plot and thereby added to the list of marked genes directly for further analyses within REACT. This deep integration and direct connection of array-centric results with gene-centric information is one of the major strengths of the REACT suite,

**Figure 4.** Example of a two-dimensional scatter plot analysis with labelled genes. The inset on the right

But scatter plots can also be performed on only a small group of genes collected in previous analyses, thereby enabling the user to focus on a relevant subset of the data. The second approach is for example useful if these genes are known or suspected to belong to one regulon, in which case they should show a similar behaviour under various conditions. Two-dimensional scatter plots provide an easy way to test this hypothesis, since currently

Images of the scatter-plots can be directly retrieved. For presentation or publication purposes, individual genes can be labelled with their names, or specific symbols can be

To perform more sophisticated expression analyses of multiple microarray datasets, the hierarchical clustering functions of the Cluster 3.0 Software (de Hoon *et al.*, 2004), an enhanced version of the Cluster Software20, were integrated into REACT. This analysis assigns sets of genes into groups (the so-called clusters), so that the behaviour of the genes

shows the parameter window for choosing the settings for a scatter plot analysis.

marked genes can be labelled in the plot and thereby easily visualized (Fig. 4).

20 URL for the source code of the Cluster software: http://rana.lbl.gov/EisenSoftwareSource.htm

assigned to groups of genes, in order to distinguish them.

*4.3.2. The cluster analysis (HeatMap) tool* 

which enables the user to efficiently analyse even complex datasets.

The hierarchical cluster analyses embedded in the REACT suite provide a way to compare the expression behaviour of genes over multiple microarray datasets but also, if needed, to group and cluster arrays. The result is a two-dimensional, colour-coded matrix (or grid) in which each row represents one gene, while each column corresponds to one array dataset. Rows and/or columns are sorted according to their overall distances, and this clustering is illustrated by flanking distance trees, in which the length of the branches serves as a measure for similarity: The shorter the branches, the higher their similarity (Fig. 5).

**Figure 5.** Example of a cluster analysis performed with the *HeatMap* tool. The inset on the right shows the parameter window for choosing the settings for a cluster analysis.

The complexity of the data is not lost, as all ratio or signal values for each gene within all arrays are visualized by the colour of the individual cells within the heat map grid. When ratio values are displayed, green colour indicates an increase (positive ratio value) and red a decrease (negative ratio value) of the expression in comparison to the control condition of the array, while the intensity of the colour is an indicator for the magnitude of change (Fig. 5). Signal values are coloured according to their percentage from the lowest and highest measured value within the array.

To run a cluster analysis, the user has first to decide, which genes and microarray datasets are to be included. Again, the active list of marked genes/arrays can directly be applied. Since REACT only serves as an interface to the Cluster 3.0 software package, its panel mimics the original input fields, with some modifications (inset to Fig. 5). The choice of parameters includes: clustering of only genes, arrays, or both; (ii) use of ratio or signal values; (iii) log-transformation of the data; (iv) removal of "untrusted" data (see above). Distance measures such as Euklidian distance, Kendall's tau, Pearson correlation or Spearman's rank correlation are available for both gene- and array-clustering. Moreover, genes and arrays can again be normalized as well as centered, as described for the scatter plot analysis above. For the final linkage, the user can choose between Pairwise Single, Pairwise Complete, Pairwise Centroide and Pairwise Average Linkage clustering methods21.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 41

analysis. As usual, the identifiers of the regulons, operons and genes function as internal links to the corresponding views, enabling a seamless integration with subsequent analyses

This function therefore offers a very straightforward and easy-to-use approach to identify

If the above mentioned function did not yield a direct insight into regulatory principles underlying an observed co-expression, the next step of a typical analysis would be to search for putative regulator binding sites in the upstream genomic regions of co-expressed genes and operons. To facilitate these analyses, the MEME/MAST tools from the MEME (Multiple EM for Motif Elicitation) suite22 (Bailey *et al.*, 2009) were incorporated into REACT. MEME allows the identification of short overrepresented sequence motifs in a group of unaligned sequences of different length. MAST is a sequence similarity search algorithm that utilizes motifs either provided by the user or from a previous MEME analysis, to search for similar motifs in genome sequences. Starting from the upstream regions of co-clustering genes, these two tools, if applied in combination, often allow to identify putative regulator binding site in

A prerequisite for any motif search is a collection of (upstream) sequences that are supposed to contain a common motif. In the REACT suite, this is facilitated by the "get upstream" function, which can be found in a number of views, including the *Gene-, Operon-* or *MotifView.* The latter also contains the panels for the MEME and MAST analyses. Again, REACT's motif discovery function is just an embedded interface to these freely available and well established tools, which are components of the MEME suite. MEME is a tool for discovering motifs in a group of related DNA or protein sequences. It represents motifs as position-specific scoring matrices (PSSM's), which describe the probability of each possible letter at each position within the gapless pattern. MEME uses statistical modelling techniques to automatically choose the best width, number of occurrences, and description for each motif to reduce the number of false-positive hits. Nevertheless, they can occur incidentally, especially if the motifs are very short, and therefore have to be validated

Like other analysis panels of REACT, the MEME view is also divided into two areas: in the upper part, the sequences and analysis parameters can be specified, while the results will be

To start a MEME analysis, the user has to provide the sequences (in this case: upstream regions of genes), which are believed to share a common motif. This can be done by one of three ways: (i) Selection of upstream sequences from the "Upstream sequence" panel, (ii)

the regulators responsible for an observed co-expression of a group of genes.

of the identified transcriptional units.

**4.4. Motif analysis tools** 

novel regulons.

*4.4.1. The MEME-Analysis tool* 

experimentally, both *in vivo* and *in vitro* (Cao *et al.*, 2002).

22 URL for online access to the complete MEME suite at: http://meme.nbcr.net

displayed in the lower panel (Fig. 6).

The results of the cluster analysis are displayed in the *HeatMap* window (Fig. 5). This view is vertically split into two subpanels. The left panel displays the complete heat map, including the distance trees, while the right subpanel displays selected areas in more detail, including gene-IDs, names and descriptions. The heat map is interactive and selecting one row will directly open the corresponding *GeneView*. Marked genes are highlighted in red in the heat map.

To further analyse a certain gene cluster, it can directly be selected from the flanking distance trees, which are also interactive: Selecting any branch will mark the corresponding rows or columns. Intersections of selected rows and columns can be obtained and selected parts of the heat map can be displayed in higher resolution in the right subpanel of the *HeatMap* window, as described above.

The content of each subpanel can be exported both as an image file (different file formats can be chosen), as well as in tabulated form. Cluster results can also be stored and reloaded again, e.g. to enable the user to compare the clustering of specific groups of genes between different analyses.

#### *4.3.3. The "Show regulons of marked genes" function*

Co-expression – and therefore co-clustering – of groups of genes is a strong indication that they presumably belong to one regulon, i.e. are under the direct control of a common transcriptional regulator. In case of the two model bacteria currently implemented in the REACT suite, *B. subtilis* and *E. coli*, many of these regulons are already known.

To simplify the identification of known regulons within a marked group of genes derived from one of the above analyses, the "Show regulons of marked genes" function was implemented in the REACT suite, which displays all regulons to which at least one currently marked gene is associated in an additional window. Moreover, the results window will also list all operons and genes of any identified regulon, thereby providing a direct overview of the coverage of a given regulon within the group of marked genes identified by the cluster

<sup>21</sup> For details on clustering, see:

http://bonsai.hgc.jp/~mdehoon/software/cluster/manual/Hierarchical.html#Hierarchical

analysis. As usual, the identifiers of the regulons, operons and genes function as internal links to the corresponding views, enabling a seamless integration with subsequent analyses of the identified transcriptional units.

This function therefore offers a very straightforward and easy-to-use approach to identify the regulators responsible for an observed co-expression of a group of genes.

#### **4.4. Motif analysis tools**

40 Functional Genomics

map.

*HeatMap* window, as described above.

*4.3.3. The "Show regulons of marked genes" function* 

different analyses.

21 For details on clustering, see:

To run a cluster analysis, the user has first to decide, which genes and microarray datasets are to be included. Again, the active list of marked genes/arrays can directly be applied. Since REACT only serves as an interface to the Cluster 3.0 software package, its panel mimics the original input fields, with some modifications (inset to Fig. 5). The choice of parameters includes: clustering of only genes, arrays, or both; (ii) use of ratio or signal values; (iii) log-transformation of the data; (iv) removal of "untrusted" data (see above). Distance measures such as Euklidian distance, Kendall's tau, Pearson correlation or Spearman's rank correlation are available for both gene- and array-clustering. Moreover, genes and arrays can again be normalized as well as centered, as described for the scatter plot analysis above. For the final linkage, the user can choose between Pairwise Single, Pairwise Complete, Pairwise Centroide and Pairwise Average Linkage clustering methods21. The results of the cluster analysis are displayed in the *HeatMap* window (Fig. 5). This view is vertically split into two subpanels. The left panel displays the complete heat map, including the distance trees, while the right subpanel displays selected areas in more detail, including gene-IDs, names and descriptions. The heat map is interactive and selecting one row will directly open the corresponding *GeneView*. Marked genes are highlighted in red in the heat

To further analyse a certain gene cluster, it can directly be selected from the flanking distance trees, which are also interactive: Selecting any branch will mark the corresponding rows or columns. Intersections of selected rows and columns can be obtained and selected parts of the heat map can be displayed in higher resolution in the right subpanel of the

The content of each subpanel can be exported both as an image file (different file formats can be chosen), as well as in tabulated form. Cluster results can also be stored and reloaded again, e.g. to enable the user to compare the clustering of specific groups of genes between

Co-expression – and therefore co-clustering – of groups of genes is a strong indication that they presumably belong to one regulon, i.e. are under the direct control of a common transcriptional regulator. In case of the two model bacteria currently implemented in the

To simplify the identification of known regulons within a marked group of genes derived from one of the above analyses, the "Show regulons of marked genes" function was implemented in the REACT suite, which displays all regulons to which at least one currently marked gene is associated in an additional window. Moreover, the results window will also list all operons and genes of any identified regulon, thereby providing a direct overview of the coverage of a given regulon within the group of marked genes identified by the cluster

REACT suite, *B. subtilis* and *E. coli*, many of these regulons are already known.

http://bonsai.hgc.jp/~mdehoon/software/cluster/manual/Hierarchical.html#Hierarchical

If the above mentioned function did not yield a direct insight into regulatory principles underlying an observed co-expression, the next step of a typical analysis would be to search for putative regulator binding sites in the upstream genomic regions of co-expressed genes and operons. To facilitate these analyses, the MEME/MAST tools from the MEME (Multiple EM for Motif Elicitation) suite22 (Bailey *et al.*, 2009) were incorporated into REACT. MEME allows the identification of short overrepresented sequence motifs in a group of unaligned sequences of different length. MAST is a sequence similarity search algorithm that utilizes motifs either provided by the user or from a previous MEME analysis, to search for similar motifs in genome sequences. Starting from the upstream regions of co-clustering genes, these two tools, if applied in combination, often allow to identify putative regulator binding site in novel regulons.

#### *4.4.1. The MEME-Analysis tool*

A prerequisite for any motif search is a collection of (upstream) sequences that are supposed to contain a common motif. In the REACT suite, this is facilitated by the "get upstream" function, which can be found in a number of views, including the *Gene-, Operon-* or *MotifView.* The latter also contains the panels for the MEME and MAST analyses. Again, REACT's motif discovery function is just an embedded interface to these freely available and well established tools, which are components of the MEME suite. MEME is a tool for discovering motifs in a group of related DNA or protein sequences. It represents motifs as position-specific scoring matrices (PSSM's), which describe the probability of each possible letter at each position within the gapless pattern. MEME uses statistical modelling techniques to automatically choose the best width, number of occurrences, and description for each motif to reduce the number of false-positive hits. Nevertheless, they can occur incidentally, especially if the motifs are very short, and therefore have to be validated experimentally, both *in vivo* and *in vitro* (Cao *et al.*, 2002).

Like other analysis panels of REACT, the MEME view is also divided into two areas: in the upper part, the sequences and analysis parameters can be specified, while the results will be displayed in the lower panel (Fig. 6).

To start a MEME analysis, the user has to provide the sequences (in this case: upstream regions of genes), which are believed to share a common motif. This can be done by one of three ways: (i) Selection of upstream sequences from the "Upstream sequence" panel, (ii)

<sup>22</sup> URL for online access to the complete MEME suite at: http://meme.nbcr.net

directly pasting sequences into the respective sequence window of the MEME interface, or (iii) uploading an external file. The latter options enable the inclusion of sequences, which are not derived from the REACT database. Next, the number of allowed (or expected) motifs per sequence needs to be defined. Additional parameters include (i) the minimum and maximum motif-width, (ii) the maximum number of motifs to be discovered, (iii) a statistical threshold value, and (iv) limitation to palindromic sequences.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 43

of co-expressed loci, as described above. One way of testing predicted motifs *in silico* is to apply them to larger data sets. This not only allows the identification of additional putative matches, but it might also help to improve the motif through iterations. In the REACT suite,

MAST is a tool for searching biological sequence databases for sequences that contain one or more copies of a known motif. The quality of a resulting hit is calculated as the strength of the similarity of the particular sequence to all motifs, based on statistical probabilities. MAST works by calculating match scores for each sequence in the database compared with each of the provided motifs. These initial scores are then converted into statistical probability values, which are used to determine the overall match of the sequence to the group of motifs. By this approach, the best fitting sequences in the analysed data set can

The MAST interface of the REACT suite is located within the *MotifView* and resembles the one for the MEME analysis both in appearance and overall logic. Two important parameters need to be defined by the user. The first one is the motif. It can be directly imported from a MEME result table, from the motifs stored in the REACT database (accessed via the motif table), but also manually imported from an external motif definition, expressed as a PSSM.

The second important parameter is the sequence database to be searched. REACT contains pre-compiled data files containing all upstream regions of the currently implemented two model organisms but also of all of the respective reference organism. These regions are defined as the 200 bases upstream of the start codon of each gene. Other parameters to be defined are the maximum number of sequences to be displayed, a probability threshold and

After the analysis has been performed, a graphical overview of the results in the form of a block diagram is displayed. It shows the matching regions for each motif within each sequence, the direction of the match (forward or reverse), the gene ID to which the upstream region belongs, and a probability value indicating the match strength. The information can also be displayed as in tabulated form. As usual, the diagram is interactive and provides a

If additional promising matches could be identified, they can then be integrated into a new iteration of creating motifs with the MEME-tool and re-checking them with MAST. Again,

We will conclude this chapter with a brief summary of how the REACT suite can be navigated and modified. For this purpose, we will first describe a typical work flow through the features of REACT from the perspective of a user (5.1). In the second section, we will specifically address the rights and options of REACT-administrators (5.2). Finally, we will

the integrative nature of REACT will enable and simplify such follow-up analyses.

provide a brief summary of the REACT concept and infrastructure (5.3).

if genes overlapping with the upstream regions should be displayed in the results.

direct link to the corresponding gene-specific information.

**5. Operating the REACT suite** 

this can be done with the MAST analysis interface.

ultimately be identified.


**Figure 6.** The MEME analysis interface embedded in the *MotifView.*

REACT`s MEME results consist of a graphical overview of the analysed sequences (Fig. 6) illustrating the occurrence and position of the motifs. Each motif is described by the following information: a motif ID, the length of the motif, a statistical value as a measure for the reliability of the motif, and a corresponding SequenceLogo as a graphical representation of the motif. As computable definitions, the description also includes the Regular Expression, an alignment of the motif from the analysed sequences, and the PSSM, which can all be exported. Alternatively, these definitions can be used directly for a MAST analysis to screen genome sequences from the REACT database for additional upstream regions containing this pattern (described in the following section) or stored in the REACT database for later analyses.

#### *4.4.2. The MAST-Analysis tool*

An important strategy to identify regulon members in large datasets, such as (multiple) genome sequences, is to screen them for the presence of sequence motifs, especially in intergenic regions, that are known or postulated to function as regulator binding sites. Such patterns can be derived from known operator sites described in other, closely related organisms (Wecke *et al.*, 2006), or from motifs identified by MEME analyses from collections of co-expressed loci, as described above. One way of testing predicted motifs *in silico* is to apply them to larger data sets. This not only allows the identification of additional putative matches, but it might also help to improve the motif through iterations. In the REACT suite, this can be done with the MAST analysis interface.

MAST is a tool for searching biological sequence databases for sequences that contain one or more copies of a known motif. The quality of a resulting hit is calculated as the strength of the similarity of the particular sequence to all motifs, based on statistical probabilities. MAST works by calculating match scores for each sequence in the database compared with each of the provided motifs. These initial scores are then converted into statistical probability values, which are used to determine the overall match of the sequence to the group of motifs. By this approach, the best fitting sequences in the analysed data set can ultimately be identified.

The MAST interface of the REACT suite is located within the *MotifView* and resembles the one for the MEME analysis both in appearance and overall logic. Two important parameters need to be defined by the user. The first one is the motif. It can be directly imported from a MEME result table, from the motifs stored in the REACT database (accessed via the motif table), but also manually imported from an external motif definition, expressed as a PSSM.

The second important parameter is the sequence database to be searched. REACT contains pre-compiled data files containing all upstream regions of the currently implemented two model organisms but also of all of the respective reference organism. These regions are defined as the 200 bases upstream of the start codon of each gene. Other parameters to be defined are the maximum number of sequences to be displayed, a probability threshold and if genes overlapping with the upstream regions should be displayed in the results.

After the analysis has been performed, a graphical overview of the results in the form of a block diagram is displayed. It shows the matching regions for each motif within each sequence, the direction of the match (forward or reverse), the gene ID to which the upstream region belongs, and a probability value indicating the match strength. The information can also be displayed as in tabulated form. As usual, the diagram is interactive and provides a direct link to the corresponding gene-specific information.

If additional promising matches could be identified, they can then be integrated into a new iteration of creating motifs with the MEME-tool and re-checking them with MAST. Again, the integrative nature of REACT will enable and simplify such follow-up analyses.

#### **5. Operating the REACT suite**

42 Functional Genomics

directly pasting sequences into the respective sequence window of the MEME interface, or (iii) uploading an external file. The latter options enable the inclusion of sequences, which are not derived from the REACT database. Next, the number of allowed (or expected) motifs per sequence needs to be defined. Additional parameters include (i) the minimum and maximum motif-width, (ii) the maximum number of motifs to be discovered, (iii) a

REACT`s MEME results consist of a graphical overview of the analysed sequences (Fig. 6) illustrating the occurrence and position of the motifs. Each motif is described by the following information: a motif ID, the length of the motif, a statistical value as a measure for the reliability of the motif, and a corresponding SequenceLogo as a graphical representation of the motif. As computable definitions, the description also includes the Regular Expression, an alignment of the motif from the analysed sequences, and the PSSM, which can all be exported. Alternatively, these definitions can be used directly for a MAST analysis to screen genome sequences from the REACT database for additional upstream regions containing this pattern

(described in the following section) or stored in the REACT database for later analyses.

An important strategy to identify regulon members in large datasets, such as (multiple) genome sequences, is to screen them for the presence of sequence motifs, especially in intergenic regions, that are known or postulated to function as regulator binding sites. Such patterns can be derived from known operator sites described in other, closely related organisms (Wecke *et al.*, 2006), or from motifs identified by MEME analyses from collections

statistical threshold value, and (iv) limitation to palindromic sequences.

**Figure 6.** The MEME analysis interface embedded in the *MotifView.*

*4.4.2. The MAST-Analysis tool* 

We will conclude this chapter with a brief summary of how the REACT suite can be navigated and modified. For this purpose, we will first describe a typical work flow through the features of REACT from the perspective of a user (5.1). In the second section, we will specifically address the rights and options of REACT-administrators (5.2). Finally, we will provide a brief summary of the REACT concept and infrastructure (5.3).

#### **5.1. Navigating REACT: The user approach**

The functionality of the REACT suite relies on curated and comprehensive data that is provided by the organism-specific REACT database. It provides three different types of data: (i) gene-centric data (derived from genome sequences and their annotation), (ii) arraycentric data (extracted from microarray databases and individual sources of transcriptome experiments), and (iii) motif data (based on experimental and computational evidence).

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 45

co-regulated. All unknown genes can be subjected to in-depth analyses, primarily using the information stored in the *GeneView* (including the external links)*,* but to some extend also data from the *OperonView,* and *RegulonView* . Moreover, the group of selected genes could also be subjected to a second round of Cluster analysis, now incorporating a more diverse set of array conditions on this limited number of genes, to refine the clustering . Genes of interest derived from any of these studies can be selected and thereby added to the list of marked genes. For genes of interest that cannot be associated with known regulons, the upstream regions will be retrieved for the subsequent steps of the regulon annotation. To increase the chance of identifying regulator binding sites, internal BLAST analyses can be performed to extract upstream regions from orthologous genes from closely related species , assuming that they are subject to the same regulation. The collection of upstream regions will then be subjected to a MEME analysis to identify common sequence motifs as candidates for putative regulator binding sites . The motif definitions will be incorporated into the *MotifView* and subsequently used to screen genome sequences for additional candidates, using the MAST tool . If new candidate target genes preceded by the conserved motif could be identified, they will then be selected and subjected to the comprehensive studies described above, including Cluster analysis to compare their

Enabling such iterative and interactive processes that rely on both sequence-based and array-based data and analysis tools is a major advantage of the REACT suite. Because of its concept and architecture, the necessary information and data flow can be controlled easily

In the age of omics, new genome sequences and microarray studies are published with everincreasing speed. It is therefore important that a REACT database, once established, can be updated regularly to grow with the increase of available data and information. But as a precautious measure to avoid data corruption and thereby ensuring the integrity of the database, it is advisable that not all users have the right to modify the core data at all times during analyses. REACT has therefore implemented two different user roles: the REACT-user normally works in the "read-only" mode. This will allow him to browse the data, perform analyses, and export data to external files. In contrast, login as a REACT-administrator enables the user to permanently import additional data (such as microarray datasets or new reference genomes), to edit data already implemented in the REACT database, and even to change the

When logged in as REACT-administrator, most data displayed in the different views can be edited manually. To prevent unintentional data corruption, data can only be changed after deliberately switching into the edit-mode via the appropriate buttons, provided in each view. In the edit-mode, all editable data is displayed in green and all links are disabled. Any changes applied to the data remains transient until they are confirmed by the REACT-

behaviour to the group of genes initially selected .

**5.2. Modifying REACT: The administrator mode** 

main views of REACT by incorporating additional links and features.

administrator and thereby sent to the REACT-server and stored permanently.

and the analyses can be performed efficiently.

While there are many ways to use the REACT suite, it was developed with the goal in mind to enable the user to identify and characterize regulons starting from in-depth analyses of microarray datasets. Here, we will illustrate a typical workflow through the REACT suite (Fig. 7), in order to highlight the concept of REACT by connecting the central features that have so far been primarily described in isolation in the previous sections.

**Figure 7.** Work flow through the REACT suite. Major views are indicated by light grey, analysis tools by the green colour. The numbers refer to the description in the text.

A typical experiment could start with importing new microarray datasets to be subsequently analysed in detail by scatter plot or cluster analysis . These initial studies will presumably be performed genome-wide, but with a limited number of relevant microarray datasets. As a result, groups of interesting genes will be identified that respond in a condition-specific manner and could potentially be co-expressed and therefore co-regulated. All unknown genes can be subjected to in-depth analyses, primarily using the information stored in the *GeneView* (including the external links)*,* but to some extend also data from the *OperonView,* and *RegulonView* . Moreover, the group of selected genes could also be subjected to a second round of Cluster analysis, now incorporating a more diverse set of array conditions on this limited number of genes, to refine the clustering . Genes of interest derived from any of these studies can be selected and thereby added to the list of marked genes. For genes of interest that cannot be associated with known regulons, the upstream regions will be retrieved for the subsequent steps of the regulon annotation. To increase the chance of identifying regulator binding sites, internal BLAST analyses can be performed to extract upstream regions from orthologous genes from closely related species , assuming that they are subject to the same regulation. The collection of upstream regions will then be subjected to a MEME analysis to identify common sequence motifs as candidates for putative regulator binding sites . The motif definitions will be incorporated into the *MotifView* and subsequently used to screen genome sequences for additional candidates, using the MAST tool . If new candidate target genes preceded by the conserved motif could be identified, they will then be selected and subjected to the comprehensive studies described above, including Cluster analysis to compare their behaviour to the group of genes initially selected .

Enabling such iterative and interactive processes that rely on both sequence-based and array-based data and analysis tools is a major advantage of the REACT suite. Because of its concept and architecture, the necessary information and data flow can be controlled easily and the analyses can be performed efficiently.

#### **5.2. Modifying REACT: The administrator mode**

44 Functional Genomics

**5.1. Navigating REACT: The user approach** 

The functionality of the REACT suite relies on curated and comprehensive data that is provided by the organism-specific REACT database. It provides three different types of data: (i) gene-centric data (derived from genome sequences and their annotation), (ii) arraycentric data (extracted from microarray databases and individual sources of transcriptome experiments), and (iii) motif data (based on experimental and computational evidence).

While there are many ways to use the REACT suite, it was developed with the goal in mind to enable the user to identify and characterize regulons starting from in-depth analyses of microarray datasets. Here, we will illustrate a typical workflow through the REACT suite (Fig. 7), in order to highlight the concept of REACT by connecting the central features that

**Figure 7.** Work flow through the REACT suite. Major views are indicated by light grey, analysis tools

A typical experiment could start with importing new microarray datasets to be subsequently analysed in detail by scatter plot or cluster analysis . These initial studies will presumably be performed genome-wide, but with a limited number of relevant microarray datasets. As a result, groups of interesting genes will be identified that respond in a condition-specific manner and could potentially be co-expressed and therefore

by the green colour. The numbers refer to the description in the text.

have so far been primarily described in isolation in the previous sections.

In the age of omics, new genome sequences and microarray studies are published with everincreasing speed. It is therefore important that a REACT database, once established, can be updated regularly to grow with the increase of available data and information. But as a precautious measure to avoid data corruption and thereby ensuring the integrity of the database, it is advisable that not all users have the right to modify the core data at all times during analyses. REACT has therefore implemented two different user roles: the REACT-user normally works in the "read-only" mode. This will allow him to browse the data, perform analyses, and export data to external files. In contrast, login as a REACT-administrator enables the user to permanently import additional data (such as microarray datasets or new reference genomes), to edit data already implemented in the REACT database, and even to change the main views of REACT by incorporating additional links and features.

When logged in as REACT-administrator, most data displayed in the different views can be edited manually. To prevent unintentional data corruption, data can only be changed after deliberately switching into the edit-mode via the appropriate buttons, provided in each view. In the edit-mode, all editable data is displayed in green and all links are disabled. Any changes applied to the data remains transient until they are confirmed by the REACTadministrator and thereby sent to the REACT-server and stored permanently.

However, some data fields are not editable, as REACT uses them as immutable internal references (e.g. as primary and secondary database keys) to identify the complete dataset. This includes the names and IDs of genes, operons, or microarray datasets, as well as DNA and amino acid sequences from the *GeneView*, which are derived from and defined by the respective genomic sequence.

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 47

TAs mentioned previously, the major aim of the REACT bioinformatics toolkit was the creation of an intuitive and interactive graphical user interface that allows an integrative view on genomic and microarray data and provides combined access to various bioinformatics tools commonly used in comparative genomic and transcriptomic studies.

In the current release, the tools listed in Table 1 are integrated into the REACT suite. The software was implemented using a client/server architecture, enabling the parallel and locally distributed work of one to multiple users (Fig. 8). The REACT-server is the central computer running the database-managing software (MySQL), as well as all internal and integrated third-party analysis tools. The users will solely work with the corresponding client program, which can be installed on the personal computers or laptops of all users. Client and server are communicating via intra- or internet using remote method invocation (RMI) techniques. REACT is implemented as a java swing application, therefore client and server should run under a variety of operating systems depending only on the Java Runtime Environment (Version 5 or higher). However, in case of the server, this is limited by the external tools, as some of them (e.g. the MEME suite) depend on a Linux / Unix environment. To circumvent this limitation, REACT was developed and tested for being executable on Windows OS using Cygwin (1.5.x or higher), which is a Linux emulator for

**5.4. Developing REACT: Concept, sources and infrastructure** 

The overall structure of the REACT suite is illustrated in Fig. 8.

Windows and provides substantial Linux API functionality.

**Figure 8.** Structure, components and data flow of the REACT suite. See text for details.

A REACT-administrator can also define new data fields for the above listed views according to the individual requirements, including plain text fields and numeric fields. Moreover, new external links can also be added to the views. While it is quite easy to generate plain text or numeric fields (as just the field name and type have to be defined within the REACTadministrator dialogue), creation of additional link-fields is technically a bit more demanding.

In addition to the aforementioned options, REACT-administrators can import additional array data, create new array sets or change the assignments of arrays to a set. They can also store motifs computed during a MEME analysis permanently within the REACT database or define new regulons. In the *RegulonView*, new operons can be connected to or removed from the regulons.

#### **5.3. Expanding REACT: Embedding new organism-specific databases**

REACT was initially developed for the analysis of two model organisms, *E. coli* and *B. subtilis*. But given the wealth of knowledge already available for these organisms, the potential of REACT may be even higher when applied to genomic and expression data of other, less well-characterized organisms.

Therefore, REACT is equipped with a small set of additional tools that enables researchers with little knowledge of programming languages or database administration to create new organism-specific REACT databases from scratch. Following the instructions provided by the software, the user has to download freely available files from sources like the NCBI, Uniprot or MicrobesOnline databases that contain the data used by REACT. Additional information (e.g. links to PFAM or PDB) will be obtained from the KEGG web service via SOAP/WSDL, again without the need for more than very basic user interaction.

After the creation of an initial, empty REACT database (done by importing a provided sqlfile into the SQL database), the information contained in the downloaded files and provided by KEGG are parsed by helper tools provided by the REACT package, again minimizing user interaction.

Users with basic programming knowledge will then be able to extend the new REACT database by parsing data from additional data sources, depending on the organism chosen and the focus of the respective database. Subsequently, additional data needed by REACT (e.g. interbl BLAST databases) will be computed automatically. The user will now be able to connect with this newly created REACT database, in order to upload the first array sets.

#### **5.4. Developing REACT: Concept, sources and infrastructure**

46 Functional Genomics

demanding.

the regulons.

user interaction.

respective genomic sequence.

other, less well-characterized organisms.

However, some data fields are not editable, as REACT uses them as immutable internal references (e.g. as primary and secondary database keys) to identify the complete dataset. This includes the names and IDs of genes, operons, or microarray datasets, as well as DNA and amino acid sequences from the *GeneView*, which are derived from and defined by the

A REACT-administrator can also define new data fields for the above listed views according to the individual requirements, including plain text fields and numeric fields. Moreover, new external links can also be added to the views. While it is quite easy to generate plain text or numeric fields (as just the field name and type have to be defined within the REACTadministrator dialogue), creation of additional link-fields is technically a bit more

In addition to the aforementioned options, REACT-administrators can import additional array data, create new array sets or change the assignments of arrays to a set. They can also store motifs computed during a MEME analysis permanently within the REACT database or define new regulons. In the *RegulonView*, new operons can be connected to or removed from

REACT was initially developed for the analysis of two model organisms, *E. coli* and *B. subtilis*. But given the wealth of knowledge already available for these organisms, the potential of REACT may be even higher when applied to genomic and expression data of

Therefore, REACT is equipped with a small set of additional tools that enables researchers with little knowledge of programming languages or database administration to create new organism-specific REACT databases from scratch. Following the instructions provided by the software, the user has to download freely available files from sources like the NCBI, Uniprot or MicrobesOnline databases that contain the data used by REACT. Additional information (e.g. links to PFAM or PDB) will be obtained from the KEGG web service via

After the creation of an initial, empty REACT database (done by importing a provided sqlfile into the SQL database), the information contained in the downloaded files and provided by KEGG are parsed by helper tools provided by the REACT package, again minimizing

Users with basic programming knowledge will then be able to extend the new REACT database by parsing data from additional data sources, depending on the organism chosen and the focus of the respective database. Subsequently, additional data needed by REACT (e.g. interbl BLAST databases) will be computed automatically. The user will now be able to connect with this newly created REACT database, in order to upload the first array sets.

**5.3. Expanding REACT: Embedding new organism-specific databases** 

SOAP/WSDL, again without the need for more than very basic user interaction.

TAs mentioned previously, the major aim of the REACT bioinformatics toolkit was the creation of an intuitive and interactive graphical user interface that allows an integrative view on genomic and microarray data and provides combined access to various bioinformatics tools commonly used in comparative genomic and transcriptomic studies. The overall structure of the REACT suite is illustrated in Fig. 8.

In the current release, the tools listed in Table 1 are integrated into the REACT suite. The software was implemented using a client/server architecture, enabling the parallel and locally distributed work of one to multiple users (Fig. 8). The REACT-server is the central computer running the database-managing software (MySQL), as well as all internal and integrated third-party analysis tools. The users will solely work with the corresponding client program, which can be installed on the personal computers or laptops of all users. Client and server are communicating via intra- or internet using remote method invocation (RMI) techniques. REACT is implemented as a java swing application, therefore client and server should run under a variety of operating systems depending only on the Java Runtime Environment (Version 5 or higher). However, in case of the server, this is limited by the external tools, as some of them (e.g. the MEME suite) depend on a Linux / Unix environment. To circumvent this limitation, REACT was developed and tested for being executable on Windows OS using Cygwin (1.5.x or higher), which is a Linux emulator for Windows and provides substantial Linux API functionality.

**Figure 8.** Structure, components and data flow of the REACT suite. See text for details.


The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 49

Bailey, T. L., M. Boden, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li & W. S. Noble, (2009) MEME SUITE: tools for motif discovery and searching. *Nucleic acids* 

Barrett, T., D. B. Troup, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, R. N. Muertter, M. Holko, O. Ayanbule, A. Yefanov & A. Soboleva, (2011) NCBI GEO: archive for functional genomics data sets

Brazma, A., (2009) Minimum Information About a Microarray Experiment (MIAME)--

Brazma, A., P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, J. Aach, W. Ansorge, C. A. Ball, H. C. Causton, T. Gaasterland, P. Glenisson, F. C. Holstege, I. F. Kim, V. Markowitz, J. C. Matese, H. Parkinson, A. Robinson, U. Sarkans, S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo & M. Vingron, (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. *Nature* 

Cao, M., P. A. Kobel, M. M. Morshedi, M. F. Wu, C. Paddon & J. D. Helmann, (2002) Defining the *Bacillus subtilis* sW regulon: a comparative analysis of promoter consensus search, run-off transcription/macroarray analysis (ROMA), and transcriptional profiling

de Hoon, M. J. L., S. Imoto, J. Nolan & S. Miyano, (2004) Open source clustering software.

Dehal, P. S., M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, D. Chivian, G. D. Friedland, K. H. Huang, K. Keller, P. S. Novichkov, I. L. Dubchak, E. J. Alm & A. P. Arkin, (2010) MicrobesOnline: an integrated portal for comparative and functional

Demeter, J., C. Beauheim, J. Gollub, T. Hernandez-Boussard, H. Jin, D. Maier, J. C. Matese, M. Nitzberg, F. Wymore, Z. K. Zachariah, P. O. Brown, G. Sherlock & C. A. Ball, (2007) The Stanford Microarray Database: implementation of new analysis tools and open

Finn, R. D., J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E. L. Sonnhammer, S. R. Eddy & A. Bateman, (2010) The Pfam protein families database. *Nucleic Acids Res* 38: D211-222. Gama-Castro, S., H. Salgado, M. Peralta-Gil, A. Santos-Zavaleta, L. Muniz-Rascado, H. Solano-Lira, V. Jimenez-Jacinto, V. Weiss, J. S. Garcia-Sotelo, A. Lopez-Fuentes, L. Porron-Sotelo, S. Alquicira-Hernandez, A. Medina-Rivera, I. Martinez-Flores, K. Alquicira-Hernandez, R. Martinez-Adame, C. Bonavides-Martinez, J. Miranda-Rios, A. M. Huerta, A. Mendoza-Vargas, L. Collado-Torres, B. Taboada, L. Vega-Alvarado, M. Olvera, L. Olvera, R. Grande, E. Morett & J. Collado-Vides, (2011) RegulonDB version 7.0: transcriptional regulation of *Escherichia coli* K-12 integrated within genetic sensory

Lechat, P., L. Hummel, S. Rousseau & I. Moszer, (2008) GenoList: an integrated environment for comparative analysis of microbial genomes. *Nucleic Acids Res* 36: D469-474.

Bairoch, A., (2000) The ENZYME database in 2000. *Nucleic acids research* 28: 304-305.

successes, failures, challenges. *TheScientificWorldJournal* 9: 420-423.

*research* 37: W202-208.

*Genetics* 29: 365-371.

approaches. *J Mol Biol* 316: 443-457.

genomics. *Nucleic acids research* 38: D396-400.

source release of software. *Nucleic Acids Res* 35: D766-770.

response units (Gensor Units). *Nucleic Acids Res* 39: D98-105.

*Bioinformatics* 20: 1453-1454.


**Table 1.** Third party software tools implemented in the REACT suite. "n.a.", not applicable.

#### **6. Conclusion**

This chapter aimed at providing a thorough overview of the concept and functions of the REACT suite, a bioinformatics toolkit that was developed to simplify regulon predictions and comparative transcriptomic analyses for biologists with little to no background in bioinformatics. REACT was written in the believe that it will provide a powerful, yet simple-to-use platform that will hopefully also support the work of other research groups in extracting meaningful data from transcriptome studies with the help of comparative genomics. The complete REACT suite, including the databases for *B. subtilis* and *E. coli*, are available from the corresponding author upon request.

### **Author details**

Peter Ricke and Thorsten Mascher\* *Ludwig-Maximilians-University Munich, Germany* 

### **Acknowledgement**

The authors would like to thank Tina Wecke for beta-testing of the REACT suite, providing the figures and critical reading of the manuscript. Work in the Mascher lab is financially supported by grants from the Deutsche Forschungsgemeinschaft (DFG). Development of the REACT suite was enabled by funding from the 'Concept for the future' of the Karlsruhe Institute of Technology (KIT) within the framework of the German Excellence Initiative.

#### **7. References**

Altschul, S. F., W. Gish, W. Miller, E. W. Myers & D. J. Lipman, (1990) Basic local alignment search tool. *J Mol Biol* 215: 403-410.

<sup>\*</sup> Corresponding Author

The REACT Suite: A Software Toolkit for Microbial *RE*gulon *A*nnotation and *C*omparative *T*ranscriptomics 49


48 Functional Genomics

**6. Conclusion** 

**Author details** 

**Acknowledgement** 

**7. References** 

Corresponding Author

 \*

Peter Ricke and Thorsten Mascher\*

**Name Version Link Reference**  Blast 2.2.x ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ (Altschul *et al.*,

Cluster 3 3.0.x http://bonsai.hgc.jp/~mdehoon/software/cluster/ (de Hoon *et al.*,

MEME suite 4.0.0 http://meme.sdsc.edu/meme/meme-download.html (Bailey *et al.*,

This chapter aimed at providing a thorough overview of the concept and functions of the REACT suite, a bioinformatics toolkit that was developed to simplify regulon predictions and comparative transcriptomic analyses for biologists with little to no background in bioinformatics. REACT was written in the believe that it will provide a powerful, yet simple-to-use platform that will hopefully also support the work of other research groups in extracting meaningful data from transcriptome studies with the help of comparative genomics. The complete REACT suite, including the databases for *B. subtilis* and *E. coli*, are

The authors would like to thank Tina Wecke for beta-testing of the REACT suite, providing the figures and critical reading of the manuscript. Work in the Mascher lab is financially supported by grants from the Deutsche Forschungsgemeinschaft (DFG). Development of the REACT suite was enabled by funding from the 'Concept for the future' of the Karlsruhe Institute of Technology (KIT) within the framework of the German Excellence Initiative.

Altschul, S. F., W. Gish, W. Miller, E. W. Myers & D. J. Lipman, (1990) Basic local alignment

Cygwin 1.7.5.x http://www.cygwin.com/install.html n.a. MySQL 5.5.x http://www.mysql.de/downloads/mysql/ n.a. **Table 1.** Third party software tools implemented in the REACT suite. "n.a.", not applicable.

available from the corresponding author upon request.

*Ludwig-Maximilians-University Munich, Germany* 

search tool. *J Mol Biol* 215: 403-410.

1990)

2004)

2009)


Letunic, I., T. Doerks & P. Bork, (2009) SMART 6: recent updates and new developments. *Nucleic Acids Res* 37: D229-232.

**Chapter 3** 

© 2012 Al-Akwaa, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Al-Akwaa, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Analysis of Gene Expression Data** 

One of the main research areas of bioinformatics is functional genomics; which focuses on the interactions and functions of each gene and its products (mRNA, protein) through the whole genome (the entire genetics sequences encoded in the DNA and responsible for the hereditary information). In order to identify the functions of certain gene, we should able to capture the gene expressions which describe how the genetic information converted to a functional gene product through the transcription and translation processes. Functional genomics uses microarray technology to measure the genes expressions levels under certain conditions and environmental limitations. In the last few years, microarray has become a central tool in biological research. Consequently, the corresponding data analysis becomes one of the important work disciplines in bioinformatics. The analysis of microarray data poses a large number of exploratory statistical aspects including **clustering** and **biclustering** algorithms, which help to identify similar patterns in gene expression data and group genes

A large number of clustering definitions can be found in the literature. The simplest definition is shared among all and includes one fundamental concept: the grouping together

Clustering is an important explorative statistical analysis of gene expression data. It aims to identify and group genes that exhibit similar expression patterns over several conditions and also group the conditions based on the expression profiles across set of genes. The successful clustering approach should guarantee two criteria which are homogeneity high similarity between elements in the same cluster, and separation – low similarity between elements from different clusters. When homogeneity and separation are precisely defined,

**Using Biclustering Algorithms** 

Additional information is available at the end of the chapter

and conditions in to subsets that share biological significance.

Fadhl M. Al-Akwaa

http://dx.doi.org/10.5772/48150

**1.1. What is Clustering?** 

of similar data items into clusters[1].

**1. Introduction** 


**Chapter 3** 

## **Analysis of Gene Expression Data Using Biclustering Algorithms**

Fadhl M. Al-Akwaa

50 Functional Genomics

D346-354.

*Nucleic Acids Res* 37: D229-232.

services. *Nucleic Acids Res* 39: D392-401.

and annotation. *Nucleic Acids Res* 38: D161-166.

Utts, J. M., (2005) *Seeing through statistics*. Thomson Brooks.

*Nucleic Acids Res* 36: D93-96.

families. *Science* 278: 631-637.

Letunic, I., T. Doerks & P. Bork, (2009) SMART 6: recent updates and new developments.

Liolios, K., I. M. Chen, K. Mavromatis, N. Tavernarakis, P. Hugenholtz, V. M. Markowitz & N. C. Kyrpides, (2010) The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. *Nucleic Acids Res* 38:

Rose, P. W., B. Beran, C. Bi, W. F. Bluhm, D. Dimitropoulos, D. S. Goodsell, A. Prlic, M. Quesada, G. B. Quinn, J. D. Westbrook, J. Young, B. Yukich, C. Zardecki, H. M. Berman & P. E. Bourne, (2011) The RCSB Protein Data Bank: redesigned web site and web

Sierro, N., Y. Makita, M. de Hoon & K. Nakai, (2008) DBTBS: a database of transcriptional regulation in *Bacillus subtilis* containing upstream intergenic conservation information.

Sigrist, C. J., L. Cerutti, E. de Castro, P. S. Langendijk-Genevaux, V. Bulliard, A. Bairoch & N. Hulo, (2010) PROSITE, a protein domain database for functional characterization

Tatusov, R. L., E. V. Koonin & D. J. Lipman, (1997) A genomic perspective on protein

Wecke, T., B. Veith, A. Ehrenreich & T. Mascher, (2006) Cell envelope stress response in *Bacillus licheniformis*: Integrating comparative genomics, transcriptional profiling, and regulon mining to decipher a complex regulatory network. *J. Bacteriol.* 188: 7500-7511.

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48150

#### **1. Introduction**

One of the main research areas of bioinformatics is functional genomics; which focuses on the interactions and functions of each gene and its products (mRNA, protein) through the whole genome (the entire genetics sequences encoded in the DNA and responsible for the hereditary information). In order to identify the functions of certain gene, we should able to capture the gene expressions which describe how the genetic information converted to a functional gene product through the transcription and translation processes. Functional genomics uses microarray technology to measure the genes expressions levels under certain conditions and environmental limitations. In the last few years, microarray has become a central tool in biological research. Consequently, the corresponding data analysis becomes one of the important work disciplines in bioinformatics. The analysis of microarray data poses a large number of exploratory statistical aspects including **clustering** and **biclustering** algorithms, which help to identify similar patterns in gene expression data and group genes and conditions in to subsets that share biological significance.

#### **1.1. What is Clustering?**

A large number of clustering definitions can be found in the literature. The simplest definition is shared among all and includes one fundamental concept: the grouping together of similar data items into clusters[1].

Clustering is an important explorative statistical analysis of gene expression data. It aims to identify and group genes that exhibit similar expression patterns over several conditions and also group the conditions based on the expression profiles across set of genes. The successful clustering approach should guarantee two criteria which are homogeneity high similarity between elements in the same cluster, and separation – low similarity between elements from different clusters. When homogeneity and separation are precisely defined,

© 2012 Al-Akwaa, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Al-Akwaa, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

those are two opposing objectives: The better the homogeneity the poorer the separation, and vice versa [2]. Several algorithmic techniques were previously used for clustering gene expression data, including hierarchical clustering [3], self organizing maps [4], and graph theoretic approaches [5].

Analysis of Gene Expression Data Using Biclustering Algorithms 53

1. Assign each item to a cluster, so you have N clusters, each containing just one item. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

In Step 3, distance or similarity measurements between the merged clusters and all the other clusters can be calculated in one of three schemes: single-linkage, complete linkage and

Traditional clustering approaches such as k-means and hierarchical clustering put each gene in exactly one cluster based on the assumption that all genes behave similarly in all conditions. However, recent understanding of cellular processes shows that it is possible for subset of genes to be co expressed under certain experimental conditions, and at the same time; to behave almost independently under other conditions. From this context, a new two mode clustering approach called biclustering or co-clustering has been introduced to group

This allows finding subgroups of genes that show the same response under a subset of conditions, not all conditions. Also, genes may participate in more than one function,

Example, if a cellular process is only active under specific conditions and there is a gene participates in multiple pathways that are differentially regulated, one would expect this gene to be included in more than one cluster; and this cannot be achieved by traditional

Many biclustering methods exist in the literature [8]. Table 1 summarized some of promising biclustering algorithms developed during the last ten years. In brief, we described some of these algorithms according to their prediction strength, their promising results, to what they extend in the community, whether an implementation was available,

CC algorithm[18] is considered to be the first real biclustering implementation after the

resulting in one regulation pattern in one context and a different pattern in another.

and the feedback from their authors to explain some ambiguous issues.

primary idea has been introduced by Hartigan [19] in 1972.

average-linkage.

**1.2. Biclustering** 

clustering techniques.

*1.2.1. Cheng and Church (CC)* 

**Figure 1.** HCL: Agglomerative and Divisive Methods.

the genes and conditions in both dimensions simultaneously.

#### **1.1.1. K-means**

K-means is a classical clustering algorithm [6] invented in 1956 to classify or to group objects (genes) based on attributes or features (experimental conditions) into K number of groups (clusters). K is positive integer number and assumed to be known.

K-means computational approach starts by placing K points into the space represented by the objects that are being clustered. These points represent initial group centroids. We can take any random objects as the initial centroids or the first K objects in sequence can also be used as the initial centroids. Then the K means algorithm will do the four steps below until convergence:


Each iteration of k-means modifies the current partition by checking all possible modifications of the solution, in which one element is moved to another cluster. This is done by reducing the sum of distances between objects and the centers of their clusters. This procedure is repeated until no further improvement is achieved (No object move the group) and all the objects are grouped into the final required number of clusters.

A disadvantage of K-means algorithm could be perceived in the need to specify the number of clusters K as a parameter value prior to running the algorithm. In cases where there is no expectation about K, user has to make trails with several values of K or use external techniques to guess the no of clusters may be exist.

#### *1.1.2. Hierarchical clustering (HCL)*

Hierarchical clustering does not partition the genes into subsets. Instead it builds a downtop hierarchy of clusters using agglomerative methods or top - down hierarchy of clusters using divisive methods. The traditional graphical representation of this hierarchy is called dendrogram tree. The divisive method begins at the root and starts to breaks up clusters whose having low similarity. Whereas, the Agglomerative method begins at the leaves of the tree and starts with an initial partition into single element clusters and successively merges clusters until all elements belong to the same cluster [3]. (See Figure 1) The agglomerative method is widely used than the divisive one which is not generally available, and rarely has been applied. The idea of the agglomerative method can be summarized as following: Given a set of N items (genes in our case) to be clustered, and an N\*N distance (or similarity) matrix [7],


In Step 3, distance or similarity measurements between the merged clusters and all the other clusters can be calculated in one of three schemes: single-linkage, complete linkage and average-linkage.

**Figure 1.** HCL: Agglomerative and Divisive Methods.

#### **1.2. Biclustering**

52 Functional Genomics

theoretic approaches [5].

**1.1.1. K-means** 

convergence:

1. Determine the centroids coordinate.

3. Group the objects based on minimum distance.

techniques to guess the no of clusters may be exist.

*1.1.2. Hierarchical clustering (HCL)* 

(or similarity) matrix [7],

those are two opposing objectives: The better the homogeneity the poorer the separation, and vice versa [2]. Several algorithmic techniques were previously used for clustering gene expression data, including hierarchical clustering [3], self organizing maps [4], and graph

K-means is a classical clustering algorithm [6] invented in 1956 to classify or to group objects (genes) based on attributes or features (experimental conditions) into K number of groups

K-means computational approach starts by placing K points into the space represented by the objects that are being clustered. These points represent initial group centroids. We can take any random objects as the initial centroids or the first K objects in sequence can also be used as the initial centroids. Then the K means algorithm will do the four steps below until

2. Determine the distance of each object to the centroids using the Euclidean distance.

Each iteration of k-means modifies the current partition by checking all possible modifications of the solution, in which one element is moved to another cluster. This is done by reducing the sum of distances between objects and the centers of their clusters. This procedure is repeated until no further improvement is achieved (No object move the group)

A disadvantage of K-means algorithm could be perceived in the need to specify the number of clusters K as a parameter value prior to running the algorithm. In cases where there is no expectation about K, user has to make trails with several values of K or use external

Hierarchical clustering does not partition the genes into subsets. Instead it builds a downtop hierarchy of clusters using agglomerative methods or top - down hierarchy of clusters using divisive methods. The traditional graphical representation of this hierarchy is called dendrogram tree. The divisive method begins at the root and starts to breaks up clusters whose having low similarity. Whereas, the Agglomerative method begins at the leaves of the tree and starts with an initial partition into single element clusters and successively merges clusters until all elements belong to the same cluster [3]. (See Figure 1) The agglomerative method is widely used than the divisive one which is not generally available, and rarely has been applied. The idea of the agglomerative method can be summarized as following: Given a set of N items (genes in our case) to be clustered, and an N\*N distance

(clusters). K is positive integer number and assumed to be known.

4. Iterate the above steps till no object moves its assigned group.

and all the objects are grouped into the final required number of clusters.

Traditional clustering approaches such as k-means and hierarchical clustering put each gene in exactly one cluster based on the assumption that all genes behave similarly in all conditions. However, recent understanding of cellular processes shows that it is possible for subset of genes to be co expressed under certain experimental conditions, and at the same time; to behave almost independently under other conditions. From this context, a new two mode clustering approach called biclustering or co-clustering has been introduced to group the genes and conditions in both dimensions simultaneously.

This allows finding subgroups of genes that show the same response under a subset of conditions, not all conditions. Also, genes may participate in more than one function, resulting in one regulation pattern in one context and a different pattern in another.

Example, if a cellular process is only active under specific conditions and there is a gene participates in multiple pathways that are differentially regulated, one would expect this gene to be included in more than one cluster; and this cannot be achieved by traditional clustering techniques.

Many biclustering methods exist in the literature [8]. Table 1 summarized some of promising biclustering algorithms developed during the last ten years. In brief, we described some of these algorithms according to their prediction strength, their promising results, to what they extend in the community, whether an implementation was available, and the feedback from their authors to explain some ambiguous issues.

#### *1.2.1. Cheng and Church (CC)*

CC algorithm[18] is considered to be the first real biclustering implementation after the primary idea has been introduced by Hartigan [19] in 1972.


Analysis of Gene Expression Data Using Biclustering Algorithms 55

conditions (Figure 2). Starting with an initial set of genes, all samples (conditions) are scored with respect to this gene set and those samples are chosen for which the score exceeds a certain threshold (usually defined by the user). In the same way, all genes are scored regarding the selected samples and a new set of genes is selected based on another userdefined threshold. The entire procedure is repeated until the set of genes and the set of

Multiple biclusters can be discovered by running the ISA algorithm on several initial gene sets. This approach requires identification of a reference gene set which needs to be carefully selected for good quality results. In the absence of pre-specified reference gene set, random

**Figure 2.** The recurrence signature method. a, The signature algorithm. b, Recurrence as a reliability measure. The signature algorithm is applied to distinct input sets containing different subsets of the postulated transcription module. If the different input sets give rise to the same module, it is considered

Bimax[11] is a simple binary model and new fast divide-and-conquer algorithm used to cluster the gene expression data. It is presented in 2006 by Computer Engineering and Networks Laboratory ETH Zurich, Switzerland. Bimax discretized the gene expression data matrix and convert it into a binary matrix by identifying a threshold, so transcription levels (genes expression values) above this threshold become ones and transcription levels below become zeros (or vice versa). Then, it searches for all possible biclusters that contain only

reliable. c, General application of the recurrent signature method. Copyright © [17].

*1.2.3. Biclusters Inclusion Maximal (Bimax)* 

ones. This can be done by iterating these steps:

samples converge and do not change anymore.

set of genes is selected at the cost of results quality[17].

a n and m are the row and column sizes of the expression matrix b not available

**Table 1.** Biclustering Algorithms Comparison.

CC defines a bicluster as a subset of rows and a subset of columns with a high similarity. The proposed similarity score is called mean squared residue (H) and it is used to measure the coherence of the rows and columns in the single bicluster. Given the gene expression data matrix A = (X;Y); a bicluster is defined as a uniform submatrix (I;J) having a low mean squared residue score as following:

The CC Mean Squared Residue:

$$H\left(I,J\right) = \frac{1}{\|I\| \|J\|} \sum\_{i \neq I, j \neq J} \left(a\_{ij} - a\_{ij} - a\_{Ij} + a\_{Ij}\right)^2$$

Where: aij is gene expression level at row i and column j, aiJ is the mean of row i, aI j is the mean of column j, aIJ is the overall mean. CC algorithm will identify the submatrix as a bicluster if the score is below a level alpha which is a user input parameter to control the quality of the output biclusters. Generally; CC algorithm performs the following major steps:


#### *1.2.2. Iterative Signature Algorithm (ISA)*

The ISA algorithm [17, 20] is a novel method for the biclustering analysis of large-scale expression data. It is an efficient algorithm based on the iterative application of the signature algorithm presented in [17]. ISA considers a bicluster to be a transcription module which can be defined as a set of coexpressed genes together with the associated set of regulating conditions (Figure 2). Starting with an initial set of genes, all samples (conditions) are scored with respect to this gene set and those samples are chosen for which the score exceeds a certain threshold (usually defined by the user). In the same way, all genes are scored regarding the selected samples and a new set of genes is selected based on another userdefined threshold. The entire procedure is repeated until the set of genes and the set of samples converge and do not change anymore.

Multiple biclusters can be discovered by running the ISA algorithm on several initial gene sets. This approach requires identification of a reference gene set which needs to be carefully selected for good quality results. In the absence of pre-specified reference gene set, random set of genes is selected at the cost of results quality[17].

**Figure 2.** The recurrence signature method. a, The signature algorithm. b, Recurrence as a reliability measure. The signature algorithm is applied to distinct input sets containing different subsets of the postulated transcription module. If the different input sets give rise to the same module, it is considered reliable. c, General application of the recurrent signature method. Copyright © [17].

#### *1.2.3. Biclusters Inclusion Maximal (Bimax)*

54 Functional Genomics

b not available

steps:

found [18].

a n and m are the row and column sizes of the expression matrix

1. Delete rows and columns with a score larger than alpha. 2. Adding rows or columns until alpha level is reached.

*1.2.2. Iterative Signature Algorithm (ISA)* 

**Table 1.** Biclustering Algorithms Comparison.

squared residue score as following:

The CC Mean Squared Residue:

**Algorithm Approach Time Complicity Prediction ability**  Bivisu [9] Exhaustive Bicluster Enumeration O(m2nlogm)a Coherent values MSBE [10] Greedy Iterative Search O((n + m)2) Coherent values Bimax[11] Divide-and-Conquer O(nmβlogβ) Coherent values ROBA [12] Matrix algebra O(nmLN) Coherent Evolution x-motif [13] Greedy Iterative Search nmO(log(1/α)/log(1/β)) Coherent Evolution SAMBA [14] Exhaustive Bicluster Enumeration O(n2) Coherent Evolution OPSM [15] Greedy Iterative Search O(nm3I) Coherent Evolution Plaid[16] Distribution Parameter Identification XXXb Coherent values ISA [17] Iterative Signature Algorithm XXX Coherent values CC [18] Greedy Iterative Search O((n + m)nm) Coherent values

CC defines a bicluster as a subset of rows and a subset of columns with a high similarity. The proposed similarity score is called mean squared residue (H) and it is used to measure the coherence of the rows and columns in the single bicluster. Given the gene expression data matrix A = (X;Y); a bicluster is defined as a uniform submatrix (I;J) having a low mean

> <sup>2</sup> , <sup>1</sup> , *ij iJ Ij IJ i Ij J HIJ aaaa I J*

Where: aij is gene expression level at row i and column j, aiJ is the mean of row i, aI j is the mean of column j, aIJ is the overall mean. CC algorithm will identify the submatrix as a bicluster if the score is below a level alpha which is a user input parameter to control the quality of the output biclusters. Generally; CC algorithm performs the following major

3. Iterate these steps until a maximum number of biclusters is reached or no bicluster is

The ISA algorithm [17, 20] is a novel method for the biclustering analysis of large-scale expression data. It is an efficient algorithm based on the iterative application of the signature algorithm presented in [17]. ISA considers a bicluster to be a transcription module which can be defined as a set of coexpressed genes together with the associated set of regulating Bimax[11] is a simple binary model and new fast divide-and-conquer algorithm used to cluster the gene expression data. It is presented in 2006 by Computer Engineering and Networks Laboratory ETH Zurich, Switzerland. Bimax discretized the gene expression data matrix and convert it into a binary matrix by identifying a threshold, so transcription levels (genes expression values) above this threshold become ones and transcription levels below become zeros (or vice versa). Then, it searches for all possible biclusters that contain only ones. This can be done by iterating these steps:

1. Rearrange the rows and columns to concentrate ones in the upper right of the matrix.

Analysis of Gene Expression Data Using Biclustering Algorithms 57

Recently Kevin *et al.*[25]proposed a semantic web algorithm to recommend the best algorithm based on user inputs like: is the dataset contain outliers, is it allowed to get

Generally, comparing different biclustering algorithms is not straightforward as they differ in strategies, approaches, time complicity, number of parameters and prediction ability. In addition, they are strongly influenced by user selected parameter values. For these reasons, the quality of biclustering results is often considered more important than the required computation time. Although there are some analytical comparative studies to evaluate the traditional clustering algorithms[21-23], for biclustering; no such extensive comparison exist even after initial trails have been taken [11]. In the end, Biological merit is the main criterion

In this chapter we attempt to develope a comparative tool (Bicat-Plus) which is showen in Figure 3 that includes the biological comparative methodology and to be as an extension to

The Goal of BicAT-Plus is to enable researchers and biologists to compare between the different biclustering methods based on set of biological merits and draw conclusion on the biological meaning of the results. In addition, BicAT-Plus help researchers in comparing and evaluating the algorithms results multiple times according to the user selected parameter

Algorithms required to be compared could be selected from the biclustering list (left list) to the compared list (right list). External biclustering results for other algorithms could be included in the comparison process. In addition, the organism model, selectable significance level, and GO category should be selected. Finally, Comparison criteria have to be selected

1. User could perform biclusters functional analysis using the three Gene Ontology (GO) categories (biological process, molecular function and cellular component) (Figure3

2. User could evaluate the quality of each biclustering algorithm results after applying the GO functional analysis and display the percentage of the enriched biclusters at different

3. User could compare between the different biclustering algorithms according to the percentage of the functionally enriched biclusters at the required significance levels, the selected GO category and with certain filtration criteria for the GO terms. (Figure3 with

4. User could evaluate and compare the results of external biclustering algorithms. This gives the BicAT-plus the advantage to be a generic tool that does not depend on the employed methods only. For example, it can be used to evaluate the quality of the new

overlapped clusters and the time to retrieve the biclusters.

the BicAT program[26].

based on the user biological metric.

P-values (Figure3 with label number 2).

with label number 1).

label number 3).

for evaluation and comparison between the various biclustering methods.

values as well as the required biological perspective on various datasets.

BicAT-Plus has many features, which could be summarized in the following:-


#### *1.2.4. Order Preserving Submatrix(OPSM)*

The order-preserving submatrix (OPSM) algorithm [15] is a probabilistic model introduced to discover a subset of genes identically ordered among a subset of conditions. It focuses on the coherence of the relative order of the conditions rather than the coherence of actual expression levels. In other words, the expression values of the genes within a bicluster induce an identical linear ordering across the selected conditions. Accordingly, the authors define a bicluster as a subset of rows whose values induce a linear order across a subset of the columns. The time complexity of this model is O(nm3I) where n andmare the number of rows and columns of the input gene expression matrix respectively and I is the number of biclusters. A disadvantage of OPSM algorithm is that it takes long time for high dimensional datasets. And this is because its time complexity is cubic with regards to the number of columns (dimensions) of the input matrix [15].

#### *1.2.5. Maximum Similarity Bicluster(MSBE)*

MSBE Biclustering algorithm [10] is a novel polynomial time algorithm to find an optimal biclusters with the maximum similarity. The idea behind this algorithm is to find subset of genes that are related to a reference gene. The reference gene is known in advance. MSBE algorithm uses the similarity score for a sub-matrix to find the similar expressions in the microarray datasets. And the threshold of the average similarity score is a user input parameter in order to allow the user to control the quality of the biclustering results.

#### **1.3. Clustering or biclustering**

Clustering algorithms [21-23] have been used to analyze gene expression data, on the basis that genes showing similar expression patterns can be assumed to be co-regulated or part of the same regulatory pathway. Unfortunately, this is not always true. Two limitations obstruct the use of clustering algorithms with microarray data. First, all conditions are given equal weights in the computation of gene similarity; in fact, most conditions do not contribute information but instead increase the amount of background noise. Second, each gene is assigned to a single cluster, whereas in fact genes may participate in several functions and should thus be included in several clusters[24].

A new modified clustering approach to uncovering processes that are active over some but not all samples has emerged, which is called biclustering. A bicluster is defined as a subset of genes that exhibit compatible expression patterns over a subset of conditions [11].

During the last ten years, many biclustering algorithms have been proposed (see [8] for a survey), but the important questions are: which algorithm is better? And do some algorithms have advantages over others?

Recently Kevin *et al.*[25]proposed a semantic web algorithm to recommend the best algorithm based on user inputs like: is the dataset contain outliers, is it allowed to get overlapped clusters and the time to retrieve the biclusters.

56 Functional Genomics

2. Divide the matrix into two sub matrices.

*1.2.4. Order Preserving Submatrix(OPSM)* 

columns (dimensions) of the input matrix [15].

*1.2.5. Maximum Similarity Bicluster(MSBE)* 

**1.3. Clustering or biclustering** 

algorithms have advantages over others?

1. Rearrange the rows and columns to concentrate ones in the upper right of the matrix.

3. Whenever in one of the submatrices only ones are found, this sub matrix is returned.

The order-preserving submatrix (OPSM) algorithm [15] is a probabilistic model introduced to discover a subset of genes identically ordered among a subset of conditions. It focuses on the coherence of the relative order of the conditions rather than the coherence of actual expression levels. In other words, the expression values of the genes within a bicluster induce an identical linear ordering across the selected conditions. Accordingly, the authors define a bicluster as a subset of rows whose values induce a linear order across a subset of the columns. The time complexity of this model is O(nm3I) where n andmare the number of rows and columns of the input gene expression matrix respectively and I is the number of biclusters. A disadvantage of OPSM algorithm is that it takes long time for high dimensional datasets. And this is because its time complexity is cubic with regards to the number of

MSBE Biclustering algorithm [10] is a novel polynomial time algorithm to find an optimal biclusters with the maximum similarity. The idea behind this algorithm is to find subset of genes that are related to a reference gene. The reference gene is known in advance. MSBE algorithm uses the similarity score for a sub-matrix to find the similar expressions in the microarray datasets. And the threshold of the average similarity score is a user input

Clustering algorithms [21-23] have been used to analyze gene expression data, on the basis that genes showing similar expression patterns can be assumed to be co-regulated or part of the same regulatory pathway. Unfortunately, this is not always true. Two limitations obstruct the use of clustering algorithms with microarray data. First, all conditions are given equal weights in the computation of gene similarity; in fact, most conditions do not contribute information but instead increase the amount of background noise. Second, each gene is assigned to a single cluster, whereas in fact genes may participate in several

A new modified clustering approach to uncovering processes that are active over some but not all samples has emerged, which is called biclustering. A bicluster is defined as a subset

During the last ten years, many biclustering algorithms have been proposed (see [8] for a survey), but the important questions are: which algorithm is better? And do some

of genes that exhibit compatible expression patterns over a subset of conditions [11].

parameter in order to allow the user to control the quality of the biclustering results.

functions and should thus be included in several clusters[24].

Generally, comparing different biclustering algorithms is not straightforward as they differ in strategies, approaches, time complicity, number of parameters and prediction ability. In addition, they are strongly influenced by user selected parameter values. For these reasons, the quality of biclustering results is often considered more important than the required computation time. Although there are some analytical comparative studies to evaluate the traditional clustering algorithms[21-23], for biclustering; no such extensive comparison exist even after initial trails have been taken [11]. In the end, Biological merit is the main criterion for evaluation and comparison between the various biclustering methods.

In this chapter we attempt to develope a comparative tool (Bicat-Plus) which is showen in Figure 3 that includes the biological comparative methodology and to be as an extension to the BicAT program[26].

The Goal of BicAT-Plus is to enable researchers and biologists to compare between the different biclustering methods based on set of biological merits and draw conclusion on the biological meaning of the results. In addition, BicAT-Plus help researchers in comparing and evaluating the algorithms results multiple times according to the user selected parameter values as well as the required biological perspective on various datasets.

BicAT-Plus has many features, which could be summarized in the following:-

Algorithms required to be compared could be selected from the biclustering list (left list) to the compared list (right list). External biclustering results for other algorithms could be included in the comparison process. In addition, the organism model, selectable significance level, and GO category should be selected. Finally, Comparison criteria have to be selected based on the user biological metric.


algorithms introduced to the field and compare against the existing ones. (Figure 3 with label number 4).

Analysis of Gene Expression Data Using Biclustering Algorithms 59

**Figure 4.** Functional analysis results of the selected bicluster. Each column represents an enriched GO functional class. And the height of the column is proportional to the significance of this enrichment (i.e.

Many programs like: BINGO[27], FUNCAT[28], GeneMerge[29] and FuncAssociate[30] were used to investigate whether the set of genes discovered by biclustering methods present significant enrichment with respect to a specific GO annotation provided by Gene Ontology Consortium [31]. BicAT-Plus used GeneMerge program as the most popular GO program. GeneMerge provides a statistical test for assessing the enrichment of each GO term in the sample test. The basic question answered by this test is as described by Steven *et al.*[27] "when sampling *X* genes (test set) out of *N* genes (reference set, either a graph or an annotation), what is the probability that *x* or more of these genes belong to a functional category *C* shared by *n* of the *N* genes in the reference set? The hypergeometric test, in which sampling occurs without replacement, answers this question in the form of P-value. It's counterpart with replacement, the binomial test, which provides only an approximate P-

BicAT-Plus provides reasonable methods for comparing the results of different biclustering

height = -log (p-value).

**2.1. GO overrepresentation programs** 

value, but requires less calculation time."

**2.2. Comparative methodologies** 

algorithms by:

5. User could display the results using graphical and statistical charts visualizations in multiple modes (2D and 3D).


**Figure 3.** BicAT-Plus Comparison Panel.

#### **2. Materials and methods**

Before using the BicAT-Plus, Active Perl version 5.10 and Java Runtime Environment (JRE) version 6 are required to be installed on your machine. BicAT-Plus has been tested and show good performance on a PC machine with the following configurations: CPU: Pentium 4, 1.5 GHZ, RAM: 2.0 GB, Platform: windows XP professional with SP2.

#### Analysis of Gene Expression Data Using Biclustering Algorithms 59

**Figure 4.** Functional analysis results of the selected bicluster. Each column represents an enriched GO functional class. And the height of the column is proportional to the significance of this enrichment (i.e. height = -log (p-value).

#### **2.1. GO overrepresentation programs**

58 Functional Genomics

label number 4).

multiple modes (2D and 3D).

**Figure 3.** BicAT-Plus Comparison Panel.

**2. Materials and methods** 

algorithms introduced to the field and compare against the existing ones. (Figure 3 with

5. User could display the results using graphical and statistical charts visualizations in

Before using the BicAT-Plus, Active Perl version 5.10 and Java Runtime Environment (JRE) version 6 are required to be installed on your machine. BicAT-Plus has been tested and show good performance on a PC machine with the following configurations: CPU: Pentium

4, 1.5 GHZ, RAM: 2.0 GB, Platform: windows XP professional with SP2.

Many programs like: BINGO[27], FUNCAT[28], GeneMerge[29] and FuncAssociate[30] were used to investigate whether the set of genes discovered by biclustering methods present significant enrichment with respect to a specific GO annotation provided by Gene Ontology Consortium [31]. BicAT-Plus used GeneMerge program as the most popular GO program. GeneMerge provides a statistical test for assessing the enrichment of each GO term in the sample test. The basic question answered by this test is as described by Steven *et al.*[27] "when sampling *X* genes (test set) out of *N* genes (reference set, either a graph or an annotation), what is the probability that *x* or more of these genes belong to a functional category *C* shared by *n* of the *N* genes in the reference set? The hypergeometric test, in which sampling occurs without replacement, answers this question in the form of P-value. It's counterpart with replacement, the binomial test, which provides only an approximate Pvalue, but requires less calculation time."

#### **2.2. Comparative methodologies**

BicAT-Plus provides reasonable methods for comparing the results of different biclustering algorithms by:

 Identifying the percentage of enriched or overrepresented biclusters with one or more GO term per multiple significance level for each algorithm. A bicluster is said to be significantly overrepresented (enriched) with a functional category if the P-value of this functional category is lower than the preset threshold. The results are displayed using a histogram for all the algorithms compared at the different preset significance levels, and the algorithm that gives the highest proportion of enriched biclusters for all significance levels is considered the optimum because it effectively groups the genes sharing similar functions in the same bicluster.

Analysis of Gene Expression Data Using Biclustering Algorithms 61

The above comparison steps is performed on the gene expression data of *S. cerevisiae* provided by Gasch [32]. The dataset contains 2993 genes and 173 conditions of diverse environmental transitions such as temperature shocks, amino acid starvation, and nitrogen source depletion. This dataset is freely available from Stanford University website [33]. For each biclustering algorithm, we used the default parameters as authors recommend in their

After applying the above steps on Gasch data[32] , BicAT-plus produce the histogram shown in Fig 6. Investigating this figure, we observed that OPSM algorithm gave a high portion of functionally enriched biclusters at all significance levels (from 85% to 100 %).

In order to evaluate the ability of the algorithms to group the maximum number of genes whose expression patterns are similar and sharing the same GO category, we use the filtration criteria developed in the comparative tool by neglecting those bi/clusters which have study fraction less than 25%. The study fraction of a GO term is the fraction of genes in

<sup>100</sup> *No of genessharing the GOterminabicluster Study fraction of a GO term totalnumber of genesinthisbicluster*

Figure 7 shows that OPSM and ISA have highly enriched biclusters/clusters that have large number of genes per each GO category. On the other hand, Bivisu biclusters are strongly affected by this filtration and they contain a lower number of genes per each category. This filtration will help in identifying the powerful and most reliable algorithms which are able

to group maximum numbers of genes sharing same functions in one bicluster.

Next to OPSM, ISA show relatively high portions of enriched biclusters.

**3. Results & discussion** 

corresponding publications. See Table 2.

**Figure 5.** BicAT-Plus Comparison process steps

the study set (bicluster) with this term.

**3.1. The percentage of enriched function** 


#### **2.3. Comparison Process Steps**

The following process diagram shown in Fig 5 summarizes the required steps by the user to compare between the different algorithms using the BicAT-plus:



**Table 2.** Default Parameter settings of the compared bi/clustering methods. The definitions of these parameters are listed in their original publications [9, 15, 17-18, 20] respectively.

#### **3. Results & discussion**

60 Functional Genomics

functions in the same bicluster.

results and in Table 3.

components.

6. Press compare button.

**2.3. Comparison Process Steps** 

 Identifying the percentage of enriched or overrepresented biclusters with one or more GO term per multiple significance level for each algorithm. A bicluster is said to be significantly overrepresented (enriched) with a functional category if the P-value of this functional category is lower than the preset threshold. The results are displayed using a histogram for all the algorithms compared at the different preset significance levels, and the algorithm that gives the highest proportion of enriched biclusters for all significance levels is considered the optimum because it effectively groups the genes sharing similar

 Estimating the predictive power of algorithms to recover interesting patterns. Genes whose transcription is responsive to a variety of stresses have been implicated in a general Yeast response to stress (awkward). Other gene expression responses appear to be specific to particular environmental conditions. BicAT-Plus compares biclustering methods on the basis of their capacity to recover known patterns in experimental data sets. For example, Gasch et al.[32] measure changes in transcript levels over time responding to a panel of environmental changes, so it was expected to find biclusters enriched with one of response to stress (GO:0006950), Gene Ontology categories such as response to heat (GO:0009408), response to cold (GO:0009409) and response to glucose starvation(GO:0042149). The details of this comparison strategy are described in the

The following process diagram shown in Fig 5 summarizes the required steps by the user to

2. Load Gene Expression Data to BicAT-Plus then run the selected five prominent

3. Run GO comparison tool in the BicAT-Plus and add the available biclustering

4. Select the available GO category e.g. biological process, molecular function and cellular

compare between the different algorithms using the BicAT-plus:

algorithms to the compared list as shown in Fig 1.

5. Select the P-values e.g. 0.00001, 0.0001, 0.01, 0.005, and 0.05.

1. Download BicAT-Plus from (*www.bioinformatics.org/bicat-plus/*).

biclustering methods with setting parameters as shown in Table 2.

7. Press comparison menu, Functional enrichment and select 2D or 3D charts.

OPSM *l* = 100

K-means K=100

parameters are listed in their original publications [9, 15, 17-18, 20] respectively.

Bi/clustering Algorithm Parameter settings ISA *tg* = 2.0, *tc* = 2.0, seeds = 500 CC *δ* = 0.5, *α* = 1.2, *M* = 100

BiVisu *Ε* = 0.82, *Nr* = 10, *Nc* = 5, *Po* = 25

**Table 2.** Default Parameter settings of the compared bi/clustering methods. The definitions of these

Identifying the percentage of annotated genes per each enriched bicluster.

The above comparison steps is performed on the gene expression data of *S. cerevisiae* provided by Gasch [32]. The dataset contains 2993 genes and 173 conditions of diverse environmental transitions such as temperature shocks, amino acid starvation, and nitrogen source depletion. This dataset is freely available from Stanford University website [33]. For each biclustering algorithm, we used the default parameters as authors recommend in their corresponding publications. See Table 2.

**Figure 5.** BicAT-Plus Comparison process steps

#### **3.1. The percentage of enriched function**

After applying the above steps on Gasch data[32] , BicAT-plus produce the histogram shown in Fig 6. Investigating this figure, we observed that OPSM algorithm gave a high portion of functionally enriched biclusters at all significance levels (from 85% to 100 %). Next to OPSM, ISA show relatively high portions of enriched biclusters.

In order to evaluate the ability of the algorithms to group the maximum number of genes whose expression patterns are similar and sharing the same GO category, we use the filtration criteria developed in the comparative tool by neglecting those bi/clusters which have study fraction less than 25%. The study fraction of a GO term is the fraction of genes in the study set (bicluster) with this term.

$$\text{Study fraction of a GO term} = \frac{\text{No of genus sharing the GO term in a bicluster}}{\text{total number of genus in this bicluster}} \times 100$$

Figure 7 shows that OPSM and ISA have highly enriched biclusters/clusters that have large number of genes per each GO category. On the other hand, Bivisu biclusters are strongly affected by this filtration and they contain a lower number of genes per each category. This filtration will help in identifying the powerful and most reliable algorithms which are able to group maximum numbers of genes sharing same functions in one bicluster.

#### **3.2. The predictability power to recover interested pattern**

The user could compare bi/clusters algorithms based on which of them could recover defined pattern like which one of them could recover bi/clusters which have response to the conditions applied in Gasch experiments. In Table 2, the difference between the biclusters/clusters contents were summarized.

Analysis of Gene Expression Data Using Biclustering Algorithms 63

**K-means CC ISA Bivisu OPSM** 

Although OPSM show high percentage level of enriched biclusters (as shown in Fig 6, 7), its biclusters do not contain any genes within any GO category response to Gasch experiments. The k-means and Bivisu cluster/bicluster results distinguished a unique GO category, which is GO: 0000304 (response to singlet oxygen), and GO: 0042542 (response to hydrogen peroxide) The powerful usage of these bicluster algorithms is significantly appeared in GO: 0006995 "cellular response to nitrogen starvation" where these algorithms were able to discover 4 out of

5 annotated genes without any prior biological information or on desk experiments.

Response to drug / (**118**) 4 5 7 6 0

response to osmotic stress / (**83**) 3 5 6 3 0

response to oxidative stress / (79) 2 7 11 0 0

response to cadmium ion / (102) 2 3 2 2 0

response to exogenous dsRNA / (7) 2 3 2 2 0

response to arsenic / (77) 2 0 2 2 0

response to stress / (532) 9 11 16 2 0

response to heat / (24) 3 0 2 2 0

response to cold / (7) 0 0 2 0 0

cellular response to starvation / (44) 0 2 0 0 0

cellular response to nitrogen starvation / (5) 4 4 4 0 0

cellular response to glucose starvation / (5) 0 2 0 0 0

response to salt stress / (15) 2 7 0 0 0

response to hydrogen peroxide /(5) 0 0 0 2 0

response to DNA damage stimulus / (240) 0 22 0 3 0

response to singlet oxygen / (4) 2 0 0 0 0 **Table 3.** Gene Ontology category per number of annotated genes of the Bicluster/cluster algorithm

results for the experimental condition on Gasch Experiments[32].

**GO Term / (number of annotated genes)** 

**GO:0042493** 

**GO:0006970** 

**GO:0006979** 

**GO:0046686** 

**GO:0043330** 

**GO:0046685** 

**GO:0006950** 

**GO:0009408** 

**GO:0009409** 

**GO:0009267** 

**GO:0006995** 

**GO:0042149** 

**GO:0009651** 

**GO:0042542** 

**GO:0006974** 

**GO:0000304** 

**Figure 6.** Percentage of biclusters significantly enriched by GO Biological Process category (*S. cerevisiae*) for the five selected biclustering methods and K-means at different significance levels p.

**Figure 7.** Percentage of significantly enriched biclusters by GO Biological Process category by setting the allowed minimum number of genes per each GO category to 10 and the study fraction to large than 50%.

Although OPSM show high percentage level of enriched biclusters (as shown in Fig 6, 7), its biclusters do not contain any genes within any GO category response to Gasch experiments. The k-means and Bivisu cluster/bicluster results distinguished a unique GO category, which is GO: 0000304 (response to singlet oxygen), and GO: 0042542 (response to hydrogen peroxide) The powerful usage of these bicluster algorithms is significantly appeared in GO: 0006995 "cellular response to nitrogen starvation" where these algorithms were able to discover 4 out of 5 annotated genes without any prior biological information or on desk experiments.

62 Functional Genomics

**3.2. The predictability power to recover interested pattern**

biclusters/clusters contents were summarized.

The user could compare bi/clusters algorithms based on which of them could recover defined pattern like which one of them could recover bi/clusters which have response to the conditions applied in Gasch experiments. In Table 2, the difference between the

**Figure 6.** Percentage of biclusters significantly enriched by GO Biological Process category (*S. cerevisiae*)

**Figure 7.** Percentage of significantly enriched biclusters by GO Biological Process category by setting the allowed minimum number of genes per each GO category to 10 and the study fraction to large than 50%.

for the five selected biclustering methods and K-means at different significance levels p.


**Table 3.** Gene Ontology category per number of annotated genes of the Bicluster/cluster algorithm results for the experimental condition on Gasch Experiments[32].

#### **4. Conclusion**

We have introduced the BicAT-Plus with reasonable comparative methodology based on the Gene Ontology. To the best of our knowledge such an automatic comparison tool of the various biclustering algorithms has not been available in the literature. BicAT-Plus is an open source tool written in java swing and it has a well structured design that can be extended easily to employ more comparative methodologies that help biologists to extract the best results of each algorithm and interpret these results to useful biological meaning.

Analysis of Gene Expression Data Using Biclustering Algorithms 65

[3] Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. *Proceedings of the National Academy of Sciences of the United* 

[4] P. Tamayo DS, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic dierentiation. In: *Proceedings of the National Academy of* 

[5] Sharan RSaR: Click: a clustering algorithm for gene Expression analysis. In: *Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology: 2000*. 307–316. [6] Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of

[8] Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey.

[9] Cheng KO, Law NF, Siu WC, Lau TH: BiVisu: software tool for bicluster detection and

[10] Liu X, Wang L: Computing the maximum similarity bi-clusters of gene expression data.

[11] Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A Systematic comparison and evaluation of biclustering methods for

[12] A. Tchagang and A. Twefik: Robust biclustering algorithm (ROBA) for DNA microarray data analysis. In: *IEEE/SP 13thWorkshop on Statistical Signal Processing*. 2005: 984–989. [13] Murali TM, S K: Extracting conserved gene expression motifs from gene expression

[14] A. Tanay RS, M. Kupiec, and R. Shamir, : Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data,. In: *Proceedings of the National Academy of Sciences of the United States of America:* 

[15] Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the order-preserving submatrix problem. *Journal of Computational Biology* 2003,

[16] H. Wang WW, J. Yang, and P. S. Yu, : Clustering by Pattern Similarity: the pCluster

[17] Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. *Nature Genetics* 2002, 31:370 - 377. [18] Cheng Y, Church GM: Biclustering of expression data. *Proceedings of 8th International* 

[19] Hartigan J: Direct Clustering of a data matrix. *Journal of the American Statistical* 

[20] Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene

*Conference on Intelligent Systems for Molecular Biology* 2000:93 - 103.

expression data. *Bioinformatics* 2004, 20:1993 - 2003.

*States of America* 1998, 95:14863 - 14868.

*Sciences of the United States of America,: 1999*. 2907–2912.

genetic network architecture. *Nature Genetics* 1999, 22:281-285.

*IEEE/ACM Trans Comput Biol Bioinform* 2004, 1(1):24 - 45.

gene expression data. *Bioinformatics* 2006, 22(9):1122 - 1129.

visualization. *Bioinformatics* 2007, 23(17):2342 - 2344.

*Bioinformatics* 2007, 23(1):50-56.

*2004*. 2981–2986.

Algorithm. *SIGMOD* 2002.

*Association* 1972, 67:123–129.

10:373 - 384.

data. In: *Pac Symp Biocomput*. 2003: 77–88.

[7] Johnson S: Hierarchical clustering schemes. *Psychometrika* 1967, 32(3):241-254.

In other words, the algorithms that show good quality of results (per the dataset) can be used to provide a simple means of gaining leads to the functions of many genes for which information is not available currently (unannotated genes).

Using BicAT-Plus, we can identify the highly enriched biclusters of the whole compared algorithms. This might be quite helpful in solving the dimensionality reduction problem of the Gene Regulatory Network construction from the gene expression data. This problem originates from the relatively few time points (conditions or samples) with respect to the large number of genes in the microarray dataset.

Finally there are several aspects of this research that worth further investigation, according to the Studies carried out so far and also introducing new ideas for consideration


### **Author details**

Fadhl M. Al-Akwaa *Biomedical Eng. Dept., Univ. of Science & Technology, Sana'a, Yemen*

#### **5. References**


**4. Conclusion** 

genes.

Cytoscape.

**Author details** 

Fadhl M. Al-Akwaa

**5. References** 

enriched biclusters.

We have introduced the BicAT-Plus with reasonable comparative methodology based on the Gene Ontology. To the best of our knowledge such an automatic comparison tool of the various biclustering algorithms has not been available in the literature. BicAT-Plus is an open source tool written in java swing and it has a well structured design that can be extended easily to employ more comparative methodologies that help biologists to extract the best results of each algorithm and interpret these results to useful biological meaning.

In other words, the algorithms that show good quality of results (per the dataset) can be used to provide a simple means of gaining leads to the functions of many genes for which

Using BicAT-Plus, we can identify the highly enriched biclusters of the whole compared algorithms. This might be quite helpful in solving the dimensionality reduction problem of the Gene Regulatory Network construction from the gene expression data. This problem originates from the relatively few time points (conditions or samples) with respect to the

Finally there are several aspects of this research that worth further investigation, according

1. Enrich the BicAT-Plus with more comparative methodologies beside GO. For example, KEGG and promoter analysis by identifying the transcription factors for the clustered

2. Extend the BicAT-Plus to provide users with multiple export options for the interested

3. Embed the BicAT-Plus as a plug-in in the Cytoscape platform[34] which is open source bioinformatics software for visualizing molecular interaction networks and biological pathways and integrating these networks with annotations, gene expression profiles and other state data. Thus, very promising challenge is to get use of the highly enriched biclusters identified by the BicAT-Plus in solving these integrated networks in the

[1] Fung G: A Comprehensive Overview of Basic Clustering Algorithms. *Citeseer* 2001:1-37. [2] Sharan R, Elkon R, Shamir R: Cluster analysis and its applications to gene expression

to the Studies carried out so far and also introducing new ideas for consideration

information is not available currently (unannotated genes).

*Biomedical Eng. Dept., Univ. of Science & Technology, Sana'a, Yemen*

data. *Ernst Schering Res Found Workshop* 2002:83-108.

large number of genes in the microarray dataset.


[21] Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Systematic determination of genetic network architecture. *Nature Genetics* 1999, 22:281-285.

**Chapter 4** 

© 2012 Tenea and Burlibasa, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Tenea and Burlibasa, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

**RNAi Towards Functional Genomics Studies** 

RNA interference is an evolutionarily conserved mechanism that uses short antisense RNAs that are generated by 'dicing' dsRNA precursors to target corresponding mRNAs for cleavage. Pioneering observations on RNAi were reported in plants, but later on RNAirelated events were described in almost all eukaryotic organisms, including protozoa, flies, nematodes, insects, parasites, and mouse and human cell lines [1, 2]. Called initial cosupression or PTGS (post-transcriptional gene silencing), RNAi was first discovered in transgenic petunia plants [3]. In order to increase the pigmentation the chalcone synthase (CHS) gene was over-expressed in petunia plants and instead of enhancing in the flower pigmentation an opposite effect was observed. Some of the flowers were completely lacked of pigmentation and others showed different degrees of pigmentation. It was shown that even though an extra copy of the transgene was present, the CHS mRNA levels were strongly reduced in the white sectors. It was suggested that interaction between transgenes and native transcripts triggers a mechanisms that leads to the destruction of both transcripts or to the obstruction of the translation process and to gene silencing. This phenomenon was called co-suppression because the extra copies of CHS transgene determined reduction of its

Later on, other study in the field of virus resistance was being exploited in order to produce virus resistance plants. Using different viral systems it has been shown that the expression of viral genes in the target plant genome was not associated with resistance to that particular virus [4-6]. The virus resistance in the recovered plants correlated with reduction of transgene mRNA in the cytoplasm, these phenomena was also called co-suppression. The finding provided supporting evidence of plant natural response to viral infection that the recovered parts of this plants response to virus would not only be resistant to initially inoculated virus but also cross-protect the plants against other viruses carrying homologous sequences [7]. This phenomenon was later called VIGS (virus-induced gene silencing). Further work found that the transcripts produced from both loci have been degraded in the

Gabriela N. Tenea and Liliana Burlibasa

http://dx.doi.org/10.5772/47762

**1. Introduction** 

Additional information is available at the end of the chapter

own expression but also the endogenous gene expression.


## **RNAi Towards Functional Genomics Studies**

Gabriela N. Tenea and Liliana Burlibasa

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47762

#### **1. Introduction**

66 Functional Genomics

7(1):280.

2005, 21(16):3448-3449.

32(18):5539-5545.

[21] Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Systematic determination of

[22] Guthke R, Moller U, Hoffmann M, Thies F, Topfer S: Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection.

[23] D'haeseleer P, Liang S, Somogyi R: Genetic network inference: from co-expression

[24] Reiss D, Baliga N, Bonneau R: Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. *BMC Bioinformatics* 2006,

[25] Yip KYaQ, Peishen and Schultz, Martin and Cheung, David W and Cheung, Kei-Hoi: SemBiosphere: A Semantic Web Approach to Recommending Microarray Clustering

[26] Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E: BicAT: a biclustering analysis

[27] Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks. *Bioinformatics* 

[28] Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M *et al*: The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. *Nucl Acids Res* 2004,

[29] Castillo-Davis CI, Hartl DL: GeneMerge - post-genomic analysis, data mining, and

[30] Berriz GF, King OD, Bryant B, Sander C, Roth FP: Characterizing gene sets with

[31] Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J *et al*: Gene ontology: tool for the unification of biology. The Gene

[32] Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic Expression Programs in the Response of Yeast Cells to

[34] Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular

genetic network architecture. *Nature Genetics* 1999, 22:281-285.

clustering to reverse engineering. *Bioinformatics* 2000, 16(8):707-726.

Services. In: *The Pacific Symposium on Biocomputing*. 2006: 188-199.

*Bioinformatics* 2005, 21(8):1626-1634.

toolbox. *Bioinformatics* 2006, 22(10):1282-1283.

hypothesis testing. *Bioinformatics* 2003, 19(7):891 - 892.

FuncAssociate. *Bioinformatics* 2003, 19(18):2502-2504.

Ontology Consortium. *Nat Genet* 2000, 25:25 - 29.

[33] http://genome-www.stanford.edu/yeast/\_stress

Environmental Changes. *Mol Biol Cell* 2000, 11(12):4241-4257.

interaction networks. *Genome Res* 2003, 13(11):2498-2504.

RNA interference is an evolutionarily conserved mechanism that uses short antisense RNAs that are generated by 'dicing' dsRNA precursors to target corresponding mRNAs for cleavage. Pioneering observations on RNAi were reported in plants, but later on RNAirelated events were described in almost all eukaryotic organisms, including protozoa, flies, nematodes, insects, parasites, and mouse and human cell lines [1, 2]. Called initial cosupression or PTGS (post-transcriptional gene silencing), RNAi was first discovered in transgenic petunia plants [3]. In order to increase the pigmentation the chalcone synthase (CHS) gene was over-expressed in petunia plants and instead of enhancing in the flower pigmentation an opposite effect was observed. Some of the flowers were completely lacked of pigmentation and others showed different degrees of pigmentation. It was shown that even though an extra copy of the transgene was present, the CHS mRNA levels were strongly reduced in the white sectors. It was suggested that interaction between transgenes and native transcripts triggers a mechanisms that leads to the destruction of both transcripts or to the obstruction of the translation process and to gene silencing. This phenomenon was called co-suppression because the extra copies of CHS transgene determined reduction of its own expression but also the endogenous gene expression.

Later on, other study in the field of virus resistance was being exploited in order to produce virus resistance plants. Using different viral systems it has been shown that the expression of viral genes in the target plant genome was not associated with resistance to that particular virus [4-6]. The virus resistance in the recovered plants correlated with reduction of transgene mRNA in the cytoplasm, these phenomena was also called co-suppression. The finding provided supporting evidence of plant natural response to viral infection that the recovered parts of this plants response to virus would not only be resistant to initially inoculated virus but also cross-protect the plants against other viruses carrying homologous sequences [7]. This phenomenon was later called VIGS (virus-induced gene silencing). Further work found that the transcripts produced from both loci have been degraded in the

© 2012 Tenea and Burlibasa, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Tenea and Burlibasa, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

cytoplasm. In this case, activation of PTGS was taught to be due to the production of aberrant dsRNA by the transgene, which results in the silencing of the mRNA [8].

RNAi Towards Functional Genomics Studies 69

25, 26]. In the last few years, important insights have been gained in elucidating the molecular mechanism of RNAi by identification and characterization of the central players of the core RNAi pathway. Extensive genetic and biochemical studies in various species have yielded a common model of RNA silencing in which trigger dsRNA. Using genetic screening analyses performed in several organisms, such as the fungus *Neurospora crassa*, the alga *Chlamydomonas reinhardtii*, the nematode *Caenorhabditis elegans*, and the plant *Arabidopsis thaliana* several host-encoded proteins involved in gene silencing as well as the essential enzymes or factors common in this process has been identified [27-28]. The molecular mechanism of RNA silencing involves several steps and a key step of silencing is the

In the initiator step, the enzyme Dicer (RNAse III-like enzyme) chops dsRNA into small pieces called short interfering RNA (siRNA), which are around 21-24 nucleotide in length [12, 21, 26, 31-33]. Dicer or Drosha, proteins known for their catalytic RNAseIII and dsRNAbinding domains, catalyzes the maturation of small RNAs. miRNAs are transcribed as long primary transcripts, which are processed by Drosha in the nucleus. Nuclear transport occurs through nuclear pore complexes, which are large proteinaceous channels deposited in the nuclear membrane. The miRNA precursor in further transported to the cytoplasm by means of the nuclear export receptor, exportin-5. Following their export from the nucleus, premiRNAs are subsequently processed by the cytoplasmic Dicerthat yields RNAs duplexes of 21 nucleotides in length, with 5' phosphates and 2-nucleotide 3' overhangs. Numerous Dicer proteins have been identified in plants as well as animal system and each Dicer is

siRNAs are loaded into the effector protein complex to form an RNA-induced gene silencing complex, called RISC-complex. Subsequently, the siRNA within RISC unzips, exposing anticodons and thus activating the RISC. Usually, effector complexes containing siRNAs are known as a RISC, while those containing miRNAs are known as miRNPs. For example, in *Arabidopsis thaliana* the rasiRNA-containing effector complexes are known as RITSs. All RISCs or miRNPs have a member of the Argonaute (Ago) family of proteins attached to them. RISCs and miRNPs differ in size and composition, based on the provenience organism. Further studies for identification of more specific and active siRNA duplexes for guidance of cleavage of mRNA, revealed that the sequence of the siRNA duplex had a significant impact on the ratio of sense and antisense siRNAs that were entering the RISC complex [26]. There have been different numbers of Ago proteins identified in different organisms. *Arabidopsis thaliana* has ten members, *D. melanogaster* has five members, and humans have eight members of the Ago protein family. Only a small number of these

processing of dsRNAs precursors into short RNA duplexes [29-30].

preferentially processing dsRNAs, which comes from different sources [26].

**2.1. The core RNAi mechanism** 

*2.1.1. Processing the dsRNA precursors* 

*2.1.2. RNA silencing effector complex assembly* 

proteins have actually had their function characterized [26].

In fungi *Neurospora crassa*, it was shown that an overexpressed transgene could induce gene silencing at the post-transcriptional level a phenomenon called "quelling"[9]. Few years later, the efficiency of injecting single-stranded anti-sense RNA as a method of gene silencing in the nematode *Caenorhabditis elegans* by using its ability to hybridize with endogenous mRNA and inhibits translation was investigated [10]. The discovery that introduction of a dsRNA was more effective at inhibiting the target gene led to the conclusion that both original single stranded sense and antisense samples, may have been initially were contaminated with dsRNA. However, similarities between plant and nematode were recognized and the term RNAi was adopted in both systems [11]. Initial experiments on RNAi were successfully in plants and nematodes by introduction of a dsRNA into the cytoplasm. In mammalian cells, however, similar techniques, results in the initiation of the interferon response and cell death before processing could occur. This happen till the group of Elbashir and co-workers [12] reported as alternative method by introduction of a siRNAs under 30 base pair length in the mammalian cells and the interferon response was avoided and activated the RISC (RNA interfering silencing complex) complex and the mRNA destruction.

The importance of the discovery of the RNAi by Fire and Mello was acknowledged in 2006 with the Nobel Prize in Physiology and Medicine. Shortly after this discovery, dsRNAs were found to induce similar gene silencing in a variety of other organisms: in the fruit fly *Drosophila* [13], zebrafish *Danio rerio* [14], *Hydra magnipapillata* (cnidarian) [15], and in some plant species [16-17]. Many experiments have shown that an intermediate in the RNAi process, called shortinterfering RNAs (siRNA), might be effective in degrading mRNA in mammalian cells [18-19]. Nonetheless, it was still not believed that RNAi could work in humans, because long dsRNAs, larger than 30 base pairs in length, induce a cellular response (e.g. interferon response). The first evidence that RNAi functions in humans came from experiments performed by two groups of researchers. Kreutzer and Limmer (2000) [20] demonstrated that short fragments of dsRNA might mediate the RNAi response triggered by the long dsRNAs as observed by Fire and Mello. Despite the fact that these findings were not published, a key patent around this discovery and a company focused on the development and commercialization of RNAi therapeutics was established (www.huntington-assoc.com). In the same time, Elbashir and coworkers [21] found that synthetic short dsRNA molecules result in a potent RNAi gene silencing in mammalian cells without inducing interferon response.

The knowledge accumulated from RNAi studies opened an enormous potential for the use as a tool in functional genomics studies in both plant and animal systems. In recent years, numerous strategies have been developed for targeted gene silencing and a combination of approaches enhanced the manipulation of gene silencing for functional genomics studies.

#### **2. The molecular mechanism of RNA interference in eukaryote system**

RNA silencing mechanism was first recognized as antiviral mechanism that protects organisms from RNA viruses, or prevents random integration of transposable elements [2225, 26]. In the last few years, important insights have been gained in elucidating the molecular mechanism of RNAi by identification and characterization of the central players of the core RNAi pathway. Extensive genetic and biochemical studies in various species have yielded a common model of RNA silencing in which trigger dsRNA. Using genetic screening analyses performed in several organisms, such as the fungus *Neurospora crassa*, the alga *Chlamydomonas reinhardtii*, the nematode *Caenorhabditis elegans*, and the plant *Arabidopsis thaliana* several host-encoded proteins involved in gene silencing as well as the essential enzymes or factors common in this process has been identified [27-28]. The molecular mechanism of RNA silencing involves several steps and a key step of silencing is the processing of dsRNAs precursors into short RNA duplexes [29-30].

#### **2.1. The core RNAi mechanism**

68 Functional Genomics

cytoplasm. In this case, activation of PTGS was taught to be due to the production of

In fungi *Neurospora crassa*, it was shown that an overexpressed transgene could induce gene silencing at the post-transcriptional level a phenomenon called "quelling"[9]. Few years later, the efficiency of injecting single-stranded anti-sense RNA as a method of gene silencing in the nematode *Caenorhabditis elegans* by using its ability to hybridize with endogenous mRNA and inhibits translation was investigated [10]. The discovery that introduction of a dsRNA was more effective at inhibiting the target gene led to the conclusion that both original single stranded sense and antisense samples, may have been initially were contaminated with dsRNA. However, similarities between plant and nematode were recognized and the term RNAi was adopted in both systems [11]. Initial experiments on RNAi were successfully in plants and nematodes by introduction of a dsRNA into the cytoplasm. In mammalian cells, however, similar techniques, results in the initiation of the interferon response and cell death before processing could occur. This happen till the group of Elbashir and co-workers [12] reported as alternative method by introduction of a siRNAs under 30 base pair length in the mammalian cells and the interferon response was avoided and activated the RISC (RNA

The importance of the discovery of the RNAi by Fire and Mello was acknowledged in 2006 with the Nobel Prize in Physiology and Medicine. Shortly after this discovery, dsRNAs were found to induce similar gene silencing in a variety of other organisms: in the fruit fly *Drosophila* [13], zebrafish *Danio rerio* [14], *Hydra magnipapillata* (cnidarian) [15], and in some plant species [16-17]. Many experiments have shown that an intermediate in the RNAi process, called shortinterfering RNAs (siRNA), might be effective in degrading mRNA in mammalian cells [18-19]. Nonetheless, it was still not believed that RNAi could work in humans, because long dsRNAs, larger than 30 base pairs in length, induce a cellular response (e.g. interferon response). The first evidence that RNAi functions in humans came from experiments performed by two groups of researchers. Kreutzer and Limmer (2000) [20] demonstrated that short fragments of dsRNA might mediate the RNAi response triggered by the long dsRNAs as observed by Fire and Mello. Despite the fact that these findings were not published, a key patent around this discovery and a company focused on the development and commercialization of RNAi therapeutics was established (www.huntington-assoc.com). In the same time, Elbashir and coworkers [21] found that synthetic short dsRNA molecules result in a potent RNAi gene

The knowledge accumulated from RNAi studies opened an enormous potential for the use as a tool in functional genomics studies in both plant and animal systems. In recent years, numerous strategies have been developed for targeted gene silencing and a combination of approaches enhanced the manipulation of gene silencing for functional genomics studies.

**2. The molecular mechanism of RNA interference in eukaryote system** 

RNA silencing mechanism was first recognized as antiviral mechanism that protects organisms from RNA viruses, or prevents random integration of transposable elements [22-

aberrant dsRNA by the transgene, which results in the silencing of the mRNA [8].

interfering silencing complex) complex and the mRNA destruction.

silencing in mammalian cells without inducing interferon response.

#### *2.1.1. Processing the dsRNA precursors*

In the initiator step, the enzyme Dicer (RNAse III-like enzyme) chops dsRNA into small pieces called short interfering RNA (siRNA), which are around 21-24 nucleotide in length [12, 21, 26, 31-33]. Dicer or Drosha, proteins known for their catalytic RNAseIII and dsRNAbinding domains, catalyzes the maturation of small RNAs. miRNAs are transcribed as long primary transcripts, which are processed by Drosha in the nucleus. Nuclear transport occurs through nuclear pore complexes, which are large proteinaceous channels deposited in the nuclear membrane. The miRNA precursor in further transported to the cytoplasm by means of the nuclear export receptor, exportin-5. Following their export from the nucleus, premiRNAs are subsequently processed by the cytoplasmic Dicerthat yields RNAs duplexes of 21 nucleotides in length, with 5' phosphates and 2-nucleotide 3' overhangs. Numerous Dicer proteins have been identified in plants as well as animal system and each Dicer is preferentially processing dsRNAs, which comes from different sources [26].

#### *2.1.2. RNA silencing effector complex assembly*

siRNAs are loaded into the effector protein complex to form an RNA-induced gene silencing complex, called RISC-complex. Subsequently, the siRNA within RISC unzips, exposing anticodons and thus activating the RISC. Usually, effector complexes containing siRNAs are known as a RISC, while those containing miRNAs are known as miRNPs. For example, in *Arabidopsis thaliana* the rasiRNA-containing effector complexes are known as RITSs. All RISCs or miRNPs have a member of the Argonaute (Ago) family of proteins attached to them. RISCs and miRNPs differ in size and composition, based on the provenience organism. Further studies for identification of more specific and active siRNA duplexes for guidance of cleavage of mRNA, revealed that the sequence of the siRNA duplex had a significant impact on the ratio of sense and antisense siRNAs that were entering the RISC complex [26]. There have been different numbers of Ago proteins identified in different organisms. *Arabidopsis thaliana* has ten members, *D. melanogaster* has five members, and humans have eight members of the Ago protein family. Only a small number of these proteins have actually had their function characterized [26].

#### *2.1.3. mRNA cleavage and repression of translation*

After the formation of the RISC complex, the siRNAs in the RISC complex guide degradation that is sequence-specific, of the complementary or near complementary mRNAs [26]. The RISC complex cleaved the mRNA in the middle of its complementary region. The cleavage does not require the presence of ATP, however multiple cleavages are more efficient in the presence of ATP. RISC and miRNP complexes work by catalyzing hydrolysis of the phosphodiester linkage of the target RNA [26]. It is not fully understood the mechanism by which repression of translation guided by miRNA (micro-RNA) as well as the mechanism by which mRNA cleavage are working. The first evidence that this occurs was described in *C. elegans* mutants, where specifically targeted miRNAs reduced synthesis of proteins without affecting the levels of mRNA. It has been suggested that miRNAs affect translation termination or elongation rather than actual initiation of the process. In addition, it has been shown that miRNAs can act as siRNAs and vice versa. Further investigations suggested that mRNA degradation and translational regulation guided by miRNAs could be used as simultaneous mechanisms for natural regulation [26]. Notably, siRNAs can also move from cell to cell and systematically spread and deliver the silencing signal to the entire organism [34].

RNAi Towards Functional Genomics Studies 71

human RISC has revealed that miRNA processing and Ago2-mediated target-RNA cleavage are functionally coupled [39]. The demonstration of physical and functional coupling of premiRNA processing and target-RNA cleavage provides an explanation for earlier observations that 27 nucleotide dsRNA and short hairpin RNA (shRNA) are considerably

Animal miRNAs may act combinatorial, several miRNAs could binding a single transcript [40]. Also, experiments performed by Doench and his co-workers (2003) [42] suggested that multiple miRNAs could act cooperatively, reducing mRNA translation by more than the sum of their individual effect. The high number of putative target genes indicates that miRNAs function in a broad range of biological processes. Until now, the function of only a few miRNA have been analyzed in animals, but these studies have already revealed important roles of miRNAs in control of cell division, differentiation, apoptosis [43] in several developmental processes such as morphogenesis, neurogenesis and developmental timing [44, 45]. Nonetheless, antisense RNA has been implicated in imprinting and X

Studies of miRNA processing have also provided information that could improve scientists ability to design more efficient RNAi inducing RNA molecules for experimental and therapeutic applications [46]. Krutzfeldt and co-workers [47] identified a novel class of chemically engineered oligonucleotides named "antagomirs" which silenced miRNA in murine model. Antagomirs are cholesterol-conjugated single-stranded RNA molecules 21- 23nt in length and complementary to the mature target miRNA [48]. These oligonucleotides can be designed to specifically bind to the miRNA-RISC complex and they inhibit its function. Antagomirs down-regulated the proteins translation by the silencing miRNA [48]. It has been shown that these molecules are very stable *in vivo* and after one intravenous injection only, and can silence target miRNA in the liver, lung, intestine, heart, skin and

RNA gene silencing was discovered in plants as a mechanism whereby invading nucleic acids, such as transgenes and viruses, are silenced through the action of small (20–26nt) homologous RNA molecules [3]. Crucial to understanding the gene silencing mechanism is to know how to trigger it from the theoretical perspective of understanding a remarkable

This process is initially triggered by dsRNA, which can be introduced experimentally or arise from endogenous transposons, replicating RNA viruses, or the transcription of the transgenes as shown in Figure 1. In brief, double-stranded RNAs generated through aberrant gene expression from a foreign gene, virus infection or tandem repeats sequences due to insertion of transposons / retrotransposon are digested into 21-25 nucleotide long siRNAs by Dicer. This siRNA functioned as a template for the targeted degradation of mRNA in RISC and acts also as the primer for RdRP to amplify the secondary dsRNA [49]. As mentioned in the mammals system numerous components of RNAi machinery were

biological response to the practical use of silencing as an experimental tool.

more potent triggers of RNAi than duplex siRNA [40-41].

inactivation.

bone marrow for more than a week.

**2.3. RNA silencing pathways in plants** 

#### **2.2. RNA silencing pathways in mammals**

The 21-nucleotide miRNAs derive from dsRNA-like hairpin regions of 70 nucleotides within primary transcript [35]. Firstly, cleavage of the pri-miRNA in the nucleus by the RNAse III enzyme Drosha releases the stem-loop (or hairpin), and this precursor (pre-miRNA) is subsequently exported to the cytoplasm. The end of the stem of the pre-miRNA has the same characteristic 5'and 3'termini as siRNAs. In the cytoplasm a Dicer enzyme makes a pair of cuts that liberates a 21–nucleotide RNA duplex. Similar to siRNA duplexes, the strand whose 5'end is less stably paired will be used as guide/miRNA strand [36]. miRNA and RNAi pathways share the same core machinery, but in various animal species exist different specialization. MicroRNAs and non-coding RNAs are a major breakthrough in epigenetic of the last years, and have been found to contribute to almost all biological pathways, including gametogenesis, early development and cell signaling. While, this RNA gene silencing pathway is used by both siRNAs and miRNAs, there exist some important differences.

Comparison of *Drosophila, C. elegans* and humans has revealed that homologous Drosha enzymes catalyze the first processing step of their miRNA pathway [35]. Nonetheless, these three species show more variation in respect to the functional roles of Dicer enzymes. In *Drosophila,* two distinct enzymes are responsible for pre-miRNA cleavage and siRNA production [26]. In *C. elegans* and humans only one Dicer enzyme is present, having both cleavage functions [26].

In humans and other vertebrates, the main RNA silencing pathway seems to involved miRNA because of the existence of an immune response against long dsRNA, suggesting that the processing pathway for this type of RNAi trigger would be less important [37]. In humans, four Ago proteins (ago 1-4) have been identified [38]. A study of the assembly of human RISC has revealed that miRNA processing and Ago2-mediated target-RNA cleavage are functionally coupled [39]. The demonstration of physical and functional coupling of premiRNA processing and target-RNA cleavage provides an explanation for earlier observations that 27 nucleotide dsRNA and short hairpin RNA (shRNA) are considerably more potent triggers of RNAi than duplex siRNA [40-41].

Animal miRNAs may act combinatorial, several miRNAs could binding a single transcript [40]. Also, experiments performed by Doench and his co-workers (2003) [42] suggested that multiple miRNAs could act cooperatively, reducing mRNA translation by more than the sum of their individual effect. The high number of putative target genes indicates that miRNAs function in a broad range of biological processes. Until now, the function of only a few miRNA have been analyzed in animals, but these studies have already revealed important roles of miRNAs in control of cell division, differentiation, apoptosis [43] in several developmental processes such as morphogenesis, neurogenesis and developmental timing [44, 45]. Nonetheless, antisense RNA has been implicated in imprinting and X inactivation.

Studies of miRNA processing have also provided information that could improve scientists ability to design more efficient RNAi inducing RNA molecules for experimental and therapeutic applications [46]. Krutzfeldt and co-workers [47] identified a novel class of chemically engineered oligonucleotides named "antagomirs" which silenced miRNA in murine model. Antagomirs are cholesterol-conjugated single-stranded RNA molecules 21- 23nt in length and complementary to the mature target miRNA [48]. These oligonucleotides can be designed to specifically bind to the miRNA-RISC complex and they inhibit its function. Antagomirs down-regulated the proteins translation by the silencing miRNA [48]. It has been shown that these molecules are very stable *in vivo* and after one intravenous injection only, and can silence target miRNA in the liver, lung, intestine, heart, skin and bone marrow for more than a week.

#### **2.3. RNA silencing pathways in plants**

70 Functional Genomics

differences.

cleavage functions [26].

*2.1.3. mRNA cleavage and repression of translation* 

spread and deliver the silencing signal to the entire organism [34].

**2.2. RNA silencing pathways in mammals** 

After the formation of the RISC complex, the siRNAs in the RISC complex guide degradation that is sequence-specific, of the complementary or near complementary mRNAs [26]. The RISC complex cleaved the mRNA in the middle of its complementary region. The cleavage does not require the presence of ATP, however multiple cleavages are more efficient in the presence of ATP. RISC and miRNP complexes work by catalyzing hydrolysis of the phosphodiester linkage of the target RNA [26]. It is not fully understood the mechanism by which repression of translation guided by miRNA (micro-RNA) as well as the mechanism by which mRNA cleavage are working. The first evidence that this occurs was described in *C. elegans* mutants, where specifically targeted miRNAs reduced synthesis of proteins without affecting the levels of mRNA. It has been suggested that miRNAs affect translation termination or elongation rather than actual initiation of the process. In addition, it has been shown that miRNAs can act as siRNAs and vice versa. Further investigations suggested that mRNA degradation and translational regulation guided by miRNAs could be used as simultaneous mechanisms for natural regulation [26]. Notably, siRNAs can also move from cell to cell and systematically

The 21-nucleotide miRNAs derive from dsRNA-like hairpin regions of 70 nucleotides within primary transcript [35]. Firstly, cleavage of the pri-miRNA in the nucleus by the RNAse III enzyme Drosha releases the stem-loop (or hairpin), and this precursor (pre-miRNA) is subsequently exported to the cytoplasm. The end of the stem of the pre-miRNA has the same characteristic 5'and 3'termini as siRNAs. In the cytoplasm a Dicer enzyme makes a pair of cuts that liberates a 21–nucleotide RNA duplex. Similar to siRNA duplexes, the strand whose 5'end is less stably paired will be used as guide/miRNA strand [36]. miRNA and RNAi pathways share the same core machinery, but in various animal species exist different specialization. MicroRNAs and non-coding RNAs are a major breakthrough in epigenetic of the last years, and have been found to contribute to almost all biological pathways, including gametogenesis, early development and cell signaling. While, this RNA gene silencing pathway is used by both siRNAs and miRNAs, there exist some important

Comparison of *Drosophila, C. elegans* and humans has revealed that homologous Drosha enzymes catalyze the first processing step of their miRNA pathway [35]. Nonetheless, these three species show more variation in respect to the functional roles of Dicer enzymes. In *Drosophila,* two distinct enzymes are responsible for pre-miRNA cleavage and siRNA production [26]. In *C. elegans* and humans only one Dicer enzyme is present, having both

In humans and other vertebrates, the main RNA silencing pathway seems to involved miRNA because of the existence of an immune response against long dsRNA, suggesting that the processing pathway for this type of RNAi trigger would be less important [37]. In humans, four Ago proteins (ago 1-4) have been identified [38]. A study of the assembly of RNA gene silencing was discovered in plants as a mechanism whereby invading nucleic acids, such as transgenes and viruses, are silenced through the action of small (20–26nt) homologous RNA molecules [3]. Crucial to understanding the gene silencing mechanism is to know how to trigger it from the theoretical perspective of understanding a remarkable biological response to the practical use of silencing as an experimental tool.

This process is initially triggered by dsRNA, which can be introduced experimentally or arise from endogenous transposons, replicating RNA viruses, or the transcription of the transgenes as shown in Figure 1. In brief, double-stranded RNAs generated through aberrant gene expression from a foreign gene, virus infection or tandem repeats sequences due to insertion of transposons / retrotransposon are digested into 21-25 nucleotide long siRNAs by Dicer. This siRNA functioned as a template for the targeted degradation of mRNA in RISC and acts also as the primer for RdRP to amplify the secondary dsRNA [49]. As mentioned in the mammals system numerous components of RNAi machinery were

identified and characterized in plants. For example, Argonaute proteins played an important role in RNA silencing in plants because they are components of the silencing effector complexes that bind to siRNAs and miRNAs. Dicer proteins are required for miRNAs biogenesis. Unlike the animal system, miRNAs in plants are more paired to their target RNA and use RNA cleavage rather than translational suppression as the primary silencing mechanism [35].

RNAi Towards Functional Genomics Studies 73

heterochromatin. The first pathway of RNA silencing, called cytoplasmic siRNA silencing, is a mechanism by which the dsRNA could be a replication intermediate or a secondarystructure feature of single-stranded viral RNA and maybe important for virus-infected plant cells. The source of dsRNAs includes replication intermediates of plant RNA viruses, transgenic inverted repeats, and products of RNA-dependent RNA polymerases (RdRPs). The dsRNA may be form by annealing of overlapping complementary transcripts [30].

Silencing of endogenous messenger RNAs by miRNAs is a second pathway of silencing in plants. These miRNAs negatively regulate gene expression by base pairing to specific mRNAs, resulting in either RNA cleavage or arrest of protein translation. Like siRNAs, the miRNAs are short 21-24-nucleotide RNAs derived by Dicer cleavage of a precursor [32]. miRNAs downregulate gene expression through base-pairing to target mRNAs, leading to either the degradation of mRNAs or the inhibition of translation or both. In plants, the prototype miRNAs were identified as a subset of the short RNA population, and are derived from an inverted repeat precursor RNA with partially double-stranded regions, and they

The third pathway of RNA silencing in plants is associated with DNA methylation and suppression of transcription**.** This type of silencing was evidenced in plants by the discovery that the transgene and viral RNAs guide DNA methylation to specific nucleotide sequences. More recently, these findings have been extended by the observations that siRNA-directed DNA methylation in plants is linked to histone modification [30]. An important role of RNA silencing at the chromatin level is probably protecting the genome against damage caused

Small interfering RNAs (siRNAs) are 22nt fragments, which bind to the complementary portion of their target mRNA and tag it for degradation. siRNA have a role in conferring viral resistance and secures genome stability by preventing transposon hopping. RNAi mechanisms, in which siRNAs are involved keep chromatin condensed and suppress

miRNAs are integral components of the genetic program that account for approximately 5% of the predicted genes in plants, worms and vertebrates. Loci encoding these miRNA are localised in the introns of protein-coding genes or in the noncoding region of the genome [50]. miRNAs help regulate gene expression, particularly during development [51]. The phenomenon of RNA interference, broadly defined, includes the endogenously induced gene silencing effects of miRNA as well as silencing triggered by foreign dsRNA. Mature miRNAs are structurally similar to siRNAs produced from exogenous dsRNA, but before reaching maturity, miRNAs must first undergo extensive post-transcriptional modification.

transcription, repress protein synthesis and regulate the development of organisms.

target a complementary single-stranded mRNA.

**2.4. Types of small RNAs in eukaryote** 

by transposons.

*2.4.1. siRNAs* 

*2.4.2. miRNAs* 

**Figure 1.** RNA mediated gene -silencing pathway. Double stranded RNA is digested in siRNAs by Dicer; this siRNA function as a template for the target degradation of mRNA in RISC. siRNA acts as the primer for RdRp to amplify the secondary dsRNA. RISC (RNA inducing silencing complex); Dicer (RNaseIII –like RNase); RdRP (RNA-dependent RNA polymerase)

In plants RNAi process engages the participation of numerous pathways [32]. Diverse biological roles of these pathways have been established including the mechanism of viral defense, regulation of gene expression and the condensation of chromatin into heterochromatin. The first pathway of RNA silencing, called cytoplasmic siRNA silencing, is a mechanism by which the dsRNA could be a replication intermediate or a secondarystructure feature of single-stranded viral RNA and maybe important for virus-infected plant cells. The source of dsRNAs includes replication intermediates of plant RNA viruses, transgenic inverted repeats, and products of RNA-dependent RNA polymerases (RdRPs). The dsRNA may be form by annealing of overlapping complementary transcripts [30].

Silencing of endogenous messenger RNAs by miRNAs is a second pathway of silencing in plants. These miRNAs negatively regulate gene expression by base pairing to specific mRNAs, resulting in either RNA cleavage or arrest of protein translation. Like siRNAs, the miRNAs are short 21-24-nucleotide RNAs derived by Dicer cleavage of a precursor [32]. miRNAs downregulate gene expression through base-pairing to target mRNAs, leading to either the degradation of mRNAs or the inhibition of translation or both. In plants, the prototype miRNAs were identified as a subset of the short RNA population, and are derived from an inverted repeat precursor RNA with partially double-stranded regions, and they target a complementary single-stranded mRNA.

The third pathway of RNA silencing in plants is associated with DNA methylation and suppression of transcription**.** This type of silencing was evidenced in plants by the discovery that the transgene and viral RNAs guide DNA methylation to specific nucleotide sequences. More recently, these findings have been extended by the observations that siRNA-directed DNA methylation in plants is linked to histone modification [30]. An important role of RNA silencing at the chromatin level is probably protecting the genome against damage caused by transposons.

#### **2.4. Types of small RNAs in eukaryote**

#### *2.4.1. siRNAs*

72 Functional Genomics

silencing mechanism [35].

Target mRNA CAP

**siRNA** 

**RdRP** 

**dsRNA** 

identified and characterized in plants. For example, Argonaute proteins played an important role in RNA silencing in plants because they are components of the silencing effector complexes that bind to siRNAs and miRNAs. Dicer proteins are required for miRNAs biogenesis. Unlike the animal system, miRNAs in plants are more paired to their target RNA and use RNA cleavage rather than translational suppression as the primary

siRNA dsRNA miRNA

CAP

CAP

**RISC** 

**CAP** 

**siRNA** 

**Secondary RNAi** 

**TRIGGERS** 

**DICER** 

**Precursor**s

**siRNA** 

(via DNA expression**)** 

AAAA

AAAA **DICER** 

AAAA

**Cleavage of dsRNA** 

**RdRP** 

**Figure 1.** RNA mediated gene -silencing pathway. Double stranded RNA is digested in siRNAs by Dicer; this siRNA function as a template for the target degradation of mRNA in RISC. siRNA acts as the primer for RdRp to amplify the secondary dsRNA. RISC (RNA inducing silencing complex); Dicer

In plants RNAi process engages the participation of numerous pathways [32]. Diverse biological roles of these pathways have been established including the mechanism of viral defense, regulation of gene expression and the condensation of chromatin into

(RNaseIII –like RNase); RdRP (RNA-dependent RNA polymerase)

AAAA

**Cleavage of dsRNA** 

CAP AAAA

CAP AAAA

**Degradation of mRNA** 

Small interfering RNAs (siRNAs) are 22nt fragments, which bind to the complementary portion of their target mRNA and tag it for degradation. siRNA have a role in conferring viral resistance and secures genome stability by preventing transposon hopping. RNAi mechanisms, in which siRNAs are involved keep chromatin condensed and suppress transcription, repress protein synthesis and regulate the development of organisms.

#### *2.4.2. miRNAs*

miRNAs are integral components of the genetic program that account for approximately 5% of the predicted genes in plants, worms and vertebrates. Loci encoding these miRNA are localised in the introns of protein-coding genes or in the noncoding region of the genome [50]. miRNAs help regulate gene expression, particularly during development [51]. The phenomenon of RNA interference, broadly defined, includes the endogenously induced gene silencing effects of miRNA as well as silencing triggered by foreign dsRNA. Mature miRNAs are structurally similar to siRNAs produced from exogenous dsRNA, but before reaching maturity, miRNAs must first undergo extensive post-transcriptional modification.

miRNA is express from a longer RNA coding locus as a primary transcript called primiRNA which is processed in the nucleus, to a 70-nucleotide hairpin structure known as pre-miRNA. The small miRNAs are processed from larger hairpin precursors by an RNAilike machinery.

RNAi Towards Functional Genomics Studies 75

overhangs of 2 to 3 nucleotides and 5'


involved in gene regulation [61-63]. For example, *miR159* and its putative target transcription factor MYB33vmRNA, has been regulated by the hormone gibberellic acid [27]. Gibberellic acid stimulus could lend to an increase in MYB33 mRNA that would initiate flowering, and, directly or indirectly, to an increase in *miR159*. A similar mechanism has

In addition to endogenous miRNAs and exogenous siRNAs, several other classes of siRNAs such as: *trans*-acting siRNAs (tasiRNAs), repeat-associated siRNAs (rasiRNAs), small-scan (scn)RNAs and Piwi-interacting (pi)RNAs have been identified. tasiRNAs are small (~21nt) RNAs that have been reported in plants, and they are encoded in intergenic regions that correspond to both the sense and antisense strands [64-65]. In *Arabidopsis thaliana*, tasiRNAs require components of the miRNA machinery and cleave their target mRNAs in *trans* [64-65]. These siRNAs represses the gene expression though post-transcriptional gene silencing in plants and it is transcribed from the genome to form a polyadenylated, double-stranded precursor. rasiRNAs that match sense and antisense sequences could be involved in transcriptional gene silencing in *Schizosaccharomyces pombe* and *A. thaliana* [66-68]. scnRNAs are ~28-nt RNAs that have been found in *Tetrahymena thermophila* and that might be involved in scanning DNA sequences in order to induce genome rearrangement [69]. piRNAs are different from miRNAs and are possibly important in mammalian gametogenesis [70]. They are small (~26–31-nt) RNAs that bind to MILI and MIWI proteins, a subgroup of Argonaute proteins that belong to the Piwi family and that are essential for spermatogenesis in mice.

Members of Dicer family which showed specificity for cleavage of dcRNAs, played central role in the processing of the dsRNAs precursors: miRNA and siRNA. Processing of dsRNAs by

has four distinct domains: an amino terminal helicase domain, dual RNase III motifs, a dsRNA binding domain, and a PAZ domain (a 110-amino-acid domain present in proteins like Piwi, Argo, and Zwille/Pinhead), which it shares with the RDE1/QDE2/Argonaute family of proteins that has been genetically linked to RNAi by independent studies [71-72]. Cleavage by Dicer is thought to be catalyzed by its tandem RNase III domains. Some DCR proteins, including the one from *D. melanogaster*, contain an ATP-binding motif along with the DEAD box RNA helicase domain. In *Arabidopsis thaliana*, four Dicer-like proteins (DCL1, DCL2, DCL3 and DCL4) have been identified and are involved in the processing of several dsRNAs coming from different sources [73]. For example, DCL2 is required for production of siRNA from plant viruses while DCL3 is involved in production of rasiRNA [73]. On the other hand, in *C. elegans*


been identified for miR177 that target a transcription factor in GRAS mRNA [62].

*2.4.3. Other molecules involved in RNAi processing* 

**2.5. Factors and proteins involved silencing**

Dicer yields RNA duplexes of 21 nucleotides with 3'

and mammals one single Dicer gene, DCR-1 has been identified.

*2.5.1. Dicers* 

phosphate and 3'

The first miRNA, lin-4, was discovered in *C. elegans* five years prior to the demonstration of dsRNA as an inducer or RNAi [52]. Short non-coding transcript from lin-4 represses the expression of the nuclear protein encoding gene lin-14 as part of the control of developmental timing. The existence of partial complementarity between the small lin-4 RNA and several elements in the 3' untranslated region (UTR) of the lin-14 mRNA suggested a mechanism of translational inhibition via an antisense RNA-RNA interaction. In this context, miRNAs were shown to compose a large class of ribo-regulators [36, 53-54]. In the same time, was demonstrated that Dicer converts pre-miRNA into mature miRNAs of approximately the same length as single-stranded siRNAs, establishing a formal connection between miRNAs and siRNAs [55-56]. Other studies have revealed the complete pathway of miRNA processing in animals, which is based on two steps catalysed by the RNase III enzymes Drosha and Dicer. The mature miRNAs of animals generally regulate their target genes by translational repression, but some cases of target mRNA cleavage have also been reported [35, 57]. This is in contrast with the situation in plants, in which target mRNA cleavage appears to be the main mechanism [58].

With the discovery of the first miRNA lin-4, interest in the role of miRNA in the regulation of fundamental biological processes has rapidly emerged. Now, more than 18226 entries representing hairpin precursor miRNAs, expressing 21643 mature miRNA products, in 168 species are tabulated in the miRNA registry (http://microrna.sanger.ac.uk). Among them, more than 300 miRNA have been discovered in humans [46]. In mammals, about one-half of the know miRNA are located within the transcription units of other genes and share a single primary transcript [59-60]. These miRNAs generally reside in the introns or in exon sequences that are not protein coding. The expression pattern of the miRNA varied. While some *C. elegans* and *Drosophila* miRNAs were expressed in all cells and at all developmental stages, other had a more restricted spatial and temporal expression pattern. This suggested that such miRNAs might be involved in post-transcriptional regulation of developmental genes [18].

In plants, as in animal systems, miRNAs, are generated as single-stranded 20-24-nucleotide species, by several proteins such Dicer and Argonaute (Ago). Ago proteins are components of the silencing effector complexes that bind the siRNAs and miRNAs. miRNAs act in *trans* on cellular target transcripts to induce their degradation via cleavage, or to attenuate protein production. Based on a computational genome analyses in *Arabidopsis*, it has been estimated that there are about 100 miRNA loci and some of them are conserved between *Arabidopsis* and *Lotus, Medicago* or *Populus* but not founded in rice [30]. Currently, there are numerous known plant miRNAs, and, in several cases, the target mRNA has been experimentally validated by expression of a miRNA-resistant target gene with silent mutations in the putative miRNA complementary region [30]. In *Arabidopsis* many miRNAs have been identified and correspond to the mRNAs for transcription factors and other proteins involved in gene regulation [61-63]. For example, *miR159* and its putative target transcription factor MYB33vmRNA, has been regulated by the hormone gibberellic acid [27]. Gibberellic acid stimulus could lend to an increase in MYB33 mRNA that would initiate flowering, and, directly or indirectly, to an increase in *miR159*. A similar mechanism has been identified for miR177 that target a transcription factor in GRAS mRNA [62].

#### *2.4.3. Other molecules involved in RNAi processing*

In addition to endogenous miRNAs and exogenous siRNAs, several other classes of siRNAs such as: *trans*-acting siRNAs (tasiRNAs), repeat-associated siRNAs (rasiRNAs), small-scan (scn)RNAs and Piwi-interacting (pi)RNAs have been identified. tasiRNAs are small (~21nt) RNAs that have been reported in plants, and they are encoded in intergenic regions that correspond to both the sense and antisense strands [64-65]. In *Arabidopsis thaliana*, tasiRNAs require components of the miRNA machinery and cleave their target mRNAs in *trans* [64-65]. These siRNAs represses the gene expression though post-transcriptional gene silencing in plants and it is transcribed from the genome to form a polyadenylated, double-stranded precursor. rasiRNAs that match sense and antisense sequences could be involved in transcriptional gene silencing in *Schizosaccharomyces pombe* and *A. thaliana* [66-68]. scnRNAs are ~28-nt RNAs that have been found in *Tetrahymena thermophila* and that might be involved in scanning DNA sequences in order to induce genome rearrangement [69]. piRNAs are different from miRNAs and are possibly important in mammalian gametogenesis [70]. They are small (~26–31-nt) RNAs that bind to MILI and MIWI proteins, a subgroup of Argonaute proteins that belong to the Piwi family and that are essential for spermatogenesis in mice.

#### **2.5. Factors and proteins involved silencing**

#### *2.5.1. Dicers*

74 Functional Genomics

like machinery.

genes [18].

cleavage appears to be the main mechanism [58].

miRNA is express from a longer RNA coding locus as a primary transcript called primiRNA which is processed in the nucleus, to a 70-nucleotide hairpin structure known as pre-miRNA. The small miRNAs are processed from larger hairpin precursors by an RNAi-

The first miRNA, lin-4, was discovered in *C. elegans* five years prior to the demonstration of dsRNA as an inducer or RNAi [52]. Short non-coding transcript from lin-4 represses the expression of the nuclear protein encoding gene lin-14 as part of the control of developmental timing. The existence of partial complementarity between the small lin-4 RNA and several elements in the 3' untranslated region (UTR) of the lin-14 mRNA suggested a mechanism of translational inhibition via an antisense RNA-RNA interaction. In this context, miRNAs were shown to compose a large class of ribo-regulators [36, 53-54]. In the same time, was demonstrated that Dicer converts pre-miRNA into mature miRNAs of approximately the same length as single-stranded siRNAs, establishing a formal connection between miRNAs and siRNAs [55-56]. Other studies have revealed the complete pathway of miRNA processing in animals, which is based on two steps catalysed by the RNase III enzymes Drosha and Dicer. The mature miRNAs of animals generally regulate their target genes by translational repression, but some cases of target mRNA cleavage have also been reported [35, 57]. This is in contrast with the situation in plants, in which target mRNA

With the discovery of the first miRNA lin-4, interest in the role of miRNA in the regulation of fundamental biological processes has rapidly emerged. Now, more than 18226 entries representing hairpin precursor miRNAs, expressing 21643 mature miRNA products, in 168 species are tabulated in the miRNA registry (http://microrna.sanger.ac.uk). Among them, more than 300 miRNA have been discovered in humans [46]. In mammals, about one-half of the know miRNA are located within the transcription units of other genes and share a single primary transcript [59-60]. These miRNAs generally reside in the introns or in exon sequences that are not protein coding. The expression pattern of the miRNA varied. While some *C. elegans* and *Drosophila* miRNAs were expressed in all cells and at all developmental stages, other had a more restricted spatial and temporal expression pattern. This suggested that such miRNAs might be involved in post-transcriptional regulation of developmental

In plants, as in animal systems, miRNAs, are generated as single-stranded 20-24-nucleotide species, by several proteins such Dicer and Argonaute (Ago). Ago proteins are components of the silencing effector complexes that bind the siRNAs and miRNAs. miRNAs act in *trans* on cellular target transcripts to induce their degradation via cleavage, or to attenuate protein production. Based on a computational genome analyses in *Arabidopsis*, it has been estimated that there are about 100 miRNA loci and some of them are conserved between *Arabidopsis* and *Lotus, Medicago* or *Populus* but not founded in rice [30]. Currently, there are numerous known plant miRNAs, and, in several cases, the target mRNA has been experimentally validated by expression of a miRNA-resistant target gene with silent mutations in the putative miRNA complementary region [30]. In *Arabidopsis* many miRNAs have been identified and correspond to the mRNAs for transcription factors and other proteins Members of Dicer family which showed specificity for cleavage of dcRNAs, played central role in the processing of the dsRNAs precursors: miRNA and siRNA. Processing of dsRNAs by Dicer yields RNA duplexes of 21 nucleotides with 3' overhangs of 2 to 3 nucleotides and 5' phosphate and 3' -hydroxyl termini [12]. Dicer namely DCR (in *Drosophila*) / DCL (*Arabidopsis*), has four distinct domains: an amino terminal helicase domain, dual RNase III motifs, a dsRNA binding domain, and a PAZ domain (a 110-amino-acid domain present in proteins like Piwi, Argo, and Zwille/Pinhead), which it shares with the RDE1/QDE2/Argonaute family of proteins that has been genetically linked to RNAi by independent studies [71-72]. Cleavage by Dicer is thought to be catalyzed by its tandem RNase III domains. Some DCR proteins, including the one from *D. melanogaster*, contain an ATP-binding motif along with the DEAD box RNA helicase domain. In *Arabidopsis thaliana*, four Dicer-like proteins (DCL1, DCL2, DCL3 and DCL4) have been identified and are involved in the processing of several dsRNAs coming from different sources [73]. For example, DCL2 is required for production of siRNA from plant viruses while DCL3 is involved in production of rasiRNA [73]. On the other hand, in *C. elegans* and mammals one single Dicer gene, DCR-1 has been identified.

#### *2.5.2. RISC complex*

RICS complex, the effector of RNAi silencing is a multi-protein complex of which several components were identified. One of the proteins identified in almost all organisms is AGO protein that is essential for mRNA silencing activity [74]. In plants 10 AGO member proteins have been identified. For example, AGO1 mutant plants have been found to develop distinctive developmental defects. miRNAs are accumulated in these mutants but the cleavage of target mRNA not longer occur [75]. AGO4 has role in the production of long siRNAs of 24bp and it was early reported that AGO4 is involved in long siRNA mediated chromatin modifications (histone methylation and non-CpG DNA methylation) [76]. In addition to AGO family members, several other proteins associated with RISC complex have been identified in vertebrate and invertebrate models. For example, the *Drosophila*  homologue of the fragile X mental retardation protein (FMRP); R2D2, found in *Drosophila*  and thought to facilitate the passage of the Dicer substrate to the RISC; members of the mammalian Gemin family, some of which are thought to have helicase activity [44].

RNAi Towards Functional Genomics Studies 77

scientists can quickly and easily reduce the expression of a particular gene in mammalian and plant cell systems, often by 90% or greater, to analyze the effect that gene has on cellular

In plants, many efforts were concentrated on the improvement of the nutritional content using the classical breeding approaches such as selection of the natural or induced genetic variations, or by genetic engineering of transgenic plants [83]. Genetic engineering technologies have advantages over classical breeding not only because they increase the scope of genes and the types of mutation that can be manipulated, but also because they have the ability to control the spatial and temporal expression patterns of the genes of interest [84]. The delivery of siRNAs in plants has been always achieved by expressing hairpin RNAs that fold back to create a double-strand region that will be recognized by the Dicer-like enzyme. Figure 2 depicted an typically RNAi construct in plants with the promoter region, the inverted repeats of the target gene with the appropriate orientation, the spacer region which separate the two inverted repeats sequences and the terminator region.

**Figure 2.** A schematic representation of a RNAi vector cassette with the promoter/ terminator region, target inverted repeats, intron spacer region; the arrows represent the direction of transcription

**Intron spacer**

**Promoter Terminator** 

 **Target gene sequence** 

**Target gene sequence** 

The double stranded RNA generated through aberrant gene expression from a foreign gene, viral infection, tandem repeat sequences formed due to insertion of a transposon/retrotransposon, are recognized by Dicer and digested in small interfering RNAs, which functioned as template for the targeted degradation of mRNA in RISC. This siRNA functioned also as primer for RdRP to amplify secondary dsRNA. On the other hand, in plants, RNAi is both systemic and heritable and siRNAs move between cells through channels in cell walls, thus enabling communication and transport throughout the plant. In addition, methylation of promoters targeted by RNAi confers heritability, as the new

Genetic transformation via *Agrobacterium* or by particle bombardment or by infecting plants with viruses that can express the dsRNAs, or the infiltration of *Agrobacterium* harboring the hairpin cassette for transient gene silencing are the common methods for inducing gene silencing in plant system. The transgene expression should be evaluated as soon as possible for each transgenic event, and over multiple generations to insure that each line is stablesilencing its target. Many transgenic events should be generated and analyzed and the lines with active transgenes that are effectively inducing silencing cal be selected and maintained.

methylation pattern is copied in each new generation of the cell [85].

**3.1. Development of efficient RNAi vector cassettes** 

function [18, 49, 82].

#### *2.5.3. RNA-directed RNA polymerase (RdRP)*

In both plants and *C. elegans*, RNAi/PTGS requires proteins similar in sequence to a tomato RNA-directed RNA polymerase [77]. In *Arabidopsis*, RdRP homologue SDE1/SGS2 is required for transgene silencing, but not for virally induced gene silencing [78]. This may suggest that SDE1/SGS2 act as an RdRP, since viral replicases could substitute for this function in VIGS. In *Neurospora*, RdRP homologue QDE-1 is required for efficient quelling [79]. EGO-1, one of the *C. elegans* RdRP, is essential for RNAi in the germline of the worm [80], and another RdRP homologue, RRF-1/RDE-9, is required for silencing in the soma. All RdRP proteins could be involved in amplifying the RNAi signal. However, only the tomato and *Neurospora* enzymes have been demonstrated to posses RNA polymerase activity, and biochemical studies are required to establish definitively the role of these proteins in RNAi [81].

#### *2.5.4. Putative helicase*

Other proteins, helicases have been identified in plants (e.g *sde3* in *Arabidopsis*) [78]. Although possible roles in RNAi for some of these proteins were proposed, e.g. MUT6 might involved in the degradation of misprocessed aberrant RNAs [81], their functions are mostly unknown and further biochemical experiments are needed to reveal their exact roles in RNAi. The quelling-defective mutant in *Neurospora*, *qde3*, was cloned and the sequence encodes a 1,955 amino acid protein. This protein shows homology with the family of RecQ DNA helicases, which includes the human proteins for Bloom syndrome and Werner syndrome.

#### **3. Applications of RNAi in plant systems**

RNAi has been used as new tool to reduce the expression of a particular gene in mammalian and plant cell systems, to analyze the effect that gene has on cellular function, and also it has the potential to be exploited therapeutically and clinical trials. However, by using RNAi, scientists can quickly and easily reduce the expression of a particular gene in mammalian and plant cell systems, often by 90% or greater, to analyze the effect that gene has on cellular function [18, 49, 82].

#### **3.1. Development of efficient RNAi vector cassettes**

76 Functional Genomics

*2.5.2. RISC complex* 

*2.5.4. Putative helicase* 

RICS complex, the effector of RNAi silencing is a multi-protein complex of which several components were identified. One of the proteins identified in almost all organisms is AGO protein that is essential for mRNA silencing activity [74]. In plants 10 AGO member proteins have been identified. For example, AGO1 mutant plants have been found to develop distinctive developmental defects. miRNAs are accumulated in these mutants but the cleavage of target mRNA not longer occur [75]. AGO4 has role in the production of long siRNAs of 24bp and it was early reported that AGO4 is involved in long siRNA mediated chromatin modifications (histone methylation and non-CpG DNA methylation) [76]. In addition to AGO family members, several other proteins associated with RISC complex have been identified in vertebrate and invertebrate models. For example, the *Drosophila*  homologue of the fragile X mental retardation protein (FMRP); R2D2, found in *Drosophila*  and thought to facilitate the passage of the Dicer substrate to the RISC; members of the

mammalian Gemin family, some of which are thought to have helicase activity [44].

required to establish definitively the role of these proteins in RNAi [81].

which includes the human proteins for Bloom syndrome and Werner syndrome.

**3. Applications of RNAi in plant systems** 

In both plants and *C. elegans*, RNAi/PTGS requires proteins similar in sequence to a tomato RNA-directed RNA polymerase [77]. In *Arabidopsis*, RdRP homologue SDE1/SGS2 is required for transgene silencing, but not for virally induced gene silencing [78]. This may suggest that SDE1/SGS2 act as an RdRP, since viral replicases could substitute for this function in VIGS. In *Neurospora*, RdRP homologue QDE-1 is required for efficient quelling [79]. EGO-1, one of the *C. elegans* RdRP, is essential for RNAi in the germline of the worm [80], and another RdRP homologue, RRF-1/RDE-9, is required for silencing in the soma. All RdRP proteins could be involved in amplifying the RNAi signal. However, only the tomato and *Neurospora* enzymes have been demonstrated to posses RNA polymerase activity, and biochemical studies are

Other proteins, helicases have been identified in plants (e.g *sde3* in *Arabidopsis*) [78]. Although possible roles in RNAi for some of these proteins were proposed, e.g. MUT6 might involved in the degradation of misprocessed aberrant RNAs [81], their functions are mostly unknown and further biochemical experiments are needed to reveal their exact roles in RNAi. The quelling-defective mutant in *Neurospora*, *qde3*, was cloned and the sequence encodes a 1,955 amino acid protein. This protein shows homology with the family of RecQ DNA helicases,

RNAi has been used as new tool to reduce the expression of a particular gene in mammalian and plant cell systems, to analyze the effect that gene has on cellular function, and also it has the potential to be exploited therapeutically and clinical trials. However, by using RNAi,

*2.5.3. RNA-directed RNA polymerase (RdRP)* 

In plants, many efforts were concentrated on the improvement of the nutritional content using the classical breeding approaches such as selection of the natural or induced genetic variations, or by genetic engineering of transgenic plants [83]. Genetic engineering technologies have advantages over classical breeding not only because they increase the scope of genes and the types of mutation that can be manipulated, but also because they have the ability to control the spatial and temporal expression patterns of the genes of interest [84]. The delivery of siRNAs in plants has been always achieved by expressing hairpin RNAs that fold back to create a double-strand region that will be recognized by the Dicer-like enzyme. Figure 2 depicted an typically RNAi construct in plants with the promoter region, the inverted repeats of the target gene with the appropriate orientation, the spacer region which separate the two inverted repeats sequences and the terminator region.

**Figure 2.** A schematic representation of a RNAi vector cassette with the promoter/ terminator region, target inverted repeats, intron spacer region; the arrows represent the direction of transcription

The double stranded RNA generated through aberrant gene expression from a foreign gene, viral infection, tandem repeat sequences formed due to insertion of a transposon/retrotransposon, are recognized by Dicer and digested in small interfering RNAs, which functioned as template for the targeted degradation of mRNA in RISC. This siRNA functioned also as primer for RdRP to amplify secondary dsRNA. On the other hand, in plants, RNAi is both systemic and heritable and siRNAs move between cells through channels in cell walls, thus enabling communication and transport throughout the plant. In addition, methylation of promoters targeted by RNAi confers heritability, as the new methylation pattern is copied in each new generation of the cell [85].

Genetic transformation via *Agrobacterium* or by particle bombardment or by infecting plants with viruses that can express the dsRNAs, or the infiltration of *Agrobacterium* harboring the hairpin cassette for transient gene silencing are the common methods for inducing gene silencing in plant system. The transgene expression should be evaluated as soon as possible for each transgenic event, and over multiple generations to insure that each line is stablesilencing its target. Many transgenic events should be generated and analyzed and the lines with active transgenes that are effectively inducing silencing cal be selected and maintained.

Currently, several vectors used for RNAi silencing that make use of *Agrobacterium* mediated delivery or artificially introduced dsRNA and/or VIGS into plants has been reported. For example, in 2007, an *Arabidopsis* genomic RNAi knock-out line analysis consortium was lunched out (AGRIKOLA) which is using the PCR products to generate gene-specific RNAi constructs for each *Arabidopsis* gene used in large scale gene silencing studies [86-87]; other consortium called CATMA (Complete *Arabidopsis* Transcriptome MicroArray), is generating gene sequence tags (GSTs) representing each *Arabidopsis* gene, designed so that they will hybridize on *Arabidopsis* cDNA microarrays in a gene-specific manner; the *Medicago truncatula* RNAi database (https://mtrnai.msi.umn.edu/) is a NSF-funded project planning to silence 1500 genes involved in symbiosis in this model legume; amiRNAi Central (http://www.agrikola.org) a NSF project funded to provide a comprehensive resource for knockdown of *Arabidopsis* genes.

RNAi Towards Functional Genomics Studies 79

region of the target gene ensures that a specific member of a multiples gene family can be silenced. For example, RNAi can down-regulate specific target sequences when 3'UTR region is used as a trigger sequence [89-90]. It has been also shown that RNAi facilitates the generation of dominant loss-of-function mutation in polyploidy plants, even with short

Nowadays, numerous projects are lunched to produce siRNAs that will silence essential genes in insects, nematodes and pathogens using an approach called hdRNAi (hostdelivered RNAi) [92-94] based on the partial sequences similarities between plant and animal genes. There is also a limitation of this approach because, some unexpected genes

RNAi has been also used for the improvement of nutritional value of some important crops. For example, to decrease the levels of natural toxins in food plants a stable, heritable and distinct siRNA against the toxin could be used. Cottonseeds are rich in dietary proteins but unpalatable by humans as they contain a natural toxic terpenoid item, called gossypol. RNAi mechanism has been used to minimize the levels of delta-cadinene synthase, an

RNAi technology has been also applied to barley for developing varieties resistant to BYDV (barley yellow dwarf virus) [96]. In rice, RNAi has been used to reduce the level of glutein and produce rice varieties with low-glutein content [97]. Soybeans can be engineered to produce oil with low levels of polyunsaturated fatty acids through a reduction of FAD2, a fatty acyl Δ12 desaturase. This enzyme converts the monounsaturated fatty acid oleic acid (18:1Δ9) to linoleic acid (18:2Δ9, Δ12), which can be subsequently desaturated to α-linolenic acid (18:3Δ9, Δ12, Δ15) by FAD3 [98]. The reduced polyunsaturated fatty acid levels from >65% of the total oil content in normal soybean oil to less than 5% was observed [99]. In an attempt to specifically target FAD2-1, and not related family members, the soybean FAD2-1A intron was tested as an RNAi trigger, resulting in a reduction in polyunsaturated fatty acids in the seeds to about 20% [100]. This result was surprising, given that intron sequences are removed from precursor mRNAs (pre-mRNAs) by splicing in the nucleus and spatially separated from the cytoplasm

can be silenced with consequences on the organisms itself but also environment.

where mature mRNAs are presumed to be targeted by the PTGS machinery [101].

There are also some limitations of using RNAi in functional genomics studies. Unlike the insertional mutagenesis, for the use of RNAi the exact sequence of the target gene is required. Secondly, the methods to delivery RNAi are very important, some species are easily transformable and some not. Nonetheless, further improvement of the delivery methods and vectors that can be used safely and reliably are needed. There have also been some reports that revealed the difficulty to detect mutants with subtle changes in gene expression. However, in plants, numerous marker genes are being developed that will

RNA silencing in plants prevents viral accumulation and accordingly, viruses have evolved several strategies to counteract the defense mechanism. A viral protein, HC-Pro (helper

dsRNAs of a 37 nucleotide long [91].

enzyme crucial for the production of gossypol [95].

indicate if a change in gene expression occurs [102].

**3.3. RNAi and viral infections** 

Moreover, a set of binary vectors, called ChromDB's RNAi vectors, were designed for producing dominant negative RNAi mutants using a target sequence cloning strategy that is based on the inclusion of two restriction enzyme cleavage sites in each of two primers used to amplify gene-specific fragments from cDNA. This design minimizes the number of PCR primers and results in the placement of unique restriction enzyme recognition sites to allow for flexibility in future manipulations of the plasmid, *e*.*g*., moving the inverted repeat target sequence to a different vector (Chrom database). These vectors are based on pCAMBIA binary vectors, a set of plasmids developed by the Center for Application of Molecular Biology to International Agriculture (CAMBIA).

The pHELLSGATE, high-throughput gene silencing vector and a high throughput tobacco rattle virus (TRV) based VIGS vector are binary vectors developed by Invitrogen are used for expression of GUS and GFP proteins. These vectors are base on Gateway recombinationbased technology, which replaced the conventional cloning strategy. It is based on the phage lambda system of recombination. It enables segments of DNA to be transferred between different vectors while orientation and reading frame are maintained. It can also be used for transfer of PCR products. It saves valuable time, because once the DNA has been cloned into a Gateway vector, it can be used as many genome function analysis systems as is required. In this way, the use of vectors in the process of plant functional genomics has been made much easier, while the process has also been made faster. This allows for higher throughput analysis to occur [88].

#### **3.2. RNAi and functional genomic studies**

An important application of RNAi for functional genomics studies is to generate lines that are deficient for the activity of a subset of genes and then test the knockdown lines for a specific phenotype. The assessing of a specific phenotype requires the presence of a specific allele of marker genes and several generation of crosses are necessary for selecting a specific mutant allele for specific genotype. RNAi technology for functional genomics has advantage that a specific gene can be silenced if the target sequence is better chosen. However, since RNAi is a homology-dependent process a carefully selection of a unique or conserved region of the target gene ensures that a specific member of a multiples gene family can be silenced. For example, RNAi can down-regulate specific target sequences when 3'UTR region is used as a trigger sequence [89-90]. It has been also shown that RNAi facilitates the generation of dominant loss-of-function mutation in polyploidy plants, even with short dsRNAs of a 37 nucleotide long [91].

Nowadays, numerous projects are lunched to produce siRNAs that will silence essential genes in insects, nematodes and pathogens using an approach called hdRNAi (hostdelivered RNAi) [92-94] based on the partial sequences similarities between plant and animal genes. There is also a limitation of this approach because, some unexpected genes can be silenced with consequences on the organisms itself but also environment.

RNAi has been also used for the improvement of nutritional value of some important crops. For example, to decrease the levels of natural toxins in food plants a stable, heritable and distinct siRNA against the toxin could be used. Cottonseeds are rich in dietary proteins but unpalatable by humans as they contain a natural toxic terpenoid item, called gossypol. RNAi mechanism has been used to minimize the levels of delta-cadinene synthase, an enzyme crucial for the production of gossypol [95].

RNAi technology has been also applied to barley for developing varieties resistant to BYDV (barley yellow dwarf virus) [96]. In rice, RNAi has been used to reduce the level of glutein and produce rice varieties with low-glutein content [97]. Soybeans can be engineered to produce oil with low levels of polyunsaturated fatty acids through a reduction of FAD2, a fatty acyl Δ12 desaturase. This enzyme converts the monounsaturated fatty acid oleic acid (18:1Δ9) to linoleic acid (18:2Δ9, Δ12), which can be subsequently desaturated to α-linolenic acid (18:3Δ9, Δ12, Δ15) by FAD3 [98]. The reduced polyunsaturated fatty acid levels from >65% of the total oil content in normal soybean oil to less than 5% was observed [99]. In an attempt to specifically target FAD2-1, and not related family members, the soybean FAD2-1A intron was tested as an RNAi trigger, resulting in a reduction in polyunsaturated fatty acids in the seeds to about 20% [100]. This result was surprising, given that intron sequences are removed from precursor mRNAs (pre-mRNAs) by splicing in the nucleus and spatially separated from the cytoplasm where mature mRNAs are presumed to be targeted by the PTGS machinery [101].

There are also some limitations of using RNAi in functional genomics studies. Unlike the insertional mutagenesis, for the use of RNAi the exact sequence of the target gene is required. Secondly, the methods to delivery RNAi are very important, some species are easily transformable and some not. Nonetheless, further improvement of the delivery methods and vectors that can be used safely and reliably are needed. There have also been some reports that revealed the difficulty to detect mutants with subtle changes in gene expression. However, in plants, numerous marker genes are being developed that will indicate if a change in gene expression occurs [102].

#### **3.3. RNAi and viral infections**

78 Functional Genomics

knockdown of *Arabidopsis* genes.

analysis to occur [88].

Biology to International Agriculture (CAMBIA).

**3.2. RNAi and functional genomic studies** 

Currently, several vectors used for RNAi silencing that make use of *Agrobacterium* mediated delivery or artificially introduced dsRNA and/or VIGS into plants has been reported. For example, in 2007, an *Arabidopsis* genomic RNAi knock-out line analysis consortium was lunched out (AGRIKOLA) which is using the PCR products to generate gene-specific RNAi constructs for each *Arabidopsis* gene used in large scale gene silencing studies [86-87]; other consortium called CATMA (Complete *Arabidopsis* Transcriptome MicroArray), is generating gene sequence tags (GSTs) representing each *Arabidopsis* gene, designed so that they will hybridize on *Arabidopsis* cDNA microarrays in a gene-specific manner; the *Medicago truncatula* RNAi database (https://mtrnai.msi.umn.edu/) is a NSF-funded project planning to silence 1500 genes involved in symbiosis in this model legume; amiRNAi Central (http://www.agrikola.org) a NSF project funded to provide a comprehensive resource for

Moreover, a set of binary vectors, called ChromDB's RNAi vectors, were designed for producing dominant negative RNAi mutants using a target sequence cloning strategy that is based on the inclusion of two restriction enzyme cleavage sites in each of two primers used to amplify gene-specific fragments from cDNA. This design minimizes the number of PCR primers and results in the placement of unique restriction enzyme recognition sites to allow for flexibility in future manipulations of the plasmid, *e*.*g*., moving the inverted repeat target sequence to a different vector (Chrom database). These vectors are based on pCAMBIA binary vectors, a set of plasmids developed by the Center for Application of Molecular

The pHELLSGATE, high-throughput gene silencing vector and a high throughput tobacco rattle virus (TRV) based VIGS vector are binary vectors developed by Invitrogen are used for expression of GUS and GFP proteins. These vectors are base on Gateway recombinationbased technology, which replaced the conventional cloning strategy. It is based on the phage lambda system of recombination. It enables segments of DNA to be transferred between different vectors while orientation and reading frame are maintained. It can also be used for transfer of PCR products. It saves valuable time, because once the DNA has been cloned into a Gateway vector, it can be used as many genome function analysis systems as is required. In this way, the use of vectors in the process of plant functional genomics has been made much easier, while the process has also been made faster. This allows for higher throughput

An important application of RNAi for functional genomics studies is to generate lines that are deficient for the activity of a subset of genes and then test the knockdown lines for a specific phenotype. The assessing of a specific phenotype requires the presence of a specific allele of marker genes and several generation of crosses are necessary for selecting a specific mutant allele for specific genotype. RNAi technology for functional genomics has advantage that a specific gene can be silenced if the target sequence is better chosen. However, since RNAi is a homology-dependent process a carefully selection of a unique or conserved

RNA silencing in plants prevents viral accumulation and accordingly, viruses have evolved several strategies to counteract the defense mechanism. A viral protein, HC-Pro (helper

component proteinase) was shown to mediate one class of viral synergism disease [103] and expression of this protein in transgenic plants allows the accumulation of heterologous viruses beyond the normal level suggesting that HC-Pro blocked the target plant defence mechanism [104]. There are several methods known to identify viral suppressor proteins, such as transient expression assay, the reversal of silencing assay and stable expression assay.

RNAi Towards Functional Genomics Studies 81

whose expression is know to be up-regulated in a disease, given an appropriate tissue culture model of that disease. The libraries of RNAi reagents can be used in one of two ways. One is in a high throughput manner, in which each gene in the genome is knocked down one at a time. The other approach is to use large pools of RNA interference viral vectors and apply a selective pressure that only cells with the desired change in behavior can survive [112]. Rapid progress in the application of RNAi to mammalian cells, including neurons, muscle cells, offers new approaches to drug target identification. Advances in targeted delivery of RNAi-inducing molecules has raised the possibility of using RNAi

Considering the gene-specific features of RNAi, it is conceivable that this method it will be very useful for therapeutic applications. Direct transfection of siRNAs into cells, creating an expression construct in which a promoter drives the production of both the sense and antisense siRNAs which then hybridize in the cell to produce the double stranded siRNA and using viral vectors to infect cells with an expression construct are the methods used

Nonetheless, this hypothesis is based on the assumption that the effect of exogenous siRNA applications will remain gene specific and do not show nonspecific side effects relating to mismatches off-target hybridization or protein binding to nucleic acids. For example, several research groups have explored the use of RNAi to limit infection by viruses in cultured cells. There is a huge potential for using RNAi for the treatment of viral diseases such as those

RNAi strategy includes multiple targets to neutralize HIV. For example, directed siRNAs against several regions of the HIV-1 genome, including the viral long terminal repeat (LTR) and the accessory genes, *vif* and *nef* [113-115]. Using Magi cells (CD4-positive HeLa cells) as a model system, they demonstrated a sequence specific reduction of >95% in viral infection after cotransfection of siRNAs with an HIV-1 proviral DNA. When the same assay was done in primary peripheral blood lymphocytes, which are natural targets for HIV-1, the frequency of infected cells was also substantially reduced. These could be targets that block entry into the cell and disrupts the virus reproduction cycle inside the cells. This technology will help researchers dissect the biology of HIV infection and design drugs based on this molecular information [116]. Researchers from Hope Cancer Center in Duarte have developed a DNA-based delivery system in which human cells are generated to produce

The delivery of siRNA to HIV-infected T lymphocytes, monocytes and macrophages is a challenge. As synthetic siRNAs do not persist for long periods in cells, they would have to be delivered repeatedly for years to treat the infection. Systemic delivery of siRNAs to lymphocytes is not feasible owning to the huge number of these cells. Therefore, the preferred method is to isolate T cells from patients. In clinical trial T cells from HIV-infected individuals are transduced ex vivo with a lentiviral vector that encodes an anti-HIV antisense RNA, and then reinfused into patients [118]. In other study, it was reported that

caused by the human immunodeficiency virus (HIV) and the hepatitis C virus.

siRNA against REV protein, which is important in causing AIDS [117].

directly as a therapy for a variety of human genetic disorders.

**4.2. RNAi and therapy** 

nowadays for RNAi-based therapy.

A well known used method to study the transient expression is co-infiltration method using *Agrobacterium* strains, one strain used for inducing of RNA silencing of a reporter gene such as GFP (green fluorescent protein) and one strain that will express the candidate suppressor gene. Both strains will be infiltrated in a plant tissues such tobacco leaves, which are suited for production of a higher amount of protein in response to agro-infiltration. However, if the local silencing is triggered three days after infiltration the effect can be evaluated under UV light. If the candidate suppressor expressed from the co-infiltrated *Agrobacterium* interferes with RNA silencing, the tissue will remain bright green and in case not, the tissue will turn red [105]. In the case of reversal approach the candidates that may suppress silencing are identified. Several studies have shown that the viral suppressor proteins play an important role in this defense mechanism [106]. For two suppressor proteins, p21 encoded by beet western yellow virus [107] and p19 encoded by the tomato bushy stunt virus (TBSV) group [108] the molecular mechanism was identified.

Stable expression assay approach, a stable RNAi line expressing a suppressor candidate is crossed with several lines silenced for a repressor gene [109-110]. This method is also advantageous because provide information about the molecular mechanisms of the suppression and is also suited to investigate the role of suppression in systemic silencing using grafting [111].

However, the findings that certain viral proteins suppress RNA silencing open a new tool for biotechnologies applications. With silencing under control, many transgenic plants can be generated to produce desired plant traits or very higher level of expression to use the plant as a factor for producing pharmaceutical compounds, vaccines and other gene-products.

#### **4. Application of RNAi in animal systems**

#### **4.1. RNAi and medicine**

The ability to trigger RNAi in mammals was first demonstrated by microinjection of long dsRNA into mouse oocytes and one-cell stage embryos [19]. In this case was demonstrated that the antiviral interferon response to long dsRNAs is not yet functional in early mouse embryos. It was discovered rather quickly that chemically synthesized siRNAs could trigger sequence-specific silencing in cultured mammalian cells without inducing the interferon response [21]. Starting from this important breakthrough, RNAi has emerged as a powerful experimental tool for analyzing mammalian systems.

The ability of RNAi to determine ablation of gene expression has open up the possibility of using collections of siRNAs to analyze the role of hundreds or thousands of different genes whose expression is know to be up-regulated in a disease, given an appropriate tissue culture model of that disease. The libraries of RNAi reagents can be used in one of two ways. One is in a high throughput manner, in which each gene in the genome is knocked down one at a time. The other approach is to use large pools of RNA interference viral vectors and apply a selective pressure that only cells with the desired change in behavior can survive [112]. Rapid progress in the application of RNAi to mammalian cells, including neurons, muscle cells, offers new approaches to drug target identification. Advances in targeted delivery of RNAi-inducing molecules has raised the possibility of using RNAi directly as a therapy for a variety of human genetic disorders.

#### **4.2. RNAi and therapy**

80 Functional Genomics

assay.

[108] the molecular mechanism was identified.

**4. Application of RNAi in animal systems** 

experimental tool for analyzing mammalian systems.

using grafting [111].

**4.1. RNAi and medicine** 

component proteinase) was shown to mediate one class of viral synergism disease [103] and expression of this protein in transgenic plants allows the accumulation of heterologous viruses beyond the normal level suggesting that HC-Pro blocked the target plant defence mechanism [104]. There are several methods known to identify viral suppressor proteins, such as transient expression assay, the reversal of silencing assay and stable expression

A well known used method to study the transient expression is co-infiltration method using *Agrobacterium* strains, one strain used for inducing of RNA silencing of a reporter gene such as GFP (green fluorescent protein) and one strain that will express the candidate suppressor gene. Both strains will be infiltrated in a plant tissues such tobacco leaves, which are suited for production of a higher amount of protein in response to agro-infiltration. However, if the local silencing is triggered three days after infiltration the effect can be evaluated under UV light. If the candidate suppressor expressed from the co-infiltrated *Agrobacterium* interferes with RNA silencing, the tissue will remain bright green and in case not, the tissue will turn red [105]. In the case of reversal approach the candidates that may suppress silencing are identified. Several studies have shown that the viral suppressor proteins play an important role in this defense mechanism [106]. For two suppressor proteins, p21 encoded by beet western yellow virus [107] and p19 encoded by the tomato bushy stunt virus (TBSV) group

Stable expression assay approach, a stable RNAi line expressing a suppressor candidate is crossed with several lines silenced for a repressor gene [109-110]. This method is also advantageous because provide information about the molecular mechanisms of the suppression and is also suited to investigate the role of suppression in systemic silencing

However, the findings that certain viral proteins suppress RNA silencing open a new tool for biotechnologies applications. With silencing under control, many transgenic plants can be generated to produce desired plant traits or very higher level of expression to use the plant as

The ability to trigger RNAi in mammals was first demonstrated by microinjection of long dsRNA into mouse oocytes and one-cell stage embryos [19]. In this case was demonstrated that the antiviral interferon response to long dsRNAs is not yet functional in early mouse embryos. It was discovered rather quickly that chemically synthesized siRNAs could trigger sequence-specific silencing in cultured mammalian cells without inducing the interferon response [21]. Starting from this important breakthrough, RNAi has emerged as a powerful

The ability of RNAi to determine ablation of gene expression has open up the possibility of using collections of siRNAs to analyze the role of hundreds or thousands of different genes

a factor for producing pharmaceutical compounds, vaccines and other gene-products.

Considering the gene-specific features of RNAi, it is conceivable that this method it will be very useful for therapeutic applications. Direct transfection of siRNAs into cells, creating an expression construct in which a promoter drives the production of both the sense and antisense siRNAs which then hybridize in the cell to produce the double stranded siRNA and using viral vectors to infect cells with an expression construct are the methods used nowadays for RNAi-based therapy.

Nonetheless, this hypothesis is based on the assumption that the effect of exogenous siRNA applications will remain gene specific and do not show nonspecific side effects relating to mismatches off-target hybridization or protein binding to nucleic acids. For example, several research groups have explored the use of RNAi to limit infection by viruses in cultured cells. There is a huge potential for using RNAi for the treatment of viral diseases such as those caused by the human immunodeficiency virus (HIV) and the hepatitis C virus.

RNAi strategy includes multiple targets to neutralize HIV. For example, directed siRNAs against several regions of the HIV-1 genome, including the viral long terminal repeat (LTR) and the accessory genes, *vif* and *nef* [113-115]. Using Magi cells (CD4-positive HeLa cells) as a model system, they demonstrated a sequence specific reduction of >95% in viral infection after cotransfection of siRNAs with an HIV-1 proviral DNA. When the same assay was done in primary peripheral blood lymphocytes, which are natural targets for HIV-1, the frequency of infected cells was also substantially reduced. These could be targets that block entry into the cell and disrupts the virus reproduction cycle inside the cells. This technology will help researchers dissect the biology of HIV infection and design drugs based on this molecular information [116]. Researchers from Hope Cancer Center in Duarte have developed a DNA-based delivery system in which human cells are generated to produce siRNA against REV protein, which is important in causing AIDS [117].

The delivery of siRNA to HIV-infected T lymphocytes, monocytes and macrophages is a challenge. As synthetic siRNAs do not persist for long periods in cells, they would have to be delivered repeatedly for years to treat the infection. Systemic delivery of siRNAs to lymphocytes is not feasible owning to the huge number of these cells. Therefore, the preferred method is to isolate T cells from patients. In clinical trial T cells from HIV-infected individuals are transduced ex vivo with a lentiviral vector that encodes an anti-HIV antisense RNA, and then reinfused into patients [118]. In other study, it was reported that

the GFP siRNA induced gene silencing of transient or stably expressed GFP mRNA was highly specific in the human embryonic kidney (HEK) 293 cell background [119]. Further study, in human non-cell lung carcinoma cell line H1299 have shown that specific siRNAs corresponding to *akt1*, *rb1*, and *plk1* could be used as highly specific tools for targeted gene knockdown and can be used in high-throughput approaches and drug target validation.

RNAi Towards Functional Genomics Studies 83

Defects in miRNA expression could cause the development of cancer associated with impaired formation of oncoproteins or tumor supressors regulated by these miRNAs. One of the first identified miRNAs, let-7 found in *C. elegans* as well as in humans, is a tumor suppressor decreasing formation of the oncoproteins Ras and HMGA2 (High Mobility Group protein A2). In patients with lung cancer was a inverse correlation between expression of let-7 and Ras/HMGA2 [125]. miR-155 is an exemple of miRNA exhibiting oncogenic properties. This miRNA is required for B- and T- cells functioning [126]. Increased expression of miR-155 was observed in paediatric Burkitt's lymphoma, diffuse large cell lymphoma, Hodgkin's disease, lung, breast and pancreatic cancers [124]. In some tumor cells miR-21 and mi-R 24 act as oncogenes, and in others they act as tumor supressors. In HeLa cells, inhibition of miR-21 or miR-24 activity by modified anti-miR oligonucleotides accelerated proliferation [127]. Analysis of bioinformatic predictions of putative targets

suggests that proto- and anti-oncogenic activity is typical for various miRNA [128].

of tumors [130].

Defects in miRNA expression associated with carcinogens are caused not only by the chromosomal rearrangements, but also due to impairments in the machinery responsible for miRNA formation and processing. Inhibition of expression of ribonucleases Dicer and Drosha by complementary siRNA caused acceleration of growth of lung adenocarcinoma cells in murine model [129]. Tumor transformation may be also determined by primary impairments in regulation of expression of a single miRNA, which is then accompanied by imbalance in the entire miRNA network [124]. Impairments in miRNA functioning seen in cancerogenesis can be used for detection of miRNA expression for diagnostics of tumor origin. Each type of cancer is characterized by a certain pattern of miRNA expression [124]. Moreover, the evaluation of miRNA profiles can be used for prognosis of the development

The importance of epigenomic modifications of chromatin structure for the development of tumors has been recognized [13]. Methylation of cytosine in DNA followed by formation of 5 methylCytozine occurs in dinucleotide sequence CpG. Methylation can cause suppression of transcription by proteins recognizing methylated CpG attraction. Excesive methylation of CpG island of miRNA genes has been found in cancer cells [131]. The authors attribute the effect of this methylation to suppression of tumor suppressor genes represented by miRNA genes. Processes of DNA methylation may be closely associated with modification of chromatin histones. Some histone modifications like acetylation, methylation and phosphorylation of specific aminoacides residues are involved in gene expression. Impairments of histone modifications in tumor cells have been found in many studies [132]. Processes of epigenomic silencing may involve another type of non-coding RNAs, short siRNAs. This observation demonstrates the existence of nuclear RNA-interference, which is based on suppression of mRNA translation. Some experimental data suggest that siRNA plays a certain role in gene silencing at the level of chromatin in human cancer cells. There are some examples of involvement of short RNAs in epigenomic silencing, which is coupled to DNA methylation, histone modification of target genes, and attraction of the heterochromatin HP1 protein to them. All these chromatin modifications typically occur in the cancer epigenome [124]. The role of RNAi is recognized not only in silencing of proto-oncogenes or

RNAi is also utilized as an antiviral therapy against diseases caused by herpes simplex virus type 2, hepatits A and hepatits B. Early RNAi studies noted that RNA silencing was prominent in the liver, which made this organ an attractive target for therapeutic approaches. Vaccine against HBV is used only for prevention and there is no vaccine for HCV. Mc Caffrey and his co-workers (2003) [120] demonstrated that a significant knockdown of the HBV core antigen in liver hepatocytes could be achieved by the siRNA, providing an important proof of principle for future antiviral applications of RNAi. They developed a transient model of HBV infection in which a plasmid containing approximately 1-3 copies of the HBV genome (pTHBV2) was introduced into the livers of mice by hydrodynamic transfection. This results in production of all four families of viral mRNAs, including the pregenomic RNA. The pregenomic RNA is the template for the viral reverse transcriptase, which replicates new viral DNA. All four viral proteins are also made. Transient viral replication occured in the mouse liver for about 1 week. In addition, RNAi has achieved regression of clinical traits in neurodegenerative disease model [121] but its potential for use in pharmaceutical target validation and as a therapeutic tool is still ongoing.

The ability to induce RNAi across mucosal surfaces was also investigated as a means for treating de sexually transmitted disorders [122]. siRNA targeting tumor necrosis factor alpha was injected into the joints of mice with collagen induced arthritis (CIA) and the development of arthritis was scored by assessing the inflammation of joints in the mouse paw, and the mice with CIA, joint inflammation was successfully inhibited [123]. Antiviral RNAi therapeutics have already entered human clinical trials and will hopefully prove to be safe and efcacious.

#### **4.3. RNAi and cancer**

The discovery of RNAi led to the realization that the RNAi machinery is also involved in normal gene regulation through the action of a class of small RNAs known as microRNAs. There is experimental evidence that miRNAs regulate cell division, differentiation, cell fate decisions, development, oncogenesis, apoptosis, and many other processes [35]. miRNA levels are also dramatically shifted in various cancers, and miRNAs can act as oncogenes [35]. It is now clear that miRNAs represent a gene regulatory network of enormous signicance. The expression profile of miRNA is highly specific for a particular type of tissue and cell stage of cell differentiation [124]. Impaired miRNA functioning, which occurs during tumor transformation, can be evaluated as a consequence rather than the cause of loss of cell identity. However, detection of chromosomal rearrangement like deletions, local amplifications and chromosomal breackpoints in the region of miRNA genes (causing impairments in miRNA expression during cancerogenesis) is a good demonstration of direct role of miRNA in these processes.

Defects in miRNA expression could cause the development of cancer associated with impaired formation of oncoproteins or tumor supressors regulated by these miRNAs. One of the first identified miRNAs, let-7 found in *C. elegans* as well as in humans, is a tumor suppressor decreasing formation of the oncoproteins Ras and HMGA2 (High Mobility Group protein A2). In patients with lung cancer was a inverse correlation between expression of let-7 and Ras/HMGA2 [125]. miR-155 is an exemple of miRNA exhibiting oncogenic properties. This miRNA is required for B- and T- cells functioning [126]. Increased expression of miR-155 was observed in paediatric Burkitt's lymphoma, diffuse large cell lymphoma, Hodgkin's disease, lung, breast and pancreatic cancers [124]. In some tumor cells miR-21 and mi-R 24 act as oncogenes, and in others they act as tumor supressors. In HeLa cells, inhibition of miR-21 or miR-24 activity by modified anti-miR oligonucleotides accelerated proliferation [127]. Analysis of bioinformatic predictions of putative targets suggests that proto- and anti-oncogenic activity is typical for various miRNA [128].

82 Functional Genomics

be safe and efcacious.

**4.3. RNAi and cancer** 

role of miRNA in these processes.

the GFP siRNA induced gene silencing of transient or stably expressed GFP mRNA was highly specific in the human embryonic kidney (HEK) 293 cell background [119]. Further study, in human non-cell lung carcinoma cell line H1299 have shown that specific siRNAs corresponding to *akt1*, *rb1*, and *plk1* could be used as highly specific tools for targeted gene knockdown and can be used in high-throughput approaches and drug target validation.

RNAi is also utilized as an antiviral therapy against diseases caused by herpes simplex virus type 2, hepatits A and hepatits B. Early RNAi studies noted that RNA silencing was prominent in the liver, which made this organ an attractive target for therapeutic approaches. Vaccine against HBV is used only for prevention and there is no vaccine for HCV. Mc Caffrey and his co-workers (2003) [120] demonstrated that a significant knockdown of the HBV core antigen in liver hepatocytes could be achieved by the siRNA, providing an important proof of principle for future antiviral applications of RNAi. They developed a transient model of HBV infection in which a plasmid containing approximately 1-3 copies of the HBV genome (pTHBV2) was introduced into the livers of mice by hydrodynamic transfection. This results in production of all four families of viral mRNAs, including the pregenomic RNA. The pregenomic RNA is the template for the viral reverse transcriptase, which replicates new viral DNA. All four viral proteins are also made. Transient viral replication occured in the mouse liver for about 1 week. In addition, RNAi has achieved regression of clinical traits in neurodegenerative disease model [121] but its potential for use in pharmaceutical target validation and as a therapeutic tool is still ongoing. The ability to induce RNAi across mucosal surfaces was also investigated as a means for treating de sexually transmitted disorders [122]. siRNA targeting tumor necrosis factor alpha was injected into the joints of mice with collagen induced arthritis (CIA) and the development of arthritis was scored by assessing the inflammation of joints in the mouse paw, and the mice with CIA, joint inflammation was successfully inhibited [123]. Antiviral RNAi therapeutics have already entered human clinical trials and will hopefully prove to

The discovery of RNAi led to the realization that the RNAi machinery is also involved in normal gene regulation through the action of a class of small RNAs known as microRNAs. There is experimental evidence that miRNAs regulate cell division, differentiation, cell fate decisions, development, oncogenesis, apoptosis, and many other processes [35]. miRNA levels are also dramatically shifted in various cancers, and miRNAs can act as oncogenes [35]. It is now clear that miRNAs represent a gene regulatory network of enormous signicance. The expression profile of miRNA is highly specific for a particular type of tissue and cell stage of cell differentiation [124]. Impaired miRNA functioning, which occurs during tumor transformation, can be evaluated as a consequence rather than the cause of loss of cell identity. However, detection of chromosomal rearrangement like deletions, local amplifications and chromosomal breackpoints in the region of miRNA genes (causing impairments in miRNA expression during cancerogenesis) is a good demonstration of direct Defects in miRNA expression associated with carcinogens are caused not only by the chromosomal rearrangements, but also due to impairments in the machinery responsible for miRNA formation and processing. Inhibition of expression of ribonucleases Dicer and Drosha by complementary siRNA caused acceleration of growth of lung adenocarcinoma cells in murine model [129]. Tumor transformation may be also determined by primary impairments in regulation of expression of a single miRNA, which is then accompanied by imbalance in the entire miRNA network [124]. Impairments in miRNA functioning seen in cancerogenesis can be used for detection of miRNA expression for diagnostics of tumor origin. Each type of cancer is characterized by a certain pattern of miRNA expression [124]. Moreover, the evaluation of miRNA profiles can be used for prognosis of the development of tumors [130].

The importance of epigenomic modifications of chromatin structure for the development of tumors has been recognized [13]. Methylation of cytosine in DNA followed by formation of 5 methylCytozine occurs in dinucleotide sequence CpG. Methylation can cause suppression of transcription by proteins recognizing methylated CpG attraction. Excesive methylation of CpG island of miRNA genes has been found in cancer cells [131]. The authors attribute the effect of this methylation to suppression of tumor suppressor genes represented by miRNA genes. Processes of DNA methylation may be closely associated with modification of chromatin histones. Some histone modifications like acetylation, methylation and phosphorylation of specific aminoacides residues are involved in gene expression. Impairments of histone modifications in tumor cells have been found in many studies [132]. Processes of epigenomic silencing may involve another type of non-coding RNAs, short siRNAs. This observation demonstrates the existence of nuclear RNA-interference, which is based on suppression of mRNA translation. Some experimental data suggest that siRNA plays a certain role in gene silencing at the level of chromatin in human cancer cells. There are some examples of involvement of short RNAs in epigenomic silencing, which is coupled to DNA methylation, histone modification of target genes, and attraction of the heterochromatin HP1 protein to them. All these chromatin modifications typically occur in the cancer epigenome [124]. The role of RNAi is recognized not only in silencing of proto-oncogenes or tumor suppressors, but also in maintenance of heterochromatin structure of centromeric region in mammalian cells [133].

RNAi Towards Functional Genomics Studies 85

The Nobel Prize-winning research on RNA inhibition might lead to new treatments for patients with this disease due to dominant mutations in the superoxide dismutase gene. Reduction in the level of the superoxide dismutase enzyme coded for by the mutant gene has been studied in animal models [143]. Working with laboratory mice as an experimental model system for the human disease ALS*,* Miller and coworkers showed that loss of muscle function could be slowed using RNA interference [144]. This result was obtained by using a virus to induce RNA interference in neurons. These results from laboratory experiments suggest that if RNA-induced inhibition of mutant superoxide dismutase can be induced in the correct cells of the brain and spinal cord, it might be possible to slow progression of *Lou* 

A high number of pharmaceutical and biotechnology companies have declared an interest in or have an active drug development program already underway in RNAi-based therapeutics to silence disease associated genes. This web (www.rnaiweb.com) collects together the latest research covering the development of RNAi based tools for drug target and gene function analysis. These include Sirna Therapeutics (Colorado) for macular degeneration; Avocel (Sunnyvale, California) for hepatits C; Alnylam Pharmaceuticals (Cambridge) for Parkinson's disease; CytRx (Los Angeles, California) for obesity, type II diabetes and ALS etc. But the major challenge in turning RNAi into an effective therapeutic strategy is the delivery of the RNAi agents, whether they are synthetic short double stranded RNAs or viral vectors directing production of double stranded RNA. During diseases, changes in the pattern of microRNAs will occur with some being indicative of treatment outcome and disease progression. More exciting then diagnostic value is the evidence that directly involves miRNAs in a number of diseases (cancer, imprinting impairments etc). In this context there is an increased interest in manipulations miRNAs for therapeutic purposes. In the loos of miRNA function, one approch is to mimic miRNA activity by introducing microRNA "mimics"with the same genetic information as the natural miRNA. For exemple, by adding more of a microRNA named let-t, it has been

Another complementary approach to using miRNAs for therapy is to inhibit the activity of disease-associated miRNAs. This can be achieved by employing antisense oligonucleotides that, based on sequence complementarity, will bind to inactivate miRNA function. Esau end colleagues demonstrated that the inhibition of miRNA may be a potential therapeutic approach to the treatment of disease [146]. They inhibited miR-122 expression with antagomirs, which resulted in reduced plasma cholesterol levels and a decrease in hepatic fatty aceid and cholesterol synthesis rates in normal mice and in diet-induced obese mice.

RNA interference has much promise in laboratory. In principle, RNAi might be used to treat any disease that is linked to expression of an identified gene [112]. The most important challenge in turning RNA interference into an effective therapeutic strategy is the delivery of the RNA interference agents. Given sufficient research into delivery methods, some of

Targeting miR-122 with antagomirs resulted in inhibition of disease development.

these diseases will probably be treated effectively by RNAi based therapeutics.

*Gehrig's disease* in humans.

possible to halt cancer cells from further multiplying [145].

#### **4.4. Challenges for RNAi as a tool for diseases inverstigations**

One of the advantages of RNAi over gene knockout is the ability to restrict gene knockdown to specific tissues or even cell types. This is important when a disease is a result of a mutation in an essential gene. The versatility of the technique has led to many applications. RNAi can be used in drug target validation and RNAi can target specific spliced exons, enabling the investigation of the functional roles of alternatively spliced forms of a gene [134]. An important opportunity is the use of RNAi in identification of all candidate genes involved in certain physiological processes using genome-wide RNAi screening [135].

RNAi can be applied to genetic model organisms such *Drosophila*, *C. elegans* and mouse in order to investigate and/or to treat some human disorders. Several models of human neural and neuromuscular disorders are available in this three experimental models. *C. elegans* is model for RNAi knockdown of genes to mimic loss of in order to elucidate the mechanisms of a number of muscle wasting diseases like: Duchenne muscular dystrophy [136], X-linked form of Emery-Dreifuss muscular distrophy [137], spinal muscular atrophy [137], fragile Xsyndrome [138], Alzheimer's disease [139]. RNAi has been used in combination with overexpression studies to study the role of Parkin, an E3 ubiquitin ligase, in dopamine neuron degeneration in *Drosophila* in order to investigate the molecular mechanisms underlying Parkinson's disease. Overexpression of Parkin was shown to degrade its substrate (Pael-R) and suppress its toxicity, whereas interfering with endigenous Parkin promoted substrate accumulation and augmented its neurotoxicity [140].

Other experiment uses *Drosophila* in order to investigate human neurodegenerative disorders that are caused by expansion of CAG trinucleotide repeat. RNAi has been performed on two such diseases: Huntington's disease and spinobulbar muscular atrophy. *Drosophila* S2 cells were engineered to express a portion of human ar gene with CAG tracs of 26, 43 or 106 repeats tagged by green fluorescent protein (GFP) [141]. Cells carrying CAG repeats of 43 and 106 developed GFP aggregates. Using RNAi directed against AR protein, a loos of AR-GFP aggregates by 80% in co-transfected S2 cells has been observed. Therefore, RNAi could have considerable therapeutic potential in neurodegenerative disorders [134]. Murine model was used to investigate the potential of RNAi as therapeutic tool in neurodegenerative disorders. Spinocerebellar ataxia type 1 (SCA1) has been successfully suppressed by RNAi in mouse model of this disease [121].

RNAi might also allow future treatments of human disease such as Lou Gehrig's disease (amyotrophic lateral sclerosis, ALS), a genetically dominant inherited disease. This pattern of inheritance allowed identification of a specific gene that is linked to the death of Betz cells in some families, the gene for an enzyme called superoxide dismutase. Superoxide dismutase can protect cells from molecular damage caused by free radicals of oxygen. Mutant forms of superoxide dismutase can lead to cell death [142].

The Nobel Prize-winning research on RNA inhibition might lead to new treatments for patients with this disease due to dominant mutations in the superoxide dismutase gene. Reduction in the level of the superoxide dismutase enzyme coded for by the mutant gene has been studied in animal models [143]. Working with laboratory mice as an experimental model system for the human disease ALS*,* Miller and coworkers showed that loss of muscle function could be slowed using RNA interference [144]. This result was obtained by using a virus to induce RNA interference in neurons. These results from laboratory experiments suggest that if RNA-induced inhibition of mutant superoxide dismutase can be induced in the correct cells of the brain and spinal cord, it might be possible to slow progression of *Lou Gehrig's disease* in humans.

84 Functional Genomics

region in mammalian cells [133].

in mouse model of this disease [121].

tumor suppressors, but also in maintenance of heterochromatin structure of centromeric

One of the advantages of RNAi over gene knockout is the ability to restrict gene knockdown to specific tissues or even cell types. This is important when a disease is a result of a mutation in an essential gene. The versatility of the technique has led to many applications. RNAi can be used in drug target validation and RNAi can target specific spliced exons, enabling the investigation of the functional roles of alternatively spliced forms of a gene [134]. An important opportunity is the use of RNAi in identification of all candidate genes involved in certain physiological processes using genome-wide RNAi screening [135].

RNAi can be applied to genetic model organisms such *Drosophila*, *C. elegans* and mouse in order to investigate and/or to treat some human disorders. Several models of human neural and neuromuscular disorders are available in this three experimental models. *C. elegans* is model for RNAi knockdown of genes to mimic loss of in order to elucidate the mechanisms of a number of muscle wasting diseases like: Duchenne muscular dystrophy [136], X-linked form of Emery-Dreifuss muscular distrophy [137], spinal muscular atrophy [137], fragile Xsyndrome [138], Alzheimer's disease [139]. RNAi has been used in combination with overexpression studies to study the role of Parkin, an E3 ubiquitin ligase, in dopamine neuron degeneration in *Drosophila* in order to investigate the molecular mechanisms underlying Parkinson's disease. Overexpression of Parkin was shown to degrade its substrate (Pael-R) and suppress its toxicity, whereas interfering with endigenous Parkin

Other experiment uses *Drosophila* in order to investigate human neurodegenerative disorders that are caused by expansion of CAG trinucleotide repeat. RNAi has been performed on two such diseases: Huntington's disease and spinobulbar muscular atrophy. *Drosophila* S2 cells were engineered to express a portion of human ar gene with CAG tracs of 26, 43 or 106 repeats tagged by green fluorescent protein (GFP) [141]. Cells carrying CAG repeats of 43 and 106 developed GFP aggregates. Using RNAi directed against AR protein, a loos of AR-GFP aggregates by 80% in co-transfected S2 cells has been observed. Therefore, RNAi could have considerable therapeutic potential in neurodegenerative disorders [134]. Murine model was used to investigate the potential of RNAi as therapeutic tool in neurodegenerative disorders. Spinocerebellar ataxia type 1 (SCA1) has been successfully suppressed by RNAi

RNAi might also allow future treatments of human disease such as Lou Gehrig's disease (amyotrophic lateral sclerosis, ALS), a genetically dominant inherited disease. This pattern of inheritance allowed identification of a specific gene that is linked to the death of Betz cells in some families, the gene for an enzyme called superoxide dismutase. Superoxide dismutase can protect cells from molecular damage caused by free radicals of oxygen.

**4.4. Challenges for RNAi as a tool for diseases inverstigations** 

promoted substrate accumulation and augmented its neurotoxicity [140].

Mutant forms of superoxide dismutase can lead to cell death [142].

A high number of pharmaceutical and biotechnology companies have declared an interest in or have an active drug development program already underway in RNAi-based therapeutics to silence disease associated genes. This web (www.rnaiweb.com) collects together the latest research covering the development of RNAi based tools for drug target and gene function analysis. These include Sirna Therapeutics (Colorado) for macular degeneration; Avocel (Sunnyvale, California) for hepatits C; Alnylam Pharmaceuticals (Cambridge) for Parkinson's disease; CytRx (Los Angeles, California) for obesity, type II diabetes and ALS etc. But the major challenge in turning RNAi into an effective therapeutic strategy is the delivery of the RNAi agents, whether they are synthetic short double stranded RNAs or viral vectors directing production of double stranded RNA. During diseases, changes in the pattern of microRNAs will occur with some being indicative of treatment outcome and disease progression. More exciting then diagnostic value is the evidence that directly involves miRNAs in a number of diseases (cancer, imprinting impairments etc). In this context there is an increased interest in manipulations miRNAs for therapeutic purposes. In the loos of miRNA function, one approch is to mimic miRNA activity by introducing microRNA "mimics"with the same genetic information as the natural miRNA. For exemple, by adding more of a microRNA named let-t, it has been possible to halt cancer cells from further multiplying [145].

Another complementary approach to using miRNAs for therapy is to inhibit the activity of disease-associated miRNAs. This can be achieved by employing antisense oligonucleotides that, based on sequence complementarity, will bind to inactivate miRNA function. Esau end colleagues demonstrated that the inhibition of miRNA may be a potential therapeutic approach to the treatment of disease [146]. They inhibited miR-122 expression with antagomirs, which resulted in reduced plasma cholesterol levels and a decrease in hepatic fatty aceid and cholesterol synthesis rates in normal mice and in diet-induced obese mice. Targeting miR-122 with antagomirs resulted in inhibition of disease development.

RNA interference has much promise in laboratory. In principle, RNAi might be used to treat any disease that is linked to expression of an identified gene [112]. The most important challenge in turning RNA interference into an effective therapeutic strategy is the delivery of the RNA interference agents. Given sufficient research into delivery methods, some of these diseases will probably be treated effectively by RNAi based therapeutics.

#### **5. Conclusions**

The study of RNAi has led to a revolution in the understanding of gene expression and the examples in plants, animal and mammalian system as reviewed here, showed the diversity and the potential of RNAi as new approach to replaces the classical genetic technologies and manipulation.

RNAi Towards Functional Genomics Studies 87

[6] van der Vlugt RA, Ruiter RK, Goldbach R (1992) Evidence for sense RNA-mediated protection to PVYN in tobacco plants transformed with the viral coat protein cistron.

[7] Ratcliff FG, MacFarlane SA, Baulcombe DC (1999) Gene silencing without DNA. RNA-

[8] Zamore PD (2002) Ancient pathways programmed by small RNAs. Science 296: 1265-

[10] Fire A, Xu S, Montgomery M, Kostas S, Driver S, Mello C (1998) Potent and specific genetic interference by double-stranded RNA in *Caenorhabditis elegans. Nature* 391: 806-811. [11] Ketting RF, Plasterk RH, (2000) A genetic link between suppression and RNA

[12] Elbashir SM, Lendeckel W, Tuschl T (2001a) RNA interference is mediated by 21- and

[13] Bernstein BE, Meissner A, Lander ES (2007) The mammalian epigenome. *Cell* 128: 669-681. [14] Wargelius A, Ellingsen S, Jose FA (1999) Double stranded RNA induces specific developmental defects in zebrafish embryos. *Biochemical and Biophysical Research* 

[15] Lohmann JU, Endl I, Bosch TC (1999) Silencing of developmental genes in

[16] Akashi H, Miyagishi M, Taira K (2001) Suppression of gene expression by RNA interference in cultured plant cells. *Antisense Nucleic Acid Drug Development* 11(6):359-367. [17] Elmayan T, Balzergue S, Beon F, Bourdon V, Daubremet J, Guenet Y, Mourrain P, Palauqui JC, Vernhettes S, Vialle T, Wostrikoff K, Vaucheret H (1998) *Arabidopsis*

[18] Thakur A (2003) RNA interference revolution. *Electronic Journal Of Biotechnology* 6 (1). [19] Wianny F, Zernicka-Goetz M (2000) Specific interference with gene function by double-

[21] Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, Tuschl T (2001b) Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. *Nature*

[22] Watson JM, Fusaro FM, Wang M, Waterhouse PM (2005) RNA silencing platforms in

[23] Matzke MA, Matzke AJ, Pruss GJ, Vance VB (2001) RNA-based silencing strategy in

[24] Tabara H, Grishok A, Mello CC (1998) RNAi in *C. elegans*: soaking in the genome

[25] Waterhouse PM, Wang MB, Lough T (2001) Gene silencing as an adaptive defence

[26] Meister G, Tuschl T (2004) Mechanisms of gene silencing by double-stranded RNA.

stranded RNA in early mouse development. *Nature Cell Biology* 2: 70-75. [20] Kreutzer R, Limmer S (2000) Kreutzer-Limmer Patent(1999/2000), Germany.

mediated cross-protection between viruses. *Plant Cell* 11: 1207-1216.

[9] Ruvkun G (2001) Glimpses of a tiny RNA world. *Science* 294 (5543): 797-799.

*Plant Molecular Biology* 20: 631-639.

interference in *C. elegans. Nature* 404: 296-298.

Hydra. *Developmental Biology* 214(1): 211-214.

*Communications* 263(1):156-161.

plants. *Febs Letters* 579: 5982-5987.

sequence. *Science* 282 (5388): 430–431.

against viruses. *Nature* 411: 834-842.

*Nature* 431: 343-349.

411: 494-498.

22-nucleotide RNAs. *Genes and Development* 15(2): 188–200.

mutants impaired in cosuppression. *Plant Cell* 10(10):1747-1758.

plants. *Current Opinion In Genetic Development* 11: 221-227.

1269.

Today, the RNAi strategies as a new tool for cheap screen of gene function in organisms for which, a genetic approach was not developed yet. However, after 11 years of extensive research, RNAi has now been demonstrated to function in mammalian cells to alter gene expression and used as a means for genetic discovery as well as a possible strategy for genetic correction and genetic therapy in cancer and other diseases.

Finally, RNAi represent a significant tool for the accomplishment of these goals, and will undoubtedly be used to address many other challenges in eukaryote functional genomics.

In future, a combination of RNAi and whole genome sequencing can contribute to the enhancement of the drug development success rates through better targets and RNAi platform efficiencies and keeping waste to a minimum by only treating people genetically predicted to respond to the therapeutic.

#### **Author details**

Gabriela N. Tenea and Liliana Burlibasa *Department of Genetics, University of Bucharest, Romania* 

### **Acknowledgement**

GNT and LB were supported by The National Council of the Higher Education's Scientific Research (CNCSIS), IDEAS project: PNII 1958/2009.

#### **6. References**


manipulation.

**Author details** 

**Acknowledgement** 

**6. References** 

338-342.

*Plant Cell* 2(4): 279–289.

**5. Conclusions** 

The study of RNAi has led to a revolution in the understanding of gene expression and the examples in plants, animal and mammalian system as reviewed here, showed the diversity and the potential of RNAi as new approach to replaces the classical genetic technologies and

Today, the RNAi strategies as a new tool for cheap screen of gene function in organisms for which, a genetic approach was not developed yet. However, after 11 years of extensive research, RNAi has now been demonstrated to function in mammalian cells to alter gene expression and used as a means for genetic discovery as well as a possible strategy for

Finally, RNAi represent a significant tool for the accomplishment of these goals, and will undoubtedly be used to address many other challenges in eukaryote functional genomics.

In future, a combination of RNAi and whole genome sequencing can contribute to the enhancement of the drug development success rates through better targets and RNAi platform efficiencies and keeping waste to a minimum by only treating people genetically

GNT and LB were supported by The National Council of the Higher Education's Scientific

[1] Cogoni C, Macino G (2000) Post-transcriptional gene silencing across kingdoms. *Current* 

[2] Mello CC, Darryl Conte Jr (2004) Revealing the world of RNA interference. *Nature* 431:

[3] Napoli C, Lemieux C, Jorgensen R (1990) Introduction of a chimeric chalcone synthase gene into petunia results in reversible co-suppression of homologous genes in trans.

[4] De Haan P, Gielen JJL, Prins M, Wijkamp IG, Van Schepen A, Peters D, Van Grinsven MQJM, Goldbach R (1992) Characterization of RNA-mediated resistance to tomato

[5] Lindbo JA, Dougherty WG (1992) Untranslatable transcripts of the tobacco etch virus coat protein gene sequence can interfere with tobacco etch virus replication in

spotted wilt virus in transgenic plants. *Bio/Technology* 10: 1133-1137.

transgenic plants and protoplasts. *Virology* 189: 725-733.

genetic correction and genetic therapy in cancer and other diseases.

predicted to respond to the therapeutic.

Gabriela N. Tenea and Liliana Burlibasa

*Department of Genetics, University of Bucharest, Romania* 

Research (CNCSIS), IDEAS project: PNII 1958/2009.

*Opinion in Genetic Development* 10(6): 638–643.


[27] Kamath RS, Martinez-Campos M, Zipperlen P, Fraser AG, Ahringer J (2001) Effectiveness of specific RNA-mediated interference through ingested double-stranded RNA in *Caenorhabditis elegans*. *Genome Biology* 2(1): 2.1–2.10.

RNAi Towards Functional Genomics Studies 89

[48] Rodriguez A, Griffiths-Jones S, Ashurst JL (2004) Identification of mammalian microRNA host genes and transcription units. *Genome Research* 14: 1902-1910. [49] Tang G, Galili G (2004) Using RNAi to improve plant nutritional value: from

[50] Ying SY, Lin SL (2004) Intron derived microRNAs-fine tunning of gene functions. *Gene*

[51] Zhao Y, Srivastava D (2007) A developmental view of microRNA function. *Trends* 

[52] Lee RC, Feinbaum RL, Ambros V (1993) The *C. elegans* heterochronic gene lin-4 encodes

[53] Lagos-Quintana M, Rauhut R, Lendeckel W (2001) Identification of novel genes coding

[54] Lee RC, Ambros V (2001) An extensive class of small RNAs in *Caenorhabditis elegans.* 

[55] Ketting RF, Fischer SEJ, Bernstein E, Sijen T, Hannon GJ, Plasterk RHA (2001) Dicer functions in RNA interference and in synthesis of small RNA involved in

[56] Knight SW, Bass BL (2001) A role for the RNAse III enzyme DCR-1 in RNA interference

[57] Yekta S, Shih IH, Bartel DP (2004) MicroRNA-directed cleavage of HOXB8 mRNA*.* 

[58] Yang M, Mattes J (2008) Discovery biology and therapeutic potential of RNA interference microRNA and antagomirs. *Pharmacology and Therapeutics* 117: 94-104. [59] Baskerville S, Bartel DP (2005) Microarray profiling of microRNAs reveals frequent

[60] Du TT, Zamore PD (2005) microPrimer: the biogenesis and function of microRNA.

[61] Jones-Rhoades MW, Bartel DP (2004) Computational identification of plant microRNAs and their targets including a stress-induced miRNA. *Molecular Cell* 14: 787-799. [62] Llave C, Kasschau KD, Rector MA, Carrington JC (2002) Endogenous and silencing-

[63] Park W, Li J, Song R, Messing J, Chen X (2002) CARPEL FACTORY: a Dicer homolog and HEN1 a novel protein act in microRNA metabolism in *Arabidopsis thaliana. Current* 

[64] Peragine A, Yoshikawa M, Wu G, Albrecht HL, Poethig RS (2004) *SGS3* and *SGS2*/*SDE1*/*RDR6* are required for juvenile development and the production of *trans*-

[65] Vazquez F, Vaucheret H, Rajagopalan R, Lepers C, Gasciolli V, Mallory AC, Hilbert JL, Bartel DP, Crete P (2004) Endogenous trans-acting siRNAs regulate the accumulation of

[66] Aravin AA, Naumova NM, Tulin AV, Vagin VV, Rozovsky YM, Gvozdev VA (2001) Double-stranded RNA-mediated silencing of genomic tandem repeats and transposable

mechanisms to application. *Trends In Biotechnology* 22(9): 463-469.

small RNAs with antisense complementary to lin-14*. Cell* 75: 843-854.

developmental timing in *C. elegans. Genes and Development* 15: 2654-2659.

and germ line development in *Caenorhabditis elegans. Science* 293: 2269-2271.

coexpression with neighboring miRNAs and host genes. *RNA* 11: 241-247.

associated small RNAs in plants*. Plant Cell* 14: 1605-1619.

*Arabidopsis* mRNAs. *Molecular Cell* 16: 69–79.

acting siRNAs in *Arabidopsis*. *Genes and Development* 18: 2368–2379.

elements in the *D. melanogaster* germline. *Current Biololy* 11: 1017–1027.

342: 25-28.

*Science* 294: 862-864.

*Science* 304: 594-596.

*Development* 132: 4645-4652.

*Biology* 12: 1484-1495.

*Biochemistry Science* 32(4): 189-97.

for small express RNAs. *Science* 294: 853-858.


[48] Rodriguez A, Griffiths-Jones S, Ashurst JL (2004) Identification of mammalian microRNA host genes and transcription units. *Genome Research* 14: 1902-1910.

88 Functional Genomics

[27] Kamath RS, Martinez-Campos M, Zipperlen P, Fraser AG, Ahringer J (2001) Effectiveness of specific RNA-mediated interference through ingested double-stranded RNA in

[28] Waterhouse PM, Helliwell CA, (2003) Exploring plant genomes by RNA induced gene

[29] Agrawal NP, Dasaradhi VN, Asif M, Pawan M, Bhatnagar RK, Mukherjee SK, (2003) RNA interference: biology mechanism and applications. *Microbiology and Molecular* 

[31] Hammond SM, Bernstein E, Beach D, Hannon GJ (2000) An RNA-directed nuclease mediates post-transcriptional gene silencing in *Drosophila* cells. *Nature* 404 (6775): 293–296. [32] Hamilton AJ, Baulcombe DC (1999) A species of small antisense RNAi

[33] Zamore PD, Tuschl T, Sharp PA, Bartel DP (2000) RNAi: double-stranded RNA directs the ATP-dependent cleavage of mRNA at 21 to 23 nucleotide intervals. *Cell* 101(1): 25–33. [34] Voinnet O, Lederer C, Baulcombe DC (2000) Aviral movement protein prevents spread

[35] Rhoades MW, Reinhart BJ, Lim LP, Burge CB, Bartel B, Bartel DP (2002) Prediction of

[36] Fjose A, Drivenes O (2006) RNAi and microRNAs: from animal models to disease

[37] Okamura K, Ishizuka A, Siomi H, Siomi MC (2004) Distinct roles for argonaute proteins in small RNA-directed cleavage pathways. *Genes and Development* 18: 1655-1666. [38] Gregory RI, Chendrimada TP, Cooch N, Shiekhattar R (2005) Human RISC couples microRNA biogenesis and posttranscriptional gene silencing. *Cell* 123: 631-640. [39] Kim DH, Behlke MA, Rose SD, Chang MS, Choi S, Ross J (2005) Synthetic dsRNA Dicer substrates enhance RNAi potency and efficacy. *Nature Biotechnology* 23: 222-226. [40] Siolas D, Lerner C, Burchard J, Ge W, Linsley PS, Paddison PJ, Hannon GJ, Cleary MA (2005) Synthetic shRNAs as potent RNAi triggers. *Nature Biotechnology* 23: 227-231. [41] Doench JS, Petersen CP, Sharp PA (2003) siRNAs can function as miRNAs. *Genes* 

[42] Miska EA (2005) How microRNAs control cell division differentiation and death.

[43] Alvarez-Garcia I, Miska EA, (2005) MicroRNA functions in animal development and

[44] Schiebel W, Pelissier T, Riedel L, Thalmeir S, Schiebel R, Kempe D, Lottspeich F, Sanger HL, Wassenegger M (1998) Isolation of an RNA-directed RNA polymerase-specific

[46] Krutzfeldt J, Rajewsky N, Braich R, Rajeev KG, Tuschl T, Manoharan M, Stoffel M (2005) Silencing of microRNAs in vivo with "antagomirs". *Nature* 438: 685-689. [47] Metzalapff M (2005) Applications of RNAi in crop improvement *Pflanzenschutz-*

*Caenorhabditis elegans*. *Genome Biology* 2(1): 2.1–2.10.

[30] Baulcombe D (2004) RNA silencing in plants. *Nature* 43: 356-363.

posttranscriptional gene silencing in plants. *Science* 286: 950-952.

of the gene silencing signal in *Nicotiana benthamiana Cell* 103: 157-167.

silencing. *Nature Reviews: Genetics* 4: 29-38.

plant microRNA targets*. Cell* 110: 513-520.

*Development* 17: 438-442.

*Nachrichten Bayer* 58(1): 51-59.

therapy. *Birth Defects Research (part C)* 78: 150-171.

*Current Opinion Genetic Development* 15: 563-568.

cDNA clone from tomato. *Plant Cell* 10: 2087-2101.

[45] Cullen BR (2005) RNAi the natural way. *Nature Genetics* 37:1163-1165.

human disease. *Development* 132: 4653-4662.

*Biology Reviews* 67: 4 657–685.


[67] Reinhart BJ, Bartel DP (2002) Small RNAs correspond to centromere heterochromatic repeats. *Science* 297: 1831.

RNAi Towards Functional Genomics Studies 91

[85] Jones L, Ratcliff F, Baulcombe DC, (2001) RNA-directed transcriptional gene silencing in plants can be inherited independently of the RNA trigger and requires Met1 for

[86] Matthew L (2004) RNAi for plant functional genomics. *Comparative And Functional* 

[87] Hilson P, Allemeersch J, Altmann T, Aubourg S, Avon A, Beynon J, Bhalerao RP, Bitto F, Caboche M, Cannoot B, Chardakov V, Cognet-Holliger C, Colot V, Crowe M, Darimont C, Durinck S, Eickhoff H, Falcon De Longevialle A, Farmer EE, Grant M, Kuiper MTR, Lehrach H, Léon C, Leyva A, Lundeberg J, Lurin C, Moreau Y, Nietfeld W, Paz-Ares J, Reymond P, Rouzé P, Sandberg G, Dolores Segura M, Serizet C, Tabrett A, Taconnat L, Thareau V, Van Hummelen P, Vercruysse S, Vuylsteke M, Weingartner M, Weisbeek PJ, Wirta V, Wittink FRA, Zabeau M, Small I (2004) Versatile gene-specific sequence tags for *Arabidopsis* functional genomics: Transcript profiling and reverse

[89] Ifuku K, Yamamoto Y, Sato F (2003) Specific RNA interference in psbP genes encoded by multigene family in *Nicotiana tabacum* with a short 3'-untranslated sequence.

[90] Miki D, Itoh R, Shimamoto K (2005) RNA silencing of single and multiple members in a

[91] Yamamoto Y, Ifuku K, Sato F (2005) Suppression of psbP and psbQ genes in *Nicotiana tabacum* by RNA interference technique. In: van der Est A. Bruce D. (eds) *Photosynthesis: Fundamental Aspects to Global Perspectives*. Kluwers Academic Publisher The Netherlands

[92] Auer C, Frederick R (2009) Crop improvement using small RNAs: applications and

[93] Gheysen G, Vanholme B (2007) RNAi from plants to nematodes. *Trends Biotechnology* 25:

[94] Gordon KHJ, Waterhouse PM (2007) RNAi for insect-proof plants. *Nature Biotechnology*

[95] Sunilkumar G, Campbell LM, Puckhaber L, Stipanovic RD, Rathore KS (2006) Engineering cottonseed for use in human nutrition by tissue-specific reduction of toxic

[96] Wang M, Abbott D, Waterhouse PM (2000) A single copy of a virus derived transgene encoding hairpin RNA gives immunity to barley yellow dwarf virus. *Molecular Plant* 

[97] Kusaba M, Miyahara K, Lida S, Fukuoka H, Takario T, Sassa H, Mischimura M, Nischio T (2003) Low-glutein content 1: a dominant mutation that suppresses the glutein

[98] Voelker T, Kinney AJ (2001) Variations in the biosynthesis of seed storage lipids. *Annual*

[99] Knowlton S (1999) Soybean oil having high oxidative stability. US Patent 5981781.

predictive ecological risk assessments. *Trends Biotechnology* 27: 644-651.

gossypol. *Proceedings of the National Academy of Sciences* 103: 18054-18059.

multigene family via RNA silencing in rice. *Plant Cell* 15: 1455-1467.

*Review Plant Physiolology. Plant Molecular Biology* 52: 335–361.

maintenance. *Current Biology* 11: 747–757.

genetics applications. *Genome Research* 14: 2176-2189. [88] Invitrogen Gatewaytm Technology. (2004) *Quest* 1(2): 32-33.

*Bioscience Biotechnology Biochemistry* 67: 107-113.

gene family of rice. *Plant Physiology* 138: 1903-1913.

*Genomics* 5: 240-244.

pp 798-799.

25: 1231-1232.

*Pathology* 1: 401-410.

89-92.


1833–1837.

123–132.

repeats. *Science* 297: 1831.

testes. *Nature* 442: 203–207.

plants. *Plos Biology* 2: E104.

development. *Genes and Development* 18: 1187-1197.

mediated by a transgene but not by a virus. *Cell 101*: 543-553.

RNA interference in *C. elegans*. *Current Biology* 10: 169-178.

pathway. *FEBS Letters* 579: 5822-5829.

4360-4364.

homologous to RNAdependent RNA polymerase. *Nature 399*: 166-169.

[83] Sato F (2005) RNAi and functional genomics. *Plant Biotechnology* 22: 431-442.

fungi. *Nature* 404: 245.

[67] Reinhart BJ, Bartel DP (2002) Small RNAs correspond to centromere heterochromatic

[68] Volpe TA, Kidner C, Hall IM, Teng G, Grewal SIS, Martienssed RA (2002) Regulation of heterochromatic silencing and histone H3 lysine-9 methylation by RNAi. *Science* 297:

[69] Mochizuki K, Gorovsky MA (2004) Small RNAs in genome rearrangement in

[70] Aravin A, Pfeffer GD, Lagos-Quintana M, Landgraf P, Iovino N, Morris P, Brownstein MJ, Kuramochi-Miyagawa S, Nakano T, Chien M, Russo JJ, Sheridan R, Sander C, Zavolan M, Tuschl T (2006) A novel class of small RNAs bind to MILI protein in mouse

[71] Catalanotto C, Azzalin G, Macino G, Cogoni C (2000) Gene silencing in worms and

[72] Tabara H, Sarkissian M, Kelly WG, Fleenor J, Grishok A, Timmons L, Fire A, Mello CC (1999) The RDE-1 gene RNA interference and transposon silencing in *C. elegans. Cell* 99:

[73] Xie Z, Johansen LK, Gustafson AM, Kasschau KD, Lellis AD, Zilberman D, Jacobsen SE, Carrington JC (2004) Genetic and functional diversification of small RNA pathways in

[74] Bohmert K, Camus I, Bellini C, Bouchez D, Caboche M, Benning C (1998) AGO1 defines a novel locus of *Arabidopsis* controlling leaf development*. EMBO Journal* 17: 170-180. [75] Vaucheret H, Vasquez F, Crete P, Bartel DP (2004) The action of ARGONAUTE1 in the miRNA pathway and its regulation by the miRNA pathway are crucial for plant

[76] Zilberman D, Cao X, Jacobsen SE (2003) ARGONAUTE 4 control of locus-specific siRNA accumulation and DNA and histone methylation. *Science* 299: 716-719. [77] Kosik KS (2006) The neuronal microRNA system *Nature Reviews. Neurosciences* 7: 911-920. [78] Dalmay T, Hamilton A, Rudd S, Angell S, Baulcombe DC, (2000) An RNA-dependent RNA polymerase gene in *Arabidopsis* is required for posttranscriptional gene silencing

[79] Cogoni C, Macino G (1999) Gene silencing in *Neurospora crassa* requires a protein

[80] Smardon A, Spoerke JM, Stacey SC, Klein ME, Mackin N, Maine EM (2000) EGO-1 is related to RNA-directed RNA polymerase and functions in germ-line development and

[81] Wu-Scharf D, Jeong B, Zhang C, Cerutti H (2000) Transgene and transposon silencing in *Chlamydomonas reinhardtii* by a DEAH-box RNA helicase. *Science 290*: 1159-1162. [82] Hammond SM (2005) Dicing and slicing: the core machinery of the RNA interference

[84] Tenea GN (2009) Exploring the world of RNA interference in plant functional genomics: a research tool for many biology phenomena. *Roumanian Biotechnological Letters* 14(3)

*Tetrahymena*. *Current Opinion in Genetic Development* 14: 181–187.


[100] Mroczka A, Roberts PD, Fillatti JJ, Wiggins BE, Ulmasov T, Voelker T (2010) An intron sense suppression construct targeting soybean FAD2-1 requires a double-stranded RNA-producing inverted repeat T-DNA insert. *Plant Physiology* 153: 882–891.

RNAi Towards Functional Genomics Studies 93

[117] Yu JY, Deruiter SL, Turner DL (2002) RNA interference by expression of short interfering RNAs and hairpin RNAs in mammalian cells *Proceedings of the National*

[118] Hannon GJ, Rossi JJ (2004) Unlocking the potential of the human genome with RNA

[119] Chi JT, Chang HY, Wang NN, Chang DS, Dunthy N, Brown PO (2003) Genome wide view of gene silencing by small interfering RNAs. *Proceedings of the National Academy of* 

[120] McCaffrey AP, Nakai H, Pandey K, Huang Z, Salazar FH, Xu H, Wieland SF, Marion PL, Kay MA (2003) Inhibition of hepatitis B virus in mice by RNA interference. *Nature* 

[121] Xia H, Mao Q, Eliason S, Harper SQ, Martins IH, Orr HT, Laulson HL, Yang L, Kotin RM, Davidson BL (2004) RNAi suppresses polyglutamine-induced neurodegeneration

[122] Palliser D, Chowdhury D, Wang Q, Lee SJ, Bronson RT, Knipe DM, Lieberman J (2006) An siRNA-based microbicide protects mice from lethal Herpex simplex virus 2 infection.

[123] Schiffelers RM, Xu J, Storm G, Woodle MC, Scaria PV (2005) Effects of treatment with small interfering RNA on joint inflammation in mice with collagen-induced arthritis.

[124] Ryazansky SS, Gvozdev VA (2008) Small RNAs and cencerogenesis. *Biokhimia* 73(5):

[125] Johnson SM, Grosshans H, Shingara J, Byrom M, Jarvis R, Cheng A, Labourier E, Reinert KL, Brown D, Slack FJ (2005) RAS is regulated by the let-7 microRNA family.

[126] O'Connell RM, Taganov KD, Boldin MP, Cheng G, Baltimore D (2007) MicroRNA-155 is induced during the macrophage inflammatory response. *Proceedings of the National*

[127] Cheng AM, Byrom MW, Shelton J, Ford L (2005) Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis.

[128] Janot G, Simard MJ (2006) Tumour-related microRNAs functions in *Caenorhabditis* 

[129] Kumar MS, Lu J, Mercer KL, Golub TR, Jacks T (2007) Impaired microRNA processing enhances cellular transformation and tumorigenesis. *Nature Genetics* 39: 673-677. [130] Calin GA, Ferracin M, Cimmino A, Di LG, Shimizu M, Wojcik SE, Iorio MV, Visone R, Sever NI, Fabbri M, Iuliano R, Palumbo T, Picchiorri F, Roldo C, Garzon R, Sevignani C, Rassenti L, Alder H, Volinia S, Liu CG, Kipps TJ, Negrini M, Croce CM (2005) A MicroRNA signature associated with prognosis and progression in chronic lymphocytic

[131] Saito Y, Liang G, Egger G, Friedman JM, Chuang JC, Coetzee GA, Jones PA (2006) Specific activation of microRNA-127 with downregulation of the proto-oncogene BCL6

by chromatin-modifying drugs in human cancer cells. *Cancer Cell* 9: 435-443.

leukemia. *The New England Journal of Medicine* 353: 1793-1801.

in a model of spinocerebellar ataxia. *Nature Medicine* 10: 816-820.

*Academy of Sciences* USA 99: 6047-6052.

interference. *Nature* 431: 371-378.

*Arthritis Rheumatology* 52: 1314-1318.

*Academy of Sciences* USA 104: 1604-1609.

*Nucleic Acids Research* 33: 1290-1297.

*elegans. Oncogene* 25: 6197-6201.

*Sciences* USA 100: 6343–6346.

*Biotechnology* 21: 639–644.

*Nature* 439: 89-94.

*Cell* 120: 635-647.

640-655.


[117] Yu JY, Deruiter SL, Turner DL (2002) RNA interference by expression of short interfering RNAs and hairpin RNAs in mammalian cells *Proceedings of the National Academy of Sciences* USA 99: 6047-6052.

92 Functional Genomics

*of Sciences* USA 108(1): 409-414.

virus X/potyviral synergism. *Virology* 231: 35-42.

replication of heterologous viruses*. Plant Cell* 9: 859-868.

*Genomics* 5: 240-244.

*Research* 102: 97-108.

*Plant Phatology* 5: 71-82.

*Development* 18: 1179-1186.

*Sciences* USA 95: 13079-13084.

RNA silencing. *Plant Cell* 16: 1235-1250.

transgene locus. *Plant Journal* 35: 82-92.

interference. *Nature* 418: 435–438.

*Biotechnology* 20: 500–505.

*Nature Medicine* 8: 681–686.

[100] Mroczka A, Roberts PD, Fillatti JJ, Wiggins BE, Ulmasov T, Voelker T (2010) An intron sense suppression construct targeting soybean FAD2-1 requires a double-stranded

[102] Matthew L (2004) RNAi for plant functional genomics. *Comparative And Functional* 

[103] Shi XM, Miller H, Verchot J, Carrington JC, Vance VB (1997) Mutation in the region encoding the central domain of helper component-proteinase (HC-Pro) eliminate potato

[104] Pruss G, Ge X, Shi XM, Carrington JC, Bowman Vance V (1997) Plant viral synergism: the potyviral genome encodes broad–range pathogenicity enhancer that transactivates

[105] Roth BM, Pruss GJ, Vance VB (2004) Plant viral suppressors of RNA silencing. *Virus* 

[106] Moissiard G, Voinnet O (2004) Viral suppression of RNA silencing in plants. *Molecular* 

[107] Chapman EJ, Prokhnevsky AI, Gopinath K, Dolja V, Carrington JC (2004) Viral RNA silencing suppressors inhibits the microRNA pathways at an intermediate step. *Genes* 

[108] Dunoyer P, Lecellier CH, Parizotto EA, Himber C, Voinnet O (2004) Probing the microRNA and small interfering RNA pathways with virus-encoded suppressors of

[109] Anandalakshmi R, Pruss GJ, Ge X, Marathe R, Mallory AC, Smith TH, Vance VB (1998) A viral suppressor of gene silencing in plants. *Proceedings of the National Academy of* 

[110] Kasschau KD, Carrington JC (1998) A counterdefensive strategy of plant viruses:

[111] Mallory AC, Mlotshwa S, Bowman LH, Vance VB (2003) The capacity of transgenic tobacco to send a systemic RNA silencing signal depends on the nature of the inducing

[113] Jacque JM, Triques K, Stevenson M (2002) Modulation of HIV-1 replication by RNA

[114] Lee NS, Dohjima T, Bauer G, Li H, Ehsani A, Salvaterra P, Rossi J (2002) Expression of small interfering RNAs targeted against HIV-1 *rev* transcripts in human cells. *Nature* 

[115] Novina CD, Murray MF, Dykxhoorn DM, Beresford PJ, Riess J, Lee SK, Collman RG, Lieberman J, Shankar P, Sharp PA (2002) siRNA-directed inhibition of HIV-1 infection.

[116] Ananthalakshmi P, Sutton R (2008) Titers of HIV-based vectors encoding shRNAs are

reduced by a Dicer-dependent mechanism. *Molecular Therapy* 16: 378-386.

suppression of posttranscriptional gene silencing. *Cell* 95: 461-470.

[112] Downward J (2004) RNA interference. *British Medical Journal* 328: 1245-1248.

RNA-producing inverted repeat T-DNA insert. *Plant Physiology* 153: 882–891. [101] Hoffer P, Ivashuta S, Pontesc O, Vitins A, Pikaard C, Mroczka A, Wagnera N, Voelker T (2011) Posttranscriptional gene silencing in nuclei. *Proceedings of the National Academy*


[132] Esteller M (2007) Cancer epigenomics: DNA methylomes and histone-modification maps. *Nature Review Genet*ics 8: 286-298.

**Chapter 5** 

© 2012 Bai, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License [http://creativecommons.org/licenses/by/3.0], which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Bai, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In their *Nature* paper, Fire and Mello wrote: 'Whatever their target, the mechanisms underlying RNA interference probably exist for a biological purpose'. Indeed, it's been shown that there are numerous cellular and physiological functions linked to RNAi [1]. For example, viral defense has been proposed to be the primary function of RNAi in both plants and flies [3]. In plants, virus infection could trigger sequence-specific gene silencing [4].

**Genome-Wide RNAi** 

Hua Bai

http://dx.doi.org/10.5772/49945

**1. Introduction** 

silencing (TGS) [2].

**Screen for the Discovery of** 

Additional information is available at the end of the chapter

**Gene Function, Novel Therapeutical** 

**Targets and Agricultural Applications** 

The phenomenon of double-stranded RNAs (dsRNAs)-mediated gene silencing or RNA interference (RNAi] was first discovered in nematode *Caenorhabditis elegans* by Andrew Fire and Craig Mello in 1998 [1]. This great discovery gives rise to a fast-growing field and leads to the identification of novel RNAi pathways by which small interference RNAs (siRNAs) regulate gene expression and gene functions. Collective evidence suggests that the RNAi pathway is conserved in many eukaryotes and this pathway can be triggered by either exogenous or endogenous small interference RNAs. Exogenous dsRNAs (e.g. a virus with an RNA genome) are typically required a membrane transporter for dsRNA uptake into the cytoplasm, while endogenous small interference RNAs (e.g. microRNAs) are encoded in the genome. The precursors of both dsRNA and microRNA are first cleaved into short interference RNAs by a ribonuclease III (RNaseIII) enzyme, Dicer. Then these short interference RNAs initiate RNAi process when interacting argonaute proteins in the RNAinduced silencing complex (RISC). The small interference RNAs normally consist of 20~30 nucleotides. They can repress the transcription of message RNAs containing homologous sequences by either post-transcriptional gene silencing (PTGS) or transcriptional gene


## **Genome-Wide RNAi Screen for the Discovery of Gene Function, Novel Therapeutical Targets and Agricultural Applications**

Hua Bai

94 Functional Genomics

maps. *Nature Review Genet*ics 8: 286-298.

*Molecular Genetics* 13: R275-R288.

*Methods* 30: 313-321.

*Biology* 12: R852-R854.

USA 101: 6403-6408.

37: 911-924.

6: 443-453.

[132] Esteller M (2007) Cancer epigenomics: DNA methylomes and histone-modification

[133] Kanellopoulou C, Muljo SA, Kung AL, Ganesan S, Drapkin R, Jenuwein T, Livingston DM, Rajewsky K (2005) Dicer deficient mouse embryonic stem cell are defective in

[134] Buckingham SD, Esmaeili B, Wood M, Sattelle DB (2004) RNA interference: from model organisms towards therapy for neural and neuromuscular disorders. *Human* 

[135] Kamath RS, Ahringer J (2003) Genome-wide RNAi screening in *Caenorhabditis elegans.* 

[136] Eagle M, Boudouin SV, Chandler C, Giddings DR, Bullock R, Bushby K (2002) Survival in Duchenne muscular dystrophy: improvements in life expectancy since 1967 and the impact of home nocturnal ventilation. *Neuromuscular Disorders* 12: 926-929. [137] Goldman RD, Gruenbaum Y, Moir RD, Shumaker DK, Spann TP (2002) Nuclear lamins: building blocks of nuclear arhitecture. *Genes and Development* 16: 533-547. [138] Carthew RW (2002) RNA interference: the fragile X syndrome connection. *Current* 

[139] Nollen EA, Garcia SM, van Haaften G, Kim S, Chavez A, Morimoto RI, Plasterk RH (2004) Genome-wide RNA interference screen identifies previously undescribed regulators of polyglutamine aggregation. *Proceedings of the National Academy of Sciences* 

[140] Yang Y, Nishimura I, Imai Y, Takahashi R, Lu B (2003) Parkin suppresses dopaminergic neuron-selective neutotoxicuty induced by Pael-R in *Drosophila. Neuron*

[141] Caplen NJ, Taylor JP, Statham VS, Tanaka F, Fire A, Morgan RA (2002) Rescue of polyglutamine-mediated cytotoxicity by double-stranded RNA-mediated RNA

[142] Kononenko NI, Shao LR, Dudek FE (2004) Riluzole-sensitive slowly inactivating sodium current in rat suprachiasmatic nucleus neurons. *Journal of Neurophysiology* 91: 710-718. [143] Smith RA, Miller TM, Yamanaka K, Monia BP, Condon TP, Hung G, Lobsiger CS, Ward CM, Wei H, Wancewicz Bennett CF, Cleveland DW (2006) Antisense oligonucleotide therapy for neurodegenerative disease*The Journal of Clinical Investigation* 116: 2290-2296. [144] Miller M, Kaspar KP, Kops GJ, Yamanaka K, Christian LJ, Gage FH, Cleveland DW (2006) Virus–delivered small RNA silencing sustain strength in amyotrophic lateral

[145] de Fougerolles A, Vornlocher HP, Maraganore J, Lieberman J (2007) Interfering with disease: a progress report on siRNA-based therapeutics. *Nature Reviews Drug Discovery*

[146] Esau C, Davis S, Murray SF, Yu XX, Pandey SK, Pear M, Watts L, Booten S.L, Graham M (2006) miR-122 regulation of lipid metabolism revealed by in vivo antisense

interference. *Human Molecular Genetics* 11:175-184.

sclerosis. *Annual Neurology* 57: 773-776.

targeting. *Cell Metabolism* 3: 87-98

differentiation and centromeric silencing. *Genes and Development* 19: 489-501.

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/49945

#### **1. Introduction**

The phenomenon of double-stranded RNAs (dsRNAs)-mediated gene silencing or RNA interference (RNAi] was first discovered in nematode *Caenorhabditis elegans* by Andrew Fire and Craig Mello in 1998 [1]. This great discovery gives rise to a fast-growing field and leads to the identification of novel RNAi pathways by which small interference RNAs (siRNAs) regulate gene expression and gene functions. Collective evidence suggests that the RNAi pathway is conserved in many eukaryotes and this pathway can be triggered by either exogenous or endogenous small interference RNAs. Exogenous dsRNAs (e.g. a virus with an RNA genome) are typically required a membrane transporter for dsRNA uptake into the cytoplasm, while endogenous small interference RNAs (e.g. microRNAs) are encoded in the genome. The precursors of both dsRNA and microRNA are first cleaved into short interference RNAs by a ribonuclease III (RNaseIII) enzyme, Dicer. Then these short interference RNAs initiate RNAi process when interacting argonaute proteins in the RNAinduced silencing complex (RISC). The small interference RNAs normally consist of 20~30 nucleotides. They can repress the transcription of message RNAs containing homologous sequences by either post-transcriptional gene silencing (PTGS) or transcriptional gene silencing (TGS) [2].

In their *Nature* paper, Fire and Mello wrote: 'Whatever their target, the mechanisms underlying RNA interference probably exist for a biological purpose'. Indeed, it's been shown that there are numerous cellular and physiological functions linked to RNAi [1]. For example, viral defense has been proposed to be the primary function of RNAi in both plants and flies [3]. In plants, virus infection could trigger sequence-specific gene silencing [4].

© 2012 Bai, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License [http://creativecommons.org/licenses/by/3.0], which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Bai, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Plant RNAi forms the basis of virus induced gene silencing (VIGS), proofed from the genetic links between virulence and RNAi pathways [5-6]. On the other hand, endogenous microRNAs (about 1000 microRNAs in human genome) [7] play essential roles in controlling cellular functions. For example, the early discovered microRNAs, such as *lin-4* and *let-7* of *C. elegans*, were identified to regulate developmental timing [8-9]. Following the identification of *let-7* in *C. elegans* and later in fruit flies *Drosophila melanogaster* (hereafter I will refer it as '*Drosophila*'), it is soon realized that *let-7* belongs to a conserved microRNA family in many species. Besides the regulation on development, many microRNAs have been found to control key physiological processes, such as lipid metabolism [10] and insulin sensitivity [11]. The dysregulation of microRNA may result in many human diseases. A mutation in the seed region of miR-96 causes hereditary progressive hearing loss [12]. Some microRNAs have also been linked to cancer [13].

Genome-Wide RNAi Screen for the Discovery of

Gene Function, Novel Therapeutical Targets and Agricultural Applications 97

performed in *Drosophila* and mammalian cultured cells, as well as in primary cells. RNAi screen with *Drosophila* and mammalian cells has already led to important discovery in a wide variety of topics, including signal transduction, metabolism and cancer [22]. In general, a cell-based RNAi screen involves four major steps: [1]. RNAi library selection; [2]. Incubation of appropriate cell lines with RNAi reagents that are pooled or individually arrayed into 96- or 384-well plates; [3]. After additional treatments (if applicable), cells are subjected to the automated plate reader to quantify the specific readout (e.g. changes in cell morphology or fluorescence and luminescence signals from study targets); [4]. High-content

**RNAi library and reagent delivery methods.** Since the completion of *C. elegans* genome sequencing in 1998 [14], more and more eukaryotic genomes have been sequenced, which makes it possible to produce whole-genome RNAi libraries for functional genomics studies. Typically, long dsRNAs are used for RNAi screen in *Drosophila* cells, while synthetic siRNAs or vector-based short hairpin RNAs (shRNAs) are commonly used for mammalian cells [24]. Several *Drosophila* cell lines (e.g. S2 and Kc167) can directly take up dsRNA without the help of transfection reagents, which provides great advantages in high-throughput RNAi screens [25-26]. For mammalian cells, RNAi reagents are transient transfected into the cells. Therefore, not only the variation of transfection efficiency among cell lines and experimental replicates will affect the RNAi knockdown effects, but also the doubling time will affect the duration of gene silencing. Compared to synthetic siRNA method, vector-based shRNA technology combining with viral delivery methods provides robust gene silencing for a longer period of time. Besides, vector-based shRNA approaches make it possible to build

Now RNAi libraries are available for many model organisms, especially *Drosophila* and mammalian cells (Table. 1). These RNAi libraries normally contain the collection of RNAi reagents (dsRNAs, siRNAs and shRNAs) for all annotated genes in the genome. In the RNAi libraries, typically there are several different dsRNAs or siRNAs corresponding to each gene. For example, The GeneNet™ Human 50K siRNA library from System Biosciences contains 200,000 siRNA templates targeted to 47,400 human transcripts (~4 different siRNA sequences per transcript) (http://www.systembio.com/rnai-libraries). In *Drosophila* DRSC 2.0 RNAi collection, there are 1-2 dsRNAs per gene, including genes that encode proteins and non-coding RNAs (http://www.flyrnai.org). Beside genome-wide libraries, many pathway libraries or sub-libraries are also available for silencing specific signal pathways or multigene families (e.g. kinases & phosphatases library, G protein-coupled receptor library,

**Read-out assays.** Together with luminescent and fluorescent reporter-based image analysis and improved high-content image processing packages (e.g. CellProfiler [27] , various readouts are used in RNAi screens, including the expression changes in target genes or proteins, post-translation modification, metabolic processes, and changes in sub-cellular localization patterns. Most of these read-outs are based on the measurement of the intensity of luminescent and fluorescent reporters. Although RNAi screens for cell morphology (e.g. cytoskeletal organization and simple cell shape) have been previously reported [28-30],

image data analysis.

renewable and cost-effective RNAi libraries.

apoptosis & cell cycle library, etc.).

Soon after the discovery of dsRNA-mediated RNAi in 1998, gene silencing through RNAi was quickly developed as a powerful tool or technique in functional genomics studies. Comparing to forward genetics tools (e.g. EMS-induced mutagenesis screens), RNAi is one of effective reverse genetic tools, especially for non-model organisms and mammalian systems in which genetics is difficult. The advantage of applying RNAi in function studies becomes even more apparent when the whole genome sequences of model organisms (e.g. *C. elegans* and *Drosophila*) were completed in the late 90's and early 2000's [14-15]. In the post-genome era, utilizing high-throughput platforms and innovative bioinformatics tools, many large-scale RNAi screens has been successfully applied for the discovery of novel gene function associated with many important aspects of biology such as signal transduction [16], cell proliferation [17], metabolism [18], host-pathogen interactions [19] and aging [20-21]. Through these genome-wide RNAi screens, we have gained new insights on novel players in many key biological processes and complexity of cellular signaling networks. Cell-based and *in vivo* RNAi screen has been extensively reviewed in the past [22- 23].In this book chapter, I will focus on the recent development of high-throughput RNAi screen for functional analysis in cultured cells and *in vivo* systems, as well as its applications on functional genomics and the discovery of novel therapeutic drug and agricultural targets.

#### **2. RNAi screen methods**

Despite the challenges from off-target effects and false discovery during the data analysis, genome-wide RNAi screens have benefitted from improved RNAi delivery methods, automated high-content image system and robust statistic analysis [23]. In the following section, I will compare the different reagent delivery methods, read-out assays, and offtarget effects in various systems and platforms. I will also provide several examples from recent studies using genome-wide RNAi screen in cultured cells.

#### **2.1. Cell-based RNAi screen**

Genome-wide RNAi screen in cultured cells or primary cells provides an opportunity to systematically interrogate gene function. Now large-scale RNAi screens have been routinely performed in *Drosophila* and mammalian cultured cells, as well as in primary cells. RNAi screen with *Drosophila* and mammalian cells has already led to important discovery in a wide variety of topics, including signal transduction, metabolism and cancer [22]. In general, a cell-based RNAi screen involves four major steps: [1]. RNAi library selection; [2]. Incubation of appropriate cell lines with RNAi reagents that are pooled or individually arrayed into 96- or 384-well plates; [3]. After additional treatments (if applicable), cells are subjected to the automated plate reader to quantify the specific readout (e.g. changes in cell morphology or fluorescence and luminescence signals from study targets); [4]. High-content image data analysis.

96 Functional Genomics

microRNAs have also been linked to cancer [13].

**2. RNAi screen methods** 

**2.1. Cell-based RNAi screen** 

Plant RNAi forms the basis of virus induced gene silencing (VIGS), proofed from the genetic links between virulence and RNAi pathways [5-6]. On the other hand, endogenous microRNAs (about 1000 microRNAs in human genome) [7] play essential roles in controlling cellular functions. For example, the early discovered microRNAs, such as *lin-4* and *let-7* of *C. elegans*, were identified to regulate developmental timing [8-9]. Following the identification of *let-7* in *C. elegans* and later in fruit flies *Drosophila melanogaster* (hereafter I will refer it as '*Drosophila*'), it is soon realized that *let-7* belongs to a conserved microRNA family in many species. Besides the regulation on development, many microRNAs have been found to control key physiological processes, such as lipid metabolism [10] and insulin sensitivity [11]. The dysregulation of microRNA may result in many human diseases. A mutation in the seed region of miR-96 causes hereditary progressive hearing loss [12]. Some

Soon after the discovery of dsRNA-mediated RNAi in 1998, gene silencing through RNAi was quickly developed as a powerful tool or technique in functional genomics studies. Comparing to forward genetics tools (e.g. EMS-induced mutagenesis screens), RNAi is one of effective reverse genetic tools, especially for non-model organisms and mammalian systems in which genetics is difficult. The advantage of applying RNAi in function studies becomes even more apparent when the whole genome sequences of model organisms (e.g. *C. elegans* and *Drosophila*) were completed in the late 90's and early 2000's [14-15]. In the post-genome era, utilizing high-throughput platforms and innovative bioinformatics tools, many large-scale RNAi screens has been successfully applied for the discovery of novel gene function associated with many important aspects of biology such as signal transduction [16], cell proliferation [17], metabolism [18], host-pathogen interactions [19] and aging [20-21]. Through these genome-wide RNAi screens, we have gained new insights on novel players in many key biological processes and complexity of cellular signaling networks. Cell-based and *in vivo* RNAi screen has been extensively reviewed in the past [22- 23].In this book chapter, I will focus on the recent development of high-throughput RNAi screen for functional analysis in cultured cells and *in vivo* systems, as well as its applications on functional genomics and the discovery of novel therapeutic drug and agricultural targets.

Despite the challenges from off-target effects and false discovery during the data analysis, genome-wide RNAi screens have benefitted from improved RNAi delivery methods, automated high-content image system and robust statistic analysis [23]. In the following section, I will compare the different reagent delivery methods, read-out assays, and offtarget effects in various systems and platforms. I will also provide several examples from

Genome-wide RNAi screen in cultured cells or primary cells provides an opportunity to systematically interrogate gene function. Now large-scale RNAi screens have been routinely

recent studies using genome-wide RNAi screen in cultured cells.

**RNAi library and reagent delivery methods.** Since the completion of *C. elegans* genome sequencing in 1998 [14], more and more eukaryotic genomes have been sequenced, which makes it possible to produce whole-genome RNAi libraries for functional genomics studies. Typically, long dsRNAs are used for RNAi screen in *Drosophila* cells, while synthetic siRNAs or vector-based short hairpin RNAs (shRNAs) are commonly used for mammalian cells [24]. Several *Drosophila* cell lines (e.g. S2 and Kc167) can directly take up dsRNA without the help of transfection reagents, which provides great advantages in high-throughput RNAi screens [25-26]. For mammalian cells, RNAi reagents are transient transfected into the cells. Therefore, not only the variation of transfection efficiency among cell lines and experimental replicates will affect the RNAi knockdown effects, but also the doubling time will affect the duration of gene silencing. Compared to synthetic siRNA method, vector-based shRNA technology combining with viral delivery methods provides robust gene silencing for a longer period of time. Besides, vector-based shRNA approaches make it possible to build renewable and cost-effective RNAi libraries.

Now RNAi libraries are available for many model organisms, especially *Drosophila* and mammalian cells (Table. 1). These RNAi libraries normally contain the collection of RNAi reagents (dsRNAs, siRNAs and shRNAs) for all annotated genes in the genome. In the RNAi libraries, typically there are several different dsRNAs or siRNAs corresponding to each gene. For example, The GeneNet™ Human 50K siRNA library from System Biosciences contains 200,000 siRNA templates targeted to 47,400 human transcripts (~4 different siRNA sequences per transcript) (http://www.systembio.com/rnai-libraries). In *Drosophila* DRSC 2.0 RNAi collection, there are 1-2 dsRNAs per gene, including genes that encode proteins and non-coding RNAs (http://www.flyrnai.org). Beside genome-wide libraries, many pathway libraries or sub-libraries are also available for silencing specific signal pathways or multigene families (e.g. kinases & phosphatases library, G protein-coupled receptor library, apoptosis & cell cycle library, etc.).

**Read-out assays.** Together with luminescent and fluorescent reporter-based image analysis and improved high-content image processing packages (e.g. CellProfiler [27] , various readouts are used in RNAi screens, including the expression changes in target genes or proteins, post-translation modification, metabolic processes, and changes in sub-cellular localization patterns. Most of these read-outs are based on the measurement of the intensity of luminescent and fluorescent reporters. Although RNAi screens for cell morphology (e.g. cytoskeletal organization and simple cell shape) have been previously reported [28-30],

complex cell shape and structure-based read-outs still remain problematic. Frequently, additional treatments are performed before the read-out assays in RNAi screens. These treatments can be drugs, pathogens, or stress inducers [31-33].

Genome-Wide RNAi Screen for the Discovery of

Gene Function, Novel Therapeutical Targets and Agricultural Applications 99

signaling [37]; novel modulators of p53 pathway [38]; and key genes that are essential for the proliferation of cancer cells [17]. Recently, an integrative approach with RNAi screen and whole genome structural analysis identified IKBKE kinase as a breast cancer oncogene [39]. Beside its application in studying signaling pathways, cell-based RNAi screens are also performed to understand the cellular responses to pathogens. For example, recent RNAi screens identified novel host factors that are required for dengue virus propagation [40] and influenza virus replication [41]. I will discuss more detail on some of genome-wide RNAi

Many complex phenotypes, such as aging, cannot be tested in a cell-based assay, thus *in vivo* functional analysis is required*. In vivo* RNAi screen is one of such approaches to study gene function at an organism level. *C. elegans* and *Drosophila* are two model organisms that are commonly used in *in vivo* RNAi screens. Although *ex vivo* RNAi screens have been done by introducing shRNA-transfected cells into mice [42-43], direct *in vivo* RNAi screen in mice is still under development. Most importantly, *in vivo* RNAi makes it possible for gene function studies in species lacking classical genetic tools, even including species without whole genome sequences (Usually next-generation sequencing is used to identify gene coding

*In vivo* genome-wide RNAi screen was first reported in *C. elegans* [44]. In *C. elegans*, dsRNAmediated RNAi effects are systemic and heritable [1], although gene knockdown is less efficient in the nervous system than in other tissues. *In vivo* genome-wide RNAi screens have been performed in the studies of aging [20-21, 45-46], metabolism [47] and microRNA pathways in *C. elegans* [48]. It is relatively easy to deliver dsRNA into *C. elegans*. Typically, dsRNAs are introduced to worms by soaking the animals in dsRNA solution [49], by injection of dsRNA [1], or by feeding the animals dsRNA-expressing bacteria [44]. The last method is commonly used to generate genome-wide RNAi libraries. These *C. elegans* RNAi libraries are

In contrast, dsRNA feeding does not appear to work for gene silencing in *Drosophila*, while RNAi via injection of dsRNA is effective only in certain embryonic stages. Therefore, transgenic RNAi approach has been developed to express a double-stranded 'hairpin' RNA from a transgene. In *Drosophila*, RNAi is cell-autonomous, so that gene silencing can be easily performed in tissue- and spatial-specific manner by using the binary GAL4/UAS expression system. Currently, there are three groups that have generated genome-wide transgenic RNAi *Drosophila* stains [50] (Table. 2). These transgenic RNAi stains are all expressing invertedrepeat hairpin RNAs once crossing to appropriate Gal4 drivers. Recently, it's been shown that small hairpin RNAs (~19 nt) can trigger stronger gene inactivation than long hairpin RNAs. Therefore new constructs expressing small hairpin RNAs were generated to produce second generation of transgenic RNAi *Drosophila* stains at Harvard medical school (Table. 2). In the past several years, a number of genome-wide RNAi screens in *Drosophila* have been conducted to study the major signaling pathways [51], as well as many important disease models [52-55].

now available from Ahringer lab RNAi collection and Open Biosystems (Table. 2).

sequences and to facilitate RNAi reagent design and production).

screens in section. 3.

**2.2.** *In vivo* **RNAi screen** 


**Table 1.** List of RNAi libraries used in cell cultures

**Off-target effects.** False positive or false negative results are commonly associated with high-throughput studies, including genome-wide RNAi screens [34]. The false discovery in RNAi screens can be caused by instrument errors, statistical noises, low knock-down efficiency and off-target effects of RNAi reagents. Off-target effects of RNAi reagents usually include: [1]. A general interference on endogenous RNAi pathway; or [2] Sequencedependent effects on the expression of non-target genes. In order to minimize off-target effects, it is suggested to perform sufficient replication experiments and choose two or more RNAi reagents that target different regions of the coding sequences. Sequence-dependent off-target effects can be avoided by selecting sequences that do not contain 19 or more base pairs of contiguous nucleotide identity to other genes in the genome [34-35].

**Recent cell-based RNAi screen studies.** Genome-wide RNAi screens have been primarily conducted in both *Drosophila* and mammalian cultured cells (Reviewed in [22]. These screens are involved in studies on a variety of biology processes, such as signal transduction, metabolism, cancer, stem cells. The cell-based RNAi screens have yielded tremendous amount of novel discoveries and greatly promoted our understanding on many basic biological processes, molecular functions and complexity of cellular networks. New findings from genome-wide RNAi screens have led to the identification of novel components of canonical signaling transduction pathways, such new players of ERK pathway [16] and phosphorylation networks regulating JUN NH2-terminal kinase (JNK) pathway [36]; the role of *S1pr2* gene (Sphingosine-1-phosphate receptor 2) in insulin signaling [37]; novel modulators of p53 pathway [38]; and key genes that are essential for the proliferation of cancer cells [17]. Recently, an integrative approach with RNAi screen and whole genome structural analysis identified IKBKE kinase as a breast cancer oncogene [39]. Beside its application in studying signaling pathways, cell-based RNAi screens are also performed to understand the cellular responses to pathogens. For example, recent RNAi screens identified novel host factors that are required for dengue virus propagation [40] and influenza virus replication [41]. I will discuss more detail on some of genome-wide RNAi screens in section. 3.

#### **2.2.** *In vivo* **RNAi screen**

98 Functional Genomics

*Genome-wide RNAi libraries* 

*Pre-defined or custom RNAi libraries* 

**Table 1.** List of RNAi libraries used in cell cultures

**Drosophila RNAi** 

**The Netherlands** 

complex cell shape and structure-based read-outs still remain problematic. Frequently, additional treatments are performed before the read-out assays in RNAi screens. These

**Name Species Type Link** 

**Open Biosystems** Human, mouse siRNA , shRNA www.openbiosystems.com/R

**Sigma** Human, mouse siRNA , shRNA www.sigmaaldrich.com **SBI** Human, mouse siRNA , shRNA http://www.systembio.com/r

**Ambion** Human, mouse siRNA , shRNA www.invitrogen.com/sirna

**Cancer Institute (NKI)** Human shRNA www.lifesciences.sourcebiosc

**Off-target effects.** False positive or false negative results are commonly associated with high-throughput studies, including genome-wide RNAi screens [34]. The false discovery in RNAi screens can be caused by instrument errors, statistical noises, low knock-down efficiency and off-target effects of RNAi reagents. Off-target effects of RNAi reagents usually include: [1]. A general interference on endogenous RNAi pathway; or [2] Sequencedependent effects on the expression of non-target genes. In order to minimize off-target effects, it is suggested to perform sufficient replication experiments and choose two or more RNAi reagents that target different regions of the coding sequences. Sequence-dependent off-target effects can be avoided by selecting sequences that do not contain 19 or more base

**Recent cell-based RNAi screen studies.** Genome-wide RNAi screens have been primarily conducted in both *Drosophila* and mammalian cultured cells (Reviewed in [22]. These screens are involved in studies on a variety of biology processes, such as signal transduction, metabolism, cancer, stem cells. The cell-based RNAi screens have yielded tremendous amount of novel discoveries and greatly promoted our understanding on many basic biological processes, molecular functions and complexity of cellular networks. New findings from genome-wide RNAi screens have led to the identification of novel components of canonical signaling transduction pathways, such new players of ERK pathway [16] and phosphorylation networks regulating JUN NH2-terminal kinase (JNK) pathway [36]; the role of *S1pr2* gene (Sphingosine-1-phosphate receptor 2) in insulin

**Qiagen** Human, mouse siRNA , shRNA www.qiagen.com **Dharmacon** Human, mouse siRNA , shRNA www.dharmacon.com

pairs of contiguous nucleotide identity to other genes in the genome [34-35].

NAi

nai-libraries

ience.com/

**Screen Center** Fruit fly dsRNA www.flyrnai.org/ **DKFZ Genome RNAi** Fruit fly dsRNA www.genomernai.org

treatments can be drugs, pathogens, or stress inducers [31-33].

Many complex phenotypes, such as aging, cannot be tested in a cell-based assay, thus *in vivo* functional analysis is required*. In vivo* RNAi screen is one of such approaches to study gene function at an organism level. *C. elegans* and *Drosophila* are two model organisms that are commonly used in *in vivo* RNAi screens. Although *ex vivo* RNAi screens have been done by introducing shRNA-transfected cells into mice [42-43], direct *in vivo* RNAi screen in mice is still under development. Most importantly, *in vivo* RNAi makes it possible for gene function studies in species lacking classical genetic tools, even including species without whole genome sequences (Usually next-generation sequencing is used to identify gene coding sequences and to facilitate RNAi reagent design and production).

*In vivo* genome-wide RNAi screen was first reported in *C. elegans* [44]. In *C. elegans*, dsRNAmediated RNAi effects are systemic and heritable [1], although gene knockdown is less efficient in the nervous system than in other tissues. *In vivo* genome-wide RNAi screens have been performed in the studies of aging [20-21, 45-46], metabolism [47] and microRNA pathways in *C. elegans* [48]. It is relatively easy to deliver dsRNA into *C. elegans*. Typically, dsRNAs are introduced to worms by soaking the animals in dsRNA solution [49], by injection of dsRNA [1], or by feeding the animals dsRNA-expressing bacteria [44]. The last method is commonly used to generate genome-wide RNAi libraries. These *C. elegans* RNAi libraries are now available from Ahringer lab RNAi collection and Open Biosystems (Table. 2).

In contrast, dsRNA feeding does not appear to work for gene silencing in *Drosophila*, while RNAi via injection of dsRNA is effective only in certain embryonic stages. Therefore, transgenic RNAi approach has been developed to express a double-stranded 'hairpin' RNA from a transgene. In *Drosophila*, RNAi is cell-autonomous, so that gene silencing can be easily performed in tissue- and spatial-specific manner by using the binary GAL4/UAS expression system. Currently, there are three groups that have generated genome-wide transgenic RNAi *Drosophila* stains [50] (Table. 2). These transgenic RNAi stains are all expressing invertedrepeat hairpin RNAs once crossing to appropriate Gal4 drivers. Recently, it's been shown that small hairpin RNAs (~19 nt) can trigger stronger gene inactivation than long hairpin RNAs. Therefore new constructs expressing small hairpin RNAs were generated to produce second generation of transgenic RNAi *Drosophila* stains at Harvard medical school (Table. 2). In the past several years, a number of genome-wide RNAi screens in *Drosophila* have been conducted to study the major signaling pathways [51], as well as many important disease models [52-55].

For example, a genome-wide obesity gene screen revealed hedgehog signaling as one of major adipose tissue regulators [18], while genome-wide Parkinson's disease modifier screen identified novel *Park* and/or *Pink1*-interacting genes [56]. In the following section, I will discuss some of these RNAi screens in more detail.

Genome-Wide RNAi Screen for the Discovery of

Gene Function, Novel Therapeutical Targets and Agricultural Applications 101

Genome-wide RNAi screen has greatly advanced our understanding on many fundamental biology problems, from signaling transduction pathways to complex phenotypes. Furthermore, the results from RNAi screen can be used to design future theroputic drugs and crop protection reagents. In the following section, I will discuss several applications

RNAi is one of the most powerful tools in functional genomics studies. Genome-wide RNAi screens have accelerated our understanding of basic biological functions and cellular signal networks, as well as the novel modulators of diseases. Follow-up experiments are usually performed to confirm the screen results and further study the underlying molecular mechanisms of identified genes or pathways. *In vivo* RNAi screens have also been conducted to study complex traits, such as aging. *C. elegans* is the primary model organism used in longevity gene screen, not only because high-throughput RNAi experiments are

Our understanding on canonical signal pathways is rapidly evolving and many new components or modulators are being identified with the help from improved technologies, including genome-wide RNAi screen. In the past decade, RNAi screens have been applied for deciphering many classical signal pathways, such as Notch, Wnt, and ERK signalings. Receptor tyrosine kinases (RTKs) are probably one of most critical protein families that regulate development, cell proliferation and growth. One of RTK families, insulin signaling plays important roles in controlling metabolism and growth. Disrupted insulin signaling leads to many human diseases, such as diabetes. To facilitate the underlying mechanism of diabetes and identify novel components and modulators of the insulin signaling pathway, a RNAi screen was conducted using 3T3-L1 adipocytes [37]. About 313 obesity and diabetes related genes were selected in the RNAi screen. The release of free fatty acid (FFA) was used as a read-out, since insulin-dependent FFA release is an indicator of insulin resistance. This screen showed that RNAi against 126 candidate genes resulted in significant changes of FFA release. After future filtering, *S1pr2* gene was identified as one of key regulators of insulin signaling. Increased plasma insulin levels were detected in male *S1pr2* -/- knockout mice,

**3.1. Deciphering cellular signaling pathways and complex traits** 

relatively easy to do in *C. elegans*, but also because of its short lifespan [57].

suggesting there is a potential link between *S1pr2* and insulin resistance [37].

time that a RNAi screen uses post-translation modification as a read-out assay.

One of RTK downstream effectors is ERK signaling pathways. Misregulated RTK/ERK signaling leads to developmental disorder and many human diseases (e.g. cancer). A recent RNAi screen study using *Drosophila* cell lines has identified 331 regulators of ERK pathway, suggesting a number of integrated signal pathways in the regulation of fine-tuned ERK signaling [16]. In this study, fluorescently-conjugated phospho-ERK antibodies were used to monitor the changes of phosphorylated ERK upon insulin stimulation, which is the first

**3. Application of RNAi screen** 

using RNAi screen approaches.

*3.1.1. Understanding signaling pathways* 


#### **2.3. Advantage and limitation of cell-based and** *In vivo* **RNAi screens**

Unlike forward genetic screens where mutations are randomly generated, RNAi screens provide a fast way to link phenotypes of interest to a precise gene. Beside, RNAi screens are generally performed in a genome-wide scale which brings us a comprehensive view of gene functions. Both cell-based and *in vivo* RNAi screens are highly effective and less laborintensive on the discovery of gene functions when compared to traditional mutagenesis screens. Furthermore, cell-based and *in vivo* RNAi can be applied to study gene functions in species lacking classical genetics tools.

In the past decade, genome-scale *in vitro* RNAi screens have been successfully applied for gene discovery and understanding fundamental biological processes and cellular signal pathways. A variety of RNAi libraries for cell-based RNAi screens have been developed for both invertebrate and vertebrate systems. Nowadays cell-based RNAi screens are relatively less expensive, and have become a fast and user-friendly platform for functional genomics studies. On the other hand, complex phenotypes that cannot be analyzed in cell-based RNAi screens, are normally directly studied *in vivo*. When compared to forward genetic screens where mutations are occurred in every cell and many mutations lead to developmental defect or lethality, *in vivo* RNAi screens can be performed in various developmental stages and different tissues. This is especially useful when adult-specific functions of target genes are studied and these genes are essential for the development. Currently, most of *in vivo* RNAi screens are conducted in *C. elegans* and *Drosophila* due to the availability of a tremendous amount of resources and advanced genetic tools. In contrast, *in vivo* RNAi screen in mice is still in early stage.

#### **3. Application of RNAi screen**

100 Functional Genomics

**Ahringer lab RNAi** 

**Transgenic RNAi project at Harvard Medical School**

**Vienna Drosophila** 

some of these RNAi screens in more detail.

**Table 2.** List of RNAi libraries used for *in vivo* systems

species lacking classical genetics tools.

For example, a genome-wide obesity gene screen revealed hedgehog signaling as one of major adipose tissue regulators [18], while genome-wide Parkinson's disease modifier screen identified novel *Park* and/or *Pink1*-interacting genes [56]. In the following section, I will discuss

**Name Species Type Link** 

**Library at Geneservice** Nematode Bacterial clone www.lifesciences.sourcebiosc

**Open Biosystems** Nematode Bacterial clone www.openbiosystems.com/R

**NIG-FLY** Fruit fly Long dsRNA http://www.shigen.nig.ac.jp/f

Unlike forward genetic screens where mutations are randomly generated, RNAi screens provide a fast way to link phenotypes of interest to a precise gene. Beside, RNAi screens are generally performed in a genome-wide scale which brings us a comprehensive view of gene functions. Both cell-based and *in vivo* RNAi screens are highly effective and less laborintensive on the discovery of gene functions when compared to traditional mutagenesis screens. Furthermore, cell-based and *in vivo* RNAi can be applied to study gene functions in

In the past decade, genome-scale *in vitro* RNAi screens have been successfully applied for gene discovery and understanding fundamental biological processes and cellular signal pathways. A variety of RNAi libraries for cell-based RNAi screens have been developed for both invertebrate and vertebrate systems. Nowadays cell-based RNAi screens are relatively less expensive, and have become a fast and user-friendly platform for functional genomics studies. On the other hand, complex phenotypes that cannot be analyzed in cell-based RNAi screens, are normally directly studied *in vivo*. When compared to forward genetic screens where mutations are occurred in every cell and many mutations lead to developmental defect or lethality, *in vivo* RNAi screens can be performed in various developmental stages and different tissues. This is especially useful when adult-specific functions of target genes are studied and these genes are essential for the development. Currently, most of *in vivo* RNAi screens are conducted in *C. elegans* and *Drosophila* due to the availability of a tremendous amount of resources

and advanced genetic tools. In contrast, *in vivo* RNAi screen in mice is still in early stage.

Short shRNA

Fruit fly Long dsRNA,

**RNAi Center** Fruit fly Long dsRNA stockcenter.vdrc.at

**2.3. Advantage and limitation of cell-based and** *In vivo* **RNAi screens** 

ience.com/

HOME.html

ly/nigfly/

www.flyrnai.org/TRiP-

NAi/

Genome-wide RNAi screen has greatly advanced our understanding on many fundamental biology problems, from signaling transduction pathways to complex phenotypes. Furthermore, the results from RNAi screen can be used to design future theroputic drugs and crop protection reagents. In the following section, I will discuss several applications using RNAi screen approaches.

#### **3.1. Deciphering cellular signaling pathways and complex traits**

RNAi is one of the most powerful tools in functional genomics studies. Genome-wide RNAi screens have accelerated our understanding of basic biological functions and cellular signal networks, as well as the novel modulators of diseases. Follow-up experiments are usually performed to confirm the screen results and further study the underlying molecular mechanisms of identified genes or pathways. *In vivo* RNAi screens have also been conducted to study complex traits, such as aging. *C. elegans* is the primary model organism used in longevity gene screen, not only because high-throughput RNAi experiments are relatively easy to do in *C. elegans*, but also because of its short lifespan [57].

#### *3.1.1. Understanding signaling pathways*

Our understanding on canonical signal pathways is rapidly evolving and many new components or modulators are being identified with the help from improved technologies, including genome-wide RNAi screen. In the past decade, RNAi screens have been applied for deciphering many classical signal pathways, such as Notch, Wnt, and ERK signalings. Receptor tyrosine kinases (RTKs) are probably one of most critical protein families that regulate development, cell proliferation and growth. One of RTK families, insulin signaling plays important roles in controlling metabolism and growth. Disrupted insulin signaling leads to many human diseases, such as diabetes. To facilitate the underlying mechanism of diabetes and identify novel components and modulators of the insulin signaling pathway, a RNAi screen was conducted using 3T3-L1 adipocytes [37]. About 313 obesity and diabetes related genes were selected in the RNAi screen. The release of free fatty acid (FFA) was used as a read-out, since insulin-dependent FFA release is an indicator of insulin resistance. This screen showed that RNAi against 126 candidate genes resulted in significant changes of FFA release. After future filtering, *S1pr2* gene was identified as one of key regulators of insulin signaling. Increased plasma insulin levels were detected in male *S1pr2* -/- knockout mice, suggesting there is a potential link between *S1pr2* and insulin resistance [37].

One of RTK downstream effectors is ERK signaling pathways. Misregulated RTK/ERK signaling leads to developmental disorder and many human diseases (e.g. cancer). A recent RNAi screen study using *Drosophila* cell lines has identified 331 regulators of ERK pathway, suggesting a number of integrated signal pathways in the regulation of fine-tuned ERK signaling [16]. In this study, fluorescently-conjugated phospho-ERK antibodies were used to monitor the changes of phosphorylated ERK upon insulin stimulation, which is the first time that a RNAi screen uses post-translation modification as a read-out assay.

#### *3.1.2. Identification of longevity genes*

Aging is one of the complex traits that are controlled by a large number of genes or the interaction between multiple genes/pathways. The first longevity pathway, insulin/ IGF-1 pathway, was identified in *C. elegans* in early 90's [58]. Following studies have shown that TOR (Target of Rapamycin) [59] and AMP kinase signaling [60] are also involved in longevity regulation. To explore other potential longevity assurance genes/pathways, two systematic RNAi screens for longevity genes were independently conducted at almost the same time in *C. elegans*. [45-46]. Both groups used the Ahringer bacterial RNAi libraries, although they chose different worm strains in the RNAi screen. Initially, the maximum lifespan of each RNAi clone was monitored, due to the tremendous work on the large-scale lifespan screen. In one screen, 89 genes were identified to be involved in lifespan regulation. These candidate genes encode diverse biological functions, including metabolism, mitochondrial functions, signal transduction, protein turnover, and so on. In contrast, 29 genes were identified in another screen. Although both groups are able to re-discover the genes in insulin/IGF-1 signaling, only three newly identified genes are shared by these two screens. This may be due to the different worm stains used in these two screens, plus high level of false positive/negative hits and off-target effects. Knockdown of these three genes led to robust lifespan extension [57].

Genome-Wide RNAi Screen for the Discovery of

Gene Function, Novel Therapeutical Targets and Agricultural Applications 103

that are required for survival and proliferation of cancer cells. One of these studies identified more than 250 genes are essential for the proliferation of cancer cells [17]. In the same study, four genes were implicated in the response of cancer cells to tumoricidal agents (e.g. imatinib). Several similar screens on drug sensitivity have led to the identification of cancer-associated genes, e.g., *ACRBP*, *TUBGCP2*, and *MAD2* in breast cancer [62]. Combining RNAi screen and other genomic tools (e.g. SNP array, aCGH, and SAGE data analysis), a recent study identified IKBKE kinase as a breast cancer oncogene [39]. This discovery could lead to the development of pharmaceutical inhibitors that block activity of

Metabolic syndrome, such as obesity, can increase the risk of developing cardiovascular disease and diabetes. However, the underlying molecular mechanisms are far from clear. To understand how adiposity is regulated in *Drosophila*, an *in vivo* genome-wide RNAi screen was reported recently. In this study, transgenic RNAi lines corresponding to 10,489 distinct open reading frames were used in RNAi screen. Tissue-specific gene inactivation for 500 candidates identified from the first screen was further tested. This study reveals hedgehog signaling as one of major adipocyte regulators [18]. To test whether hedgehog signaling plays any role in mammalian adipose tissue, mutant mice were generated to activate hedgehog in adipocytes. Activation of hedgehog in mice adipose tissues resulted in a dramatic loss of white fat compartments (but not brown) by directly blocking differentiation of white adipocytes. These results support the idea that white and brown adipocytes are derived from distinct precursor cells. Interestingly, glucose tolerance and insulin sensitivity remained normal in mutant mice. This suggests that modulating hedgehog signaling can reduce lipid accumulation in white adipose tissue, while maintain a fully functional brown adipose tissue. Since it has been suggested that functional brown adipose tissue represents a potent therapeutic target for obesity control, novel adiposity regulators (e.g. hedgehog

Initially, it's believed that systemic RNAi is a unique feature in worms, until this idea was tested in insect species other than *Drosophila*. The first systemic RNAi in insects was reported in the red flour beetle, *Tribolium castaneum* [63]. Injection of dsRNA for bristleforming gene, *Tc-achaete-scute* (*Tc-ASH*), resulted in bristle loss phenotype. Following this study, it was discovered that systemic RNAi via dsRNA injection works in many insect species, including mosquitoes [64], honey bees [65], aphids [66], termites [67], etc. Therefore RNAi became a useful tool in functional genomics studies in many non-model insect

In 2007, two breakthrough studies described the technology on pest control through feeding transgenic plant expressing dsRNA [68-69]. It is the first evident to show RNAi can be used as a potential pest control strategy for crop protection and feeding RNAi works in certain insect group just like worms. In these studies, initially a list of potential target genes were chosen and dsRNAs against target genes were synthesized *in vitro* and mixed with artificial

signaling) will be developed as obesity drug targets in the near future.

species, especially those economically important ones.

IKBKE kinase in breast cancer.

**3.3. Agricultural applications** 

Although genome-wide RNAi screens for longevity genes have not been reported in other species, large-scale genetic screens were performed recently. A screen of 564 single gene deletion strains was conducted to identify longevity genes in budding yeast [59]. Deletion of 10 genes led to extended replicative lifespan. Among them, many genes are involved in TOR pathway, suggesting a link between nutrient sensing and longevity. Recently, a large-scale misexpression screen for *Drosophila* longevity genes was reported. In this screen, a total of 15 longevity genes were identified, including genes involved in autophagy, mRNA synthesis, intracellular vesicle trafficking and neuroendocrine regulation [61]. With more longevity gene screens from other species, a cross-species comparison of these large-scale screens may provide us a list of conserved genes/pathways in regulating longevity.

#### **3.2. Identification of therapeutic drug targets**

A number of genome-wide RNAi screens have led to the identification of novel modulators of human diseases. RNAi screens have become an effective tool to identify and validate drug targets and to enhance novel drug discovery. On the other hand, RNAi-based therapies have been developed to target viral infection, cancer, cardiovascular disease and neurodegenerative diseases using specific shRNAs, although we should always keep in mind that there are some drawbacks and concerns of this technology, such as off-target effects, activation of endogenous RNAi pathways, individual genetic variation.

Traditional cancer drug discoveries still focus on a handful of known oncogenes. It remains a key challenge in identifying new targets. Application of genome-wide RNAi screens in novel target discovery greatly enhanced cancer drug discovery. In the past few years, several RNAi screens were performed using cancer cell lines to explore the essential genes that are required for survival and proliferation of cancer cells. One of these studies identified more than 250 genes are essential for the proliferation of cancer cells [17]. In the same study, four genes were implicated in the response of cancer cells to tumoricidal agents (e.g. imatinib). Several similar screens on drug sensitivity have led to the identification of cancer-associated genes, e.g., *ACRBP*, *TUBGCP2*, and *MAD2* in breast cancer [62]. Combining RNAi screen and other genomic tools (e.g. SNP array, aCGH, and SAGE data analysis), a recent study identified IKBKE kinase as a breast cancer oncogene [39]. This discovery could lead to the development of pharmaceutical inhibitors that block activity of IKBKE kinase in breast cancer.

Metabolic syndrome, such as obesity, can increase the risk of developing cardiovascular disease and diabetes. However, the underlying molecular mechanisms are far from clear. To understand how adiposity is regulated in *Drosophila*, an *in vivo* genome-wide RNAi screen was reported recently. In this study, transgenic RNAi lines corresponding to 10,489 distinct open reading frames were used in RNAi screen. Tissue-specific gene inactivation for 500 candidates identified from the first screen was further tested. This study reveals hedgehog signaling as one of major adipocyte regulators [18]. To test whether hedgehog signaling plays any role in mammalian adipose tissue, mutant mice were generated to activate hedgehog in adipocytes. Activation of hedgehog in mice adipose tissues resulted in a dramatic loss of white fat compartments (but not brown) by directly blocking differentiation of white adipocytes. These results support the idea that white and brown adipocytes are derived from distinct precursor cells. Interestingly, glucose tolerance and insulin sensitivity remained normal in mutant mice. This suggests that modulating hedgehog signaling can reduce lipid accumulation in white adipose tissue, while maintain a fully functional brown adipose tissue. Since it has been suggested that functional brown adipose tissue represents a potent therapeutic target for obesity control, novel adiposity regulators (e.g. hedgehog signaling) will be developed as obesity drug targets in the near future.

#### **3.3. Agricultural applications**

102 Functional Genomics

*3.1.2. Identification of longevity genes* 

led to robust lifespan extension [57].

Aging is one of the complex traits that are controlled by a large number of genes or the interaction between multiple genes/pathways. The first longevity pathway, insulin/ IGF-1 pathway, was identified in *C. elegans* in early 90's [58]. Following studies have shown that TOR (Target of Rapamycin) [59] and AMP kinase signaling [60] are also involved in longevity regulation. To explore other potential longevity assurance genes/pathways, two systematic RNAi screens for longevity genes were independently conducted at almost the same time in *C. elegans*. [45-46]. Both groups used the Ahringer bacterial RNAi libraries, although they chose different worm strains in the RNAi screen. Initially, the maximum lifespan of each RNAi clone was monitored, due to the tremendous work on the large-scale lifespan screen. In one screen, 89 genes were identified to be involved in lifespan regulation. These candidate genes encode diverse biological functions, including metabolism, mitochondrial functions, signal transduction, protein turnover, and so on. In contrast, 29 genes were identified in another screen. Although both groups are able to re-discover the genes in insulin/IGF-1 signaling, only three newly identified genes are shared by these two screens. This may be due to the different worm stains used in these two screens, plus high level of false positive/negative hits and off-target effects. Knockdown of these three genes

Although genome-wide RNAi screens for longevity genes have not been reported in other species, large-scale genetic screens were performed recently. A screen of 564 single gene deletion strains was conducted to identify longevity genes in budding yeast [59]. Deletion of 10 genes led to extended replicative lifespan. Among them, many genes are involved in TOR pathway, suggesting a link between nutrient sensing and longevity. Recently, a large-scale misexpression screen for *Drosophila* longevity genes was reported. In this screen, a total of 15 longevity genes were identified, including genes involved in autophagy, mRNA synthesis, intracellular vesicle trafficking and neuroendocrine regulation [61]. With more longevity gene screens from other species, a cross-species comparison of these large-scale screens may

A number of genome-wide RNAi screens have led to the identification of novel modulators of human diseases. RNAi screens have become an effective tool to identify and validate drug targets and to enhance novel drug discovery. On the other hand, RNAi-based therapies have been developed to target viral infection, cancer, cardiovascular disease and neurodegenerative diseases using specific shRNAs, although we should always keep in mind that there are some drawbacks and concerns of this technology, such as off-target

Traditional cancer drug discoveries still focus on a handful of known oncogenes. It remains a key challenge in identifying new targets. Application of genome-wide RNAi screens in novel target discovery greatly enhanced cancer drug discovery. In the past few years, several RNAi screens were performed using cancer cell lines to explore the essential genes

provide us a list of conserved genes/pathways in regulating longevity.

effects, activation of endogenous RNAi pathways, individual genetic variation.

**3.2. Identification of therapeutic drug targets** 

Initially, it's believed that systemic RNAi is a unique feature in worms, until this idea was tested in insect species other than *Drosophila*. The first systemic RNAi in insects was reported in the red flour beetle, *Tribolium castaneum* [63]. Injection of dsRNA for bristleforming gene, *Tc-achaete-scute* (*Tc-ASH*), resulted in bristle loss phenotype. Following this study, it was discovered that systemic RNAi via dsRNA injection works in many insect species, including mosquitoes [64], honey bees [65], aphids [66], termites [67], etc. Therefore RNAi became a useful tool in functional genomics studies in many non-model insect species, especially those economically important ones.

In 2007, two breakthrough studies described the technology on pest control through feeding transgenic plant expressing dsRNA [68-69]. It is the first evident to show RNAi can be used as a potential pest control strategy for crop protection and feeding RNAi works in certain insect group just like worms. In these studies, initially a list of potential target genes were chosen and dsRNAs against target genes were synthesized *in vitro* and mixed with artificial diet. RNAi for several target genes results in larval growth arrest and lethality. Next, transgenic plant was engineered to produce dsRNA against genes whose inactivation results in strong RNAi response. Such genes include V-type ATPase A in western corn rootworm and cytochrome P450 (CYP6AE14) in cotton bollworm. These results provide strong evidence to support the feasibility of using RNAi in pest control and crop protection. Recently, feeding RNAi was also demonstrated in termites [67]. Feeding on cellulose disks soaked with dsRNA against digestive cellulose enzyme and hexamerin storage protein caused reduction in termite fitness and increased mortality. This study opened a new way for termite control using feed RNAi technology combining with a bait system. Although developing RNAi-based pest control approach is still at early stage and it is not as effective as current crop protection technology (e.g. *Bacillus thuringiensis* (Bt) toxin), RNAi will provide an alternative strategy for the future pest management.

Genome-Wide RNAi Screen for the Discovery of

Gene Function, Novel Therapeutical Targets and Agricultural Applications 105

are biogenic amine receptor (TC007490/D2R), peptide receptors (TC013945/CcapR, TC012493/ETHR, TC004716 and TC006805), and protein hormone receptors (TC008163/bursicon receptor and TC009127/glycoprotein hormone-like receptor). Silencing of genes coding for four GPCRs (TC012521/stan, TC009370/mthl and TC001872/Cirl) in Class B and two GPCRs (TC014055/fz and TC005545/smo) in Class D also caused severe mortality (Table. 3). DsRNA-mediated knockdown for eight GPCRs caused more than 90% mortality after dsRNA injection. Interestingly, RNAi for one of the GPCRs, dopamine-2 like receptor (TC007490), resulted in high lethality during early larval stage. In *Drosophila*, dopamine-2 like receptor (D2R) is one of the genes highly expressed in head and brain (http://www.flyatlas.org/) and D2R RNAi flies with reduced D2R expression show significantly decreased locomotor activity (Draper et al. 2007). Since TC007490/D2R RNAi beetles died during the larval stage, TC007490/D2R might be playing a critical role in the growth and development of beetle larvae by modulating neuronal development and locomotor activity as reported in *D. melanogaster*. Collectively, the RNAi screen in *T. castaneum* has provided useful information and it has also been proven to be a nice model

T. castaneum GPCR prediction through BLAST search

↓

dsRNA design and synthesis [300~500 bp]

↓

Injection of dsRNA to T. castaneum larvae

↓

Phenotypic data collection

[Development arrest & ecdysis failure]

↓

Select positive hits for second run of RNAi screen

system for future pesticide screen.

**Figure 1.** The outline of GPCR RNAi screen in *T. castaneum*

Beside mortality, RNAi for eight GPCRs also resulted in severe developmental arrest and ecdysis failure, including recently characterized bursicon receptor [72]. Interestingly, the majority of insects injected with TC007490/D2R dsRNA was not able to molt to the pupal stage and died during the larval stage. Only, a few larvae injected with TC007490/D2R dsRNA were able to reach quiescent stage (a non-feeding prepupal stage, about 96 hr after ecdysis into final instar), suggesting that this gene may play an important role during larval

#### **3.4. A case study: Large-scale GPCR RNAi screen for novel pesticide target discovery**

The G protein-coupled receptors (GPCRs) belong to the largest superfamily of integral cell membrane proteins and play crucial roles in physiological processes including behavior, development and reproduction. About 1-2% of all genes in an insect genome code for GPCRs. Whole genome sequencing identified about 200 GPCRs in *Drosophila* and 276 GPCRs in African malaria mosquito, *Anopheles gambiae*. Currently, there is not a commercial insecticide that targets GPCR. The red flour beetle, *T. castaneum* is one of the worldwide stored product pests. The genome of *T. castaneum* has been sequenced in 2008 [70], which offers great opportunities for the studies on functional genomics and the identification of targets for pest control. In one recent study [71], 111 non-sensory GPCRs were annotated from the beetle *T. castaneum* genome. To discover potential GPCRs as pesticide targets, a large-scale RNAi screen was performed by injecting dsRNA into developing larvae. The outline of this study is shown in Figure. 1. In this study, eight GPCRs were found involved in larval growth, molting and metamorphosis. The identified GPCRs may serve as potential insecticide targets for controlling *T. castaneum* and other related pest species.

In this GPCR RNAi study [71], 111 annotated *T. castaneum* GPCRs were classified into four different families based on conserved domain prediction program: Class A, Rhodopsin-like receptor; Class B, Secretin receptor-like; Class C, Metabotropic glutamate receptor-like and Class D, Atypical GPCRs. In summary, there are 74 Rhodopsin-like GPCRs, 19 Secretin receptor-like GPCRs, 11 Metabotropic glutamate receptor-like GPCRs, and 7 Atypical GPCRs. Rhodopsin-like GPCR family contains 20 biogenic amine receptors, 42 peptide receptors, four glycoprotein hormone receptors and one purine receptors.

A large-scale GPCR RNAi screen was then conducted by injecting dsRNA for 111 *T. castaneum* GPCRs into one-day-old final instar larvae. Mortality and development defects of dsRNA injected insects were recorded every 2-3 days until adult eclosion. This screen identified 12 GPCRs that effect growth and development. Among 12 GPCRs identified there are biogenic amine receptor (TC007490/D2R), peptide receptors (TC013945/CcapR, TC012493/ETHR, TC004716 and TC006805), and protein hormone receptors (TC008163/bursicon receptor and TC009127/glycoprotein hormone-like receptor). Silencing of genes coding for four GPCRs (TC012521/stan, TC009370/mthl and TC001872/Cirl) in Class B and two GPCRs (TC014055/fz and TC005545/smo) in Class D also caused severe mortality (Table. 3). DsRNA-mediated knockdown for eight GPCRs caused more than 90% mortality after dsRNA injection. Interestingly, RNAi for one of the GPCRs, dopamine-2 like receptor (TC007490), resulted in high lethality during early larval stage. In *Drosophila*, dopamine-2 like receptor (D2R) is one of the genes highly expressed in head and brain (http://www.flyatlas.org/) and D2R RNAi flies with reduced D2R expression show significantly decreased locomotor activity (Draper et al. 2007). Since TC007490/D2R RNAi beetles died during the larval stage, TC007490/D2R might be playing a critical role in the growth and development of beetle larvae by modulating neuronal development and locomotor activity as reported in *D. melanogaster*. Collectively, the RNAi screen in *T. castaneum* has provided useful information and it has also been proven to be a nice model system for future pesticide screen.

**Figure 1.** The outline of GPCR RNAi screen in *T. castaneum*

104 Functional Genomics

**discovery** 

diet. RNAi for several target genes results in larval growth arrest and lethality. Next, transgenic plant was engineered to produce dsRNA against genes whose inactivation results in strong RNAi response. Such genes include V-type ATPase A in western corn rootworm and cytochrome P450 (CYP6AE14) in cotton bollworm. These results provide strong evidence to support the feasibility of using RNAi in pest control and crop protection. Recently, feeding RNAi was also demonstrated in termites [67]. Feeding on cellulose disks soaked with dsRNA against digestive cellulose enzyme and hexamerin storage protein caused reduction in termite fitness and increased mortality. This study opened a new way for termite control using feed RNAi technology combining with a bait system. Although developing RNAi-based pest control approach is still at early stage and it is not as effective as current crop protection technology (e.g. *Bacillus thuringiensis* (Bt) toxin), RNAi will

provide an alternative strategy for the future pest management.

**3.4. A case study: Large-scale GPCR RNAi screen for novel pesticide target** 

insecticide targets for controlling *T. castaneum* and other related pest species.

receptors, four glycoprotein hormone receptors and one purine receptors.

In this GPCR RNAi study [71], 111 annotated *T. castaneum* GPCRs were classified into four different families based on conserved domain prediction program: Class A, Rhodopsin-like receptor; Class B, Secretin receptor-like; Class C, Metabotropic glutamate receptor-like and Class D, Atypical GPCRs. In summary, there are 74 Rhodopsin-like GPCRs, 19 Secretin receptor-like GPCRs, 11 Metabotropic glutamate receptor-like GPCRs, and 7 Atypical GPCRs. Rhodopsin-like GPCR family contains 20 biogenic amine receptors, 42 peptide

A large-scale GPCR RNAi screen was then conducted by injecting dsRNA for 111 *T. castaneum* GPCRs into one-day-old final instar larvae. Mortality and development defects of dsRNA injected insects were recorded every 2-3 days until adult eclosion. This screen identified 12 GPCRs that effect growth and development. Among 12 GPCRs identified there

The G protein-coupled receptors (GPCRs) belong to the largest superfamily of integral cell membrane proteins and play crucial roles in physiological processes including behavior, development and reproduction. About 1-2% of all genes in an insect genome code for GPCRs. Whole genome sequencing identified about 200 GPCRs in *Drosophila* and 276 GPCRs in African malaria mosquito, *Anopheles gambiae*. Currently, there is not a commercial insecticide that targets GPCR. The red flour beetle, *T. castaneum* is one of the worldwide stored product pests. The genome of *T. castaneum* has been sequenced in 2008 [70], which offers great opportunities for the studies on functional genomics and the identification of targets for pest control. In one recent study [71], 111 non-sensory GPCRs were annotated from the beetle *T. castaneum* genome. To discover potential GPCRs as pesticide targets, a large-scale RNAi screen was performed by injecting dsRNA into developing larvae. The outline of this study is shown in Figure. 1. In this study, eight GPCRs were found involved in larval growth, molting and metamorphosis. The identified GPCRs may serve as potential

> Beside mortality, RNAi for eight GPCRs also resulted in severe developmental arrest and ecdysis failure, including recently characterized bursicon receptor [72]. Interestingly, the majority of insects injected with TC007490/D2R dsRNA was not able to molt to the pupal stage and died during the larval stage. Only, a few larvae injected with TC007490/D2R dsRNA were able to reach quiescent stage (a non-feeding prepupal stage, about 96 hr after ecdysis into final instar), suggesting that this gene may play an important role during larval

growth and development rather than molting and metamorphosis. In contrast, most of the insects injected with TC001872/Cirl dsRNA entered the quiescent stage and died during this stage. About 40% of the insect injected TC001872/Cirl dsRNA were able to molt to the pupal stage and eventually died during the early pupal stage. The majority of insects injected with TC012521/stan dsRNA was not able to complete adult eclosion and died during pharate adult stage. Interestingly, TC014055/fz and TC009370/mthl RNAi caused an arrest in both larval-pupal and pupal-adult ecdysis, suggesting that they may play important roles in the regulation of ecdysis behavior. In contrast, insects injected with TC009370/mthl dsRNA were arrested at the late phase of larval-pupal and pupal-adult ecdysis. The majority of insects injected with TC005545/smo dsRNA died during the early pupal stages without showing any ecdysis defects.

Genome-Wide RNAi Screen for the Discovery of

Gene Function, Novel Therapeutical Targets and Agricultural Applications 107

Genome-wide RNAi screen is a powerful technique for studying gene functions, deciphering complex phenotypes, and identifying novel drug targets. It opens up a whole new field that allows researchers to explore new modulators in classical signaling pathways, new mechanisms underlying basic biological functions, and new drug targets of human diseases. An increasing number of genome-wide RNAi screens have been successfully conducted for all kinds of novel discoveries. Although the off-target effects and other false discovery issues still remain, RNAi screen technique will be greatly improved as the development of new RNAi libraries and image detection instruments. Most importantly, as our understanding of RNAi pathway continues to grow, we will be able to design more specific and effective RNAi tools for genome-wide RNAi screen. There is no doubt that, through genome-wide RNAi screens, we will gain more insights into complex signaling networks and molecular mechanism of diseases in the near future, which will eventually

I'd like to thank Subba R. Palli and Ping Kang for valuable comments on the manuscript, and Ellison Medical Foundation/AFAR postdoctoral fellowship for the financial support.

[1] Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature. 1998

[2] Mello CC, Conte D, Jr. Revealing the world of RNA interference. Nature. 2004 Sep

[3] Zamore PD. RNA interference: big applause for silencing in Stockholm. Cell. 2006 Dec

[4] Ratcliff F, Harrison BD, Baulcombe DC. A similarity between viral defense and gene

[5] Mourrain P, Beclin C, Elmayan T, Feuerbach F, Godon C, Morel JB, et al. Arabidopsis SGS2 and SGS3 genes are required for posttranscriptional gene silencing and natural

[6] Voinnet O, Lederer C, Baulcombe DC. A viral movement protein prevents spread of the gene silencing signal in Nicotiana benthamiana. Cell. 2000 Sep 29;103(1):157-67.

lead to the discovery of novel therapeutic drug and crop protection reagents.

*Department of Ecology and Evolutionary Biology, Brown University, USA* 

silencing in plants. Science. 1997 Jun 6;276(5318):1558-60.

virus resistance. Cell. 2000 May 26;101(5):533-42.

**4. Conclusion** 

**Author details** 

**Acknowledgement** 

Feb 19;391(6669):806-11.

16;431(7006):338-42.

15;127(6):1083-6.

**5. References** 

Hua Bai

The GPCRs identified in this study [71] could be served as potential pesticide targets, which can be used in small molecule screen, or the development of RNAi-based pesticides. Among the identified GPCRs, many of them belong to classic GPCR families, e.g. biogenic amine receptors (TC007490 /D2R and TC011960/5-HTR) and neuropeptide receptors (TC009127/glycoprotein hormone-like receptor). These GPCRs, which are activated by small molecules, can be used as potential tar-gets for novel pesticide development. On the other hand, it may not be possible to apply small molecule ligands for pest management through targeting identified atypical GPCRs (e.g. TC014055 / fz and TC005545 / smo) whose ligands tend to be larger proteins. However, it should be possible to develop a RNAi-based pest control strategy through ingestion of specific dsRNA targeting atypical GPCRs as well as classical GPCRs [73].

