Chromosome No: Start of pattern: End of pattern: No. of Patterns: 
Average No. per block
4 3000848 3170609 2 6.58 
13 67400889 67687804 2 4.09
```
Fig. 5. Example output file showing genomic regions with low SNP allele pattern counts.

As an example, we could find the genes within the genomic region. In the FunctSNP R package (5) there is a function called "getGenesByRange" which returns the Gene ID for all genes located between a user-specified start and end location.

#### **3.5 Implementation summary**

In summary, three sets of Perl scripts comprise *SNPpattern*: 1) grouping data scripts – to create separate data files for further downstream comparison analysis; 2) SNP allele block scripts – to find, count, and compare the SNP allele block patterns between any group of individuals; and 3) similarity scripts - to score the similarity between individuals based on an individual's entire SNP allele pattern. Table 8 encapsulates the primary function and rationale of each script.

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

SNP\_map Count the number of SNPs per

pattern Generate a file listing all the possible

pattern size

chromosome and determine the distance between each genotyped SNP.

combinations of 1s and 2s given a

Table 8. The suite of Perl scripts collectively called SNPpattern.

Perl Script Name (.pl)

**4. Discussion** 

from 1 to 3, 2 to 4, 3 to 5 etc.

and Measure Genomic Diversity in Populations Using SNP Genotypes 443

Miscellaneous scripts

The *SNPpattern* program is a first version and is still in its development phase and the program testing was a first attempt to analyse the haplotype structure within and across populations. Nonetheless, *SNPpattern* in its current form will easily generate, with little user required effort, output files that provide a researcher with information about LD and IBD which can be used in population diversity and association studies. *SNPpattern* still has some shortcomings that need to be addressed in future releases. Accounting for the population structure of a group is currently at the discretion of the user by grouping genotypes appropriately. During the grouping of genotypes *SNPpattern* allows specified animal IDs to be excluded from the group e.g. if in a particular breed group the number of progeny from each sire is disproportionate, animal IDs can be excluded to balance the proportions. It is anticipated that knowing which animals to exclude may be difficult and the exclusions may inadvertently introduce biases. Therefore a weighted SNP allele pattern count in accordance to animal proportions may be a possible solution. Pritchard et al. propose a model-based clustering method for using genotype data to infer population structure Pritchard et al., (2000). With this method it might be possible to assign individuals to appropriate groups automatically. Another important omission that needs to be addressed is to take into account, when interpreting haplotype block locations, the varying physical distance between the SNPs within the blocks. Some SNPs are closer together in some regions and further apart in others. Also, a sliding block window would improve accuracy and needs to be implemented. For example, if we have a 3-SNP allele block the program currently uses a window of SNPs from 1 to 3, 4 to 6, 7 to 9 etc. A sliding window would encompass SNPs

During the development of *SNPpattern* several statistical methods (in addition to Fisher's Exact Test for Count Data) were used in an attempt to determine which SNP allele pattern has occurred because there is a correlation between the SNP alleles (possible members of a haplotype block) and which SNP allele pattern occurred by chance. Despite taking allele frequencies into account, no statistical test was found to reliably prove that SNPs were inherited by descent. For example, let us suppose we have 3 SNP alleles in relative close proximity to each other on a particular chromosome in a distant ancestor. Many generations of progeny later, we have exactly the same 3 SNP alleles (the same haplotype block) in some of the progeny. The challenge is to prove that these 3 SNP alleles where inherited from the distant ancestor. The expectation is that these 3 SNP alleles have remained together on the haplotype block because they reside in a genomic region which is involved in important

Primary function Rationale

Knowledge of the distribution and distance between the genotyped SNPs is important for interpreting haplotype

Created as a general pattern generator

block boundaries.

tool.



Table 8. The suite of Perl scripts collectively called SNPpattern.

#### **4. Discussion**

442 Bioinformatics – Trends and Methodologies

Scripts for grouping and summarising genotype data

Scripts for finding, counting, and comparing SNP allele block patterns

Scripts used to find similarity between animals based on SNP allele patterns

Group Genotype data is separated into files

divide Divide the bi-allelic SNPs in any group

derive\_pattern Derive all SNP allele patterns of a

match Find and count the number of matching

order\_match Similar to "match.pl" except the output

score Output the most frequent SNP allele

end location of the block.

SNP markers are the same.

Rank Similar to "sim.pl" except rank the

matrices.

Sim For each animal in turn, list all other

(e.g. either flock, breed, or sire)

according to a grouping criterion. For example, the genotype of animals can be grouped according to their sire breed, or flock ID, or birth years.

input file (e.g. flock, breed, and sire groups) into separate chromosome files.

specified block size (e.g. 3, 100, 1000, 2000 etc.) that exist in the maternal and/or paternal chromosomes for *any* group file

SNP allele patterns found within a specified block size along a paternal and/or maternal chromosome for every individual in a specified group.

is in a different format. Also creates a group consensus file containing a concatenation of the most common SNP allele pattern found at each block. In effect it creates a paternal or maternal chromosome comprising the most common SNP allele patterns in a group.

block pattern found at each block location along the chromosome and provide additional information such as the percentage of animals with the pattern, and chromosomal start and

animals in the same group in order of SNP allele pattern similarity. The entire chromosome is compared and individuals are scored as to how many

animals' similarity to all other animals in the group based on the summation of the scores from the similarity

Primary function Rationale

Main purpose of dividing the data into groups is to account for population structure, facilitate the SNP-block pattern counting within a group and the comparison of the SNP-block pattern count between groups.

Used as the main input for the SNP allele pattern analysis scripts and in particular for the multiple SNP allele

Compiles all the unique SNP allele patterns found in a group into 1 file. Used as input to subsequent scripts to find and count the frequency of these

An essential requirement for the multiple SNP allele block approach

Enables a researcher to view and compare, one block at a time, the SNP allele patterns found within each block. The group consensus chromosome can be compared to the chromosomes of individuals within the group and the difference can be used as measure of dissimilarity between individuals.

The most frequent SNP allele block pattern is deemed to be the most likely to be a haplotype. The statistics provided may enable the researcher to decide if the SNP allele pattern is a true haplotype or one occurring by chance.

Similarity matrices for individuals within flocks, breeds, or sires can be

Scores can be used as a measure of genetic similarity between individuals or groups. It is expected that similar individuals will have similar LD

computed

patterns.

block approach

unique patterns.

Perl Script Name (.pl)

> The *SNPpattern* program is a first version and is still in its development phase and the program testing was a first attempt to analyse the haplotype structure within and across populations. Nonetheless, *SNPpattern* in its current form will easily generate, with little user required effort, output files that provide a researcher with information about LD and IBD which can be used in population diversity and association studies. *SNPpattern* still has some shortcomings that need to be addressed in future releases. Accounting for the population structure of a group is currently at the discretion of the user by grouping genotypes appropriately. During the grouping of genotypes *SNPpattern* allows specified animal IDs to be excluded from the group e.g. if in a particular breed group the number of progeny from each sire is disproportionate, animal IDs can be excluded to balance the proportions. It is anticipated that knowing which animals to exclude may be difficult and the exclusions may inadvertently introduce biases. Therefore a weighted SNP allele pattern count in accordance to animal proportions may be a possible solution. Pritchard et al. propose a model-based clustering method for using genotype data to infer population structure Pritchard et al., (2000). With this method it might be possible to assign individuals to appropriate groups automatically. Another important omission that needs to be addressed is to take into account, when interpreting haplotype block locations, the varying physical distance between the SNPs within the blocks. Some SNPs are closer together in some regions and further apart in others. Also, a sliding block window would improve accuracy and needs to be implemented. For example, if we have a 3-SNP allele block the program currently uses a window of SNPs from 1 to 3, 4 to 6, 7 to 9 etc. A sliding window would encompass SNPs from 1 to 3, 2 to 4, 3 to 5 etc.

> During the development of *SNPpattern* several statistical methods (in addition to Fisher's Exact Test for Count Data) were used in an attempt to determine which SNP allele pattern has occurred because there is a correlation between the SNP alleles (possible members of a haplotype block) and which SNP allele pattern occurred by chance. Despite taking allele frequencies into account, no statistical test was found to reliably prove that SNPs were inherited by descent. For example, let us suppose we have 3 SNP alleles in relative close proximity to each other on a particular chromosome in a distant ancestor. Many generations of progeny later, we have exactly the same 3 SNP alleles (the same haplotype block) in some of the progeny. The challenge is to prove that these 3 SNP alleles where inherited from the distant ancestor. The expectation is that these 3 SNP alleles have remained together on the haplotype block because they reside in a genomic region which is involved in important

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

reliable results in the future.

haplotype diversity/analysis.

**5. Conclusions** 

individuals is perfect for determining genetic similarity.

and Measure Genomic Diversity in Populations Using SNP Genotypes 445

DNA sequences will be compared and counted between individuals to determine the structure and distribution of LD. Also, comparing entire DNA sequences between

Although the motivation for developing *SNPpattern* was to find patterns of LD, it is suggested that common SNP allele patterns could be used in association studies (Botstein & Risch, 2007). Common SNP allele patterns is only an interim suggestion, as it is expected that using common DNA sequences in association studies will prove to generate the most

We described the development of *SNPpattern*, which is the collective name for a suite of Perl scripts essentially designed to group, count, and compare SNP allele patterns of various block sizes. Differences in SNP allele block frequency are used as a measure of haplotype diversity within and between groups. A SNP allele pattern represents SNPs inherited from one parent and is a product from phased genotype data. The SNP allele pattern from a programming point of view is simply a line of either 2 characters (0 or 1, 1 or 2, A or B) representing 2 different states. The main factor that drove the development of *SNPpattern* was the premise that studying SNP allele patterns can reveal useful information to help understand the genetics of individuals within groups and across groups. The use of *SNPpattern* has been illustrated on sheep breeds (Goodswen et al., 2009) but it is indeed generic software meant for all species. *SNPpattern* allows researchers, given any phased genotype data in a PHASE or fastPHASE format, to analyse SNP allele patterns within any user-defined SNP allele block size. These SNP allele patterns can be compared between user-defined groups. The primary objective of the tool is to provide a researcher with useful information on SNP allele block patterns and as a major example of its usage, the information can be used to quantify haplotype diversity within and between groups. While there are similar bioinformatics tools that have a primary focus on haplotype inference and/or analysis tools (such as Haploview, HapBlock, HaploBlock, and GERBIL) we have found no tool that provides a smooth interface between a PHASE or fastPHASE output and

Two main approaches for studying the SNP allele patterns have been implemented within *SNPpattern*: a multiple SNP allele block and a pattern similarity scoring approach. For both approaches, *SNPpattern* generates various descriptive statistics of the SNP allele patterns in plaintext output files. It is not the author's intention to stipulate how a researcher should interpret or use the information. Nevertheless, in this chapter suggestions were made as to how *SNPpattern* might be used by a researcher. In particular, *SNPpattern* was proposed as a generic tool for finding the patterns of LD using a multiple SNP allele block model. We have demonstrated in another published paper how *SNPpattern* can be used to examine the patterns and extent of LD within and between 4 Australian sheep breeds (Goodswen et al., 2009). The results show that *SNPpattern* could be used to effectively evaluate overall

In closing, *SNPpattern* is a simple pre-screening tool to rapidly screen genome for haplotype structure and provide insights on highly conserved biologically important haplotypes. SNPpattern is implemented in Perl and supported on Linux and MS Windows. We have tested *SNP pattern* on Ovine 60k SNPchip data (Goodswen et al., 2010). All scripts are freely available from: http://web4ftp.it.csiro.au/ftp4goo17a/SNPpattern/SNPpattern.zip.

haplotype diversity within and between groups of individuals.

biological processes. That is, positive selection has ensured the survivable of the haplotype block. Consequently it is expected that in a population of descendents from the distant ancestor, the frequency of the haplotype block housing the 3 SNP alleles will be high within the population. The increased frequency of the 3 SNP alleles might be explained by the process of selective sweeps (Montpetit & Chagnon, 2006, Chevin & Hospital, 2008). A strong selective sweep can result in only 1 or 2 haplotypes existing in the same region of the genome for a population (Chevin & Hospital, 2008). Therefore, although further evidence is required, it is argued that in some instances SNP allele patterns, which are overrepresented in the population, indicate non-random SNP inheritance and could be considered a part of a haplotype block. For example, there are cases where in a particular genomic region there is only 1 out of 8 possible SNP allele patterns present in the population (i.e. 100% of individuals have the same pattern). Many of the results from the Fisher's Exact Test dispute this argument. For example, in regions on the genome where nearly all individuals have the same SNP allele pattern block and SNP allele frequencies on the block are high, Fisher's pvalues indicate that the SNPs are independent.

Like all programs, the worth and accuracy of the output data from *SNPpattern* is totally dependent on the data input. For example in the program testing on sheep breeds (Goodswen et al. 2009), the frequency of SNP allele block patterns were counted and the similarity between animals scored based on only 5,494 SNPs, which were genotyped for chromosome #1. In other words, the interpretation of the LD patterns for chromosome #1 was based on the state of 5,494 nucleotides. Chromosome #1 in fact is comprised of over 299,636,549 nucleotides and, as in the case for sheep; there is an unknown number of SNPs. It is expected that as the number of SNPs increase and the distance between the SNPs decrease the more the *SNPpattern* outputs will be informative. Also it is important to know what selection criterion was used for selecting the SNPs to be genotyped before interpreting the results obtained from *SNPpattern*. For example, were the SNPs selected for even distribution across the genome and/or were the SNPs selected as tags owing to prior knowledge of the LD structure. If the purpose of using *SNPpattern* is to define haplotype blocks then it is expected that the results may be distorted if the genotyped SNPs are tag SNPs.

This chapter solely focused on SNP haplotypes in the context of LD or selective sweeps due to directional selection (natural or artificial) acting on the genetic variants affecting complex traits measured / observed on the individuals. However, the consequences of this would have been at the underlying biological level, namely the SNP haplotype diversity affecting gene expression levels or protein abundance in cells and tissues of relevance to the complex trait. This emphasises that future genetic studies on global gene expression patterns (Kadarmideen et al., 2006 and Kadarmideen 2008) should be targeted at effects of LD / expression-related SNP haplotype patterns. In fact, such studies could contribute to prediction of transcription factor binding sites, using combined SNP and gene expression datasets (Vonrohr et al., 2007). Further, identification of unique co-expression gene networks and functional gene modules distinguishing different phenotypic extremes or case/controls (e.g. Kadarmideen et al., 2011) could be speculated as being result of formation of distinct SNP haplotypes after selective sweeps.

It is expected that in the very near future SNPs will, for the most part, be superseded by entire DNA sequences due to the advent of low cost next generation sequencing (Hayden, 2009). With little modification, *SNPpattern* will handle DNA sequences in much the same way as it currently does for SNP allele sequences (although the computer performance/capability is an unknown entity). It is envisaged that varying block sizes of DNA sequences will be compared and counted between individuals to determine the structure and distribution of LD. Also, comparing entire DNA sequences between individuals is perfect for determining genetic similarity.

Although the motivation for developing *SNPpattern* was to find patterns of LD, it is suggested that common SNP allele patterns could be used in association studies (Botstein & Risch, 2007). Common SNP allele patterns is only an interim suggestion, as it is expected that using common DNA sequences in association studies will prove to generate the most reliable results in the future.

#### **5. Conclusions**

444 Bioinformatics – Trends and Methodologies

biological processes. That is, positive selection has ensured the survivable of the haplotype block. Consequently it is expected that in a population of descendents from the distant ancestor, the frequency of the haplotype block housing the 3 SNP alleles will be high within the population. The increased frequency of the 3 SNP alleles might be explained by the process of selective sweeps (Montpetit & Chagnon, 2006, Chevin & Hospital, 2008). A strong selective sweep can result in only 1 or 2 haplotypes existing in the same region of the genome for a population (Chevin & Hospital, 2008). Therefore, although further evidence is required, it is argued that in some instances SNP allele patterns, which are overrepresented in the population, indicate non-random SNP inheritance and could be considered a part of a haplotype block. For example, there are cases where in a particular genomic region there is only 1 out of 8 possible SNP allele patterns present in the population (i.e. 100% of individuals have the same pattern). Many of the results from the Fisher's Exact Test dispute this argument. For example, in regions on the genome where nearly all individuals have the same SNP allele pattern block and SNP allele frequencies on the block are high, Fisher's p-

Like all programs, the worth and accuracy of the output data from *SNPpattern* is totally dependent on the data input. For example in the program testing on sheep breeds (Goodswen et al. 2009), the frequency of SNP allele block patterns were counted and the similarity between animals scored based on only 5,494 SNPs, which were genotyped for chromosome #1. In other words, the interpretation of the LD patterns for chromosome #1 was based on the state of 5,494 nucleotides. Chromosome #1 in fact is comprised of over 299,636,549 nucleotides and, as in the case for sheep; there is an unknown number of SNPs. It is expected that as the number of SNPs increase and the distance between the SNPs decrease the more the *SNPpattern* outputs will be informative. Also it is important to know what selection criterion was used for selecting the SNPs to be genotyped before interpreting the results obtained from *SNPpattern*. For example, were the SNPs selected for even distribution across the genome and/or were the SNPs selected as tags owing to prior knowledge of the LD structure. If the purpose of using *SNPpattern* is to define haplotype blocks then it is expected that the results may be distorted if

This chapter solely focused on SNP haplotypes in the context of LD or selective sweeps due to directional selection (natural or artificial) acting on the genetic variants affecting complex traits measured / observed on the individuals. However, the consequences of this would have been at the underlying biological level, namely the SNP haplotype diversity affecting gene expression levels or protein abundance in cells and tissues of relevance to the complex trait. This emphasises that future genetic studies on global gene expression patterns (Kadarmideen et al., 2006 and Kadarmideen 2008) should be targeted at effects of LD / expression-related SNP haplotype patterns. In fact, such studies could contribute to prediction of transcription factor binding sites, using combined SNP and gene expression datasets (Vonrohr et al., 2007). Further, identification of unique co-expression gene networks and functional gene modules distinguishing different phenotypic extremes or case/controls (e.g. Kadarmideen et al., 2011) could be speculated as being result of formation of distinct

It is expected that in the very near future SNPs will, for the most part, be superseded by entire DNA sequences due to the advent of low cost next generation sequencing (Hayden, 2009). With little modification, *SNPpattern* will handle DNA sequences in much the same way as it currently does for SNP allele sequences (although the computer performance/capability is an unknown entity). It is envisaged that varying block sizes of

values indicate that the SNPs are independent.

the genotyped SNPs are tag SNPs.

SNP haplotypes after selective sweeps.

We described the development of *SNPpattern*, which is the collective name for a suite of Perl scripts essentially designed to group, count, and compare SNP allele patterns of various block sizes. Differences in SNP allele block frequency are used as a measure of haplotype diversity within and between groups. A SNP allele pattern represents SNPs inherited from one parent and is a product from phased genotype data. The SNP allele pattern from a programming point of view is simply a line of either 2 characters (0 or 1, 1 or 2, A or B) representing 2 different states. The main factor that drove the development of *SNPpattern* was the premise that studying SNP allele patterns can reveal useful information to help understand the genetics of individuals within groups and across groups. The use of *SNPpattern* has been illustrated on sheep breeds (Goodswen et al., 2009) but it is indeed generic software meant for all species. *SNPpattern* allows researchers, given any phased genotype data in a PHASE or fastPHASE format, to analyse SNP allele patterns within any user-defined SNP allele block size. These SNP allele patterns can be compared between user-defined groups. The primary objective of the tool is to provide a researcher with useful information on SNP allele block patterns and as a major example of its usage, the information can be used to quantify haplotype diversity within and between groups. While there are similar bioinformatics tools that have a primary focus on haplotype inference and/or analysis tools (such as Haploview, HapBlock, HaploBlock, and GERBIL) we have found no tool that provides a smooth interface between a PHASE or fastPHASE output and haplotype diversity/analysis.

Two main approaches for studying the SNP allele patterns have been implemented within *SNPpattern*: a multiple SNP allele block and a pattern similarity scoring approach. For both approaches, *SNPpattern* generates various descriptive statistics of the SNP allele patterns in plaintext output files. It is not the author's intention to stipulate how a researcher should interpret or use the information. Nevertheless, in this chapter suggestions were made as to how *SNPpattern* might be used by a researcher. In particular, *SNPpattern* was proposed as a generic tool for finding the patterns of LD using a multiple SNP allele block model. We have demonstrated in another published paper how *SNPpattern* can be used to examine the patterns and extent of LD within and between 4 Australian sheep breeds (Goodswen et al., 2009). The results show that *SNPpattern* could be used to effectively evaluate overall haplotype diversity within and between groups of individuals.

In closing, *SNPpattern* is a simple pre-screening tool to rapidly screen genome for haplotype structure and provide insights on highly conserved biologically important haplotypes. SNPpattern is implemented in Perl and supported on Linux and MS Windows. We have tested *SNP pattern* on Ovine 60k SNPchip data (Goodswen et al., 2010). All scripts are freely available from: http://web4ftp.it.csiro.au/ftp4goo17a/SNPpattern/SNPpattern.zip.

SNPpattern: A Genetic Tool to Derive Haplotype Blocks

*Mammalian Genome* 17: 548-564.

*Molecular BioSystems* 7, 235–246

318.

and Measure Genomic Diversity in Populations Using SNP Genotypes 447

Greenspan, G. & Geiger, D. (2004). High density linkage disequilibrium mapping using models of haplotype block variation. *Bioinformatics*, 20(suppl 1):i137-i144. Hayden, EC. (2009). Genome sequencing: the third generation. *Nature*, 457(7231):768-769. Hayes, BJ., Gjuvsland, A. & Omholt, S. (2006). Power of QTL mapping experiments in

Hudson, RR. & Kaplan, NL. (1985). Statistical properties of the number of recombination events in the history of a sample of DNA-sequences. *Genetics*, 111(1):147-164. Jeffreys, AJ.& Neumann, R. (2002). Reciprocal crossover asymmetry and meiotic drive in a

Kadarmideen, HN., Von Rohr, P. & Janss, LLG. (2006). From Genetical-Genomics to Systems

Kadarmideen, HN. & Janss, LLG. (2007). Population and Systems genetics of cortisol in pigs

Kadarmideen, HN. & Reverter, A. (2007). Combined genetic, genomic and transcriptomic

Kadarmideen, HN. (2008). Genetical systems biology in Livestock – Application to GnRH

Kadarmideen, HN., Watson-Haigh, NS. & Andronicos, NM. (2011). Systems biology of

Kim, SH. (2001). An evaluation of a Markov chain monte carlo method for the Rasch model.

Kimmel, G. & Shamir, R. (2005). GERBIL: Genotype resolution and block identification

Kruglyak, L. (2008). The road to genome-wide association studies. *Nat Rev Genet*, 9(4):314-

Larranaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J.A.,

Li, M., Chen, X., Li, X., Ma, B. & Vi. P. (2003). The similarity metric. In: *Proceedings of the* 

Mackay, TFC., Stone, EA. & Ayroles, JF. (2009). The genetics of quantitative traits: challenges

McKay, SD., Schnabel, RD., Murdoch, BM., Matukumalli, LK., Aerts, J., Coppieters, W.,

Montpetit, A. & Chagnon, F. (2006). The Haplotype Map of the human genome: a revolution in the genetics of complex diseases. *M S-Medecine Sciences*, 22:1061-1067.

Maryland: Society for Industrial and Applied Mathematics;: 863-872. Libiger, O., Nievergelt, CM. & Schork, NJ (2009). Comparison of Genetic Distance Measures

Using Human SNP Genotype Data. *Hum Biol*, 81(4):389-406.

and prospects. *Nature Reviews Genetics*, 10(8):565-577.

disequilibrium maps in cattle. *BMC Genet*, 8.

Armananzas, R., Santafe, G., Perez, A. et al: Machine learning in bioinformatics.

*fourteenth annual ACM-SIAM symposium on Discrete algorithms*. Baltimore,

Crews, D., Dias, E., Gill, CA., Gao, C. et al. (2007). Whole genome linkage

Genetics: Potential applications in quantitative genomics and Animal Breeding.

methods in the analysis of animal traits. CAB *Reviews: Perspectives in Agriculture,* 

ovine intestinal parasite resistance: disease gene modules and biomarkers.

of genetic association studies. *Genet Med*, 4(2):45-61.

human recombination hot spot. *Nat Genet*, 31(3):267-271.

divergently selected for stress. *Physiological Genomics* 29: 57-65

*Veterinary Science, Nutrition and Natural Resources*, 2(042):16.

and Reproduction. *IET Systems Biology* 2: 423-441

*Applied Psychological Measurement*, 25(2):163-176.

*Briefings in Bioinformatics* 2006, 7(1):86-112.

using likelihood. *Proc Natl Acad Sci USA*, 102(1):158-162.

commercial Atlantic salmon populations, exploiting linkage and linkage disequilibrium and effect of limited recombination in males. *Heredity,* 97(1):19-26. Hirschhorn, JN., Lohmueller, K., Byrne, E. & Hirschhorn, K. (2002). A comprehensive review

*SNPpattern* will be made available to the public via http://systemsgenetics.dk/pages/ resources.php in the future.

#### **6. Acknowledgements**

We would like to sincerely thank Julius van der Werf and Cedric Gondro for the inspiration behind this paper and help with providing sheep SNP data for program testing.

#### **7. References**


*SNPpattern* will be made available to the public via http://systemsgenetics.dk/pages/

We would like to sincerely thank Julius van der Werf and Cedric Gondro for the inspiration

Ardlie, KG., Kruglyak, L., & Seielstad, M. (2002). Patterns of linkage disequilibrium in the

Barrett, JC., Fry, B., Maller, J., & Daly, M.J. (2005). Haploview: analysis and visualization of

Botstein, D. & Risch, N. (2003). Discovering genotypes underlying human phenotypes: past

Burton, PR., Clayton, D.G., Cardon, L.R., Craddock, N., Deloukas, P., Duncanson, A.,

Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L. & Nickerson, D.A. (2004).

Dempster, AP., Laird, NM. & Rubin, DB. (1977). Maximum likelihood from incomplete data

Fu, YX. & Li, W.H. (1999). Coalescing into the 21st century: An overview and prospects of

Gabriel, SB., Schaffner, SF., Nguyen, H., Moore, JM., Roy, J., Blumenstiel, B., Higgins, J.,

Gibbs, RA., Belmont, JW., Hardenbol, P., Willis, TD., Yu, FL., Yang, HM., Chang, LY.,

Goodswen, SJ., Gondro, C., Watson-Haigh, NS. & Kadarmideen, HN. (2010). FunctSNP: an

Goodswen, SJ. Gondro, C. Kadarmideen, HN., & van der Werf, JHJ. (2010). Evaluating

Human Genome Project: 1998-2003. *Science*, 282(5389):682-689.

blocks in the human genome. *Science*, 296(5576):2225-2229.

scripts to build SNP databases. *BMC Bioinformatics*, 11.

successes for mendelian disease, future approaches for complex disease. *Nat Genet,*

Kwiatkowski, D.P., McCarthy, M.I., Ouwehand, W.H., Samani, N.J. et al. (2007). Genome-wide association study of 14,000 cases of seven common diseases and

Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. *Am J Hum Genet*, 74(1):106-120. Chevin, L.M. & Hospital, F. (2008): Selective Sweep at a Quantitative Trait Locus in the Presence of Background Genetic Variation. *Genetics*, 180(3):1645-1660. Collins, F.S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., Walters, L., Fearon, E.,

Hartwelt, L., Langley, C.H., Mathies, R.A. et al. (1998): New goals for the US

via EM algorithm. *Journal of* the *Royal Statistical Society Series B-Methodological*,

DeFelice, M., Lochner, A., Faggart, M. et al. (2002). The structure of haplotype

Huang, W., Liu, B., Shen, Y. et al. (2003). The International HapMap Project. *Nature*,

R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl

haplotype diversity within and between Australian sheep breeds. *Proceedings of the 9th World Congress on Genetics Applied to Livestock Production* (WCGALP), Leipzig,

behind this paper and help with providing sheep SNP data for program testing.

human genome. *Nature Reviews Genetics*, 3(4): 299-309.

LD and haplotype maps. *Bioinformatics*, 21(2):263-265.

3,000 shared controls. *Nature,* 447(7145):661-678.

coalescent theory. *Theor Popul Biol,* 56(1):1-10.

resources.php in the future.

**6. Acknowledgements** 

33:228-237.

39(1):1-38.

426(6968):789-796.

Germany.

**7. References** 


**21** 

*Russia* 

Yulia A. Medvedeva

**Algorithms for CpG Islands Search: New Advantages and Old Problems** 

*Vavilov Institute of General Genetics, Russian Academy of Sciences, Research Institute for Genetics and Selection of Industrial Microorganisms,* 

CpG islands (CGIs) are regions having high GC and CpG content while generally mammalian genomes are CpG-depleted. CGIs are often located in the promoter region of the genes, mostly housekeeping but also tissue-specific. It is widely believed that CpG dinucleotides within promoters CGIs are unmethylated and are targets for specific regulatory protein binding. As a result, CGIs contain special sequence motifs for highly affinitive protein binding (transcription factor binding sites, TFBS). Methylation of cytosine in CpG context within such motifs could decrease the affinity of TF binding, increase the attraction of methyl-binding proteins, affect the histones modification and, therefore, leads to repression of genes transcription. The mechanism of local and global transcription repression via CpG methylation is used in many different normal (development, differentiation, aging, X-chromosome inactivation, imprinting) and pathological processes (cancer and other diseases). However recently it has been reported that a class of normally

Lately evidences of biological relevance of methylated CGIs or CGIs located far from gene promoters appear. Such CGIs could act as regulator for pervasive transcription, which seems to be actual genome feature rather than a side-effect of high-throughput techniques errors. Replication origins are also reported to be associated with CGIs of any location. As a consequence of specific nucleotide content, CGIs could affect DNA or RNA secondary structures. For example, G2-3C2-3 motif common within CGIs induces significant local curiosity of DNA. Another motif, G-rich sequence (GRS) in 3' and 5' region of RNA, is known to form specific structures, G-quadruplexes, on both end of RNA playing important role in its stability.

Classical algorithms for CpG islands search use sliding window (SWM) or running sum (RSM) and several distinct but not independent criteria (GC content, Obs/ExpCpG and length). The thresholds for the criteria are rather arbitrary, unconcerned between species, and demonstrate lack of biological interpretation. SWM algorithms are rather slow, RSM algorithms are faster but tend to split large CGIs into several smaller ones and to omit CGIs with nonuniform distribution of CpG dinucleotides along the sequence. Recently, several different algorithms based on CpG dinucleotides clustering were implemented. Those algorithms have smaller number of parameters and reasonable mathematical basics. The comparison of the algorithms is tricky. Hypermutability of CpG dinucleotides lead to loss of

This motif corresponds to C-rich sequence in DNA, is likely to appear in CGIs.

**1. Introduction** 

methylated but active promoters do exist.


### **Algorithms for CpG Islands Search: New Advantages and Old Problems**

Yulia A. Medvedeva

*Vavilov Institute of General Genetics, Russian Academy of Sciences, Research Institute for Genetics and Selection of Industrial Microorganisms, Russia* 

#### **1. Introduction**

448 Bioinformatics – Trends and Methodologies

Nei, M. & Roychoudhury, AK. (1974). Sampling variances of heterozygosity and genetic

Pearson, TA. &Manolio, TA. (2008). How to interpret a genome-wide association study.

Phillips, MS., Lawrence, R., Sachidanandam, R., Morris, AP., Balding, DJ., Donaldson, MA.,

Pritchard, JK. & Przeworski, M. (2001). Linkage disequilibrium in humans: Models and data.

Pritchard, JK., Stephens, M. & Donnelly, P. (2000). Inference of population structure using

Rioux, JD., Daly, MJ., Silverberg, MS., Lindblad, K., Steinhart, H., Cohen, Z., Delmonte, T.,

Roos, APW., Hayes, BJ., Spelman, RJ. & Goddard, ME. (2008). Linkage disequilibrium and

Scheet, P. & Stephens, M. (2006). A fast and flexible statistical model for large-scale

Smith, JM. & Haigh, J. (1974). Hitch-hiking effect of a favorable gene. Genet Res 1, 23(1):23-

Stephens, M. & Scheet, P. (2005). Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. *Am J Hum Genet*, 76(3):449-462. Stephens, M., Smith, NJ. & Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. *Am J Hum Genet*, 68(4):978-989. Von Rohr, P., Friberg, M. & Kadarmideen, HN. (2007). Prediction of Transcription Factor

Wall, JD. & Pritchard, JK (2003). Haplotype blocks and linkage disequilibrium in the human

Wang, N., Akey, JM., Zhang, K., Chakraborty, R. & Jin L (2002). Distribution of

Witherspoon, DJ., Wooding, S., Rogers, AR., Marchani, EE., Watkins, WS., Batzer, MA. &

Zhang, K., Qin, ZH., Chen, T., Liu, JS., Waterman, MS. & Sun, FZ. (2005). HapBlock:

Studebaker, JF., Ankener, WM., Alfisi, SV., Kuo, FS. et al. (2003). Chromosomewide distribution of haplotype blocks and the role of recombination hot spots. *Nat* 

Kocher, K., Miller, K., Guschwan, S. et al. (2001). Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. *Nat Genet*, 29(2):223-

persistence of phase in Holstein-Friesian, Jersey and Angus cattle. *Genetics*,

population genotype data: Applications to inferring missing genotypes and

Binding Sites using Results from Genetical Genomics Investigations. *J.Bioinform.* 

recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. *Am J Hum Genet*, 71(5):1227-1234.

Jorde, LB. (2007). Genetic Similarities Within and Between Human Populations.

haplotype block partitioning and tag SNP selection software using a set of dynamic

distance. *Genetics*, 76(2):379-390.

J*AMA*, 299(11):1335-1344.

*Am J Hum Genet*, 69(1):1-14.

multilocus genotype data. *Genetics*, 155(2):945-959.

haplotypic phase. *Am J Hum Genet*, 78(4):629-644.

genome. *Nature Reviews Genetics*, 4(8):587-597.

programming algorithms. *Bioinformatics*, 21(1):131-134.

*Genet*, 33(3):382-387.

179(3):1503-1512.

*Comp. Biol.,* 5: 773-793.

*Genetics*, 176(1):351-359.

228.

35.

CpG islands (CGIs) are regions having high GC and CpG content while generally mammalian genomes are CpG-depleted. CGIs are often located in the promoter region of the genes, mostly housekeeping but also tissue-specific. It is widely believed that CpG dinucleotides within promoters CGIs are unmethylated and are targets for specific regulatory protein binding. As a result, CGIs contain special sequence motifs for highly affinitive protein binding (transcription factor binding sites, TFBS). Methylation of cytosine in CpG context within such motifs could decrease the affinity of TF binding, increase the attraction of methyl-binding proteins, affect the histones modification and, therefore, leads to repression of genes transcription. The mechanism of local and global transcription repression via CpG methylation is used in many different normal (development, differentiation, aging, X-chromosome inactivation, imprinting) and pathological processes (cancer and other diseases). However recently it has been reported that a class of normally methylated but active promoters do exist.

Lately evidences of biological relevance of methylated CGIs or CGIs located far from gene promoters appear. Such CGIs could act as regulator for pervasive transcription, which seems to be actual genome feature rather than a side-effect of high-throughput techniques errors. Replication origins are also reported to be associated with CGIs of any location.

As a consequence of specific nucleotide content, CGIs could affect DNA or RNA secondary structures. For example, G2-3C2-3 motif common within CGIs induces significant local curiosity of DNA. Another motif, G-rich sequence (GRS) in 3' and 5' region of RNA, is known to form specific structures, G-quadruplexes, on both end of RNA playing important role in its stability. This motif corresponds to C-rich sequence in DNA, is likely to appear in CGIs.

Classical algorithms for CpG islands search use sliding window (SWM) or running sum (RSM) and several distinct but not independent criteria (GC content, Obs/ExpCpG and length). The thresholds for the criteria are rather arbitrary, unconcerned between species, and demonstrate lack of biological interpretation. SWM algorithms are rather slow, RSM algorithms are faster but tend to split large CGIs into several smaller ones and to omit CGIs with nonuniform distribution of CpG dinucleotides along the sequence. Recently, several different algorithms based on CpG dinucleotides clustering were implemented. Those algorithms have smaller number of parameters and reasonable mathematical basics. The comparison of the algorithms is tricky. Hypermutability of CpG dinucleotides lead to loss of

Algorithms for CpG Islands Search: New Advantages and Old Problems 451

**CpGplot** represent the simplest variant of SWM. GC content and Obs/ExpCpG ratio are calculated over a window of length 100 bp moving along the sequence with 10 bp steps. **CpG Island Searcher** (usually referred to as Takai-Jones algorithm) uses a window of 200 bp moving along the sequence with 200 bp steps. It has an additional threshold for minimal CpG dinucleotides in predicted CGI, equal to mathematical expectation of CpG dinucleotides in Bernoulli sequence of given length and nucleotides probabilities, multiplied by Obs/ExpCpG threshold. This feature lets authors exclude "mathematical CGIs" like 300 bp sequence with 150 cytosines and one guanine in CpG context which fits standard CGI criteria. This algorithm also merges two or more CGIs if they are spaced by less than 100 bp. Takai and Jones also suggest using more strict thresholds of 500 bp for CGI length, 0.55 for GC content and 0.65 for Obs/ExpCpG to find out CGIs associated with promoters of known

**CpG Island Explorer** is a modification of CpG Island Searcher from Takai and Jones. A sliding window of CpG Island Explorer moves more slowly with a step of 10 bp. After merging of close CGIs the resulting CGI is tested ones again to fit the criteria and if it does not, one bp from each side is cutting until final CGI fits the criteria. Takai and Jones believe that CGIs predicted by CpGIE are larger in length. Closely located CGIs are merged more

**CpGProD** is a program dedicated to the prediction of promoters associated with CpG islands in mammalian genomic sequence. In every sequence found by sliding window and

where Z is linear combination of CGI length, GC content and Obs/ExpCpG. Also the probability of a strand to be a template for transcription is estimated as in (2), where Z is linear combination of AT- and GC-skews which are known properties of the nucleotide sequence around the TSS. Coefficients for Z are estimated from two generalized linear regressions trained with two datasets composed of CGIs obtaining and not obtaining TSS for protein-coding genes or two datasets with different transcription templates in human.

Running sum methods (RSM) were developed as an alternative to SWM. RSM try to find segments of DNA having CpG dinucleotides more frequently comparing to the neighboring genomic sequence. RSM work faster comparing to SWM. Initially RSM did not use CGI criteria established in (Gardiner-Garden & Frommer, 1987). Most known methods from this group are CpGreport (newCpGseek) (Rice et al., 2000) and unpublished algorithm of Mikhlem and Hillier which is used in UCSC Genome Browser (http://genome.ucsc.edu)

**CpGreport** (or newCpGseek) scores each position in the sequence using a running sum calculated from all positions in the sequence, starting with the first and ending in the last. If there is not a CpG dinucleotide at a position, the score is decremented, if there is one, the score is incremented by a constant value. If the score is higher than a threshold then a putative CGI is declared. Sequence regions scoring above the threshold are searched for recursively. It should be noticed that final CGI from predicted by this algorithm starts and ends with CpG dinucleotide and doesn't necessary reach the initial CGI criteria (Gardiner-Garden & Frommer, 1987). Authors found a lot of rather short CGI with high GC content

and CpG frequency and considered such CGI as overprediction (Rice et al., 2000).

p = exp(Z) / (1 + exp(Z)), (2)

protein-coding genes and to avoid CGIs associated with Alu-repeats.

fitted the criteria of CGI the probability to find promoter is estimated as

reasonably by CpGIE than by CpGIS.

**2.2 Running sum methods** 

and therefore became *de facto* a standard for CGI search.

CGI conservation between species so comparative genomics cannot be applied for estimation of the algorithms effectiveness.

To validate the results of CGI prediction authors use different biological and mathematical properties. One of the most popular quality measures is the fraction of CGIs located near promoters of protein coding genes and avoided overlap with Alu-repeats. This measure couldn't be appropriate at least for two reasons. First, promoters of protein-coding genes are likely to be a small fraction of all promoters as it became clear recently. Second, two classes of promoters (CGI-dependent and CGI-independent) exist and their ratio is unclear. Avoiding of repetitive sequences is more or less reachable for many algorithms, but now authors prefer to remove Alu- repeats and other repetitive DNA sequences in advance.

Prediction of the methylation profile in different tissues in norm and in cancer is another idea for validation. Algorithms of CGI search *per se* fail to predict correctly the distribution of methylated cytosine in the genome. To distinguish between methylated and nonmethylated CGI machine-leaning techniques (MLT) are used. Those studies include additional sequence features (di- and trinucleotide distribution, CpG and TpG frequencies, TFBS, repetitive elements and others). Machine-leaning techniques are also applicable for collecting promoter CGIs. The point that GC content and CpG frequency or density of CpG clusters is not enough to describe special types of CGIs, is highly relevant. The main problem of MLT approaches is that resulting model usually has a lot of parameters, sometimes without clear biological meaning. Consistency of the models, build up by different authors in the similar conditions is rather low, so those feastures could hardly be used for CGI validation quality in general case.

A verification problem caused by lack of universal biological properties of CGIs results in an absence of widely accepted definition. It should be mentioned that all algorithms trying to predict CGIs with one particular function (promoter or unethylated CGIs) demonstrate a high false-positive rate, probably due to the complex network of CGIs functions. It's becoming clear that many different functional elements exist within one CGI. Moreover, both methylated and unmethylated, both promoter and non-promoter CGIs seem to be functional. So, one can conclude that contemporary algorithms for CGIs search based only on GC and CpG content or on CpG clustering determine a chimeric class of objects.

#### **2. Algorithms for CpG islands search**

Nowadays, most popular algorithms for CpG islands search are still based on criteria established more than twenty years ago (Gardiner-Garden & Frommer, 1987). The DNA segment is considered to be a CpG island if it is not shorter than 200 bp, has GC content no less than 0.5 and the ratio Obs/ExpCpG (1) no less than 0.6.

$$\text{Obs}/\text{Exp}\_{\text{CrpG}} = \text{N}\_{\text{CrpG}} \* \text{N}/(\text{N}\_{\text{C}} \* \text{N}\_{\text{G}}) \tag{1}$$

where NC, NG and NCpG are numbers of C, G and CpG in the region of length N respectively. Implementations of the basic idea vary in details, mostly in methods for search of the segments having properties mentioned above.

#### **2.1 Sliding window methods**

There are several algorithms for CGIs search using sliding window methods (SWM): CpGplot (Rice et al., 2000), CpG Island Searcher (Takai & Jones, 2002), CpG Island Explorer (Wang & Leung, 2004) and CpGProD (Ponger & Mouchiroud, 2002).

CGI conservation between species so comparative genomics cannot be applied for

To validate the results of CGI prediction authors use different biological and mathematical properties. One of the most popular quality measures is the fraction of CGIs located near promoters of protein coding genes and avoided overlap with Alu-repeats. This measure couldn't be appropriate at least for two reasons. First, promoters of protein-coding genes are likely to be a small fraction of all promoters as it became clear recently. Second, two classes of promoters (CGI-dependent and CGI-independent) exist and their ratio is unclear. Avoiding of repetitive sequences is more or less reachable for many algorithms, but now authors prefer to remove Alu- repeats and other repetitive DNA sequences in advance. Prediction of the methylation profile in different tissues in norm and in cancer is another idea for validation. Algorithms of CGI search *per se* fail to predict correctly the distribution of methylated cytosine in the genome. To distinguish between methylated and nonmethylated CGI machine-leaning techniques (MLT) are used. Those studies include additional sequence features (di- and trinucleotide distribution, CpG and TpG frequencies, TFBS, repetitive elements and others). Machine-leaning techniques are also applicable for collecting promoter CGIs. The point that GC content and CpG frequency or density of CpG clusters is not enough to describe special types of CGIs, is highly relevant. The main problem of MLT approaches is that resulting model usually has a lot of parameters, sometimes without clear biological meaning. Consistency of the models, build up by different authors in the similar conditions is rather low, so those feastures could hardly be

A verification problem caused by lack of universal biological properties of CGIs results in an absence of widely accepted definition. It should be mentioned that all algorithms trying to predict CGIs with one particular function (promoter or unethylated CGIs) demonstrate a high false-positive rate, probably due to the complex network of CGIs functions. It's becoming clear that many different functional elements exist within one CGI. Moreover, both methylated and unmethylated, both promoter and non-promoter CGIs seem to be functional. So, one can conclude that contemporary algorithms for CGIs search based only

Nowadays, most popular algorithms for CpG islands search are still based on criteria established more than twenty years ago (Gardiner-Garden & Frommer, 1987). The DNA segment is considered to be a CpG island if it is not shorter than 200 bp, has GC content no

 Obs/ExpCpG = NCpG\*N/(NC\*NG), (1) where NC, NG and NCpG are numbers of C, G and CpG in the region of length N respectively. Implementations of the basic idea vary in details, mostly in methods for search of the

There are several algorithms for CGIs search using sliding window methods (SWM): CpGplot (Rice et al., 2000), CpG Island Searcher (Takai & Jones, 2002), CpG Island Explorer

on GC and CpG content or on CpG clustering determine a chimeric class of objects.

estimation of the algorithms effectiveness.

used for CGI validation quality in general case.

**2. Algorithms for CpG islands search** 

segments having properties mentioned above.

**2.1 Sliding window methods** 

less than 0.5 and the ratio Obs/ExpCpG (1) no less than 0.6.

(Wang & Leung, 2004) and CpGProD (Ponger & Mouchiroud, 2002).

**CpGplot** represent the simplest variant of SWM. GC content and Obs/ExpCpG ratio are calculated over a window of length 100 bp moving along the sequence with 10 bp steps.

**CpG Island Searcher** (usually referred to as Takai-Jones algorithm) uses a window of 200 bp moving along the sequence with 200 bp steps. It has an additional threshold for minimal CpG dinucleotides in predicted CGI, equal to mathematical expectation of CpG dinucleotides in Bernoulli sequence of given length and nucleotides probabilities, multiplied by Obs/ExpCpG threshold. This feature lets authors exclude "mathematical CGIs" like 300 bp sequence with 150 cytosines and one guanine in CpG context which fits standard CGI criteria. This algorithm also merges two or more CGIs if they are spaced by less than 100 bp. Takai and Jones also suggest using more strict thresholds of 500 bp for CGI length, 0.55 for GC content and 0.65 for Obs/ExpCpG to find out CGIs associated with promoters of known protein-coding genes and to avoid CGIs associated with Alu-repeats.

**CpG Island Explorer** is a modification of CpG Island Searcher from Takai and Jones. A sliding window of CpG Island Explorer moves more slowly with a step of 10 bp. After merging of close CGIs the resulting CGI is tested ones again to fit the criteria and if it does not, one bp from each side is cutting until final CGI fits the criteria. Takai and Jones believe that CGIs predicted by CpGIE are larger in length. Closely located CGIs are merged more reasonably by CpGIE than by CpGIS.

**CpGProD** is a program dedicated to the prediction of promoters associated with CpG islands in mammalian genomic sequence. In every sequence found by sliding window and fitted the criteria of CGI the probability to find promoter is estimated as

$$\mathbf{p} = \exp(\mathbf{Z}) \;/\; (1 + \exp(\mathbf{Z})) \; \tag{2}$$

where Z is linear combination of CGI length, GC content and Obs/ExpCpG. Also the probability of a strand to be a template for transcription is estimated as in (2), where Z is linear combination of AT- and GC-skews which are known properties of the nucleotide sequence around the TSS. Coefficients for Z are estimated from two generalized linear regressions trained with two datasets composed of CGIs obtaining and not obtaining TSS for protein-coding genes or two datasets with different transcription templates in human.

#### **2.2 Running sum methods**

Running sum methods (RSM) were developed as an alternative to SWM. RSM try to find segments of DNA having CpG dinucleotides more frequently comparing to the neighboring genomic sequence. RSM work faster comparing to SWM. Initially RSM did not use CGI criteria established in (Gardiner-Garden & Frommer, 1987). Most known methods from this group are CpGreport (newCpGseek) (Rice et al., 2000) and unpublished algorithm of Mikhlem and Hillier which is used in UCSC Genome Browser (http://genome.ucsc.edu) and therefore became *de facto* a standard for CGI search.

**CpGreport** (or newCpGseek) scores each position in the sequence using a running sum calculated from all positions in the sequence, starting with the first and ending in the last. If there is not a CpG dinucleotide at a position, the score is decremented, if there is one, the score is incremented by a constant value. If the score is higher than a threshold then a putative CGI is declared. Sequence regions scoring above the threshold are searched for recursively. It should be noticed that final CGI from predicted by this algorithm starts and ends with CpG dinucleotide and doesn't necessary reach the initial CGI criteria (Gardiner-Garden & Frommer, 1987). Authors found a lot of rather short CGI with high GC content and CpG frequency and considered such CGI as overprediction (Rice et al., 2000).

Algorithms for CpG Islands Search: New Advantages and Old Problems 453

state. Authors use the number of C, G, and CpG in segment of length L as parameters for the model. Hidden state Y(s) for segment is 1 for CGI and 0 for baseline. Authors assume that Y(s) is a stationary first-order Markov chain. The choice of the state is based on two HMM. One is for GC content to be high or low with assumption of the binomial distribution approximated with the normal density for baseline. The second one is for CpG number with assumption of Poisson distribution for baseline. The length L=16 for the segment was chosen based on the association of CGI with epigenetic marks. The approach summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of

Having several methods for CGI prediction one is still unable to select the best one. The main reason is the lack of validation criteria. Su and colleagues (Su et al., 2009) propose cumulative mutual information of CpG dinucleotides as a measure of CGI's quality and show that it's a powerful criterion to avoid CGIs associated with Alu-repeats. Despite the power of this mathematical criterion, most of the authors try using biological features for CGIs validation.

**3.1 Sources for biologically relevant validation: DNA methylation and protein binding**  Very first work mentioned CG-rich islands (Bird, 1986) considers them as DNA regions where cytosine is unmethylated. Cytosine methylation usually appear in CpG context and increase the probability of its deamination about 10-times (Ehrlich & Wang, 1981), leading to enrichment of TpG and depletion of CpG dinucleotides in DNA. Absence (or decreased level) of cytosine methylation within CGI is usually considered as an origin of CGIs in mammalian genomes (Cross et al., 1994; Eckhardt et al., 2006). Modern research shows that methylated cytosines within CpNpG are also targets for spontaneous deamination (Cooper et al., 2010). No doubts, that cytosine methylation plays important role in CGI functioning. During early development waves of methylation-demethylation generate tissue-specific genomic methylation profiles. These profiles are stable in somatic cells generations due to replication dependent maintenance methylation system (Brero et al., 2006). About 70-80% of cytosines in CpG context are methylated in differentiated cells (Baylin et al., 1998), recent study shows that cytosine is also methylated within CpHpN context (where H = С, А or Т) especially in embryonic stem cells (Baylin et al., 1998). Cytosine methylation influence DNA structure by facilitating Z-from conformation (Behe & Felsenfeld, 1981), it also affect protein binding to

DNA, so most transcription factors (TF) usually bind unmethylated DNA.

enough whereas MeCP1 complex needs dense clusters of 5mCpGs (Ng et al., 1999).

There is a class of proteins (e.g. MeCP1/2, MBD1-6, SRA, and Kaiso) binding exclusively methylated DNA (Saito & Ishikawa, 2002). MeCP1 protein complex binds methylated cytosine using MBD2 protein (Berger & Bird, 2005) and also includes chromatin remodeling complex NuRD/Mi2. MeCP2 is the key and well-studied member of methyl-binding domain (MBD) protein group (Fatemi & Wade, 2006). Besides methyl-binding domain it contains transcription repression domain (TRD) (Dhasarathy & Wade, 2008) and is involved into DNA methylation establishment with DNMT1 (Kimura & Shiota, 2003). There are evidences that both MeCP2 and MBD1/2 binds not just 5mCpG but more complicated DNA motifs, MeCP2 binds 5mCpG with adjacent (A/T)4+, which is not true for MBD1/2 proteins (Klose et al., 2005). MeCP2 binds DNA with higher affinity than MeCP1 complex leading to more stable repression of transcription. For MeCP2 binding single 5mCpG dinucleotide is

a CGI and facilitates CGI search in different species.

**3. Validation problem** 

**UCSC CGI (**Algorithm of Mikhlem and Hillier) is based on the RSM but include additional check for CGI to fit the traditional criteria (Gardiner-Garden & Frommer, 1987). Total number of CGIs obtained by UCSC is less than those obtained by CpGplot, as not every frame is tested for fitting the criteria, but only those having score higher than a threshold on the first step. CGIs predicted by the algorithm of Mikhlem and Hillier are often shorter from both ends comparing to those predicted by CpGplot and also starts and ends with CpG dinucleotides.

#### **2.3 CpG clustering methods**

Next logical step of CpG searchers development is to implement actual CGI clustering methods (CGCM). There are several such algorithms available: CpGcluster (Hackenberg et al., 2006), CpG clusters (Glass et al., 2007), and CGI HW, an algorithm, developed by H. Wu (Irizarry et al., 2009; H. Wu et al., 2010). These algorithms are based on segmentation of the genome into regions with different frequency of CpG dinucleotides (CGI HW also uses segmentation based on GC content). Unlike methods described above this approach to CGI prediction is data-driven and allows finding CGIs in spices with different average GC-content and CpG frequency.

**CpGcluster** has two separate steps: a CpG cluster search and an estimation of the probability to find such a cluster by chance. Distance between neighboring CpG dinucleotides in random sequence is simulated by geometric law with CpG frequency as a parameter. Hackenberg and colleagues (Hackenberg et al., 2006) assume that within functional CpG cluster the distance between neighboring CpGs is smaller than expected in random sequence. Authors show that distances smaller than a median of the theoretical distribution is overrepresented in human genome. The median distance between neighboring CpG (23-53 bp depending of the chromosome) is used as a threshold, so each cluster consists of CpGs located no farther than the threshold. All resulting CGIs start and end with a CpG dinucleotide. Each cluster has a pvalue calculated based on negative binomial distribution. Only clusters with p-value less than 1.0e-5 (1.0e-20 in (Hackenberg et al., 2010a)) are considered as CpG islands. Authors find about 200000 CpG islands in human genome (25000 CpG islands using the p-value threshold equal to 1.0e-20). A lot of such CpG islands are shorter than 200 bp. Yet, authors show functionality of some short CGIs and call them CpG islets (Hackenberg et al., 2010a).

**CG clusters** annotation also has two steps. The location of every CpG dinucleotide is extracted from genomic DNA sequences. Using these positions, every overlapping sequence fragment containing a fixed number of CpGs and having variable length is identified. For each number of CpGs, the frequency of each fragment length is recorded. The threshold for each maximum fragment length is defined as a local minimum in the fragment length histogram, estimated by identifying zero values of the first derivative of a cubic spline fit. Mapping the CpG-dense fragments back to the genomic sequence produces an annotation track there each annotated locus is a conglomeration of one or more overlapping fragments of variable length. As the basis for choosing the optimal track the number of overlapping fragments at a locus normalized by the maximum fragment length is used. A track with maximal fragments overlap per locus is selected based on genomic averages of this metric for different numbers of CpGs per fragment. This approach allows authors to choose the species-specific optimal number of CpGs per fragment for the final annotation.

**CGI\_HW** (Algorithm of H. Wu) assumes that each chromosome is divided into 3 states: Alu repetitive elements, baseline, and CGI. Alu-repetitive elements are removed in advance. Hence, authors characterize the problem as that of a semi-HMM, with a known state for Alu repetitive elements, so they consider the 2-state chain conditional on being in a non-Alu state. Authors use the number of C, G, and CpG in segment of length L as parameters for the model. Hidden state Y(s) for segment is 1 for CGI and 0 for baseline. Authors assume that Y(s) is a stationary first-order Markov chain. The choice of the state is based on two HMM. One is for GC content to be high or low with assumption of the binomial distribution approximated with the normal density for baseline. The second one is for CpG number with assumption of Poisson distribution for baseline. The length L=16 for the segment was chosen based on the association of CGI with epigenetic marks. The approach summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of a CGI and facilitates CGI search in different species.

#### **3. Validation problem**

452 Bioinformatics – Trends and Methodologies

**UCSC CGI (**Algorithm of Mikhlem and Hillier) is based on the RSM but include additional check for CGI to fit the traditional criteria (Gardiner-Garden & Frommer, 1987). Total number of CGIs obtained by UCSC is less than those obtained by CpGplot, as not every frame is tested for fitting the criteria, but only those having score higher than a threshold on the first step. CGIs predicted by the algorithm of Mikhlem and Hillier are often shorter from both ends comparing to those predicted by CpGplot and also starts and ends with CpG dinucleotides.

Next logical step of CpG searchers development is to implement actual CGI clustering methods (CGCM). There are several such algorithms available: CpGcluster (Hackenberg et al., 2006), CpG clusters (Glass et al., 2007), and CGI HW, an algorithm, developed by H. Wu (Irizarry et al., 2009; H. Wu et al., 2010). These algorithms are based on segmentation of the genome into regions with different frequency of CpG dinucleotides (CGI HW also uses segmentation based on GC content). Unlike methods described above this approach to CGI prediction is data-driven and allows finding CGIs in spices with different average GC-content

**CpGcluster** has two separate steps: a CpG cluster search and an estimation of the probability to find such a cluster by chance. Distance between neighboring CpG dinucleotides in random sequence is simulated by geometric law with CpG frequency as a parameter. Hackenberg and colleagues (Hackenberg et al., 2006) assume that within functional CpG cluster the distance between neighboring CpGs is smaller than expected in random sequence. Authors show that distances smaller than a median of the theoretical distribution is overrepresented in human genome. The median distance between neighboring CpG (23-53 bp depending of the chromosome) is used as a threshold, so each cluster consists of CpGs located no farther than the threshold. All resulting CGIs start and end with a CpG dinucleotide. Each cluster has a pvalue calculated based on negative binomial distribution. Only clusters with p-value less than 1.0e-5 (1.0e-20 in (Hackenberg et al., 2010a)) are considered as CpG islands. Authors find about 200000 CpG islands in human genome (25000 CpG islands using the p-value threshold equal to 1.0e-20). A lot of such CpG islands are shorter than 200 bp. Yet, authors show functionality

**CG clusters** annotation also has two steps. The location of every CpG dinucleotide is extracted from genomic DNA sequences. Using these positions, every overlapping sequence fragment containing a fixed number of CpGs and having variable length is identified. For each number of CpGs, the frequency of each fragment length is recorded. The threshold for each maximum fragment length is defined as a local minimum in the fragment length histogram, estimated by identifying zero values of the first derivative of a cubic spline fit. Mapping the CpG-dense fragments back to the genomic sequence produces an annotation track there each annotated locus is a conglomeration of one or more overlapping fragments of variable length. As the basis for choosing the optimal track the number of overlapping fragments at a locus normalized by the maximum fragment length is used. A track with maximal fragments overlap per locus is selected based on genomic averages of this metric for different numbers of CpGs per fragment. This approach allows authors to choose the

**CGI\_HW** (Algorithm of H. Wu) assumes that each chromosome is divided into 3 states: Alu repetitive elements, baseline, and CGI. Alu-repetitive elements are removed in advance. Hence, authors characterize the problem as that of a semi-HMM, with a known state for Alu repetitive elements, so they consider the 2-state chain conditional on being in a non-Alu

of some short CGIs and call them CpG islets (Hackenberg et al., 2010a).

species-specific optimal number of CpGs per fragment for the final annotation.

**2.3 CpG clustering methods** 

and CpG frequency.

Having several methods for CGI prediction one is still unable to select the best one. The main reason is the lack of validation criteria. Su and colleagues (Su et al., 2009) propose cumulative mutual information of CpG dinucleotides as a measure of CGI's quality and show that it's a powerful criterion to avoid CGIs associated with Alu-repeats. Despite the power of this mathematical criterion, most of the authors try using biological features for CGIs validation.

#### **3.1 Sources for biologically relevant validation: DNA methylation and protein binding**

Very first work mentioned CG-rich islands (Bird, 1986) considers them as DNA regions where cytosine is unmethylated. Cytosine methylation usually appear in CpG context and increase the probability of its deamination about 10-times (Ehrlich & Wang, 1981), leading to enrichment of TpG and depletion of CpG dinucleotides in DNA. Absence (or decreased level) of cytosine methylation within CGI is usually considered as an origin of CGIs in mammalian genomes (Cross et al., 1994; Eckhardt et al., 2006). Modern research shows that methylated cytosines within CpNpG are also targets for spontaneous deamination (Cooper et al., 2010).

No doubts, that cytosine methylation plays important role in CGI functioning. During early development waves of methylation-demethylation generate tissue-specific genomic methylation profiles. These profiles are stable in somatic cells generations due to replication dependent maintenance methylation system (Brero et al., 2006). About 70-80% of cytosines in CpG context are methylated in differentiated cells (Baylin et al., 1998), recent study shows that cytosine is also methylated within CpHpN context (where H = С, А or Т) especially in embryonic stem cells (Baylin et al., 1998). Cytosine methylation influence DNA structure by facilitating Z-from conformation (Behe & Felsenfeld, 1981), it also affect protein binding to DNA, so most transcription factors (TF) usually bind unmethylated DNA.

There is a class of proteins (e.g. MeCP1/2, MBD1-6, SRA, and Kaiso) binding exclusively methylated DNA (Saito & Ishikawa, 2002). MeCP1 protein complex binds methylated cytosine using MBD2 protein (Berger & Bird, 2005) and also includes chromatin remodeling complex NuRD/Mi2. MeCP2 is the key and well-studied member of methyl-binding domain (MBD) protein group (Fatemi & Wade, 2006). Besides methyl-binding domain it contains transcription repression domain (TRD) (Dhasarathy & Wade, 2008) and is involved into DNA methylation establishment with DNMT1 (Kimura & Shiota, 2003). There are evidences that both MeCP2 and MBD1/2 binds not just 5mCpG but more complicated DNA motifs, MeCP2 binds 5mCpG with adjacent (A/T)4+, which is not true for MBD1/2 proteins (Klose et al., 2005). MeCP2 binds DNA with higher affinity than MeCP1 complex leading to more stable repression of transcription. For MeCP2 binding single 5mCpG dinucleotide is enough whereas MeCP1 complex needs dense clusters of 5mCpGs (Ng et al., 1999).

Algorithms for CpG Islands Search: New Advantages and Old Problems 455

Different tissues and cell types demonstrate specific cytosine methylation patterns (Ushijima et al., 2003), those patterns in the same tissue of different individuals are similar (Lister et al., 2009), but not identical (Bock et al., 2008). Now a lot of regions with tissues-specific methylation profiles (tDMRs) are known (Rakyan et al., 2008; Brunner et al., 2009; Straussman et al., 2009; Xin et al., 2010). DMRs are likely to be involved in gene imprinting (Lopes et al., 2003). Differential activity of imprinted alleles of the gene is dependent on

Females have one of the Х chromosomes inactivated in somatic cells (Gartler & Riggs, 1983). The process of inactivation starts at early embryo stage with Xist activation (S. D. Brown, 1991), which leads to chromatin modification and methylation of promoters of most (Deobagkar & Chandra, 2003) but not all (Zeschnigk et al., 2009) genes. Methylation and

Defect of normal methylation profile is a distinctive feature for different pathology conditions (Ratt syndrome, psychopathologies (Egger et al., 2004), autoimmune diseases (Richardson, 2007), hypertension (Frey, 2005)). Despite many evidences on epigenetic changes in pathologies, cancer is the most known disease having abnormalities in epigenetics, especially in DNA methylation (Jones & Baylin, 2002; Laird, 2003; Herrera et al., 2008). Tumor cells demonstrate a lot of modifications in epigenetics status: general demethylation of the genome, influencing chromatin structure, increased DNA methyltransferase activity, and hypermethylation of promoter regions of many genes resulting in their repression. High probability of 5mC to mutate into T brings about a lot of cancerspecific mutations. It's importation to notice, that pathological profiles of methylation often

methylation of promoters, enhanserses or silencers of those genes (Li et al., 1993).

gene repression profile of inactivated X chromosome is stable in cell generations.

depend on environmental conditions and are inherited (Liu et al., 2008).

motif.

**3.3 Sources for biologically relevant validation: CpG islands as promoter regions**  The RNA polymerase II core promoter contains DNA motifs directing transcriptional machinery to the transcription start site (TSS). Nowadays four DNA motifs are known to be a part of core promoter: the TATA box, the TFIIB recognition element (BRE), the initiator (Inr), and the downstream promoter element (DPE) (Kutach & Kadonaga, 2000). The TATA box is an A/T-rich sequence, located about 20-30 nucleotides upstream of the TSS, that binds TFIID complex (Burley & Roeder, 1996). The BRE having the consensus SSRCGCC, is located immediately upstream of the TATA element in some promoters and increases the affinity of TFIIB binding (Lagrange et al., 1998). The Inr was originally a motif encompassing the TSS that is sufficient to direct accurate initiation in the absence of a TATA element (Smale, 1997). Inr elements are, however, present in both TATA-containing and TATA-less promoters and play a role in TFIID binding (Chalkley & Verrijzer, 1999). In mammalian promoters, the Inr consensus sequence is RRA+1NWRR, where A+1 is the TSS (Bucher, 1990). The DPE acts cooperatively with the Inr helping TFIID binding and accuracy of transcription initiation in TATA-less promoters (Burley & Roeder, 1996). The DPE is located about 30 nucleotides downstream of the TSS and contains a common GWCG sequence

Saxonov and colleagues (Saxonov et al., 2006) demonstrate that human genes have two different promoter types: AT-rich and GC-rich (associated with CGIs). They are easily distinguishable not only in AT- or GC content, but also in different motifs overrepresented in each promoter type. One can see that most of core promoter elements are GC-rich and could be a part of a CGI-associated promoter. CGIs are often located in 5' regions of genes, mostly overlapping with TSS (Gardiner-Garden & Frommer, 1987; Davuluri et al., 2001;

Another well-known group of methyl-binding proteins consists of Kaiso and ZBTB4/33. They obtain zinc-finger domain and bind DNA in sequence-specific manner. Data on Kaiso binding site are controversial. Van Roy and McCrea (van Roy & McCrea, 2005) believe that Kaiso binds 5mCG5mCG. Sasai and colleagues (Sasai et al., 2010) assume that 5mCG5mCG motif is a place where two Kaiso molecules bind, one on every strand. The motif also has to be in specific sequence environment. It's also known that Kaiso binds TNGCAGGA motif having non-methylated cytosine, but with 1000-times lower affinity (Daniel et al., 2002). There are some evidences that Kaiso is a global repressor of methylated genes and is essential for early embryonic development. ZBTB4 protein binds CYGCCATC motif as well as M5mCGCYAT (Sasai et al., 2010). It also has been shown that proteins of this group demonstrate affinity to half-methylated DNA (Sasai et al., 2010).

Some other proteins also bind methylated DNA. CpG methylation of the CRE-motif (TGACGTCA) enhances the DNA binding of the C/EBPα (Rishi et al., 2010). UHRF1 and UHRF2 (SET- and Ring finger-associated proteins, SRA) bind hemimethylated CpG and the tail of histone H3 in a highly methylation sensitive manner and help assemble histones and DNA into a nucleosome after replication (Hashimoto et al., 2009).

#### **3.2 Sources for biologically relevant validation: DNA methylation and gene expression**

Nowadays there are two main hypotheses explaining DNA methylation origin during evolution. Some authors believe that methylation system arose to inactivate viruses and transposons (Walsh et al., 1998). Despite some evidences in favor of this hypothesis, most of the authors nowadays suppose that main function of DNA methylation is a control of gene expression during development and cell differentiation, most likely by influence on affinity of different protein binding.

Promoter regions of many genes are unmethylated and demonstrate resistance to increasing concentration of methylating agents (Bestor et al., 1992). Yet if promoter region become methylated this usually leads to stable in cell generations and irreversible gene suppression (Razin & Riggs, 1980; Schubeler et al., 2001). However some genes demonstrate rather high expression independently to methylation level of their promoters (Shen et al., 2007) and some promoters need to have methylated cytosine to be activated (Rishi et al., 2010).

Cytosine methylation affects transcription both directly by changing the affinity of TF binding to DNA and indirectly by forming inactive chromatin domains. Both 5mC and T change DNA conformation in core positions of TFBS. For transcription repression in some cases it's enough to have one cytosine methylated, in other cases the level of expression is correlated negatively with methylation level, but is independent on the exact position of cytosine to be methylated. Inhibition of transcription caused by partial DNA methylation can be overpassed by enhancers (Hug et al., 1996), however fully methylated promoters can't be reactivated that way (Schubeler et al., 2001).

The possibility of active demethylation is still under discussion (S. C. Wu & Zhang, 2010). Cytidine deaminase AID could play a role in this process in mammals (Fritz & Papavasiliou, 2010). Recently it has been shown that elongation complex also can participate in demethylation (Okada et al., 2010). Even DNA methyl-transferases DNMT3a/b could force cytosine deamination leading to reparation of T-G mismatch pair into correct C-G pair with GC-biased reparation system (S. C. Wu & Zhang, 2010). Overexpression of MBD3 could also play a role in demethylation (S. E. Brown et al., 2008). Yet active demethylation after implantation of the embryo is very rare occasion (S. C. Wu & Zhang, 2010).

Another well-known group of methyl-binding proteins consists of Kaiso and ZBTB4/33. They obtain zinc-finger domain and bind DNA in sequence-specific manner. Data on Kaiso binding site are controversial. Van Roy and McCrea (van Roy & McCrea, 2005) believe that Kaiso binds 5mCG5mCG. Sasai and colleagues (Sasai et al., 2010) assume that 5mCG5mCG motif is a place where two Kaiso molecules bind, one on every strand. The motif also has to be in specific sequence environment. It's also known that Kaiso binds TNGCAGGA motif having non-methylated cytosine, but with 1000-times lower affinity (Daniel et al., 2002). There are some evidences that Kaiso is a global repressor of methylated genes and is essential for early embryonic development. ZBTB4 protein binds CYGCCATC motif as well as M5mCGCYAT (Sasai et al., 2010). It also has been shown that proteins of this group

Some other proteins also bind methylated DNA. CpG methylation of the CRE-motif (TGACGTCA) enhances the DNA binding of the C/EBPα (Rishi et al., 2010). UHRF1 and UHRF2 (SET- and Ring finger-associated proteins, SRA) bind hemimethylated CpG and the tail of histone H3 in a highly methylation sensitive manner and help assemble histones and

Nowadays there are two main hypotheses explaining DNA methylation origin during evolution. Some authors believe that methylation system arose to inactivate viruses and transposons (Walsh et al., 1998). Despite some evidences in favor of this hypothesis, most of the authors nowadays suppose that main function of DNA methylation is a control of gene expression during development and cell differentiation, most likely by influence on affinity

Promoter regions of many genes are unmethylated and demonstrate resistance to increasing concentration of methylating agents (Bestor et al., 1992). Yet if promoter region become methylated this usually leads to stable in cell generations and irreversible gene suppression (Razin & Riggs, 1980; Schubeler et al., 2001). However some genes demonstrate rather high expression independently to methylation level of their promoters (Shen et al., 2007) and

Cytosine methylation affects transcription both directly by changing the affinity of TF binding to DNA and indirectly by forming inactive chromatin domains. Both 5mC and T change DNA conformation in core positions of TFBS. For transcription repression in some cases it's enough to have one cytosine methylated, in other cases the level of expression is correlated negatively with methylation level, but is independent on the exact position of cytosine to be methylated. Inhibition of transcription caused by partial DNA methylation can be overpassed by enhancers (Hug et al., 1996), however fully methylated promoters

The possibility of active demethylation is still under discussion (S. C. Wu & Zhang, 2010). Cytidine deaminase AID could play a role in this process in mammals (Fritz & Papavasiliou, 2010). Recently it has been shown that elongation complex also can participate in demethylation (Okada et al., 2010). Even DNA methyl-transferases DNMT3a/b could force cytosine deamination leading to reparation of T-G mismatch pair into correct C-G pair with GC-biased reparation system (S. C. Wu & Zhang, 2010). Overexpression of MBD3 could also play a role in demethylation (S. E. Brown et al., 2008). Yet active demethylation after

implantation of the embryo is very rare occasion (S. C. Wu & Zhang, 2010).

some promoters need to have methylated cytosine to be activated (Rishi et al., 2010).

demonstrate affinity to half-methylated DNA (Sasai et al., 2010).

DNA into a nucleosome after replication (Hashimoto et al., 2009).

can't be reactivated that way (Schubeler et al., 2001).

**expression** 

of different protein binding.

**3.2 Sources for biologically relevant validation: DNA methylation and gene** 

Different tissues and cell types demonstrate specific cytosine methylation patterns (Ushijima et al., 2003), those patterns in the same tissue of different individuals are similar (Lister et al., 2009), but not identical (Bock et al., 2008). Now a lot of regions with tissues-specific methylation profiles (tDMRs) are known (Rakyan et al., 2008; Brunner et al., 2009; Straussman et al., 2009; Xin et al., 2010). DMRs are likely to be involved in gene imprinting (Lopes et al., 2003). Differential activity of imprinted alleles of the gene is dependent on methylation of promoters, enhanserses or silencers of those genes (Li et al., 1993).

Females have one of the Х chromosomes inactivated in somatic cells (Gartler & Riggs, 1983). The process of inactivation starts at early embryo stage with Xist activation (S. D. Brown, 1991), which leads to chromatin modification and methylation of promoters of most (Deobagkar & Chandra, 2003) but not all (Zeschnigk et al., 2009) genes. Methylation and gene repression profile of inactivated X chromosome is stable in cell generations.

Defect of normal methylation profile is a distinctive feature for different pathology conditions (Ratt syndrome, psychopathologies (Egger et al., 2004), autoimmune diseases (Richardson, 2007), hypertension (Frey, 2005)). Despite many evidences on epigenetic changes in pathologies, cancer is the most known disease having abnormalities in epigenetics, especially in DNA methylation (Jones & Baylin, 2002; Laird, 2003; Herrera et al., 2008). Tumor cells demonstrate a lot of modifications in epigenetics status: general demethylation of the genome, influencing chromatin structure, increased DNA methyltransferase activity, and hypermethylation of promoter regions of many genes resulting in their repression. High probability of 5mC to mutate into T brings about a lot of cancerspecific mutations. It's importation to notice, that pathological profiles of methylation often depend on environmental conditions and are inherited (Liu et al., 2008).

#### **3.3 Sources for biologically relevant validation: CpG islands as promoter regions**

The RNA polymerase II core promoter contains DNA motifs directing transcriptional machinery to the transcription start site (TSS). Nowadays four DNA motifs are known to be a part of core promoter: the TATA box, the TFIIB recognition element (BRE), the initiator (Inr), and the downstream promoter element (DPE) (Kutach & Kadonaga, 2000). The TATA box is an A/T-rich sequence, located about 20-30 nucleotides upstream of the TSS, that binds TFIID complex (Burley & Roeder, 1996). The BRE having the consensus SSRCGCC, is located immediately upstream of the TATA element in some promoters and increases the affinity of TFIIB binding (Lagrange et al., 1998). The Inr was originally a motif encompassing the TSS that is sufficient to direct accurate initiation in the absence of a TATA element (Smale, 1997). Inr elements are, however, present in both TATA-containing and TATA-less promoters and play a role in TFIID binding (Chalkley & Verrijzer, 1999). In mammalian promoters, the Inr consensus sequence is RRA+1NWRR, where A+1 is the TSS (Bucher, 1990). The DPE acts cooperatively with the Inr helping TFIID binding and accuracy of transcription initiation in TATA-less promoters (Burley & Roeder, 1996). The DPE is located about 30 nucleotides downstream of the TSS and contains a common GWCG sequence motif.

Saxonov and colleagues (Saxonov et al., 2006) demonstrate that human genes have two different promoter types: AT-rich and GC-rich (associated with CGIs). They are easily distinguishable not only in AT- or GC content, but also in different motifs overrepresented in each promoter type. One can see that most of core promoter elements are GC-rich and could be a part of a CGI-associated promoter. CGIs are often located in 5' regions of genes, mostly overlapping with TSS (Gardiner-Garden & Frommer, 1987; Davuluri et al., 2001;

Algorithms for CpG Islands Search: New Advantages and Old Problems 457

Intergenic methylation plays an important role in regulation of alternative promoters (Maunakea, Nagarajan et al. 2010), modify chromatin structure (Lorincz, Dickerson et al.

Resenly several works show that CGIs located far from known genes in intragenic regions correspond to previously undetected promoters (Carninci et al., 2005; Medvedeva et al.,

CTCF insulator protein forming a boundary of chromatin active regions (Bell & Felsenfeld,

**CpG islands and mobile elements**. There are a lot of repetitive sequences in human genomes having high GC content, so many algorithms find CGI overlapping with repeats (Alu-repeat in human (Graff et al., 1997) and B1-repeat in mouse (Yates et al., 1999)). Cytosines within CGIs associated with Alu-repeats in normal cells are methylated, which in turn represses the expansion of the repeat (Xing et al., 2004). Loss of methylation in Alurepeats is typical for tumor cells (Xie et al., 2010). Recently absence of methylation in Alurepeats was shown for germ line (Brohede & Rand, 2006). Ullu and Tschudi (Ullu & Tschudi, 1984) believe that Alu-repeats are possessed pseudogenes of 7SL-RNA, and several Alu families still contain inner promoter of RNA polymerase III (Britten et al., 1988). One can expect that CGIs in Alu-repeats should have different DNA motifs comparing to CGIs in promoters of protein-coding genes transcribed by PolII. Nevertheless, recent studies show that pervasive PolII transcription is also a common feature for pseudogenes and transposons

Alu-repeats are source of spreading DNA methylation, so unmethylated CGIs contain TFBS for Sp1 and other proteins to protect themselves from methylation (Caiafa & Zampieri, 2005). Recent studies show that Alu-repeats proximal to CpG islands could themselves form

Taking into consideration all facts mentioned above, it's obviously too early to exclude Aluand similar repeats out of attention speaking on CGIs functionality. Most of the authors (Takai & Jones, 2002; H. Wu et al., 2010) try to build an algorithm for CGI search that avoid CGIs around Alu-repeats. There are some differences in GC content, Obs/ExpCpG (Takai & Jones, 2002) or in cumulative mutual information of CpG dinucleotides (Su et al., 2009) between CGIs found near Alu-repeats and around promoters of protein-coding genes. Yet most algorithms excluded *ab initio* all repetitive sequences and therefore all of the CGIs located within them, removing more than a half of CGIs in doing so. The question remains why the same sequences

**CpG islands and replication origins.** Sequence properties of replication origins in mammals are not studied very well. There are some evidences that CpG islands near 3' region of the gene (Phi-van & Stratling, 1999) or in other genome regions can play a role of replication origins (Rein et al., 1997; Rein et al., 1999), it's important to know that some CpG

Taking into consideration biological properties mentioned above, DNA methylation is a logically relevant feature for CGI prediction validation. Complicated system of interactions involving CGIs makes it obvious that considering CGI as merely unmethylated region is an oversimplification. As far as DNA methylation plays important role in cell differentiation, the same DNA region can be unmethylated in early stage of development and methylated in later stages (reprogrammed DMR, rDMR), or unmethylated in one tissue and methylated in

a boundary protecting CpG islands from methylation (Feltus et al., 2003).

in repetitive elements are of no use while in unique segments are essential.

should be methylated in those regions for success of replication (Rein et al., 1999).

2004) and influence the elongation efficiency (Jacquier, 2009).

2000) often binds CCCTC core motif common within CGIs.

(Frith et al., 2006).

**3.5 Approches for validation** 

2010) playing a role during development (Illingworth et al., 2011).

Ponger et al., 2001), and participate in regulation of transcription initiation (Rozenberg et al., 2008). Housekeeping genes tend to have CGI promoter more frequently comparing to tissue-specific genes (Zhu et al., 2008). However promoters of tissue-specific genes related to development and embryogenesis are usually located in proximity to CGIs (Robinson et al., 2004).

Many authors believe that CGIs exist since CpG dinucleotides inside them are protected from methylation. The mechanism of such protection is assumed to be protein binding at CGIs boundaries as it has been shown for Sp1 in the promoter of mouse *aprt* gene (Macleod et al., 1994). Later role of Sp1 in CGI boundaries formation has been shown for other genes (Tomatsu et al., 2002). Sp1 is often associated with CGIs as one of the key features (Macleod et al., 1994; Rozenberg et al., 2008). In one of the first works on CGI (Gardiner-Garden & Frommer, 1987) it has been shown that CGIs obtain many G/C-boxes (GGGCGG), which act as a core for Sp1 TFBS (Briggs et al., 1986). Sp1 binds both methylated and unmethylated DNA (Holler et al., 1988). Fan and colleagues (Fan et al., 2007) assume that all proteins with zinc-finger domain can play a role in CpG boundaries formation. Some other proteins, like VEZF1 (Dickson et al., 2010) and CTCF (Filippova et al., 2005; Recillas-Targa et al., 2006), also participate in this process. Naumann (Naumann et al., 2009) shows that loss of such a boundary (in fragile X-chromosome syndrome) leads to spread of methylation and gene inactivation. Moreover CGIs obtaining CTCF binding sites can themselves play a role of insulators forming boundaries of chromatin domains (Filippova et al., 2005).

Other DNA binding proteins with GC-rich binding sites can also decrease the level of DNA methylation (Lin et al., 2000; Recillas-Targa et al., 2006). It's most likely that unmethylated CpG islands form open chromatin structures simplifying the transcription (Choi, 2010). Binding sites for Cfp1 (Thomson et al., 2011), E2F (Weinmann et al., 2002), ETS, NRF-1, BoxA, CRE, E-Box (Rozenberg et al., 2008), p53 (Zemojtel et al., 2009) was found within CGIs.

Besides TFBS other DNA motifs are associated with CGI promoters. GC-skew, a feature of all unidirectional promoters, is stronger for genes starting within CGIs than for genes lacking this property (Polak et al., 2010). Tandem or simple repeats are also found within CGIs (Hutter et al., 2006). Sequence motifs G2-3C2-3, typical for CGI, induce local DNA curiosity and form G-qudruplexes at 5' and 3' ends of RNA molecule**.** G-quadruplexes in DNA restrict methylation of CpG dinucleotides genome-wide (Halder et al., 2010).

#### **3.4 Sources for biologically relevant validation: CpG islands located far from promoter regions**

At least 25% of CpG islands are located far from gene promoters (Ponger et al., 2001). Although a lot of such CGIs overlap with repeats, (Graff et al., 1997; Ponger et al., 2001), other CGIs don't (Ponger et al., 2001; Hackenberg et al., 2006). They are often located near 3' gene region (Gardiner-Garden & Frommer, 1987) or within the gene (Hackenberg et al., 2006). Such 3'and intragenic CGIs are subject for natural selection not only on the protein level, but also on the level of nucleic acids, which confirms their functional significance (Medvedeva et al., 2010).

Many of CGIs located far from promoters of protein-coding genes perform important biological functions. For instance, a CGI within intron 10 of *KCNQ1* acting as a promoter of antisense RNA transcript is involved into imprinting regulation of the locus (Smilinich et al., 1999). Imprinting of *MAP3K12* gene is caused by differential methylation of a CGI located in its last exon (Takada et al., 2000). Many CGI around the 3' ends of genes affect its expression in normal tissues (Appanah, Dickerson et al. 2007) and in cancer (Shiraishi et al., 2002).

Ponger et al., 2001), and participate in regulation of transcription initiation (Rozenberg et al., 2008). Housekeeping genes tend to have CGI promoter more frequently comparing to tissue-specific genes (Zhu et al., 2008). However promoters of tissue-specific genes related to development and embryogenesis are usually located in proximity to CGIs (Robinson et al.,

Many authors believe that CGIs exist since CpG dinucleotides inside them are protected from methylation. The mechanism of such protection is assumed to be protein binding at CGIs boundaries as it has been shown for Sp1 in the promoter of mouse *aprt* gene (Macleod et al., 1994). Later role of Sp1 in CGI boundaries formation has been shown for other genes (Tomatsu et al., 2002). Sp1 is often associated with CGIs as one of the key features (Macleod et al., 1994; Rozenberg et al., 2008). In one of the first works on CGI (Gardiner-Garden & Frommer, 1987) it has been shown that CGIs obtain many G/C-boxes (GGGCGG), which act as a core for Sp1 TFBS (Briggs et al., 1986). Sp1 binds both methylated and unmethylated DNA (Holler et al., 1988). Fan and colleagues (Fan et al., 2007) assume that all proteins with zinc-finger domain can play a role in CpG boundaries formation. Some other proteins, like VEZF1 (Dickson et al., 2010) and CTCF (Filippova et al., 2005; Recillas-Targa et al., 2006), also participate in this process. Naumann (Naumann et al., 2009) shows that loss of such a boundary (in fragile X-chromosome syndrome) leads to spread of methylation and gene inactivation. Moreover CGIs obtaining CTCF binding sites can themselves play a role of

Other DNA binding proteins with GC-rich binding sites can also decrease the level of DNA methylation (Lin et al., 2000; Recillas-Targa et al., 2006). It's most likely that unmethylated CpG islands form open chromatin structures simplifying the transcription (Choi, 2010). Binding sites for Cfp1 (Thomson et al., 2011), E2F (Weinmann et al., 2002), ETS, NRF-1, BoxA, CRE, E-

Besides TFBS other DNA motifs are associated with CGI promoters. GC-skew, a feature of all unidirectional promoters, is stronger for genes starting within CGIs than for genes lacking this property (Polak et al., 2010). Tandem or simple repeats are also found within CGIs (Hutter et al., 2006). Sequence motifs G2-3C2-3, typical for CGI, induce local DNA curiosity and form G-qudruplexes at 5' and 3' ends of RNA molecule**.** G-quadruplexes in

At least 25% of CpG islands are located far from gene promoters (Ponger et al., 2001). Although a lot of such CGIs overlap with repeats, (Graff et al., 1997; Ponger et al., 2001), other CGIs don't (Ponger et al., 2001; Hackenberg et al., 2006). They are often located near 3' gene region (Gardiner-Garden & Frommer, 1987) or within the gene (Hackenberg et al., 2006). Such 3'and intragenic CGIs are subject for natural selection not only on the protein level, but also on the level of nucleic acids, which confirms their functional significance

Many of CGIs located far from promoters of protein-coding genes perform important biological functions. For instance, a CGI within intron 10 of *KCNQ1* acting as a promoter of antisense RNA transcript is involved into imprinting regulation of the locus (Smilinich et al., 1999). Imprinting of *MAP3K12* gene is caused by differential methylation of a CGI located in its last exon (Takada et al., 2000). Many CGI around the 3' ends of genes affect its expression in normal tissues (Appanah, Dickerson et al. 2007) and in cancer (Shiraishi et al., 2002).

insulators forming boundaries of chromatin domains (Filippova et al., 2005).

Box (Rozenberg et al., 2008), p53 (Zemojtel et al., 2009) was found within CGIs.

DNA restrict methylation of CpG dinucleotides genome-wide (Halder et al., 2010).

**3.4 Sources for biologically relevant validation: CpG islands located far from** 

2004).

**promoter regions** 

(Medvedeva et al., 2010).

Intergenic methylation plays an important role in regulation of alternative promoters (Maunakea, Nagarajan et al. 2010), modify chromatin structure (Lorincz, Dickerson et al. 2004) and influence the elongation efficiency (Jacquier, 2009).

Resenly several works show that CGIs located far from known genes in intragenic regions correspond to previously undetected promoters (Carninci et al., 2005; Medvedeva et al., 2010) playing a role during development (Illingworth et al., 2011).

CTCF insulator protein forming a boundary of chromatin active regions (Bell & Felsenfeld, 2000) often binds CCCTC core motif common within CGIs.

**CpG islands and mobile elements**. There are a lot of repetitive sequences in human genomes having high GC content, so many algorithms find CGI overlapping with repeats (Alu-repeat in human (Graff et al., 1997) and B1-repeat in mouse (Yates et al., 1999)). Cytosines within CGIs associated with Alu-repeats in normal cells are methylated, which in turn represses the expansion of the repeat (Xing et al., 2004). Loss of methylation in Alurepeats is typical for tumor cells (Xie et al., 2010). Recently absence of methylation in Alurepeats was shown for germ line (Brohede & Rand, 2006). Ullu and Tschudi (Ullu & Tschudi, 1984) believe that Alu-repeats are possessed pseudogenes of 7SL-RNA, and several Alu families still contain inner promoter of RNA polymerase III (Britten et al., 1988). One can expect that CGIs in Alu-repeats should have different DNA motifs comparing to CGIs in promoters of protein-coding genes transcribed by PolII. Nevertheless, recent studies show that pervasive PolII transcription is also a common feature for pseudogenes and transposons (Frith et al., 2006).

Alu-repeats are source of spreading DNA methylation, so unmethylated CGIs contain TFBS for Sp1 and other proteins to protect themselves from methylation (Caiafa & Zampieri, 2005). Recent studies show that Alu-repeats proximal to CpG islands could themselves form a boundary protecting CpG islands from methylation (Feltus et al., 2003).

Taking into consideration all facts mentioned above, it's obviously too early to exclude Aluand similar repeats out of attention speaking on CGIs functionality. Most of the authors (Takai & Jones, 2002; H. Wu et al., 2010) try to build an algorithm for CGI search that avoid CGIs around Alu-repeats. There are some differences in GC content, Obs/ExpCpG (Takai & Jones, 2002) or in cumulative mutual information of CpG dinucleotides (Su et al., 2009) between CGIs found near Alu-repeats and around promoters of protein-coding genes. Yet most algorithms excluded *ab initio* all repetitive sequences and therefore all of the CGIs located within them, removing more than a half of CGIs in doing so. The question remains why the same sequences in repetitive elements are of no use while in unique segments are essential.

**CpG islands and replication origins.** Sequence properties of replication origins in mammals are not studied very well. There are some evidences that CpG islands near 3' region of the gene (Phi-van & Stratling, 1999) or in other genome regions can play a role of replication origins (Rein et al., 1997; Rein et al., 1999), it's important to know that some CpG should be methylated in those regions for success of replication (Rein et al., 1999).

#### **3.5 Approches for validation**

Taking into consideration biological properties mentioned above, DNA methylation is a logically relevant feature for CGI prediction validation. Complicated system of interactions involving CGIs makes it obvious that considering CGI as merely unmethylated region is an oversimplification. As far as DNA methylation plays important role in cell differentiation, the same DNA region can be unmethylated in early stage of development and methylated in later stages (reprogrammed DMR, rDMR), or unmethylated in one tissue and methylated in

Algorithms for CpG Islands Search: New Advantages and Old Problems 459

that CGI is not functionally equipotential throughout the length. CGI is not only a region with high GC content and CpG frequency. Even in very early works on CGIs (G/C)-box was mentioned as its structure element. Currently, it's obvious that not only Sp1 but also a lot of different TFs bind DNA within CGIs, so a huge fraction of them contains TFBS and their clusters. Also, at least some CGIs have boundary regions containing binding sites for Sp1, CTCF, VEZF1 or other TFs. Recently it was shown that G-quadruplex could also form a boundary of CGIs. It should be emphasized that quality of biologically relevant feature prediction is higher, if the method uses not only CGI prediction but includes other sequence properties. Therefore the concept of complex CGI definition based not only on GC or CpG content but also on other features like TFBS, repeats or DNA structure elements looks

Despite the huge amount of works in the area commonly accepted definition of CpG islands still doesn't exist. Most likely such situation is a result of difficulties with biological verification of predictions (Segal, 2006). Authors of SWMs and to lower extend of clustering algorithms choose the parameters arbitrarily complicating biological interpretations. Authors of machine-learning techniques usually find too many distinguishing parameters important in their models, which are not important in modeling of similar processes in other

Specifically it should be emphasized that all attempts to construct CGI prediction algorithm based on simple DNA sequence properties (GC content, Obs/ExpCpG, distance between neibouring CpG dinucleotides) having in mind prediction of complex biological feature (promoter regions, unmethylated regions and so on) bring about a high level of false positive predictions. For example, in case of promoter CGI prediction at least one third of CGIs are located far from promoters. It admits of no doubt that existing CGI searchers find a chimeric class of DNA segments, which don't have single common function. A collection of DNA motifs relevant to different biological functions could result into more adequate CGI definition. For instance, GC-skew and known core promoter elements could help to find

Speaking on another feature of CGIs, namely lack of DNA methylation, it should be mentioned that new high-throughput techniques show that not all CpG within CGIs are unmethylated in normal cells, as previously believed. Nowadays it became clear that not only CpGs but also CpNpGs are subject to methylation (Lister et al., 2009). Such a motif also

The ability of a CGI searcher to predict DMRs but not unmethylated regions seems more appropriate for quality evaluation. (Dai et al., 2008; Rakyan et al., 2008; Previti et al., 2009). Unfortunately now we are still lack of high-quality and high-resolution data on genomewide DNA methylation in different tissues, states of developmet and conditions. Highthroughput techniques, like MeDIP, MeDIP-seq (Down et al., 2008), MethylCap-Seq (Brinkman et al., 2010), bisulphyte conversion based methods (RRBS (Eckhardt et al., 2006) and Methyl-seq (Lister et al., 2009)), let us hope for a complete map of DMRs in the nearest

There is a lot of evidences that methylated cytosine also could play important functional role as sites for methyl-binding proteins. We still haven't enougth relaibale data on motif preferences for all such proteins but we expect ChIP-seq (Mardis, 2007) technique to help

should be included in CGI prediction model (Hackenberg et al., 2010b).

promising.

cases.

**4. Unsolved problems and perspectives** 

CGI or regions within them related to TSS.

future, which will help with CGI validation.

another one (tissue-specific DMR, tDMR), or unmethylated in one allele and methylated in another (allele-specific DMR, aDMR) as in case of imprinting or dosage compensation, or demonstrate cross-individual differences in methylation (individual DMR, iDMR). More appropriate way is to associate CGI with DMRs demonstrating absense (or decreased level) of cytosine methylation only in one or few conditions.

Nevertheless even methylated CGIs play a role in transcription regulation, some of them contains TSS of protein-coding (Shen et al., 2007) or non-coding genes (Medvedeva et al., 2010). Recently a mechanism of transcription activation by binding of the C/EBPα transcription factor to the methylated CRE motif (TGACGTCA) was demonstated (Rishi et al., 2010). Thus, the absence of methylation shouldn't be the only criterion for CGIs verification.

Resently a lot of works dedicated to prediction of DNA methylation status in different normal tissues ((Bock et al., 2008; Zhao & Han, 2009) and refs in them) and cancer (Feltus et al., 2006) appeared. Various machine leaning techniques (support-vector machine (Bhasin et al., 2005; Das et al., 2006), alternative decision trees (Carson et al., 2008), discriminant analysis (Feltus et al., 2003)) were used to distinguish between methylated and unmethylated regions. Authors use GC content, different di- and tri nucleotides (Das et al., 2006; Fang et al., 2006), Alu-repeat location (Das et al., 2006; Fang et al., 2006), TpG fraction, TFBS, repeats, predicted DNA structures (Bock et al., 2006) and other DNA patterns and properties (Bhasin et al., 2005; Bock et al., 2007; Oakes et al., 2007; Carson et al., 2008; Ehrich et al., 2008) as parameters for those studies. Results obtained by different authors are incomparable, as in every case the model is built on distinct set of tissues and usually not in a genome-wide manner. Features demonstrating high selectivity in one work don't do the same in other works. The consistency of features is low, so one can conclude that those models are overlearned.

Promoter proximity is another traditional key feature for CGI validation. The most popular criterion is a fraction of predicted CGIs located near promoter regions of protein coding genes. As a negative set Alu-repeats are usually used. SWM with higher thresholds for length, GC content and Obs/ExpCpG (Takai & Jones, 2003; Han & Zhao, 2009) and clustering algorithms (Glass et al., 2007; Hackenberg et al., 2010a; H. Wu et al., 2010) show best results. Takai-Jones algorithm predicts 40% of CGIs to be located near promoters of RefSeq genes, CpGcluster can reach the amount of 50% of all CGIs to be near promoter regions (with pvalue = 1.0e-20). Wu and colleagues (H. Wu et al., 2010) believe that CGHW predicts more CGI to be located near promoters of RefSeq genes comparing to UCSC CGI and CG clusters.

Despite the fact that about half of CGIs are located near TSS of protein-coding genes the rest are not. Lately various evidences of pervasive transcription appear (Carninci et al., 2005). New high-throughput techniques (CAGE, SAGE, ets) identify at least ten times more transcriptionally active regions comparing to number of protein-coding genes. Most of those regions contain TSS for ncRNA of different types. CGIs located far from TSS of proteincoding genes can act as their promoters. Nowadays discovery of new protein-coding genes is rare occasion. Nevertheless our knowledge about ncRNA genes is extremely uncomplete. On the other side, one shouldn't forget that mammalian genomes have not only CGIdependent promoters, but also TATA-dependent ones (Saxonov et al., 2006). The proportion of both types is still unclear. Therefore fraction of CGIs associated with protein-coding genes promoters is not an appropriate measure.

Other genomic features, like insulators, replication origins, recombination hot-spots, are also co-located with CGIs and make the whole picture more complicated. It's also becoming clear

another one (tissue-specific DMR, tDMR), or unmethylated in one allele and methylated in another (allele-specific DMR, aDMR) as in case of imprinting or dosage compensation, or demonstrate cross-individual differences in methylation (individual DMR, iDMR). More appropriate way is to associate CGI with DMRs demonstrating absense (or decreased level)

Nevertheless even methylated CGIs play a role in transcription regulation, some of them contains TSS of protein-coding (Shen et al., 2007) or non-coding genes (Medvedeva et al., 2010). Recently a mechanism of transcription activation by binding of the C/EBPα transcription factor to the methylated CRE motif (TGACGTCA) was demonstated (Rishi et al., 2010). Thus, the absence of methylation shouldn't be the only criterion for CGIs

Resently a lot of works dedicated to prediction of DNA methylation status in different normal tissues ((Bock et al., 2008; Zhao & Han, 2009) and refs in them) and cancer (Feltus et al., 2006) appeared. Various machine leaning techniques (support-vector machine (Bhasin et al., 2005; Das et al., 2006), alternative decision trees (Carson et al., 2008), discriminant analysis (Feltus et al., 2003)) were used to distinguish between methylated and unmethylated regions. Authors use GC content, different di- and tri nucleotides (Das et al., 2006; Fang et al., 2006), Alu-repeat location (Das et al., 2006; Fang et al., 2006), TpG fraction, TFBS, repeats, predicted DNA structures (Bock et al., 2006) and other DNA patterns and properties (Bhasin et al., 2005; Bock et al., 2007; Oakes et al., 2007; Carson et al., 2008; Ehrich et al., 2008) as parameters for those studies. Results obtained by different authors are incomparable, as in every case the model is built on distinct set of tissues and usually not in a genome-wide manner. Features demonstrating high selectivity in one work don't do the same in other works. The consistency of features is low, so one can conclude that those

Promoter proximity is another traditional key feature for CGI validation. The most popular criterion is a fraction of predicted CGIs located near promoter regions of protein coding genes. As a negative set Alu-repeats are usually used. SWM with higher thresholds for length, GC content and Obs/ExpCpG (Takai & Jones, 2003; Han & Zhao, 2009) and clustering algorithms (Glass et al., 2007; Hackenberg et al., 2010a; H. Wu et al., 2010) show best results. Takai-Jones algorithm predicts 40% of CGIs to be located near promoters of RefSeq genes, CpGcluster can reach the amount of 50% of all CGIs to be near promoter regions (with pvalue = 1.0e-20). Wu and colleagues (H. Wu et al., 2010) believe that CGHW predicts more CGI to be located near promoters of RefSeq genes comparing to UCSC CGI and CG clusters. Despite the fact that about half of CGIs are located near TSS of protein-coding genes the rest are not. Lately various evidences of pervasive transcription appear (Carninci et al., 2005). New high-throughput techniques (CAGE, SAGE, ets) identify at least ten times more transcriptionally active regions comparing to number of protein-coding genes. Most of those regions contain TSS for ncRNA of different types. CGIs located far from TSS of proteincoding genes can act as their promoters. Nowadays discovery of new protein-coding genes is rare occasion. Nevertheless our knowledge about ncRNA genes is extremely uncomplete. On the other side, one shouldn't forget that mammalian genomes have not only CGIdependent promoters, but also TATA-dependent ones (Saxonov et al., 2006). The proportion of both types is still unclear. Therefore fraction of CGIs associated with protein-coding genes

Other genomic features, like insulators, replication origins, recombination hot-spots, are also co-located with CGIs and make the whole picture more complicated. It's also becoming clear

of cytosine methylation only in one or few conditions.

verification.

models are overlearned.

promoters is not an appropriate measure.

that CGI is not functionally equipotential throughout the length. CGI is not only a region with high GC content and CpG frequency. Even in very early works on CGIs (G/C)-box was mentioned as its structure element. Currently, it's obvious that not only Sp1 but also a lot of different TFs bind DNA within CGIs, so a huge fraction of them contains TFBS and their clusters. Also, at least some CGIs have boundary regions containing binding sites for Sp1, CTCF, VEZF1 or other TFs. Recently it was shown that G-quadruplex could also form a boundary of CGIs. It should be emphasized that quality of biologically relevant feature prediction is higher, if the method uses not only CGI prediction but includes other sequence properties. Therefore the concept of complex CGI definition based not only on GC or CpG content but also on other features like TFBS, repeats or DNA structure elements looks promising.

#### **4. Unsolved problems and perspectives**

Despite the huge amount of works in the area commonly accepted definition of CpG islands still doesn't exist. Most likely such situation is a result of difficulties with biological verification of predictions (Segal, 2006). Authors of SWMs and to lower extend of clustering algorithms choose the parameters arbitrarily complicating biological interpretations. Authors of machine-learning techniques usually find too many distinguishing parameters important in their models, which are not important in modeling of similar processes in other cases.

Specifically it should be emphasized that all attempts to construct CGI prediction algorithm based on simple DNA sequence properties (GC content, Obs/ExpCpG, distance between neibouring CpG dinucleotides) having in mind prediction of complex biological feature (promoter regions, unmethylated regions and so on) bring about a high level of false positive predictions. For example, in case of promoter CGI prediction at least one third of CGIs are located far from promoters. It admits of no doubt that existing CGI searchers find a chimeric class of DNA segments, which don't have single common function. A collection of DNA motifs relevant to different biological functions could result into more adequate CGI definition. For instance, GC-skew and known core promoter elements could help to find CGI or regions within them related to TSS.

Speaking on another feature of CGIs, namely lack of DNA methylation, it should be mentioned that new high-throughput techniques show that not all CpG within CGIs are unmethylated in normal cells, as previously believed. Nowadays it became clear that not only CpGs but also CpNpGs are subject to methylation (Lister et al., 2009). Such a motif also should be included in CGI prediction model (Hackenberg et al., 2010b).

The ability of a CGI searcher to predict DMRs but not unmethylated regions seems more appropriate for quality evaluation. (Dai et al., 2008; Rakyan et al., 2008; Previti et al., 2009). Unfortunately now we are still lack of high-quality and high-resolution data on genomewide DNA methylation in different tissues, states of developmet and conditions. Highthroughput techniques, like MeDIP, MeDIP-seq (Down et al., 2008), MethylCap-Seq (Brinkman et al., 2010), bisulphyte conversion based methods (RRBS (Eckhardt et al., 2006) and Methyl-seq (Lister et al., 2009)), let us hope for a complete map of DMRs in the nearest future, which will help with CGI validation.

There is a lot of evidences that methylated cytosine also could play important functional role as sites for methyl-binding proteins. We still haven't enougth relaibale data on motif preferences for all such proteins but we expect ChIP-seq (Mardis, 2007) technique to help

Algorithms for CpG Islands Search: New Advantages and Old Problems 461

In general one could see that CGI HW finds more "relaxed" CGIs comparing to UCSC CGI (with lower GC-content, Obs/ExpCpG value and CpG frequency), whereas CpGcluster finds

**TSS prediction.** It's widely accepted that a large fraction of CGIs is found around TSS of protein-coding genes. Recent studies show that total amount of TSS is about 10-times higher than the amount of protein-coding genes, so it seems more appropriate to test the CGI searchers for their ability to find TSS of any type. Several experimental techniques are able to detect any type of TSSs. Cap analysis gene expression (CAGE) is one of the most known techniques to produce a snapshot of the 5' ends of the total cellular RNA transcribed by PolII. A collection of CAGE-tags (encodeRikenCagePlus and encodeRikenCageMinus tables

**CGI fraction** 0.0136 0.0090 0.0130 0.0152 0.0164 **CAGE fraction** 0.7274 0.7909 0.4632 0.4331 0.3903 **Sn** 0.0136 0.0091 0.0128 0.0149 0.0158 **Sp** 0.9869 0.9773 0.9900 0.9927 0.9943

Table 2 shows that CGI HW has the lowest sensitivity, although they obtain the highest fraction of CAGE-tags clusters. CpGcluster20 demonstrates the highest selectivity and specificity but obtain only 39% of CAGE-tags clusters. UCSC CGI has the intermediate

**TFBS prediction.** Although TFBS prediction is a classical problem for computational molecular biology, prediction of one single but highly reliable TFBS still remains tricky. I used TFBS conserved in the human/mouse/rat alignment based on Transfac Matrix Database (tfbsConsSites and tfbsConsFactors tables from UCSC). Keeping in mind that using of conserved TFBS leads to omission of all types of species-specific regulation regions,

Table 3 demonstrates that CpGcluster predicts CGI with fewer different TFs and lower sensitivity comparing to USCS CGI and CGI HW. The highest fraction of total TFBS length is covered by CGI HW, the very same algorithm shows the highest sensitivity and the lowest specificity. It's not obvious what fraction of the CGIs one should expect to be covered by

 **UCSC CGI CGI HW CpGcluster10 CpGcluster15 CpGcluster20 #TF** 167 167 161 154 153 **CGI fraction** 0.1834 0.1347 0.1896 0.1915 0.1917 **TFBS fraction** 0.0860 0.1098 0.0688 0.0509 0.0393 **Sn** 0.0676 0.0696 0.0567 0.0443 0.0355 **Sp** 0.9889 0.9796 0.9916 0.9938 0.9952

conserved TFBS are more likely to be functional comparing to other predicted TFBS.

TFBS but CpGcluster20 demonstrates the largest coverage (about 19 %).

**UCSC CGI CGI HW CpGcluster10 CpGcluster15 CpGcluster20** 

more "strict" CGIs comparing to UCSC CGI.

from UCSC) was used as a representative set of PolII TSS.

Table 2. CAGE-tags clusters within different CGIs.

Table 3. Conserved TFBS within different CGIs.

**5.2 Regulatory potential** 

values of Sn and Sp.

with the issue. There are proofs showing that it's premature to exclude Alu- and other repetitive mostly methylated sequences out of considereation speaking on CGI functions. To resolve mentioned problems it is necessary to figure out as many biological functions associating with CGIs as possible and to find out structure elements within CGI relating to those functions or to separate CGI on several different functional groups. Such approach should result in more precise and biologically adequate CGIs definition and, therefore construction of relevant algorithm with low false positive and negative rates which in turn will improve our knowledge in genetic and epigenetic regulation of genome functioning.

#### **5. Comparison of different algorithms**

A lot of comparisons between algorithms for CGI search have been performed. This work is focused on study of various genome features potentially relates to CGIs. Three algorithms for CpG islands search participate in the comparison: UCSC CGI, CpGcluster (with p-value threshold of clusters equal to 1.0e-10, 1.0e-15, and 1.0e-20) and CGHW (the algorithm implemented by Wu and colleagues). I prefer to focus on the algorithms of a "new wave" and UCSC CGI as a reference because the last one is the most widespread now.

ENCODE regions of human genome (version hg18) were used for the study. All annotations were downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/.

Standard sensitivity (3) and specificity (4) measures for prediction quality were used.

$$\mathbf{Sn} = \mathbf{L}\_{\rm TP} \;/\; \text{( $\mathbf{L}\_{\rm FP} \star \mathbf{L}\_{\rm FN}$ )}\tag{3}$$

$$\mathbf{Sp} = \mathbf{L}\_{\text{TN}} / \text{ (}\mathbf{L}\_{\text{FP}} + \mathbf{L}\_{\text{TN}}\text{)}\tag{4}$$

where LTP – total length (bp) of overlap of CGIs with tested annotation, LFP – total length (bp) of CGIs not overlapping with tested annotation, LFN - total length (bp) of tested annotation not overlapping with CGIs, LTN - total length (bp) of ENCODE regions not overlapping neither with tested annotation no with CGIs.

#### **5.1 Basic statistics**

As a first step I collected the summary of statictical properties of CGIs predicted by different algorithms. CGI HW covers more then 2.2 % of total length of all ENCODE regions. CpGcluster (p-value 1.0 e-20 as recommended in (Hackenberg et al., 2010a)) demonstrate the smallest genome coverage of 0.6%. CpGcluster predicts shorter CGIs with higher average GC-content and Obs/ExpCpG value comparing to other algorithms. UCSC CGI obtains the largest average number of CpGs per one CGI.


Table 1. Basic statistics for different CGIs.

In general one could see that CGI HW finds more "relaxed" CGIs comparing to UCSC CGI (with lower GC-content, Obs/ExpCpG value and CpG frequency), whereas CpGcluster finds more "strict" CGIs comparing to UCSC CGI.

#### **5.2 Regulatory potential**

460 Bioinformatics – Trends and Methodologies

with the issue. There are proofs showing that it's premature to exclude Alu- and other repetitive mostly methylated sequences out of considereation speaking on CGI functions. To resolve mentioned problems it is necessary to figure out as many biological functions associating with CGIs as possible and to find out structure elements within CGI relating to those functions or to separate CGI on several different functional groups. Such approach should result in more precise and biologically adequate CGIs definition and, therefore construction of relevant algorithm with low false positive and negative rates which in turn will improve our knowledge in genetic and epigenetic regulation of genome functioning.

A lot of comparisons between algorithms for CGI search have been performed. This work is focused on study of various genome features potentially relates to CGIs. Three algorithms for CpG islands search participate in the comparison: UCSC CGI, CpGcluster (with p-value threshold of clusters equal to 1.0e-10, 1.0e-15, and 1.0e-20) and CGHW (the algorithm implemented by Wu and colleagues). I prefer to focus on the algorithms of a "new wave"

ENCODE regions of human genome (version hg18) were used for the study. All annotations were downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/. Standard sensitivity (3) and specificity (4) measures for prediction quality were used.

where LTP – total length (bp) of overlap of CGIs with tested annotation, LFP – total length (bp) of CGIs not overlapping with tested annotation, LFN - total length (bp) of tested annotation not overlapping with CGIs, LTN - total length (bp) of ENCODE regions not

As a first step I collected the summary of statictical properties of CGIs predicted by different algorithms. CGI HW covers more then 2.2 % of total length of all ENCODE regions. CpGcluster (p-value 1.0 e-20 as recommended in (Hackenberg et al., 2010a)) demonstrate the smallest genome coverage of 0.6%. CpGcluster predicts shorter CGIs with higher average GC-content and Obs/ExpCpG value comparing to other algorithms. UCSC CGI obtains the

**#CGI** 507 1124 1093 633 418 **CGI total length** 396722 685514 303160 222603 172676 **avarage length** 782 610 277 352 413 **avarage GC content** 0.66 0.64 0.7 0.71 0.72 **avarage #CpG per CGI** 71 48 29 38 46 **avarage Obs/ExpCpG** 0.86 0.74 0.91 0.92 0.92 **ENCODE fraction** 0.0132 0.0229 0.0101 0.0074 0.0058

Sn = LTP / (LFP+LFN), (3)

Sp = LTN / (LFP+LTN), (4)

**UCSC CGI HW CpGcluster10 CpGcluster15 CpGcluster20** 

and UCSC CGI as a reference because the last one is the most widespread now.

**5. Comparison of different algorithms** 

overlapping neither with tested annotation no with CGIs.

largest average number of CpGs per one CGI.

Table 1. Basic statistics for different CGIs.

**5.1 Basic statistics** 

**TSS prediction.** It's widely accepted that a large fraction of CGIs is found around TSS of protein-coding genes. Recent studies show that total amount of TSS is about 10-times higher than the amount of protein-coding genes, so it seems more appropriate to test the CGI searchers for their ability to find TSS of any type. Several experimental techniques are able to detect any type of TSSs. Cap analysis gene expression (CAGE) is one of the most known techniques to produce a snapshot of the 5' ends of the total cellular RNA transcribed by PolII. A collection of CAGE-tags (encodeRikenCagePlus and encodeRikenCageMinus tables from UCSC) was used as a representative set of PolII TSS.


Table 2. CAGE-tags clusters within different CGIs.

Table 2 shows that CGI HW has the lowest sensitivity, although they obtain the highest fraction of CAGE-tags clusters. CpGcluster20 demonstrates the highest selectivity and specificity but obtain only 39% of CAGE-tags clusters. UCSC CGI has the intermediate values of Sn and Sp.

**TFBS prediction.** Although TFBS prediction is a classical problem for computational molecular biology, prediction of one single but highly reliable TFBS still remains tricky. I used TFBS conserved in the human/mouse/rat alignment based on Transfac Matrix Database (tfbsConsSites and tfbsConsFactors tables from UCSC). Keeping in mind that using of conserved TFBS leads to omission of all types of species-specific regulation regions, conserved TFBS are more likely to be functional comparing to other predicted TFBS.

Table 3 demonstrates that CpGcluster predicts CGI with fewer different TFs and lower sensitivity comparing to USCS CGI and CGI HW. The highest fraction of total TFBS length is covered by CGI HW, the very same algorithm shows the highest sensitivity and the lowest specificity. It's not obvious what fraction of the CGIs one should expect to be covered by TFBS but CpGcluster20 demonstrates the largest coverage (about 19 %).


Table 3. Conserved TFBS within different CGIs.

Algorithms for CpG Islands Search: New Advantages and Old Problems 463

 **UCSC CGI CGI HW CpGcluster10 CpGcluster15 CpGcluster20 CGI fraction** 0.6047 0.3221 0.4312 0.5872 0.6040 **DNase fraction** 0.0768 0.0707 0.0418 0.0418 0.0334 **Sn** 0.0789 0.0655 0.0413 0.0424 0.0338 **Sp** 0.9942 0.9827 0.9936 0.9966 0.9975 Table 6. DNase sensitivity regions within CGIs predicted by different algorithms and

**Differently methylated regions.** Data on regions differently methylated during development was downloaded from the UCSC (table rdmr). Table 7 shows that CGI HW predicts CGI located near over 43% of all rDMRs. This algorithm demonstrates also the best sensitivity in this case. It shoud be menthioned that CpGcluster20 has the lowest sensitivity

 **UCSC CGI CGI HW CpGcluster10 CpGcluster15 CpGcluster20 #rDMR fraction** 0.2500 0.4310 0.2241 0.1293 0.0776 **CGI fraction** 0.0170 0.0262 0.0179 0.0161 0.0137 **rDMR fraction** 0.0534 0.1424 0.0432 0.0284 0.0187 **Sn** 0.0132 0.0231 0.0130 0.0105 0.0080 **Sp** 0.9869 0.9776 0.9900 0.9927 0.9943

**Replication origins.** To figure out if there is any preference for replication origins to be found by one of CGI searchers data from encodeUvaDnaRepOriginsNSGM table were used. Only CGI HW and CpGcluster10 find 5 and 2 replication origins within or around (+/- 100 bp) CGI respectively. Other algorithms (and CpGcluster with more strict parameters) are

**Polymorphic loci.** Data from SNP130 were used for study of polymorphic loci within different CGIs. CGI from CGI HW contains the highest fraction of SNPs and demonstrates highest sensitivity, so one should expect more interindividual variants within those CGIs.

 **UCSC CGI CGI HW CpGcluster10 CpGcluster15 CpGcluster20 CGI fraction** 0.0072 0.0082 0.0080 0.0073 0.0066 **SNP fraction** 0.0140 0.0276 0.0120 0.0080 0.0056 **Sn** 0.0048 0.0064 0.0049 0.0038 0.0031 **Sp** 0.9868 0.9771 0.9899 0.9926 0.9942

In summary, no one algorithm for CGI search predicts all biologically relevant features with appropriate accuracy. In all cases a lot of both false positives and false negatives

quality of prediction.

and those CGIs are located near only 7% of rDMRs.

Table 7. rDMRs within different CGIs.

unable to find any replication origins.

Table 8. SNPs within different CGIs.

**6. Conclusions** 

appear.


As it's difficult to estimate the expected coverage of TFBS, I compared the coverage of CGIs with the coverage of their adjacent regions of 100 bp. Results in Table 4 show that all adjacent to CGI regions contain conserved TFBS.

Table 4. Conserved TFBS within +/- 100 bp around different CGIs.

Last row of the Table 4 demonstrates the reduction of coverage in CGI adjacent regions comparing to CGI bodies. The adjacent regions of UCSC CGI and CGI HW contain more then 12 and 6 times less TFBS comparing to CGI body respectively. One should expect some TFBS around CGI which can function as CGI's boundaries. One the other hand, if we believe that CGI itself is the regulatory region, expected amount of TFBS in the adjacent regions should be dramatically lower comparing to CGI body, which is not the case for CpGcluster.

**Insulators**. CTCF is well known as a DNA binding protein acting both as transcriptional factor and insulator protein. To test which CGI prediction algorithm finds more CTCF binding sites I used data on CTCF binding (oregano and oreganoAttr tables from UCSC). One can see that CGI HW shows the highest sensitivity in CTCF binding prediction. It's also important to mention that CGIs from CGI HW contain more than 25% of all CTCF sites. CpGcluster10 shows the second best result, and the quality of prediction decreases in case of CpGcluster15 and CpGcluster20.


Table 5. CTCF binding sites within different CGIs.

**DNase sensitivity regions.** DNase sensitivity regions are often considered as regions of open chromatin which correspond to regulatory regions of all types. To test what algorithm predicts CGI more often associated with DNase sensitivity regions I use joined data for several tissues available in UCSC (table wgEncodeRegDnaseClustered). All CGIs demonstrate rather good association with DNase sensitivity regions, at least one third of their length is located in sensitive area. UCSC CGI shows highest sensitivity and rather good spesifisity. Vast fraction of CpGcluster CGIs are also associated with DNase sensitivity regions; althougth sensivity of the algorithm is not very good.

As it's difficult to estimate the expected coverage of TFBS, I compared the coverage of CGIs with the coverage of their adjacent regions of 100 bp. Results in Table 4 show that all

 **UCSC CGI CGI HW CpGcluster10 CpGcluster15 CpGcluster20 #TF** 157 167 166 162 151 **CGI fraction** 0.0564 0.0648 0.0871 0.0859 0.0820 **TFBS fraction** 0.0069 0.0177 0.0231 0.0132 0.0083 **Sn** 0.0063 0.0143 0.0189 0.0117 0.0077 **Sp** 0.9967 0.9928 0.9931 0.9960 0.9974 **TFBS ratio** 12.38 6.21 2.98 3.86 4.72

Last row of the Table 4 demonstrates the reduction of coverage in CGI adjacent regions comparing to CGI bodies. The adjacent regions of UCSC CGI and CGI HW contain more then 12 and 6 times less TFBS comparing to CGI body respectively. One should expect some TFBS around CGI which can function as CGI's boundaries. One the other hand, if we believe that CGI itself is the regulatory region, expected amount of TFBS in the adjacent regions should be dramatically lower comparing to CGI body, which is not the

**Insulators**. CTCF is well known as a DNA binding protein acting both as transcriptional factor and insulator protein. To test which CGI prediction algorithm finds more CTCF binding sites I used data on CTCF binding (oregano and oreganoAttr tables from UCSC). One can see that CGI HW shows the highest sensitivity in CTCF binding prediction. It's also important to mention that CGIs from CGI HW contain more than 25% of all CTCF sites. CpGcluster10 shows the second best result, and the quality of prediction decreases in case of

 **UCSC CGI CGI HW CpGcluster10 CpGcluster15 CpGcluster20 CGI fraction** 0.0809 0.0658 0.0503 0.0569 0.0478 **CTCF fraction** 0.1395 0.2517 0.1871 0.1224 0.0680 **Sn** 0.0267 0.0434 0.0305 0.0241 0.0157 **Sp** 0.9872 0.9806 0.9916 0.9939 0.9953

**DNase sensitivity regions.** DNase sensitivity regions are often considered as regions of open chromatin which correspond to regulatory regions of all types. To test what algorithm predicts CGI more often associated with DNase sensitivity regions I use joined data for several tissues available in UCSC (table wgEncodeRegDnaseClustered). All CGIs demonstrate rather good association with DNase sensitivity regions, at least one third of their length is located in sensitive area. UCSC CGI shows highest sensitivity and rather good spesifisity. Vast fraction of CpGcluster CGIs are also associated with DNase sensitivity

adjacent to CGI regions contain conserved TFBS.

case for CpGcluster.

CpGcluster15 and CpGcluster20.

Table 5. CTCF binding sites within different CGIs.

regions; althougth sensivity of the algorithm is not very good.

Table 4. Conserved TFBS within +/- 100 bp around different CGIs.


Table 6. DNase sensitivity regions within CGIs predicted by different algorithms and quality of prediction.

**Differently methylated regions.** Data on regions differently methylated during development was downloaded from the UCSC (table rdmr). Table 7 shows that CGI HW predicts CGI located near over 43% of all rDMRs. This algorithm demonstrates also the best sensitivity in this case. It shoud be menthioned that CpGcluster20 has the lowest sensitivity and those CGIs are located near only 7% of rDMRs.


Table 7. rDMRs within different CGIs.

**Replication origins.** To figure out if there is any preference for replication origins to be found by one of CGI searchers data from encodeUvaDnaRepOriginsNSGM table were used. Only CGI HW and CpGcluster10 find 5 and 2 replication origins within or around (+/- 100 bp) CGI respectively. Other algorithms (and CpGcluster with more strict parameters) are unable to find any replication origins.

**Polymorphic loci.** Data from SNP130 were used for study of polymorphic loci within different CGIs. CGI from CGI HW contains the highest fraction of SNPs and demonstrates highest sensitivity, so one should expect more interindividual variants within those CGIs.


Table 8. SNPs within different CGIs.

#### **6. Conclusions**

In summary, no one algorithm for CGI search predicts all biologically relevant features with appropriate accuracy. In all cases a lot of both false positives and false negatives appear.

Algorithms for CpG Islands Search: New Advantages and Old Problems 465

Bestor, T. H., G. Gundersen, A. B. Kolsto and H. Prydz (1992). CpG islands in mammalian

Bhasin, M., H. Zhang, E. L. Reinherz and P. A. Reche (2005). Prediction of methylated CpGs

Bird, A. P. (1986). CpG-rich islands and the function of DNA methylation. *Nature*, Vol.321,

Bock, C., M. Paulsen, S. Tierling, T. Mikeska, T. Lengauer and J. Walter (2006). CpG island

Bock, C., J. Walter, M. Paulsen and T. Lengauer (2007). CpG island mapping by epigenome

Bock, C., J. Walter, M. Paulsen and T. Lengauer (2008). Inter-individual variation of DNA

Brero, A., H. Leonhardt and M. C. Cardoso (2006). Replication and translation of epigenetic

Briggs, M. R., J. T. Kadonaga, S. P. Bell and R. Tjian (1986). Purification and biochemical

Brinkman, A. B., F. Simmer, K. Ma, A. Kaan, J. Zhu and H. G. Stunnenberg (2010). Whole-

Britten, R. J., W. F. Baron, D. B. Stout and E. H. Davidson (1988). Sources and evolution of

Brohede, J. and K. N. Rand (2006). Evolutionary evidence suggests that CpG island-

Brown, S. D. (1991). XIST and the mapping of the X chromosome inactivation centre.

Brown, S. E., M. J. Suderman, M. Hallett and M. Szyf (2008). DNA demethylation induced

Brunner, A. L., D. S. Johnson, S. W. Kim, A. Valouev, T. E. Reddy, N. F. Neff, E. Anton, C.

Bucher, P. (1990). Weight matrix descriptions of four eukaryotic RNA polymerase II

Burley, S. K. and R. G. Roeder (1996). Biochemistry and structural biology of transcription

Caiafa, P. and M. Zampieri (2005). DNA methylation and chromatin structure: the puzzling

Carninci, P., T. Kasukawa, S. Katayama, J. Gough, M. C. Frith, N. Maeda, R. Oyama, T.

mammalian genome. *Science*, Vol.309, No.5740, (Sep 2, 2005), pp. 1559-63

fetal liver. *Genome Res*, Vol.19, No.6, (Jun, 2009), pp. 1044-56

factor IID (TFIID). *Annu Rev Biochem*, Vol.65, 1996), pp. 769-99

CpG islands. *J Cell Biochem*, Vol.94, No.2, (Feb 1, 2005), pp. 257-65

prediction. *PLoS Comput Biol*, Vol.3, No.6, (Jun, 2007), pp. e110

information. *Curr Top Microbiol Immunol*, Vol.301, 2006), pp. 21-44

*Appl*, Vol.9, No.2, (Apr, 1992), pp. 48-53

No.6067, (May 15-21, 1986), pp. 209-13

*Res*, Vol.36, No.10, (Jun, 2008), pp. e55

No.4772, (Oct 3, 1986), pp. 47-52

Vol.119, No.4, (May, 2006), pp. 457-8

Vol.212, No.4, (Apr 20, 1990), pp. 563-78

*Bioessays*, Vol.13, No.11, (Nov, 1991), pp. 607-12

(Nov, 2010), pp. 232-6

pp. 4770-4

2008), pp. 99-106

15, 2005), pp. 4302-8

gene promoters are inherently resistant to de novo methylation. *Genet Anal Tech* 

in DNA sequences using a support vector machine. *FEBS Lett*, Vol.579, No.20, (Aug

methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. *PLoS Genet*, Vol.2, No.3, (Mar, 2006), pp. e26

methylation and its implications for large-scale epigenome mapping. *Nucleic Acids* 

characterization of the promoter-specific transcription factor, Sp1. *Science*, Vol.234,

genome DNA methylation profiling using MethylCap-seq. *Methods*, Vol.52, No.3,

human Alu repeated sequences. *Proc Natl Acad Sci U S A*, Vol.85, No.13, (Jul, 1988),

associated Alus are frequently unmethylated in human germline. *Hum Genet*,

by the methyl-CpG-binding domain protein MBD3. *Gene*, Vol.420, No.2, (Sep 1,

Medina, L. Nguyen, E. Chiao, et al. (2009). Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human

promoter elements derived from 502 unrelated promoter sequences. *J Mol Biol*,

Ravasi, B. Lenhard, C. Wells, et al. (2005). The transcriptional landscape of the

All algorithms participating in competition have its strong sides. CpGcluster (p-value = 1.0e-15 and p-value = 1.0e-20) demonstrate the highest specificity in TSS prediction. Although such CGIs obtain the smallest fraction of CAGE-tags, this may be not a disadvantage as we don't know for sure the proportion of GC- and AT-rich promoters. The largest fraction of CGIs length is covered by TFBS in case of CGIs predicted by CpGcluster, on the other hand the largest part of their adjacent regions is also covered by TFBS. This brought me to conclusion that CpGcluster finds "cropped" promoter CGIs, espessially in case of p-value = 1.0e-20.

On the contrary CGI HW demonstrates the best sensitivity in CTCF binding sites and rDMR prediction. CGI from CGI HW are associated with at least some of origins of repliacation, thereas other algoritms (with recommended parameters) don't. They are also more prone to find diversities between humans. Also those CGIs find the highest fraction of TSS. So, CGI HW finds regions with broad regulatory potential. However all those features are related to DNA methylation, which allow me to assume that CGI HW finds DMR-associated CGIs.

UCSC CGI demonstrates moderate behavior. This algorithm has intermediate sensitivity both in TSS and rDMR prediction. Those CGIs have the highest decrease of TFBS in CGI adjasent regions and the highest sensitivity to DNase. It looks like UCSC finds CGI around promoter and also includes regulation regions, so those are promoter region CGIs.

It's quite clear that CGI is a complex object, which doesn't correspond to any single biological feature. It seems more appropriate to segregate a class of interconnected biological features: differential DNA methylation, active transcription at least in one cell type or development stage and replication. CGI HW algorithm made the first step in this direction, whereas CpGcluster (with high threshold for p-value) moves to the opposite direction and finds specific narrow class of promoters. Traditional UCSC approach still stands ground demonstrating comparable or in some points even higher quality. Hence the CpG island problem is still far from final solution.

#### **7. Acknowledgments**

Author is very grateful to N. Oparina, V. Makeev, I. Artamonova and A. Favorov for fruitful discussions on the topic of this article. This study was partially supported by RFBR grant 11- 04-02016-a and by the state contract P1376 of the Federal Special Program "Scientific and educational human resources of innovative Russia" for 2009 – 2013.

#### **8. References**


All algorithms participating in competition have its strong sides. CpGcluster (p-value = 1.0e-15 and p-value = 1.0e-20) demonstrate the highest specificity in TSS prediction. Although such CGIs obtain the smallest fraction of CAGE-tags, this may be not a disadvantage as we don't know for sure the proportion of GC- and AT-rich promoters. The largest fraction of CGIs length is covered by TFBS in case of CGIs predicted by CpGcluster, on the other hand the largest part of their adjacent regions is also covered by TFBS. This brought me to conclusion that CpGcluster finds "cropped" promoter CGIs, espessially in case of p-value = 1.0e-20. On the contrary CGI HW demonstrates the best sensitivity in CTCF binding sites and rDMR prediction. CGI from CGI HW are associated with at least some of origins of repliacation, thereas other algoritms (with recommended parameters) don't. They are also more prone to find diversities between humans. Also those CGIs find the highest fraction of TSS. So, CGI HW finds regions with broad regulatory potential. However all those features are related to DNA methylation, which allow me to assume that CGI HW finds DMR-associated CGIs. UCSC CGI demonstrates moderate behavior. This algorithm has intermediate sensitivity both in TSS and rDMR prediction. Those CGIs have the highest decrease of TFBS in CGI adjasent regions and the highest sensitivity to DNase. It looks like UCSC finds CGI around

promoter and also includes regulation regions, so those are promoter region CGIs.

CpG island problem is still far from final solution.

No.3, (Mar, 1981), pp. 1619-23

educational human resources of innovative Russia" for 2009 – 2013.

*Soc Trans*, Vol.33, No.Pt 6, (Dec, 2005), pp. 1537-40

**7. Acknowledgments** 

**8. References** 

141-96

482-5

It's quite clear that CGI is a complex object, which doesn't correspond to any single biological feature. It seems more appropriate to segregate a class of interconnected biological features: differential DNA methylation, active transcription at least in one cell type or development stage and replication. CGI HW algorithm made the first step in this direction, whereas CpGcluster (with high threshold for p-value) moves to the opposite direction and finds specific narrow class of promoters. Traditional UCSC approach still stands ground demonstrating comparable or in some points even higher quality. Hence the

Author is very grateful to N. Oparina, V. Makeev, I. Artamonova and A. Favorov for fruitful discussions on the topic of this article. This study was partially supported by RFBR grant 11- 04-02016-a and by the state contract P1376 of the Federal Special Program "Scientific and

Baylin, S. B., J. G. Herman, J. R. Graff, P. M. Vertino and J. P. Issa (1998). Alterations in DNA

Behe, M. and G. Felsenfeld (1981). Effects of methylation on a synthetic polynucleotide: the

Bell, A. C. and G. Felsenfeld (2000). Methylation of a CTCF-dependent boundary controls

Berger, J. and A. Bird (2005). Role of MBD2 in gene regulation and tumorigenesis. *Biochem* 

methylation: a fundamental aspect of neoplasia. *Adv Cancer Res*, Vol.72, 1998), pp.

B--Z transition in poly(dG-m5dC).poly(dG-m5dC). *Proc Natl Acad Sci U S A*, Vol.78,

imprinted expression of the Igf2 gene. *Nature*, Vol.405, No.6785, (May 25, 2000), pp.


Algorithms for CpG Islands Search: New Advantages and Old Problems 467

Ehrich, M., J. Turner, P. Gibbs, L. Lipton, M. Giovanneti, C. Cantor and D. van den Boom

Ehrlich, M. and R. Y. Wang (1981). 5-Methylcytosine in eukaryotic DNA. *Science*, Vol.212,

Fan, S., F. Fang, X. Zhang and M. Q. Zhang (2007). Putative zinc finger protein binding sites

Fang, F., S. Fan, X. Zhang and M. Q. Zhang (2006). Predicting methylation status of CpG islands in the human brain. *Bioinformatics*, Vol.22, No.18, (Sep 15, 2006), pp. 2204-9 Fatemi, M. and P. A. Wade (2006). MBD family proteins: reading the epigenetic code. *J Cell* 

Feltus, F. A., E. K. Lee, J. F. Costello, C. Plass and P. M. Vertino (2003). Predicting aberrant

Feltus, F. A., E. K. Lee, J. F. Costello, C. Plass and P. M. Vertino (2006). DNA motifs

Filippova, G. N., M. K. Cheng, J. M. Moore, J. P. Truong, Y. J. Hu, D. K. Nguyen, K. D.

Frey, F. J. (2005). Methylation of CpG islands: potential relevance for hypertension and kidney diseases. *Nephrol Dial Transplant*, Vol.20, No.5, (May, 2005), pp. 868-9 Frith, M. C., L. G. Wilming, A. Forrest, H. Kawaji, S. L. Tan, C. Wahlestedt, V. B. Bajic, C.

Fritz, E. L. and F. N. Papavasiliou (2010). Cytidine deaminases: AIDing DNA

Gardiner-Garden, M. and M. Frommer (1987). CpG islands in vertebrate genomes. *J Mol Biol*,

Gartler, S. M. and A. D. Riggs (1983). Mammalian X-chromosome inactivation. *Annu Rev* 

Glass, J. L., R. F. Thompson, B. Khulan, M. E. Figueroa, E. N. Olivier, E. J. Oakley, G. Van

Graff, J. R., J. G. Herman, S. Myohanen, S. B. Baylin and P. M. Vertino (1997). Mapping

Hackenberg, M., G. Barturen, P. Carpena, P. L. Luque-Escamilla, C. Previti and J. L. Oliver

Hackenberg, M., P. Carpena, P. Bernaola-Galvan, G. Barturen, A. M. Alganza and J. L.

demethylation? *Genes Dev*, Vol.24, No.19, (Oct 1, 2010), pp. 2107-14

Vol.105, No.12, (Mar 25, 2008), pp. 4844-9

human genome. *PLoS One*, Vol.2, No.11, pp. e1184

*Sci*, Vol.119, No.Pt 15, (Aug 1, 2006), pp. 3033-7

development. *Dev Cell*, Vol.8, No.1, (Jan, 2005), pp. 31-42

transcriptome. *PLoS Genet*, Vol.2, No.4, (Apr, 2006), pp. e23

Vol.196, No.2, (Jul 20, 1987), pp. 261-82

No.35, (Aug 29, 1997), pp. 22322-9

methods. *BMC Genomics*, Vol.11, pp. 327

elements. *Algorithms Mol Biol*, Vol.6, pp. 2

*Genet*, Vol.17, pp. 155-90

No.20, pp. 6798-807

No.4501, (Jun 19, 1981), pp. 1350-7

pp. 12253-8

2006), pp. 572-9

(2008). Cytosine methylation profiling of cancer cell lines. *Proc Natl Acad Sci U S A*,

are over-represented in the boundaries of methylation-resistant CpG islands in the

CpG island methylation. *Proc Natl Acad Sci U S A*, Vol.100, No.21, (Oct 14, 2003),

associated with aberrant CpG island methylation. *Genomics*, Vol.87, No.5, (May,

Tsuchiya and C. M. Disteche (2005). Boundaries between chromosomal domains of X inactivation and escape bind CTCF and lack CpG methylation during early

Kai, J. Kawai, P. Carninci, et al. (2006). Pseudo-messenger RNA: phantoms of the

Zant, E. E. Bouhassira, A. Melnick, A. Golden, et al. (2007). CG dinucleotide clustering is a species-specific property of the genome. *Nucleic Acids Res*, Vol.35,

patterns of CpG island methylation in normal and neoplastic cells implicates both upstream and downstream regions in de novo methylation. *J Biol Chem*, Vol.272,

(2010a). Prediction of CpG-island function: CpG clustering vs. sliding-window

Oliver (2010b). WordCluster: detecting clusters of DNA words and genomic


Carson, M. B., R. Langlois and H. Lu (2008). Mining knowledge for the methylation status of

Chalkley, G. E. and C. P. Verrijzer (1999). DNA binding site selection by RNA polymerase II

Choi, J. K. (2010). Contrasting chromatin organization of CpG islands and exons in the

Cooper, D. N., M. Mort, P. D. Stenson, E. V. Ball and N. A. Chuzhanova (2010). Methylation-

Cross, S. H., J. A. Charlton, X. Nan and A. P. Bird (1994). Purification of CpG islands using a methylated DNA binding column. *Nat Genet*, Vol.6, No.3, (Mar, 1994), pp. 236-44 Dai, W., J. M. Teodoridis, J. Graham, C. Zeller, T. H. Huang, P. Yan, J. K. Vass, R. Brown and

differentially methylated CpG islands. *BMC Bioinformatics*, Vol.9, pp. 337 Daniel, J. M., C. M. Spring, H. C. Crawford, A. B. Reynolds and A. Baig (2002). The

Das, R., N. Dimitrova, Z. Xuan, R. A. Rollins, F. Haghighi, J. R. Edwards, J. Ju, T. H. Bestor

Davuluri, R. V., I. Grosse and M. Q. Zhang (2001). Computational identification of

Deobagkar, D. D. and H. S. Chandra (2003). The inactive X chromosome in the human

Dhasarathy, A. and P. A. Wade (2008). The MBD protein family-reading an epigenetic

Dickson, J., H. Gowher, R. Strogantsev, M. Gaszner, A. Hair, G. Felsenfeld and A. G. West

Down, T. A., V. K. Rakyan, D. J. Turner, P. Flicek, H. Li, E. Kulesha, S. Graf, N. Johnson, J.

Eckhardt, F., J. Lewin, R. Cortese, V. K. Rakyan, J. Attwood, M. Burger, J. Burton, T. V. Cox,

mark? *Mutat Res*, Vol.647, No.1-2, (Dec 1, 2008), pp. 39-43

dinucleotides. *Hum Genomics*, Vol.4, No.6, (Aug 1, 2010), pp. 406-10

human genome. *Genome Biol*, Vol.11, No.7, 2010), pp. R70

*Acids Res*, Vol.30, No.13, (Jul 1, 2002), pp. 2911-9

Vol.2008, 2008), pp. 3787-90

10713-6

2001), pp. 412-7

No.1-2, (Apr-Aug, 2003), pp. 13-6

Vol.6, No.1, (Jan, 2010), pp. e1000804

(Jul, 2008), pp. 779-85

63

No.17, (Sep 1, 1999), pp. 4835-45

CpG islands using alternating decision trees. *Conf Proc IEEE Eng Med Biol Soc*,

TAFs: a TAF(II)250-TAF(II)150 complex recognizes the initiator. *EMBO J*, Vol.18,

mediated deamination of 5-methylcytosine appears to give rise to mutations causing human inherited disease in CpNpG trinucleotides, as well as in CpG

J. Paul (2008). Methylation Linear Discriminant Analysis (MLDA) for identifying

p120(ctn)-binding partner Kaiso is a bi-modal DNA-binding protein that recognizes both a sequence-specific consensus and methylated CpG dinucleotides. *Nucleic* 

and M. Q. Zhang (2006). Computational prediction of methylation status in human genomic sequences. *Proc Natl Acad Sci U S A*, Vol.103, No.28, (Jul 11, 2006), pp.

promoters and first exons in the human genome. *Nat Genet*, Vol.29, No.4, (Dec,

female is enriched in 5-methylcytosine to an unusual degree and appears to contain more of this modified nucleotide than the remainder of the genome. *J Genet*, Vol.82,

(2010). VEZF1 elements mediate protection from DNA methylation. *PLoS Genet*,

Herrero, E. M. Tomazou, et al. (2008). A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis. *Nat Biotechnol*, Vol.26, No.7,

R. Davies, T. A. Down, et al. (2006). DNA methylation profiling of human chromosomes 6, 20 and 22. *Nat Genet*, Vol.38, No.12, (Dec, 2006), pp. 1378-85 Egger, G., G. Liang, A. Aparicio and P. A. Jones (2004). Epigenetics in human disease and

prospects for epigenetic therapy. *Nature*, Vol.429, No.6990, (May 27, 2004), pp. 457-


Algorithms for CpG Islands Search: New Advantages and Old Problems 469

Li, E., C. Beard and R. Jaenisch (1993). Role for DNA methylation in genomic imprinting.

Lin, I. G., T. J. Tomzynski, Q. Ou and C. L. Hsieh (2000). Modulation of DNA binding

Lister, R., M. Pelizzola, R. H. Dowen, R. D. Hawkins, G. Hon, J. Tonti-Filippini, J. R. Nery, L.

Liu, L., Y. Li and T. O. Tollefsbol (2008). Gene-environment interactions and epigenetic basis

Lopes, S., A. Lewis, P. Hajkova, W. Dean, J. Oswald, T. Forne, A. Murrell, M. Constancia, M.

Mardis, E. R. (2007). ChIP-seq: welcome to the new frontier. *Nat Methods*, Vol.4, No.8, (Aug,

Medvedeva, Y. A., M. V. Fridman, N. J. Oparina, D. B. Malko, E. O. Ermakova, I. V.

Naumann, A., N. Hochstein, S. Weber, E. Fanning and W. Doerfler (2009). A distinct DNA-

Ng, H. H., Y. Zhang, B. Hendrich, C. A. Johnson, B. M. Turner, H. Erdjument-Bromage, P.

Oakes, C. C., S. La Salle, D. J. Smiraglia, B. Robaire and J. M. Trasler (2007). A unique

Okada, Y., K. Yamagata, K. Hong, T. Wakayama and Y. Zhang (2010). A role for the

Phi-van, L. and W. H. Stratling (1999). An origin of bidirectional DNA replication is located

Polak, P., R. Querfurth and P. F. Arndt (2010). The evolution of transcription-associated biases of mutations across vertebrates. *BMC Evol Biol*, Vol.10, pp. 187 Ponger, L., L. Duret and D. Mouchiroud (2001). Determinants of CpG islands: expression in

Ponger, L. and D. Mouchiroud (2002). CpGProD: identifying CpG islands associated with

*Acad Sci U S A*, Vol.104, No.1, (Jan 2, 2007), pp. 228-33

of human diseases. *Curr Issues Mol Biol*, Vol.10, No.1-2, pp. 25-36

interactions. *Hum Mol Genet*, Vol.12, No.3, (Feb 1, 2003), pp. 295-305 Macleod, D., J. Charlton, J. Mullins and A. P. Bird (1994). Sp1 sites in the mouse aprt gene

protein affinity directly affects target site demethylation. *Mol Cell Biol*, Vol.20, No.7,

Lee, Z. Ye, Q. M. Ngo, et al. (2009). Human DNA methylomes at base resolution show widespread epigenomic differences. *Nature*, Vol.462, No.7271, (Nov 19, 2009),

Bartolomei, J. Walter, et al. (2003). Epigenetic modifications in an imprinting cluster are controlled by a hierarchy of DMRs suggesting long-range chromatin

promoter are required to prevent methylation of the CpG island. *Genes Dev*, Vol.8,

Kulakovskiy, A. Heinzel and V. J. Makeev (2010). Intergenic, gene terminal, and intragenic CpG islands in the human genome. *BMC Genomics*, Vol.11, No.1, (Jan 19,

methylation boundary in the 5'- upstream sequence of the FMR1 promoter binds nuclear proteins and is lost in fragile X syndrome. *Am J Hum Genet*, Vol.85, No.5,

Tempst, D. Reinberg and A. Bird (1999). MBD2 is a transcriptional repressor belonging to the MeCP1 histone deacetylase complex. *Nat Genet*, Vol.23, No.1, (Sep,

configuration of genome-wide DNA methylation patterns in the testis. *Proc Natl* 

elongator complex in zygotic paternal genome demethylation. *Nature*, Vol.463,

within a CpG island at the 3" end of the chicken lysozyme gene. *Nucleic Acids Res*,

early embryo and isochore structure. *Genome Res*, Vol.11, No.11, (Nov, 2001), pp.

transcription start sites in large genomic mammalian sequences. *Bioinformatics*,

*Nature*, Vol.366, No.6453, (Nov 25, 1993), pp. 362-5

(Apr, 2000), pp. 2343-9

No.19, (Oct 1, 1994), pp. 2282-92

pp. 315-22

2007), pp. 613-4

2010), pp. 48

1999), pp. 58-61

1854-60

(Nov, 2009), pp. 606-16

No.7280, (Jan 28, 2010), pp. 554-8

Vol.18, No.4, (Apr, 2002), pp. 631-3

Vol.27, No.15, (Aug 1, 1999), pp. 3009-17


Hackenberg, M., C. Previti, P. L. Luque-Escamilla, P. Carpena, J. Martinez-Aroza and J. L.

Halder, R., K. Halder, P. Sharma, G. Garg, S. Sengupta and S. Chowdhury (2010). Guanine

Han, L. and Z. Zhao (2009). CpG islands or CpG clusters: how to identify functional GC-rich

Hashimoto, H., J. R. Horton, X. Zhang and X. Cheng (2009). UHRF1, a modular multi-

Hug, M., J. Silke, O. Georgiev, S. Rusconi, W. Schaffner and K. Matsuo (1996).

Hutter, B., V. Helms and M. Paulsen (2006). Tandem repeats in the CpG islands of imprinted

Illingworth, R. S., U. Gruenewald-Schneider, S. Webb, A. R. Kerr, K. D. James, D. J. Turner,

Irizarry, R. A., H. Wu and A. P. Feinberg (2009). A species-generalized probabilistic model-

Jacquier, A. (2009). The complex eukaryotic transcriptome: unexpected pervasive

Jones, P. A. and S. B. Baylin (2002). The fundamental role of epigenetic events in cancer. *Nat* 

Kimura, H. and K. Shiota (2003). Methyl-CpG-binding protein, MeCP2, is a target molecule

Klose, R. J., S. A. Sarraf, L. Schmiedeberg, S. M. McDermott, I. Stancheva and A. P. Bird

Lagrange, T., A. N. Kapanidis, H. Tang, D. Reinberg and R. H. Ebright (1998). New core

Laird, P. W. (2003). The power and the promise of DNA methylation markers. *Nat Rev* 

adjacent to methyl-CpG. *Mol Cell*, Vol.19, No.5, (Sep 2, 2005), pp. 667-78 Kutach, A. K. and J. T. Kadonaga (2000). The downstream promoter element DPE appears to

and histone modifications. *Epigenetics*, Vol.4, No.1, (Jan, 2009), pp. 8-14 Herrera, L. A., D. Prada, M. A. Andonegui and A. Duenas-Gonzalez (2008). The epigenetic origin of aneuploidy. *Curr Genomics*, Vol.9, No.1, (Mar, 2008), pp. 43-50 Holler, M., G. Westin, J. Jiricny and W. Schaffner (1988). Sp1 transcription factor binds DNA

wide. *Mol Biosyst*, Vol.6, No.12, (Dec 8, 2010), pp. 2439-47

regions in a genome? *BMC Bioinformatics*, Vol.10, pp. 65

genes. *Genomics*, Vol.88, No.3, (Sep, 2006), pp. 323-32

*BMC Bioinformatics*, Vol.7, pp. 446

*Dev*, Vol.2, No.9, (Sep, 1988), pp. 1127-35

Vol.6, No.9, (Sep 23, 2010), e1001134

*Rev Genet*, Vol.3, No.6, (Jun, 2002), pp. 415-28

Vol.20, No.13, (Jul, 2000), pp. 4754-64

*Cancer*, Vol.3, No.4, (Apr, 2003), pp. 253-66

pp. 251-4

pp. 674-80

14, 2003), pp. 4806-12

1998), pp. 34-44

833-44

Oliver (2006). CpGcluster: a distance-based algorithm for CpG-island detection.

quadruplex DNA structure restricts methylation of CpG dinucleotides genome-

domain protein, regulates replication-coupled crosstalk between DNA methylation

and activates transcription even when the binding site is CpG methylated. *Genes* 

Transcriptional repression by methylation: cooperativity between a CpG cluster in the promoter and remote CpG-rich regions. *FEBS Lett*, Vol.379, No.3, (Feb 5, 1996),

C. Smith, D. J. Harrison, R. Andrews and A. P. Bird (2010) Orphan CpG islands identify numerous conserved promoters in the mammalian genome. *PLoS Genet*,

based definition of CpG islands. *Mamm Genome*, Vol.20, No.9-10, (Sep-Oct, 2009),

transcription and novel small RNAs. *Nat Rev Genet*, Vol.10, No.12, (Dec, 2009), pp.

for maintenance DNA methyltransferase, Dnmt1. *J Biol Chem*, Vol.278, No.7, (Feb

(2005). DNA binding selectivity of MeCP2 due to a requirement for A/T sequences

be as widely used as the TATA box in Drosophila core promoters. *Mol Cell Biol*,

promoter element in RNA polymerase II-dependent transcription: sequencespecific DNA binding by transcription factor IIB. *Genes Dev*, Vol.12, No.1, (Jan 1,


Algorithms for CpG Islands Search: New Advantages and Old Problems 471

Shen, L., Y. Kondo, Y. Guo, J. Zhang, L. Zhang, S. Ahmed, J. Shu, X. Chen, R. A. Waterland

Shiraishi, M., A. Sekiguchi, M. J. Terry, A. J. Oates, Y. Miyamoto, Y. H. Chuu, M. Munakata

Smale, S. T. (1997). Transcription initiation from TATA-less promoters within eukaryotic

Smilinich, N. J., C. D. Day, G. V. Fitzpatrick, G. M. Caldwell, A. C. Lossie, P. R. Cooper, A. C.

Straussman, R., D. Nejman, D. Roberts, I. Steinfeld, B. Blum, N. Benvenisty, I. Simon, Z.

Su, J., Y. Zhang, J. Lv, H. Liu, X. Tang, F. Wang, Y. Qi, Y. Feng and X. Li (2009). CpG\_MI: a

Takada, S., M. Tevendale, J. Baker, P. Georgiades, E. Campbell, T. Freeman, M. H. Johnson,

Takai, D. and P. A. Jones (2002). Comprehensive analysis of CpG islands in human

Takai, D. and P. A. Jones (2003). The CpG island searcher: a new WWW resource. *In Silico* 

Thomson, J. P., P. J. Skene, J. Selfridge, T. Clouaire, J. Guy, S. Webb, A. R. Kerr, A. Deaton, R.

CpG-binding protein Cfp1. *Nature*, Vol.464, No.7291, (Apr 15), pp. 1082-6 Tomatsu, S., K. O. Orii, M. R. Islam, G. N. Shah, J. H. Grubb, K. Sukegawa, Y. Suzuki, T.

Ullu, E. and C. Tschudi (1984). Alu sequences are processed 7SL RNA genes. *Nature*,

Ushijima, T., N. Watanabe, E. Okochi, A. Kaneda, T. Sugimura and K. Miyamoto (2003).

van Roy, F. M. and P. D. McCrea (2005). A role for Kaiso-p120ctn complexes in cancer? *Nat* 

*Oncogene*, Vol.21, No.23, (May 23, 2002), pp. 3804-13

*Acad Sci U S A*, Vol.96, No.14, (Jul 6, 1999), pp. 8064-9

*Nucleic Acids Res*, Vol.38, No.1, (Jan, 2009), pp. e6

12. *Curr Biol*, Vol.10, No.18, (Sep 21, 2000), pp. 1135-8

Vol.312, No.5990, (Nov 8-14, 1984), pp. 171-2

*Rev Cancer*, Vol.5, No.12, (Dec, 2005), pp. 956-64

Vol.13, No.5, (May, 2003), pp. 868-74

pp. 2023-36

2009), pp. 564-71

3740-5

pp. 363-75

*Biol*, Vol.3, No.3, pp. 235-40

88

and J. P. Issa (2007). Genome-wide profiling of DNA methylation reveals a class of normally methylated CpG island promoters. *PLoS Genet*, Vol.3, No.10, (Oct, 2007),

and T. Sekiya (2002). A comprehensive catalog of CpG islands methylated in human lung adenocarcinomas for the identification of tumor suppressor genes.

protein-coding genes. *Biochim Biophys Acta*, Vol.1351, No.1-2, (Mar 20, 1997), pp. 73-

Smallwood, J. A. Joyce, P. N. Schofield, W. Reik, et al. (1999). A maternally methylated CpG island in KvLQT1 is associated with an antisense paternal transcript and loss of imprinting in Beckwith-Wiedemann syndrome. *Proc Natl* 

Yakhini and H. Cedar (2009). Developmental programming of CpG island methylation profiles in the human genome. *Nat Struct Mol Biol*, Vol.16, No.5, (May,

novel approach for identifying functional CpG islands in mammalian genomes.

M. Paulsen and A. C. Ferguson-Smith (2000). Delta-like and gtl2 are reciprocally expressed, differentially methylated linked imprinted genes on mouse chromosome

chromosomes 21 and 22. *Proc Natl Acad Sci U S A*, Vol.99, No.6, (Mar 19, 2002), pp.

Andrews, K. D. James, et al. CpG islands influence chromatin structure via the

Orii, N. Kondo and W. S. Sly (2002). Methylation patterns of the human betaglucuronidase gene locus: boundaries of methylation and general implications for frequent point mutations at CpG dinucleotides. *Genomics*, Vol.79, No.3, (Mar, 2002),

Fidelity of the methylation pattern and its variation in the genome. *Genome Res*,


Previti, C., O. Harari, I. Zwir and C. del Val (2009). Profile analysis and prediction of tissuespecific CpG island methylation classes. *BMC Bioinformatics*, Vol.10, pp. 116 Rakyan, V. K., T. A. Down, N. P. Thorne, P. Flicek, E. Kulesha, S. Graf, E. M. Tomazou, L.

Recillas-Targa, F., I. A. De La Rosa-Velazquez, E. Soto-Reyes and L. Benitez-Bribiesca (2006).

Rein, T., T. Kobayashi, M. Malott, M. Leffak and M. L. DePamphilis (1999). DNA

Rein, T., H. Zorbas and M. L. DePamphilis (1997). Active mammalian replication origins are

Rice, P., I. Longden and A. Bleasby (2000). EMBOSS: the European Molecular Biology Open

Richardson, B. (2007). Primer: epigenetics of autoimmunity. *Nat Clin Pract Rheumatol*, Vol.3,

Rishi, V., P. Bhattacharya, R. Chatterjee, J. Rozenberg, J. Zhao, K. Glass, P. Fitzgerald and C.

Robinson, P. N., U. Bohme, R. Lopez, S. Mundlos and P. Nurnberg (2004). Gene-Ontology

Rozenberg, J. M., A. Shlyakhtenko, K. Glass, V. Rishi, M. V. Myakishev, P. C. FitzGerald and

Saito, M. and F. Ishikawa (2002). The mCpG-binding domain of human MBD3 does not bind

Sasai, N., M. Nakao and P. A. Defossez (2010). Sequence-specific recognition of methylated

Saxonov, S., P. Berg and D. L. Brutlag (2006). A genome-wide analysis of CpG dinucleotides

Schubeler, D., M. C. Lorincz and M. Groudine (2001). Targeting silence: the use of site-

Segal, M. R. (2006). Validation in genomics: CpG island methylation revisited. *Stat Appl* 

Software Suite. *Trends Genet*, Vol.16, No.6, (Jun, 2000), pp. 276-7

No.4470, (Nov 7, 1980), pp. 604-10

68

1969-78

No.1, pp. 67

pp. 5015-22

1999), pp. 25792-800

No.1, (Jan, 1997), pp. 416-26

No.9, (Sep, 2007), pp. 521-7

Vol.107, No.47, (Nov 23, 2010), pp. 20311-6

*Chem*, Vol.277, No.38, (Sep 20, 2002), pp. 35434-9

*Acad Sci U S A*, Vol.103, No.5, (Jan 31, 2006), pp. 1412-7

*STKE*, Vol.2001, No.83, (May 22, 2001), pp. pl1

*Genet Mol Biol*, Vol.5, Article29

Backdahl, N. Johnson, M. Herberth, et al. (2008). An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs). *Genome Res*, Vol.18, No.9, (Sep, 2008), pp. 1518-29 Razin, A. and A. D. Riggs (1980). DNA methylation and gene function. *Science*, Vol.210,

Epigenetic boundaries of tumour suppressor gene promoters: the CTCF connection and its role in carcinogenesis. *J Cell Mol Med*, Vol.10, No.3, (Jul-Sep, 2006), pp. 554-

methylation at mammalian replication origins. *J Biol Chem*, Vol.274, No.36, (Sep 3,

associated with a high-density cluster of mCpG dinucleotides. *Mol Cell Biol*, Vol.17,

Vinson (2010). CpG methylation of half-CRE sequences creates C/EBPalpha binding sites that activate some tissue-specific genes. *Proc Natl Acad Sci U S A*,

analysis reveals association of tissue-specific 5' CpG-island genes with development and embryogenesis. *Hum Mol Genet*, Vol.13, No.17, (Sep 1, 2004), pp.

C. Vinson (2008). All and only CpG containing sequences are enriched in promoters abundantly bound by RNA polymerase II in multiple tissues. *BMC Genomics*, Vol.9,

to mCpG but interacts with NuRD/Mi2 components HDAC1 and MTA2. *J Biol* 

DNA by human zinc-finger proteins. *Nucleic Acids Res*, Vol.38, No.15, (Aug, 2010),

in the human genome distinguishes two distinct classes of promoters. *Proc Natl* 

specific recombination to introduce in vitro methylated DNA into the genome. *Sci* 


**22** 

I. C. Baianu

*USA* 

*AFC-NMR & NIR Microspectroscopy Facility, College of ACES, FSHN & NPRE Departments, University of Illinois at Urbana, Urbana, IL* 

**Translational Oncogenomics and Human** 

**Cancer Interactomics: Advanced Techniques** 

An overview of translational, human oncogenomics, transcriptomics and cancer interactomic networks is presented together with basic concepts and potential, new applications to Oncology and Integrative Cancer Biology. Novel translational oncogenomics research is rapidly expanding through the application of advanced technology, research findings and computational tools/models to both pharmaceutical and clinical problems. A self-contained presentation is adopted that covers both fundamental concepts and the most recent biomedical, as well as clinical, applications. Sample analyses in recent clinical studies have shown that gene expression data can be employed to distinguish between tumor types as well as to predict outcomes. Potentially important applications of such results are *individualized* human cancer therapies or, in general, 'personalized medicine'. Several cancer detection techniques are currently under development both in the direction of improved detection sensitivity and increased time resolution of cellular events, with the limits of single molecule detection and picosecond time resolution already reached. The urgency for the complete mapping of a human cancer interactome with the help of such novel, high-

efficiency, low-cost and ultra-sensitive techniques is also pointed out.

**1.1 Current status in translational genomics and interactome networks** 

Upon completion of the maps for several genomes, including the human genome, there are several major post-genomic tasks lying ahead such as the translation of the mapped genomes and the correct interpretation of huge amounts of data that are being rapidly generated, or the important task of applying these fundamental results to derive major benefits in various medical and agricultural biotechnology areas. Translational genomics is at the center of these tasks that are running from *transcription* through *translation* to *proteomics* and *interactomics*. The *transcriptome* is defined as the set of all 'transcripts' or messenger RNA (mRNA) molecules produced through transcription from DNA sequences by a single cell or a cell population. This concept is also extended to a multi-cellular organism as the set of all its transcripts. The transcriptome thus reflects the active part of the genome at a given instant of time. *Transcriptomics* involves the determination of mRNAs

**1. Introduction** 

**and Complex System Dynamic Approaches** 


### **Translational Oncogenomics and Human Cancer Interactomics: Advanced Techniques and Complex System Dynamic Approaches**

I. C. Baianu

*AFC-NMR & NIR Microspectroscopy Facility, College of ACES, FSHN & NPRE Departments, University of Illinois at Urbana, Urbana, IL USA* 

#### **1. Introduction**

472 Bioinformatics – Trends and Methodologies

Walsh, C. P., J. R. Chaillet and T. H. Bestor (1998). Transcription of IAP endogenous

Wang, Y. and F. C. Leung (2004). An evaluation of new criteria for CpG islands in the

Weinmann, A. S., P. S. Yan, M. J. Oberley, T. H. Huang and P. J. Farnham (2002). Isolating

Wu, H., B. Caffo, H. A. Jaffee, R. A. Irizarry and A. P. Feinberg (2010). Redefining CpG

Wu, S. C. and Y. Zhang (2010). Active DNA demethylation: many roads lead to Rome. *Nat* 

Xie, H., M. Wang, F. Bonaldo Mde, V. Rajaram, W. Stellpflug, C. Smith, K. Arndt, S.

Xin, Y., B. Chanrion, M. M. Liu, H. Galfalvy, R. Costa, B. Ilievski, G. Rosoklija, V. Arango, A.

marks in cerebral and cerebellar cortices. *PLoS One*, Vol.5, No.6, pp. e11357 Xing, J., D. J. Hedges, K. Han, H. Wang, R. Cordaux and M. A. Batzer (2004). Alu element

Yates, P. A., R. W. Burman, P. Mummaneni, S. Krussel and M. S. Turker (1999). Tandem B1

Zeschnigk, M., M. Martin, G. Betzl, A. Kalbe, C. Sirsch, K. Buiting, S. Gross, E. Fritzilas, B.

Zhao, Z. and L. Han (2009). CpG islands: algorithms and applications in methylation studies. *Biochem Biophys Res Commun*, Vol.382, No.4, (May 15, 2009), pp. 643-5 Zhu, J., F. He, S. Hu and J. Yu (2008). On the nature of human housekeeping genes. *Trends* 

methylation. *J Biol Chem*, Vol.274, No.51, (Dec 17, 1999), pp. 36357-61 Zemojtel, T., S. M. Kielbasa, P. F. Arndt, H. R. Chung and M. Vingron (2009). Methylation

*Rev Mol Cell Biol*, Vol.11, No.9, (Sep, 2010), pp. 607-20

Vol.344, No.3, (Nov 26, 2004), pp. 675-82

*Genet*, Vol.25, No.2, (Feb, 2009), pp. 63-6

*Genet*, Vol.24, No.10, (Oct, 2008), pp. 481-4

1998), pp. 116-7

1170-7

235-44

514

6952-7

1439-48

retroviruses is constrained by cytosine methylation. *Nat Genet*, Vol.20, No.2, (Oct,

human genome as gene markers. *Bioinformatics*, Vol.20, No.7, (May 1, 2004), pp.

human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. *Genes Dev*, Vol.16, No.2, (Jan 15, 2002), pp.

islands using hidden Markov models. *Biostatistics*, Vol.11, No.3, (Jul, 2010), pp. 499-

Goldman, T. Tomita and M. B. Soares (2010). Epigenomic analysis of Alu repeats in human ependymomas. *Proc Natl Acad Sci U S A*, Vol.107, No.15, (Apr 13, 2010), pp.

J. Dwork, J. J. Mann, et al. (2010). Genome-wide divergence of DNA methylation

mutation spectra: molecular clocks and the effect of DNA methylation. *J Mol Biol*,

elements located in a mouse methylation center provide a target for de novo DNA

and deamination of CpGs generate p53-binding sites on a genomic scale. *Trends* 

Frey, S. Rahmann, et al. (2009). Massive parallel bisulfite sequencing of CG-rich DNA fragments reveals that methylation of many X-chromosomal CpG islands in female blood DNA is incomplete. *Hum Mol Genet*, Vol.18, No.8, (Apr 15, 2009), pp. An overview of translational, human oncogenomics, transcriptomics and cancer interactomic networks is presented together with basic concepts and potential, new applications to Oncology and Integrative Cancer Biology. Novel translational oncogenomics research is rapidly expanding through the application of advanced technology, research findings and computational tools/models to both pharmaceutical and clinical problems. A self-contained presentation is adopted that covers both fundamental concepts and the most recent biomedical, as well as clinical, applications. Sample analyses in recent clinical studies have shown that gene expression data can be employed to distinguish between tumor types as well as to predict outcomes. Potentially important applications of such results are *individualized* human cancer therapies or, in general, 'personalized medicine'. Several cancer detection techniques are currently under development both in the direction of improved detection sensitivity and increased time resolution of cellular events, with the limits of single molecule detection and picosecond time resolution already reached. The urgency for the complete mapping of a human cancer interactome with the help of such novel, highefficiency, low-cost and ultra-sensitive techniques is also pointed out.

#### **1.1 Current status in translational genomics and interactome networks**

Upon completion of the maps for several genomes, including the human genome, there are several major post-genomic tasks lying ahead such as the translation of the mapped genomes and the correct interpretation of huge amounts of data that are being rapidly generated, or the important task of applying these fundamental results to derive major benefits in various medical and agricultural biotechnology areas. Translational genomics is at the center of these tasks that are running from *transcription* through *translation* to *proteomics* and *interactomics*. The *transcriptome* is defined as the set of all 'transcripts' or messenger RNA (mRNA) molecules produced through transcription from DNA sequences by a single cell or a cell population. This concept is also extended to a multi-cellular organism as the set of all its transcripts. The transcriptome thus reflects the active part of the genome at a given instant of time. *Transcriptomics* involves the determination of mRNAs

Oncogenomics and Cancer Interactomics 475

The analysis of bionetwork dynamics of protein synthesis considered as a channel of information operates through the formation of protein amino acid sequences of polypeptides via *translation* of the corresponding polynucleotide sequences of (usually

**DNA** (**gene**) *transcription*  **mRNA**  *translation* **into a** polypeptide's aminoacid sequence

Although not shown in this scheme, several key enzymes make such processes both efficient and precise through highly-selective catalysis; moreover, the protein assembly involves both specific enzymes and ribosome 'assembly lines'. Furthermore, such processes are compartmented in the mammalian cells by selective intracellular membranes; this seems to be also important for cell cycling and the control of cell division. On the other hand, the *reverse transcription***, RNA DNA,** does also occur (under certain conditions), catalized by a reverse transcriptase that contains both polypeptide chains and an RNA (master) strand. If error free, the first of these two sequence of processes —which are of fundamental biological importance-- generates true replicas of the information contained in the *sense codons* of the genes that are transcribed into mRNA *anti-codons*. Recall also that DNA stores information in the neucleotide bases A **(**Adenine**)**, C **(**Cytosin**)**, G **(**Guanine**)** and T **(**Thymine**)**, and that a triplet of such nucleotides in the DNA sequence is called a *codon,* which may encode unambiguously just the information necessary to specify a single amino acid. Moreover, the genetic code is quasi-universal, and capable of 'reverse transcription' from certain types of RNA back into DNA. Notably also, not all nucleotide or codon sequences present in the genome (DNA) are transcribed *in vivo.* Typically only a small percentage is transcribed. The transcribed (mRNA) sequences form what is naturally called the *transcriptome;* the protein- encoded version of the transcriptome is called the *proteome*, and upon including all protein- protein interactions for various cellular states one obtains the (global) *interactome* network. More generally, biological interactive networks as a class of complex bionetworks consist of local cellular communities **(**or '*organismic sets'***)** that are organized and managed by their characteristic selection procedures. Thus, in any partitioning of the organismal, or cell, structure, it is often necessary to regulate the *local* properties of the organism rather than the *global* mechanism, which explains an organism's need for specialized, 'modular constructions'. Such a modular, complex system biology approach to modeling signaling pathways and modifications of cell-cycling regulatory mechanisms in cancer cells was recently reported (Baianu, 2004); several consequences of this approach were also considered for the proteome and interactome networks in a 'prototype' cancer cell model (Prisecaru and Baianu, 2005). Note, on the other hand, that there seem to be also present in the living cell certain proteins and enzymes that are involved in *global* intra-cellular interactions which are thought to be essential to the cell survival and cell's flexible adaptation to stresses or challenge. Recent modeling techniques draw from a variety of mathematical sources, such as: topology (including graph theory), biostatistics, stochastic differential equations, Boolean networks, and qualitative system dynamics (Baianu, 1971a; de Jong *et al* 2000; 2003, 2004). Non--boolean network models of genetic networks and the interactome were also developed and compared with the results of Boolean ones (Baianu, 1977, 1984, 1987; Georgescu, 2006; Baianu, 2005; Baianu *et al. 2006).* The traditional use of comparatively rigid Boolean networks (reviewed extensively, for example in Baianu, 1987) can be thus extended through flexible, multi--valued (non--Boolean) logic algebra bionetworks with complex, *nonlinear dynamic* behaviors that mimic complex systems biology (Rosen, 2000).

**1.2 Basic concepts in transcription, translation and interactome networks** 

single –stranded, **messenger**) ribonucleic acid, that is:

**protein** (quaternary) *assembly* from polypeptide subunits.

expression level in a selected cell population. For example, an improved understanding of cell differentiation involves the determination of the stem cell transcriptome; understanding carcinogenesis requires the comparison between the transcriptomes of cancer cells and untransformed ('normal) cells. However, because the levels of mRNA are not directly proportional to the expression levels of the proteins they are encoding, the protein complement of a cell or a multi-cellular organism needs to be determined by other techniques, or combination of techniques; the complete protein complement of a cell or organism is defined as the *proteome*. When the network (or networks) of complex protein-protein interactions (PPIs) in a cell or organism is (are) reconstructed, the result is called an *interactome*. This complete network of PPIs is now thought to form the 'backbone' of the signaling pathways, metabolic pathways and cellular processes that are required for all key cell functions and, therefore, cell survival. Such a complete knowledge of cellular pathways and processes in the cell is essential for understanding how many diseases -- such as cancer (and also ageing) —originate and progress through mutation or alteration of individual pathway components. Furthermore, determining human cancer cell interactomes of therapy-resistant tumors will undoubtedly allow for rational clinical trials and save patients' lives through individualized cancer therapy. Since the global gene expression studies of DeRisi et al in 1997, translational genomics is very rapidly advancing through the detection in parallel of mRNA levels for large numbers of molecules, as well as through progress made with miniaturization and high density synthesis of nucleic acids on microarray solid supports. Gene expression studies with microarrays permit an integrated approach to biology in terms of network biodynamics, signaling pathways, protein-protein interactions, and ultimately, the cell interactome. An important emerging principle of gene expression is the *temporally coordinated regulation* of genes as an extremely efficient mechanism (Wen *et al* 1998) required for complex processes in which all the components of multi-subunit complexes must be present/available in defined ratios at the same time whenever such complexes are needed by the cell. The gene expression profile can be thought of either as a 'signature/ fingerprint' or as *a molecular definition of the cell in a specified state* (Young, 2000). Cellular phenotypes can then be inferred from such gene expression profiles. Success has been achieved in several projects that profile a large number of biological samples and then utilize pattern matching to predict the function of either new drug targets or previously uncharacterized genes; this '*compendium approach'* has been demonstrated in yeast (Gray *et al* 1998; Hughes *et al* 2000), and has also been applied in databases integrating gene expression data from pharmacologically characterized human cancer lines (NCI60, http://dtp.nci.nih.gov), or to classify cell lines in relation to their tissue of origin and predict their drug resistance or chemosensitivity (Weinstein *et al*, 1997; Ross *et al* 2000, Staunton *et al* 2001). Furthermore, sample analyses in clinical studies have shown that gene expression data can be employed to distinguish between tumor types as well as to predict outcomes (Golub *et al* 1999; Bittner *et al*, 2000; Shipp et al 2002; Furteal et al., 2004). The latter approach seems to lead to important applications such as individualized cancer therapy and 'personalised medicine'. On the other hand, such approaches are complemented by studies of protein-protein interactions in the area called *proteomics*, preferably under physiological conditions, or more generally still, in *cell interactomics.* Several technologies in this area are still developing both in the direction of improved detection sensitivity and time resolution of cellular events, with the limits of single molecule detection and picosecond time resolution already attained. In order to enable the development of new applications such techniques will be briefly described in the next section, together with relevant examples of their recent applications.

expression level in a selected cell population. For example, an improved understanding of cell differentiation involves the determination of the stem cell transcriptome; understanding carcinogenesis requires the comparison between the transcriptomes of cancer cells and untransformed ('normal) cells. However, because the levels of mRNA are not directly proportional to the expression levels of the proteins they are encoding, the protein complement of a cell or a multi-cellular organism needs to be determined by other techniques, or combination of techniques; the complete protein complement of a cell or organism is defined as the *proteome*. When the network (or networks) of complex protein-protein interactions (PPIs) in a cell or organism is (are) reconstructed, the result is called an *interactome*. This complete network of PPIs is now thought to form the 'backbone' of the signaling pathways, metabolic pathways and cellular processes that are required for all key cell functions and, therefore, cell survival. Such a complete knowledge of cellular pathways and processes in the cell is essential for understanding how many diseases -- such as cancer (and also ageing) —originate and progress through mutation or alteration of individual pathway components. Furthermore, determining human cancer cell interactomes of therapy-resistant tumors will undoubtedly allow for rational clinical trials and save patients' lives through individualized cancer therapy. Since the global gene expression studies of DeRisi et al in 1997, translational genomics is very rapidly advancing through the detection in parallel of mRNA levels for large numbers of molecules, as well as through progress made with miniaturization and high density synthesis of nucleic acids on microarray solid supports. Gene expression studies with microarrays permit an integrated approach to biology in terms of network biodynamics, signaling pathways, protein-protein interactions, and ultimately, the cell interactome. An important emerging principle of gene expression is the *temporally coordinated regulation* of genes as an extremely efficient mechanism (Wen *et al* 1998) required for complex processes in which all the components of multi-subunit complexes must be present/available in defined ratios at the same time whenever such complexes are needed by the cell. The gene expression profile can be thought of either as a 'signature/ fingerprint' or as *a molecular definition of the cell in a specified state* (Young, 2000). Cellular phenotypes can then be inferred from such gene expression profiles. Success has been achieved in several projects that profile a large number of biological samples and then utilize pattern matching to predict the function of either new drug targets or previously uncharacterized genes; this '*compendium approach'* has been demonstrated in yeast (Gray *et al* 1998; Hughes *et al* 2000), and has also been applied in databases integrating gene expression data from pharmacologically characterized human cancer lines (NCI60, http://dtp.nci.nih.gov), or to classify cell lines in relation to their tissue of origin and predict their drug resistance or chemosensitivity (Weinstein *et al*, 1997; Ross *et al* 2000, Staunton *et al* 2001). Furthermore, sample analyses in clinical studies have shown that gene expression data can be employed to distinguish between tumor types as well as to predict outcomes (Golub *et al* 1999; Bittner *et al*, 2000; Shipp et al 2002; Furteal et al., 2004). The latter approach seems to lead to important applications such as individualized cancer therapy and 'personalised medicine'. On the other hand, such approaches are complemented by studies of protein-protein interactions in the area called *proteomics*, preferably under physiological conditions, or more generally still, in *cell interactomics.* Several technologies in this area are still developing both in the direction of improved detection sensitivity and time resolution of cellular events, with the limits of single molecule detection and picosecond time resolution already attained. In order to enable the development of new applications such techniques will be briefly described in the next section, together with relevant examples of

their recent applications.

#### **1.2 Basic concepts in transcription, translation and interactome networks**

The analysis of bionetwork dynamics of protein synthesis considered as a channel of information operates through the formation of protein amino acid sequences of polypeptides via *translation* of the corresponding polynucleotide sequences of (usually single –stranded, **messenger**) ribonucleic acid, that is:

**DNA** (**gene**) *transcription*  **mRNA**  *translation* **into a** polypeptide's aminoacid sequence **protein** (quaternary) *assembly* from polypeptide subunits.

Although not shown in this scheme, several key enzymes make such processes both efficient and precise through highly-selective catalysis; moreover, the protein assembly involves both specific enzymes and ribosome 'assembly lines'. Furthermore, such processes are compartmented in the mammalian cells by selective intracellular membranes; this seems to be also important for cell cycling and the control of cell division. On the other hand, the *reverse transcription***, RNA DNA,** does also occur (under certain conditions), catalized by a reverse transcriptase that contains both polypeptide chains and an RNA (master) strand. If error free, the first of these two sequence of processes —which are of fundamental biological importance-- generates true replicas of the information contained in the *sense codons* of the genes that are transcribed into mRNA *anti-codons*. Recall also that DNA stores information in the neucleotide bases A **(**Adenine**)**, C **(**Cytosin**)**, G **(**Guanine**)** and T **(**Thymine**)**, and that a triplet of such nucleotides in the DNA sequence is called a *codon,* which may encode unambiguously just the information necessary to specify a single amino acid. Moreover, the genetic code is quasi-universal, and capable of 'reverse transcription' from certain types of RNA back into DNA. Notably also, not all nucleotide or codon sequences present in the genome (DNA) are transcribed *in vivo.* Typically only a small percentage is transcribed. The transcribed (mRNA) sequences form what is naturally called the *transcriptome;* the protein- encoded version of the transcriptome is called the *proteome*, and upon including all protein- protein interactions for various cellular states one obtains the (global) *interactome* network. More generally, biological interactive networks as a class of complex bionetworks consist of local cellular communities **(**or '*organismic sets'***)** that are organized and managed by their characteristic selection procedures. Thus, in any partitioning of the organismal, or cell, structure, it is often necessary to regulate the *local* properties of the organism rather than the *global* mechanism, which explains an organism's need for specialized, 'modular constructions'. Such a modular, complex system biology approach to modeling signaling pathways and modifications of cell-cycling regulatory mechanisms in cancer cells was recently reported (Baianu, 2004); several consequences of this approach were also considered for the proteome and interactome networks in a 'prototype' cancer cell model (Prisecaru and Baianu, 2005). Note, on the other hand, that there seem to be also present in the living cell certain proteins and enzymes that are involved in *global* intra-cellular interactions which are thought to be essential to the cell survival and cell's flexible adaptation to stresses or challenge. Recent modeling techniques draw from a variety of mathematical sources, such as: topology (including graph theory), biostatistics, stochastic differential equations, Boolean networks, and qualitative system dynamics (Baianu, 1971a; de Jong *et al* 2000; 2003, 2004). Non--boolean network models of genetic networks and the interactome were also developed and compared with the results of Boolean ones (Baianu, 1977, 1984, 1987; Georgescu, 2006; Baianu, 2005; Baianu *et al. 2006).* The traditional use of comparatively rigid Boolean networks (reviewed extensively, for example in Baianu, 1987) can be thus extended through flexible, multi--valued (non--Boolean) logic algebra bionetworks with complex, *nonlinear dynamic* behaviors that mimic complex systems biology (Rosen, 2000).

Oncogenomics and Cancer Interactomics 477

By combining oligonucleotide synthesis with photolithography it was possible to synthesize specific oligonucleotides with a selected orientation onto the solid surface of glass or silicon chips (Lockhart *et al* 1996; Wodicka L, *et al* 1997), thus forming oligonucleotides arrays. The expression monitoring was then carried out by hybridization to high-density oligonucleotide arrays (Lockhart *et al* 1996; Wodicka L, *et al* 1997). Commercially available oligonucleotides array products include human, mouse and several other organisms. Each gene included on the oligonucleotides array is represented by up to 20 different oligonucleotides that span the entire length of the coding region of that gene. To reduce substantially the rate of false positives, each of these oligonucleotides is paired with a second mismatch oligonucleotide in which the central base in the sequence has been replaced by a different base. As in the cDNA approach, fluorescently labeled probes are generated from test and reference samples in order to carry out comparative gene expression profiling. After cDNA amplification, the differential fluorescent signal is detected with a laser scanning system and provides a map of the alterations in the transcriptional profile between the test and reference samples that are being compared. Dynamic analysis and further sophistication is added to such oligonucleotides array capabilities by the techniques briefly discussed in **Section 2.6.** The molecular classification of cancers is of immediate importance to both cancer diagnosis and therapy. Tumors with similar histologic appearance quite often have markedly different clinical response to therapy. Such variability is a reflection of the underlying cell line and molecular heterogeneity of almost any tumor. Gene expression profiling has been successfully employed for molecular classification of cancers. It would seem from available data that each patient has her/his own molecular identity signature or fingerprint (Mohr et al 2002). Thus, Ross *et al*. (2000) reported the gene expression analysis in 60 cancer cell lines utilized in the Developmental Therapeutics Program by the National Cancer Institute (NCI) at NIH (Bethesda, MD, USA); the report also stated that cell lines could be grouped together according with the organ type and specific expression profiles corresponded to *clusters of genes*. Similar findings were reported for ovarian and breast cancers; in the latter case, Perou *et al*. (2000) reported that specific epithelial cell line genes clustered together and are relevant in breast cancer subdivision into the basal- like and luminal groups. On the other hand, the eventual use of microarray technologies for clinical applications will involve the utilization of proteome and tissue arrays in addition to gene expression profiling by cDNA microarrays and oligonucleotides arrays. Thus, tissue markers revealed unexpected relationships, as in the case of gene expression analysis of small-cell lung carcinoma, pulmonary carcinoid tissue and bronchial epithelial tissue culture (Anbazhagan *et al* 1999). Because a single biomarker has serious limitations for clinical applications there is a need for a battery of disease biomarkers that would provide a much more accurate classification of cancers. High-density screening with microarray technologies is therefore valuable in pharmacogenomic (individualized therapy), toxicogenomic, as well as in clinical -

In a manner similar to the transcriptome, the proteome does undergo both qualitative and quantitative changes during pathogenesis, and this is also true in carcinogenesis. Proteome array-based methodologies involve either proteins or protein-binding particles (DNA, RNAs, antibody, or other ligands). Utilizing such proteome arrays one can respectively

**2.2 Oligonucleotide arrays** 

diagnostic investigations.

**2.3 Proteome arrays** 

The results obtained with such non--random genetic network models have several important consequences for understanding the operation of cellular networks and the formation, transformation and growth of neoplastic network structures. Non--boolean models can also be extended to include *epigenetic* controls discussed in **Section 6**, as well as to mimic the coupling of the genome to the rest of the cell through specific signaling pathways that are involved in the modulation of both translation and transcription control processes. The latter may also provide novel approaches to cancer studies and, indeed, to developing 'individualized' cancer therapy strategies and novel anti-cancer medicines targeted at specific signaling pathways involved in malignant tumors resistance to other therapies.

#### **2. Techniques and application examples**

#### **2.1 DNA microarrays**

DNA microarray technology is widely employed to monitor in a single experiment the gene expression levels of all genes of a cell or an organism. This includes the identification of genes that are expressed in different cell types as well as the changes in gene expression levels caused, for example, by differentiation or disease. The terabytes of data thus obtained can provide valuable clues about the interactions among genes and also about the interaction networks of gene products. It has been reported that cDNA arrays were pioneered by the Brown Laboratory at Stanford University (Brown and Botstein, 1999; URL: http://cmgm.stanford.edu/pbrown/mguide/index.html). Several quantitative and highdensity DNA array applications were then reported in rapid succession (Schena et al 1995; Chee et al 1996; Brown and Botstein, 1999). Such microarrays are generated by automatically printing double-stranded cDNA onto a solid support that may be either glass silicon or nylon. The essential technologies involved are robotics and devlopment/selection of sequence-verified and array-formatted cDNA clones. The latter ensures that both the location and the identity of each cDNA on the array is known. Sequence-verified and arrayformatted cDNA clone sets are now available from companies such as Incyte Genomics (Palo Alto, CA; URL: http://www.synteni.com/) and Research Genetics (Huntsville, AL; URL: http://www.resgen.com/). In cDNA-based gene expression profiling experiments, the total RNA is extracted from the selected experimental samples and the RNA is fluorescently labeled with either cye3- or cye5-dUTP in a single round of reverse transcription. The latter have several advantages: they are readily incorporated into cDNA by reverse transcription, they exhibit widely separated excitation and emission spectra, and also they possess good photostability. Such fluorescently--labeled cDNA probes are then hybridized to a single array through a competitive hybridization reaction. Detection of hybridized probes is achieved by laser excitation of the individual fluorescent markers, followed by scanning using a confocal scanning laser microscope. The raw data obtained with a laser scanning systems is represented as a normalized ratio of cye3: cye5 and automatically color coded; thus, red color is conventionally selected to represent those genes that are transcriptionally upregulated in the test versus the reference, whereas green color represents genes that are downregulated; those genes that exhibit no difference between test and reference samples are shown in yellow. The analysis of the gene expression data obtained by such a high throughput microarray technology is quite complex and requires advanced computational/bioinformatics tools as already discussed in **Section 1.2**. Other aspects related to interactomics are discussed in **Section 3**. An alternative technology to cDNA microarrays is discussed in the next section.

#### **2.2 Oligonucleotide arrays**

476 Bioinformatics – Trends and Methodologies

The results obtained with such non--random genetic network models have several important consequences for understanding the operation of cellular networks and the formation, transformation and growth of neoplastic network structures. Non--boolean models can also be extended to include *epigenetic* controls discussed in **Section 6**, as well as to mimic the coupling of the genome to the rest of the cell through specific signaling pathways that are involved in the modulation of both translation and transcription control processes. The latter may also provide novel approaches to cancer studies and, indeed, to developing 'individualized' cancer therapy strategies and novel anti-cancer medicines targeted at specific

DNA microarray technology is widely employed to monitor in a single experiment the gene expression levels of all genes of a cell or an organism. This includes the identification of genes that are expressed in different cell types as well as the changes in gene expression levels caused, for example, by differentiation or disease. The terabytes of data thus obtained can provide valuable clues about the interactions among genes and also about the interaction networks of gene products. It has been reported that cDNA arrays were pioneered by the Brown Laboratory at Stanford University (Brown and Botstein, 1999; URL: http://cmgm.stanford.edu/pbrown/mguide/index.html). Several quantitative and highdensity DNA array applications were then reported in rapid succession (Schena et al 1995; Chee et al 1996; Brown and Botstein, 1999). Such microarrays are generated by automatically printing double-stranded cDNA onto a solid support that may be either glass silicon or nylon. The essential technologies involved are robotics and devlopment/selection of sequence-verified and array-formatted cDNA clones. The latter ensures that both the location and the identity of each cDNA on the array is known. Sequence-verified and arrayformatted cDNA clone sets are now available from companies such as Incyte Genomics (Palo Alto, CA; URL: http://www.synteni.com/) and Research Genetics (Huntsville, AL; URL: http://www.resgen.com/). In cDNA-based gene expression profiling experiments, the total RNA is extracted from the selected experimental samples and the RNA is fluorescently labeled with either cye3- or cye5-dUTP in a single round of reverse transcription. The latter have several advantages: they are readily incorporated into cDNA by reverse transcription, they exhibit widely separated excitation and emission spectra, and also they possess good photostability. Such fluorescently--labeled cDNA probes are then hybridized to a single array through a competitive hybridization reaction. Detection of hybridized probes is achieved by laser excitation of the individual fluorescent markers, followed by scanning using a confocal scanning laser microscope. The raw data obtained with a laser scanning systems is represented as a normalized ratio of cye3: cye5 and automatically color coded; thus, red color is conventionally selected to represent those genes that are transcriptionally upregulated in the test versus the reference, whereas green color represents genes that are downregulated; those genes that exhibit no difference between test and reference samples are shown in yellow. The analysis of the gene expression data obtained by such a high throughput microarray technology is quite complex and requires advanced computational/bioinformatics tools as already discussed in **Section 1.2**. Other aspects related to interactomics are discussed in **Section 3**. An alternative technology to

signaling pathways involved in malignant tumors resistance to other therapies.

**2. Techniques and application examples** 

cDNA microarrays is discussed in the next section.

**2.1 DNA microarrays** 

By combining oligonucleotide synthesis with photolithography it was possible to synthesize specific oligonucleotides with a selected orientation onto the solid surface of glass or silicon chips (Lockhart *et al* 1996; Wodicka L, *et al* 1997), thus forming oligonucleotides arrays. The expression monitoring was then carried out by hybridization to high-density oligonucleotide arrays (Lockhart *et al* 1996; Wodicka L, *et al* 1997). Commercially available oligonucleotides array products include human, mouse and several other organisms. Each gene included on the oligonucleotides array is represented by up to 20 different oligonucleotides that span the entire length of the coding region of that gene. To reduce substantially the rate of false positives, each of these oligonucleotides is paired with a second mismatch oligonucleotide in which the central base in the sequence has been replaced by a different base. As in the cDNA approach, fluorescently labeled probes are generated from test and reference samples in order to carry out comparative gene expression profiling. After cDNA amplification, the differential fluorescent signal is detected with a laser scanning system and provides a map of the alterations in the transcriptional profile between the test and reference samples that are being compared. Dynamic analysis and further sophistication is added to such oligonucleotides array capabilities by the techniques briefly discussed in **Section 2.6.** The molecular classification of cancers is of immediate importance to both cancer diagnosis and therapy. Tumors with similar histologic appearance quite often have markedly different clinical response to therapy. Such variability is a reflection of the underlying cell line and molecular heterogeneity of almost any tumor. Gene expression profiling has been successfully employed for molecular classification of cancers. It would seem from available data that each patient has her/his own molecular identity signature or fingerprint (Mohr et al 2002). Thus, Ross *et al*. (2000) reported the gene expression analysis in 60 cancer cell lines utilized in the Developmental Therapeutics Program by the National Cancer Institute (NCI) at NIH (Bethesda, MD, USA); the report also stated that cell lines could be grouped together according with the organ type and specific expression profiles corresponded to *clusters of genes*. Similar findings were reported for ovarian and breast cancers; in the latter case, Perou *et al*. (2000) reported that specific epithelial cell line genes clustered together and are relevant in breast cancer subdivision into the basal- like and luminal groups. On the other hand, the eventual use of microarray technologies for clinical applications will involve the utilization of proteome and tissue arrays in addition to gene expression profiling by cDNA microarrays and oligonucleotides arrays. Thus, tissue markers revealed unexpected relationships, as in the case of gene expression analysis of small-cell lung carcinoma, pulmonary carcinoid tissue and bronchial epithelial tissue culture (Anbazhagan *et al* 1999). Because a single biomarker has serious limitations for clinical applications there is a need for a battery of disease biomarkers that would provide a much more accurate classification of cancers. High-density screening with microarray technologies is therefore valuable in pharmacogenomic (individualized therapy), toxicogenomic, as well as in clinical diagnostic investigations.

#### **2.3 Proteome arrays**

In a manner similar to the transcriptome, the proteome does undergo both qualitative and quantitative changes during pathogenesis, and this is also true in carcinogenesis. Proteome array-based methodologies involve either proteins or protein-binding particles (DNA, RNAs, antibody, or other ligands). Utilizing such proteome arrays one can respectively

Oncogenomics and Cancer Interactomics 479

during cancer initiation and progression (Lassus et al 2001). A pathologist might, however, object that the tissue microarray provides only a partial analysis of the tumor. The arraybased technologies briefly described here provide powerful means for functional analyses of cancer and other complex diseases. Undoubtedly, much more can, and will be, done with proteome or tissue arrays combined with other state-of-the-science spectroscopic techniques

The following three **Sections 2.5 and 2.6 and 6.2** will illustrate how advanced, ultra-fast and super-sensitive techniques can be used in conjunction with either nucleic acids or proteome arrays to both speed up thousand-fold the microarray data collection (for nucleic acids, proteins, ligand-binding, etc.) and also increase sensitivity to its possible limit--that of single

**2.5 Fluorescence correlation spectroscopy and fluorescence cross—correlation** 

In the bioanalytical and biochemical sciences Fluorescence Correlation Spectroscopy (FCS) techniques can be utilized to determine various thermodynamic and kinetic properties, such as association and dissociation constants of intermolecular reactions in solution (Thompson, 1991; Schwille, Bieschke and Oehlenschläger, 1997). Examples of this are specific hybridization and renaturation processes between complementary DNA or RNA strands, as well as antigene-antibody or receptor-ligand recognition. Although of significant functional relevance in biochemical systems, the hybridization mechanism of short oligonucleotide DNA primers to a native RNA target sequence could not be investigated in detail prior to the FCS/FCCS application to these problems. Most published models agree that the process can be divided into two steps: a reversible first initiating step, where few base pairs are formed, and a second irreversible phase described as a rapid zippering of the entire sequence. By competing with the internal binding mechanisms of the target molecule such as secondary structure formation, the rate-determining initial step is of crucial relevance for the entire binding process. Increased accessibility of binding sites, attributable to singlestranded open regions of the RNA structure at loops and bulges, can be quantified using

The measurement principle for nearly all FCS/FCCS applications is based so far upon the change in diffusion characteristics when a small labeled reaction partner (eg, a short nucleic acid probe) associates with a larger, unlabeled one (target DNA/RNA). The average diffusion time of the labeled molecules through the illuminated focal volume element is inversely related to the diffusion coefficient, and increases during the association process. By calibrating the diffusion characteristics of free and bound fluorescent partner, the binding fraction can be easily evaluated from the correlation curve for any time of the reaction. This principle has been employed to investigate and compare the hybridization efficiency of six labeled DNA oligonucleotides with different binding sites to an RNA target in a native secondary structure (Schwille, Oehlenschläger and Walter, 1996). Hybridization kinetics was examined by binding six fluorescently labeled oligonucleotide probes of different sequence, length and binding sites to a 101-nucleotide-long native RNA target sequence with a known secondary structure. The hybridization kinetics was monitored and quantified by FCS, in order to investigate the overall reaction mechanism. At the measurement temperature of 40°C the probes are mostly denatured, whereas the target retains its native structure. The binding process could be directly monitored through diffusional FCS analysis, via the change in translational diffusion time of the labeled 17-mer to 37-mer oligonucleotide probes HS1 to HS6 upon specific

**spectroscopy: applications to DNA hybridization, PCR and DNA binding** 

kinetic measurements (Schwille, Oehlenschläger and Walter, 1996).

as suggested in the following **Sections 2.5, 2.6, 4 and 6.**2.

molecule detection.

study either differential protein expression profiling or protein-ligand interaction screening under specified, or selected, physiopathological conditions. According to Kodadek (2001), these two classes of practical applications of proteome arrays are respectively defined as protein function and protein-detecting arrays. A protein-detecting array may consist of an arrayed set of protein ligands that are employed to profile gene expression and therefore make visible 'proteosignatures' characterizing a selected cellular state or phase. In view of the potential clinical importance of a proteomic survey of cancers, the 'hunt' is now on for such proteosignatures of cancer cells but the amount of data reported to date is still quite limited. Already, the coupling of proteome arrays with high-resolution chromatography techniques followed by mass spectrometry has provided powerful analytical tools with which one can profile the protein expression in cancer cells. For example, a ProteinChipTM (Ciphergen Inc, Fremont, CA, USA) was successfully utilized to investigate the proteome of prostate, ovarian, head and neck cancer cells (von Eggeling et al 2000). Such methods identified protein fingerprints from which cancer biomarkers can also be obtained. A reverse proteome array was also reported in which many extracted proteins from a patient sample are 'printed' onto a flat, solid support (Paweletz et al 2001); this reverse system was then utilized to carry out a biochemical screening investigation of the signaling pathways in prostate cancer. Through such investigations it was found that the carcinoma progression was positively correlated with the phosphorylation state of Akt and negatively correlated with ERK pathways; furthermore, the carcinoma progression was positively correlated with the suppression of the apoptotic pathways, a finding which is consistent with the more detailed, recent reports on cyclin CDK2 and transcriptional factors affected by CDK2 that will be discussed in **Section 4.** Immunophentotyping of leukemias with antibody microarrays was also reported (Belov, de la Vega, dos Remedios, et al 2001), and does provide an increased antigen differentiation (CD) in leukemia processing.

#### **2.4 Tissue arrays**

The logical step after the identification of potential cancer markers through genomic and/or proteomic array analysis is the evaluation of such cancer markers by tissue arrays/ tissue chips for diagnostic, prognostic, toxicogenomic and therapeutic relevance. Such tissue microarrays (TMAs) were often designed to contain up to 1000 sections of 5micron thick sections, usually chemically--fixed and arrayed upon a glass slide. TMAs allow large-scale screening of tissue specimens and can be utilized, for example, for the pathological evaluation of molecular irreversible changes that are important for cancer research and treatment. Therefore, they can speed up the process of translating experimental, or fundamental, discoveries into clinical practice and improved cancer treatments.

TMAs have been utilized in cancer research in conjunction with **fluorescence** *in situ* **hybridization** (FISH), to analyze in parallel the gene amplification in multiple tissue sections thus allowing the researchers to map the distribution of gene amplification throughout an entire tumor. This also allowed the monitoring of changes in gene amplification during the cancer progression (Bubendorf et al 1999). Furthermore, utilizing immunohistochemical staining of tissue arrays it was possible to measure the protein levels in tumor specimens. Thus, topoisomerase II alpha was reported to be highly expressed in patients with the poorest prognosis in oligodendrogliomas (Miettinen et al 2000). TMAs may become a clinical validation, as well as a 'global' tool; thus, recent studies reported this technique to be highly efficient for the identification of molecular (irreversible) alterations

study either differential protein expression profiling or protein-ligand interaction screening under specified, or selected, physiopathological conditions. According to Kodadek (2001), these two classes of practical applications of proteome arrays are respectively defined as protein function and protein-detecting arrays. A protein-detecting array may consist of an arrayed set of protein ligands that are employed to profile gene expression and therefore make visible 'proteosignatures' characterizing a selected cellular state or phase. In view of the potential clinical importance of a proteomic survey of cancers, the 'hunt' is now on for such proteosignatures of cancer cells but the amount of data reported to date is still quite limited. Already, the coupling of proteome arrays with high-resolution chromatography techniques followed by mass spectrometry has provided powerful analytical tools with which one can profile the protein expression in cancer cells. For example, a ProteinChipTM (Ciphergen Inc, Fremont, CA, USA) was successfully utilized to investigate the proteome of prostate, ovarian, head and neck cancer cells (von Eggeling et al 2000). Such methods identified protein fingerprints from which cancer biomarkers can also be obtained. A reverse proteome array was also reported in which many extracted proteins from a patient sample are 'printed' onto a flat, solid support (Paweletz et al 2001); this reverse system was then utilized to carry out a biochemical screening investigation of the signaling pathways in prostate cancer. Through such investigations it was found that the carcinoma progression was positively correlated with the phosphorylation state of Akt and negatively correlated with ERK pathways; furthermore, the carcinoma progression was positively correlated with the suppression of the apoptotic pathways, a finding which is consistent with the more detailed, recent reports on cyclin CDK2 and transcriptional factors affected by CDK2 that will be discussed in **Section 4.** Immunophentotyping of leukemias with antibody microarrays was also reported (Belov, de la Vega, dos Remedios, et al 2001), and does

provide an increased antigen differentiation (CD) in leukemia processing.

fundamental, discoveries into clinical practice and improved cancer treatments.

The logical step after the identification of potential cancer markers through genomic and/or proteomic array analysis is the evaluation of such cancer markers by tissue arrays/ tissue chips for diagnostic, prognostic, toxicogenomic and therapeutic relevance. Such tissue microarrays (TMAs) were often designed to contain up to 1000 sections of 5micron thick sections, usually chemically--fixed and arrayed upon a glass slide. TMAs allow large-scale screening of tissue specimens and can be utilized, for example, for the pathological evaluation of molecular irreversible changes that are important for cancer research and treatment. Therefore, they can speed up the process of translating experimental, or

TMAs have been utilized in cancer research in conjunction with **fluorescence** *in situ* **hybridization** (FISH), to analyze in parallel the gene amplification in multiple tissue sections thus allowing the researchers to map the distribution of gene amplification throughout an entire tumor. This also allowed the monitoring of changes in gene amplification during the cancer progression (Bubendorf et al 1999). Furthermore, utilizing immunohistochemical staining of tissue arrays it was possible to measure the protein levels in tumor specimens. Thus, topoisomerase II alpha was reported to be highly expressed in patients with the poorest prognosis in oligodendrogliomas (Miettinen et al 2000). TMAs may become a clinical validation, as well as a 'global' tool; thus, recent studies reported this technique to be highly efficient for the identification of molecular (irreversible) alterations

**2.4 Tissue arrays** 

during cancer initiation and progression (Lassus et al 2001). A pathologist might, however, object that the tissue microarray provides only a partial analysis of the tumor. The arraybased technologies briefly described here provide powerful means for functional analyses of cancer and other complex diseases. Undoubtedly, much more can, and will be, done with proteome or tissue arrays combined with other state-of-the-science spectroscopic techniques as suggested in the following **Sections 2.5, 2.6, 4 and 6.**2.

The following three **Sections 2.5 and 2.6 and 6.2** will illustrate how advanced, ultra-fast and super-sensitive techniques can be used in conjunction with either nucleic acids or proteome arrays to both speed up thousand-fold the microarray data collection (for nucleic acids, proteins, ligand-binding, etc.) and also increase sensitivity to its possible limit--that of single molecule detection.

#### **2.5 Fluorescence correlation spectroscopy and fluorescence cross—correlation spectroscopy: applications to DNA hybridization, PCR and DNA binding**

In the bioanalytical and biochemical sciences Fluorescence Correlation Spectroscopy (FCS) techniques can be utilized to determine various thermodynamic and kinetic properties, such as association and dissociation constants of intermolecular reactions in solution (Thompson, 1991; Schwille, Bieschke and Oehlenschläger, 1997). Examples of this are specific hybridization and renaturation processes between complementary DNA or RNA strands, as well as antigene-antibody or receptor-ligand recognition. Although of significant functional relevance in biochemical systems, the hybridization mechanism of short oligonucleotide DNA primers to a native RNA target sequence could not be investigated in detail prior to the FCS/FCCS application to these problems. Most published models agree that the process can be divided into two steps: a reversible first initiating step, where few base pairs are formed, and a second irreversible phase described as a rapid zippering of the entire sequence. By competing with the internal binding mechanisms of the target molecule such as secondary structure formation, the rate-determining initial step is of crucial relevance for the entire binding process. Increased accessibility of binding sites, attributable to singlestranded open regions of the RNA structure at loops and bulges, can be quantified using kinetic measurements (Schwille, Oehlenschläger and Walter, 1996).

The measurement principle for nearly all FCS/FCCS applications is based so far upon the change in diffusion characteristics when a small labeled reaction partner (eg, a short nucleic acid probe) associates with a larger, unlabeled one (target DNA/RNA). The average diffusion time of the labeled molecules through the illuminated focal volume element is inversely related to the diffusion coefficient, and increases during the association process. By calibrating the diffusion characteristics of free and bound fluorescent partner, the binding fraction can be easily evaluated from the correlation curve for any time of the reaction. This principle has been employed to investigate and compare the hybridization efficiency of six labeled DNA oligonucleotides with different binding sites to an RNA target in a native secondary structure (Schwille, Oehlenschläger and Walter, 1996). Hybridization kinetics was examined by binding six fluorescently labeled oligonucleotide probes of different sequence, length and binding sites to a 101-nucleotide-long native RNA target sequence with a known secondary structure. The hybridization kinetics was monitored and quantified by FCS, in order to investigate the overall reaction mechanism. At the measurement temperature of 40°C the probes are mostly denatured, whereas the target retains its native structure. The binding process could be directly monitored through diffusional FCS analysis, via the change in translational diffusion time of the labeled 17-mer to 37-mer oligonucleotide probes HS1 to HS6 upon specific

Oncogenomics and Cancer Interactomics 481

provide accurate *in vivo* and *in vitro* measurements of diffusion rates, "mobility" parameters, molecular concentrations, chemical kinetics, aggregation processes, labeled nucleic acid hybridization kinetics and fluorescence photophysics/ photochemistry. Several photophysical properties of fluorophores that are required for quantitative analysis of FCS in tissues have already been widely reported. Molecular "mobilities" can be measured by

Novel, two-photon NIR excitation fluorescence correlation spectroscopy tests and preliminary results were obtained for concentrated suspensions of live cells and membranes (Baianu et al, 2007). Especially promising are further developments employing multi-photon NIR excitation that could lead, for example, to the reliable detection of cancers using NIR-excited fluorescence. Other related developments are the applications of Fluorescence Cross-Correlation Spectroscopy (FCCS) detection to monitoring *DNA- telomerase interactions*, DNA hybridization kinetics, ligand-receptor interactions and HIV-HBV testing. Very detailed, automated chemical analyses of biomolecules in cell cultures are now also becoming possible by FT-NIR spectroscopy of single cells, both *in vitro* and *in vivo*. Such rapid analyses have potentially important applications in cancer research, pharmacology and clinical diagnosis.

**2.6 Near infrared microspectroscopy, fluorescence microspectroscopy and infrared** 

*picogram* level when adequately calibrated by a suitable primary analytical method.

clinical diagnosis of viral diseases, cancers and also in cancer therapy.

**3. Mapping interactome networks** 

Novel methodologies are currently being evaluated for the chemical analysis of embryos and single cells by Fourier Transform Infrared (FT-IR), Fourier Transform Near Infrared (FT-NIR) Microspectroscopy, Fluorescence Microspectroscopy. The first FT-NIR chemical images of biological systems approaching 1micron (1μm) resolution were reported (Baianu, 2004; Baianu et al 2004), and FT-NIR spectra of oil and proteins were obtained under physiological conditions for volumes as small as 2μm3. Related, HR-NMR analyses of oil contents in somatic embryos were presented with nanoliter precision. Therefore, developmental changes may be monitored by FT-NIR with a precision approaching the

Indeed, detailed chemical analyses are now becoming possible by FT-NIR Chemical Imaging and Microspectroscopy of single cells. The cost, speed and analytical requirements are fully satisfied by FT-NIR spectroscopy and microspectroscopy for a wide range of biological specimens. These techniques were also suggested to be potentially important in functional genomics and proteomics research (Baianu et al 2004) through the rapid and accurate detection of high-content microarrays (HCMA). Multi-photon (MP), pulsed femtosecond laser NIR Fluorescence Excitation techniques were shown to be capable of *single molecule detection* (SMD). Thus, microspectroscopic techniques allow for most sensitive and reliable quantitative analyses to be carried out both *in vitro* and *in vivo.* In particular*,* MP NIR excitation in FCS allows not only *single molecule detection*, but also non-invasive monitoring of molecular dynamics and the acquisition of high-resolution, *submicron* imaging of *femtoliter* volumes inside functional cells and tissues. Such ultra-sensitive and rapid NIR-FCS analyses have therefore numerous potential applications in biomedical research areas,

Mapping protein-protein interaction networks, or charting the global interaction maps, that correspond through translation to entire genomes is undoubtedly useful for understanding

FCS over a wide range of characteristic time constants from ~10-3 to 103 ms.

**chemical imaging of single cells** 

hybridization with the larger RNA target. The characteristic diffusion time through the laserilluminated focal spot of the 0.5 µm-diameter objective increased from 0.13 to 0.20 ms for the free probe, and from 0.37 to 0.50 ms for the bound probe within 60 min. The increase in diffusion time from measurement to measurement over the 60 min could be followed on a PC monitor and varied strongly from probe to probe. HS6 showed the fastest association, while the reaction of HS2 could not be detected at all for the first 60 min. Thus, FCS diffusional analysis provides an easy and comparably fast determination of the hybridization time course of reactions between complementary DNA/RNA strands in the concentration range from 10-10 to 10-8 M. The FCS-based methodology also permits rapid screening for suitable anti-sense nucleic acids directed against important targets like HIV-1 RNA with low consumption of probes and target. Because of the high sensitivity of FCS detection, the same principle can be exploited to simplify the diagnostics for extremely low concentrations of infectious agents like bacterial or viral DNA/RNA. By combining confocal FCS with biochemical amplification reactions like PCR or 3SR, the detection threshold of infectious RNA in human sera could be dropped to concentrations of 10-18 M (Walter, Schwille and Eigen, 1996; Oehlenschläger, Schwille and Eigen, 1996). The method allows for simple quantification of initial infectious units in the observed samples. The isothermal Nucleic Acid Sequence-Based Amplification (NASBA) technique enables the detection of HIV-1 RNA in human blood-plasma (Winkler, Bieschke and Schwille, 1997). The threshold of detection is presently down to 100 initial RNA molecules per milliliter by amplifying a short sequence of the RNA template (Schwille, Oehlenschläger and Walter, 1997). The NASBA method was combined with FCS, thus allowing the online detection of the HIV-1 RNA molecules amplified by NASBA (Oehlenschläger, Schwille and Eigen, 1996). The combination of FCS with the NASBA reaction was performed by introducing a fluorescently labeled DNA probe into the NASBA reaction mixture *at nanomolar concentrations*, hybridizing to a distinct sequence of the amplified RNA molecule. After having reached a critical concentration on the order of 0.1 to 1.0 nM (the threshold for single-photon excitation / FCS detection is ~0.1 nm), the number of amplified RNA molecules could be determined as the reaction continued its course. Evaluation of the hybridization/extension kinetics allowed an estimation of the initial HIV-1 RNA concentration present at the beginning of amplification. The value of the initial HIV-1 RNA number enables discrimination between positive and false-positive samples (caused, for instance, by carryover contamination). This possibility of sharp discrimination is essential for all diagnostic methods using amplification systems (PCR as well as NASBA). The quantification of HIV-1 RNA in plasma by combining NASBA with FCS may be useful in assessing the efficacy of anti-HIV agents, especially in the early infection stage when standard ELISA antibody tests often display negative results. Furthermore, the combination of NASBA with FCS is not restricted only to the detection of HIV-1 RNA in plasma.

On the one hand, the diagnosis of Hepatitis (both B and C) remains much more challenging. On the other hand, the number of HIV, or HBV, infected subjects worldwide is increasing at an alarming rate, with up to 20% of the population in parts of Africa and Asia being infected with HBV. In contrast to HIV, HBV infection is not particularly restricted to the high-risk groups.

Multi-photon (MPE) NIR excitation of fluorophores--attached as labels to biopolymers like proteins and nucleic acids, or bound at specific biomembrane sites-- is one of the most attractive options in biological applications of FCS. Many of the serious problems encountered in spectroscopic measurements of living tissue, such as photodamage, light scattering and auto-fluorescence, can be reduced or even eliminated. FCS can therefore

hybridization with the larger RNA target. The characteristic diffusion time through the laserilluminated focal spot of the 0.5 µm-diameter objective increased from 0.13 to 0.20 ms for the free probe, and from 0.37 to 0.50 ms for the bound probe within 60 min. The increase in diffusion time from measurement to measurement over the 60 min could be followed on a PC monitor and varied strongly from probe to probe. HS6 showed the fastest association, while the reaction of HS2 could not be detected at all for the first 60 min. Thus, FCS diffusional analysis provides an easy and comparably fast determination of the hybridization time course of reactions between complementary DNA/RNA strands in the concentration range from 10-10 to 10-8 M. The FCS-based methodology also permits rapid screening for suitable anti-sense nucleic acids directed against important targets like HIV-1 RNA with low consumption of probes and target. Because of the high sensitivity of FCS detection, the same principle can be exploited to simplify the diagnostics for extremely low concentrations of infectious agents like bacterial or viral DNA/RNA. By combining confocal FCS with biochemical amplification reactions like PCR or 3SR, the detection threshold of infectious RNA in human sera could be dropped to concentrations of 10-18 M (Walter, Schwille and Eigen, 1996; Oehlenschläger, Schwille and Eigen, 1996). The method allows for simple quantification of initial infectious units in the observed samples. The isothermal Nucleic Acid Sequence-Based Amplification (NASBA) technique enables the detection of HIV-1 RNA in human blood-plasma (Winkler, Bieschke and Schwille, 1997). The threshold of detection is presently down to 100 initial RNA molecules per milliliter by amplifying a short sequence of the RNA template (Schwille, Oehlenschläger and Walter, 1997). The NASBA method was combined with FCS, thus allowing the online detection of the HIV-1 RNA molecules amplified by NASBA (Oehlenschläger, Schwille and Eigen, 1996). The combination of FCS with the NASBA reaction was performed by introducing a fluorescently labeled DNA probe into the NASBA reaction mixture *at nanomolar concentrations*, hybridizing to a distinct sequence of the amplified RNA molecule. After having reached a critical concentration on the order of 0.1 to 1.0 nM (the threshold for single-photon excitation / FCS detection is ~0.1 nm), the number of amplified RNA molecules could be determined as the reaction continued its course. Evaluation of the hybridization/extension kinetics allowed an estimation of the initial HIV-1 RNA concentration present at the beginning of amplification. The value of the initial HIV-1 RNA number enables discrimination between positive and false-positive samples (caused, for instance, by carryover contamination). This possibility of sharp discrimination is essential for all diagnostic methods using amplification systems (PCR as well as NASBA). The quantification of HIV-1 RNA in plasma by combining NASBA with FCS may be useful in assessing the efficacy of anti-HIV agents, especially in the early infection stage when standard ELISA antibody tests often display negative results. Furthermore, the combination of NASBA with FCS is not restricted

On the one hand, the diagnosis of Hepatitis (both B and C) remains much more challenging. On the other hand, the number of HIV, or HBV, infected subjects worldwide is increasing at an alarming rate, with up to 20% of the population in parts of Africa and Asia being infected with HBV. In contrast to HIV, HBV infection is not particularly restricted to the high-risk

Multi-photon (MPE) NIR excitation of fluorophores--attached as labels to biopolymers like proteins and nucleic acids, or bound at specific biomembrane sites-- is one of the most attractive options in biological applications of FCS. Many of the serious problems encountered in spectroscopic measurements of living tissue, such as photodamage, light scattering and auto-fluorescence, can be reduced or even eliminated. FCS can therefore

only to the detection of HIV-1 RNA in plasma.

groups.

provide accurate *in vivo* and *in vitro* measurements of diffusion rates, "mobility" parameters, molecular concentrations, chemical kinetics, aggregation processes, labeled nucleic acid hybridization kinetics and fluorescence photophysics/ photochemistry. Several photophysical properties of fluorophores that are required for quantitative analysis of FCS in tissues have already been widely reported. Molecular "mobilities" can be measured by FCS over a wide range of characteristic time constants from ~10-3 to 103 ms.

Novel, two-photon NIR excitation fluorescence correlation spectroscopy tests and preliminary results were obtained for concentrated suspensions of live cells and membranes (Baianu et al, 2007). Especially promising are further developments employing multi-photon NIR excitation that could lead, for example, to the reliable detection of cancers using NIR-excited fluorescence. Other related developments are the applications of Fluorescence Cross-Correlation Spectroscopy (FCCS) detection to monitoring *DNA- telomerase interactions*, DNA hybridization kinetics, ligand-receptor interactions and HIV-HBV testing. Very detailed, automated chemical analyses of biomolecules in cell cultures are now also becoming possible by FT-NIR spectroscopy of single cells, both *in vitro* and *in vivo*. Such rapid analyses have potentially important applications in cancer research, pharmacology and clinical diagnosis.

#### **2.6 Near infrared microspectroscopy, fluorescence microspectroscopy and infrared chemical imaging of single cells**

Novel methodologies are currently being evaluated for the chemical analysis of embryos and single cells by Fourier Transform Infrared (FT-IR), Fourier Transform Near Infrared (FT-NIR) Microspectroscopy, Fluorescence Microspectroscopy. The first FT-NIR chemical images of biological systems approaching 1micron (1μm) resolution were reported (Baianu, 2004; Baianu et al 2004), and FT-NIR spectra of oil and proteins were obtained under physiological conditions for volumes as small as 2μm3. Related, HR-NMR analyses of oil contents in somatic embryos were presented with nanoliter precision. Therefore, developmental changes may be monitored by FT-NIR with a precision approaching the *picogram* level when adequately calibrated by a suitable primary analytical method.

Indeed, detailed chemical analyses are now becoming possible by FT-NIR Chemical Imaging and Microspectroscopy of single cells. The cost, speed and analytical requirements are fully satisfied by FT-NIR spectroscopy and microspectroscopy for a wide range of biological specimens. These techniques were also suggested to be potentially important in functional genomics and proteomics research (Baianu et al 2004) through the rapid and accurate detection of high-content microarrays (HCMA). Multi-photon (MP), pulsed femtosecond laser NIR Fluorescence Excitation techniques were shown to be capable of *single molecule detection* (SMD). Thus, microspectroscopic techniques allow for most sensitive and reliable quantitative analyses to be carried out both *in vitro* and *in vivo.* In particular*,* MP NIR excitation in FCS allows not only *single molecule detection*, but also non-invasive monitoring of molecular dynamics and the acquisition of high-resolution, *submicron* imaging of *femtoliter* volumes inside functional cells and tissues. Such ultra-sensitive and rapid NIR-FCS analyses have therefore numerous potential applications in biomedical research areas, clinical diagnosis of viral diseases, cancers and also in cancer therapy.

#### **3. Mapping interactome networks**

Mapping protein-protein interaction networks, or charting the global interaction maps, that correspond through translation to entire genomes is undoubtedly useful for understanding

Oncogenomics and Cancer Interactomics 483

cancer protein, human interactome (sub) networks with normal human interactome networks that involve multiple protein-protein interactions (Jonsson and Bates, 2006). The latter studies reduced the 'noise' level in the human protein interaction data by employing an orthology-based method described previously by Jonsson et al. (2006). This method claims to reduce the 'noise' level in protein-interaction (PPI) data by identifying putative interactions based on homology to experimentally determined interactions in a range of different species; both the DIP (Salwinsky et al 2004) and the MIPS, Mammalian Protein— Protein Interaction (Pagel et al 2005) databases were utilized. Furthermore, the complete interactome data set that was employed is available as Supplementary Material from *loc. cit.* The conclusions was drawn that cancer proteins have an increased frequency of proteinprotein interactions in comparison with the proteins that were studied in normal cells, and this was interpreted as evidence "*indicating an underlying evolutionary pressure to which cancer genes, as genes of central importance are subjected*." It remains to be seen, however, if human interactome studies-- which occur with increasing frequency-- have indeed overcome the sampling objections raised by Han et al. (2006). The more extensive interactome data and analysis—though still quite limited- that has been reported to date is readily available and includes the following: Y2H (partial data-based) interactome maps for *C. elegans* (Li et al 2004) and *Drosophila melanogaster* (Giot et al 2003; Formstecher et al 2005), and also proteome maps obtained by co-affinity purification followed by mass spectrometry analysis in yeast-*Saccharomyces cerevisiae* (co-AP/MS: Gavin et al 2002; Ho et al 2002; Han et al 2004). The reports on the microbial transcriptional regulation network of *Escherichia coli* (Shen*-*Orr et al 2002) and on *Helicobacter pylori* protein complexes in the proteome map (Terradot et al 2004) are also worthile mentioning in this context. A first-draft of the human interactome has also been reported (Lehner and Fraser, 2004); although this human interactome map does not seem to have been included in the computational investigations of Han et al. (2006), it remains to be verified, or validated, by further extensive studies with improved technology and adequate models for a more comprehensive data analysis. The comprehensive twohybrid analysis for exploring the protein interactome network was previously reported by Ito et al. (2001). Alternative interaction mapping strategies have also been developed over the last five years. An example is the tandem affinity purification (TAP) in conjunction with liquid chromatography tandem mass spectrometry (LC-MS/MS; see, for example, Gavin et al 2006). Such methods have, however, both advantages and limitations. An interesting, new approach to the determination of protein complexes has been developed that involves a combination of fluorescence spectroscopy with peptide microarrays (Stoevesandt, *cited in*

Warner 2006); this methodology was then applied to investigate T-cell signaling.

**4. Cell cyclins expression and modular cancer interactome networks** 

2004).

cancer cells.

Carcinogenesis is a complex process that involves dynamically inter-connected biomolecules in the intercellular, membrane, cytosolic, nuclear and nucleolar compartments that form numerous inter-related pathways referred to as networks. One such family of pathways contains the cell cyclins. Cyclins are often overexpressed in cancerous cells (Dobashi et al

Our novel theoretical analysis based on recently published studies of cyclin signaling, with special emphasis placed on the roles of cyclins D1 and E, suggests novel clinical trials and rational therapies of cancer through re-establishment of cell cycling inhibition in metastatic

cellular functions, especially when such databases can be integrated into a wide collection of biologically relevant data. A prerequisite for any '*ab initio'* determination of a selected protein interactome network is to clone the open reading frames (ORFs) that encode each protein present in the selected network. Note, however, that all current analyses involve the assumption of a *model* together with some 'hidden', or implicit, *assumptions* about sampling, 'noise' levels, or uniformity/ accuracy in the database, and therefore, the '*ab initio'* claim is subject to the restrictions imposed by such additional assumptions. More than 20,000 of publicly accessible, full ORF clones have been already collected for human and mouse protein-coding genes in the Mammalian Genome Collection (MGC; http://mgc.nci.nih.gov). This community resource enables the next stages of human interactome analysis that will be directed at obtaining a reliable map of the entire human protein interactome. An additional, 12,500 ORFs are now available from the Dana Farber Cancer Institute in Boston (USA) from high-throughput, yeast two-hybrid (Y2H) analyses. A disconcerting aspect of the latest human (partial) interactome studies by different methods is the little apparent overlap of the new human interaction datasets with each other and/or with previously reported data. This aspect will be further addressed later in this section; the principal cause for the lack of overlap is likely to be caused by the low (<20%) overall coverage of the protein-protein interactions selected in such studies. A possible solution to this problem has been suggested (Warner et al 2006): several groups cooperating to produce 'networks of networks', constructed from separate—but coordinated—interaction mapping projects, 'each of which would target a specific functionality related subset of proteins and interactions'. A more effective solution would be, however, to increase the throughput, accuracy and reliability of PPI data through improved technologies (such as FCCS, or other techniques already proposed in Section 2.5, for example), reduce significantly the cost of such analyses, as well as improve the models employed for data analysis. Examples of improved modeling tools for this purpose, such as logical, ontological genetics and categorical ones, that are also appropriate for assembling the '*networks of networks*…' as in the previous approach suggested by Warner et al. (2006), were presented above in Section 1, and are described in further detail in a recent report (Baianu et al 2006) and also in two forthcoming publications (Baianu and Poli, 2011; Baianu et al 2010). Interactome network studies are currently undertaken by a number of international research teams in the US, Europe and Japan (CSH/WT, 2006; Warner et al 2006). These studies are currently undertaken only for interactome subnetworks because of both technique and funding limitations. The organisms studied were: yeast (*Saccharomyces cerevisiae*), worm (*Caenorhabditis elegans*), fruitfly (*Drosophila melanogaster*) and humans. Proteome networks were investigated for several, specific, biological processes such as: DNA degradation, ubiquitin conjugation, multivesicular formation, intracellular membrane traffick, signal transduction/ TNF� tumor necrosis and NF�B mediated pathways, and early stages of T-cell signaling (for a brief summary note the recent review by Warner *et al* 2006, and references cited therein). Such challenging studies face both methodological problems such as limited sampling (Han *et al* 2006) and consideration of only pairwise ('binary') protein-protein interactions, and also the more serious technical problem of false-positive interactions in the presence of a significant 'noise' levels associated with the experimental technologies and design currently employed in such studies. Such limitations should be borne in mind (Han et al 2006) when global topology predictions are made for the whole interactome based on partial, incomplete data obtained for subnetworks that may contain less than 20% of the entire interactome network. On a more optimistic note are the recent attempts at comparing the

cellular functions, especially when such databases can be integrated into a wide collection of biologically relevant data. A prerequisite for any '*ab initio'* determination of a selected protein interactome network is to clone the open reading frames (ORFs) that encode each protein present in the selected network. Note, however, that all current analyses involve the assumption of a *model* together with some 'hidden', or implicit, *assumptions* about sampling, 'noise' levels, or uniformity/ accuracy in the database, and therefore, the '*ab initio'* claim is subject to the restrictions imposed by such additional assumptions. More than 20,000 of publicly accessible, full ORF clones have been already collected for human and mouse protein-coding genes in the Mammalian Genome Collection (MGC; http://mgc.nci.nih.gov). This community resource enables the next stages of human interactome analysis that will be directed at obtaining a reliable map of the entire human protein interactome. An additional, 12,500 ORFs are now available from the Dana Farber Cancer Institute in Boston (USA) from high-throughput, yeast two-hybrid (Y2H) analyses. A disconcerting aspect of the latest human (partial) interactome studies by different methods is the little apparent overlap of the new human interaction datasets with each other and/or with previously reported data. This aspect will be further addressed later in this section; the principal cause for the lack of overlap is likely to be caused by the low (<20%) overall coverage of the protein-protein interactions selected in such studies. A possible solution to this problem has been suggested (Warner et al 2006): several groups cooperating to produce 'networks of networks', constructed from separate—but coordinated—interaction mapping projects, 'each of which would target a specific functionality related subset of proteins and interactions'. A more effective solution would be, however, to increase the throughput, accuracy and reliability of PPI data through improved technologies (such as FCCS, or other techniques already proposed in Section 2.5, for example), reduce significantly the cost of such analyses, as well as improve the models employed for data analysis. Examples of improved modeling tools for this purpose, such as logical, ontological genetics and categorical ones, that are also appropriate for assembling the '*networks of networks*…' as in the previous approach suggested by Warner et al. (2006), were presented above in Section 1, and are described in further detail in a recent report (Baianu et al 2006) and also in two forthcoming publications (Baianu and Poli, 2011; Baianu et al 2010). Interactome network studies are currently undertaken by a number of international research teams in the US, Europe and Japan (CSH/WT, 2006; Warner et al 2006). These studies are currently undertaken only for interactome subnetworks because of both technique and funding limitations. The organisms studied were: yeast (*Saccharomyces cerevisiae*), worm (*Caenorhabditis elegans*), fruitfly (*Drosophila melanogaster*) and humans. Proteome networks were investigated for several, specific, biological processes such as: DNA degradation, ubiquitin conjugation, multivesicular formation, intracellular membrane traffick, signal transduction/ TNF� tumor necrosis and NF�B mediated pathways, and early stages of T-cell signaling (for a brief summary note the recent review by Warner *et al* 2006, and references cited therein). Such challenging studies face both methodological problems such as limited sampling (Han *et al* 2006) and consideration of only pairwise ('binary') protein-protein interactions, and also the more serious technical problem of false-positive interactions in the presence of a significant 'noise' levels associated with the experimental technologies and design currently employed in such studies. Such limitations should be borne in mind (Han et al 2006) when global topology predictions are made for the whole interactome based on partial, incomplete data obtained for subnetworks that may contain less than 20% of the entire interactome network. On a more optimistic note are the recent attempts at comparing the cancer protein, human interactome (sub) networks with normal human interactome networks that involve multiple protein-protein interactions (Jonsson and Bates, 2006). The latter studies reduced the 'noise' level in the human protein interaction data by employing an orthology-based method described previously by Jonsson et al. (2006). This method claims to reduce the 'noise' level in protein-interaction (PPI) data by identifying putative interactions based on homology to experimentally determined interactions in a range of different species; both the DIP (Salwinsky et al 2004) and the MIPS, Mammalian Protein— Protein Interaction (Pagel et al 2005) databases were utilized. Furthermore, the complete interactome data set that was employed is available as Supplementary Material from *loc. cit.* The conclusions was drawn that cancer proteins have an increased frequency of proteinprotein interactions in comparison with the proteins that were studied in normal cells, and this was interpreted as evidence "*indicating an underlying evolutionary pressure to which cancer genes, as genes of central importance are subjected*." It remains to be seen, however, if human interactome studies-- which occur with increasing frequency-- have indeed overcome the sampling objections raised by Han et al. (2006). The more extensive interactome data and analysis—though still quite limited- that has been reported to date is readily available and includes the following: Y2H (partial data-based) interactome maps for *C. elegans* (Li et al 2004) and *Drosophila melanogaster* (Giot et al 2003; Formstecher et al 2005), and also proteome maps obtained by co-affinity purification followed by mass spectrometry analysis in yeast-*Saccharomyces cerevisiae* (co-AP/MS: Gavin et al 2002; Ho et al 2002; Han et al 2004). The reports on the microbial transcriptional regulation network of *Escherichia coli* (Shen*-*Orr et al 2002) and on *Helicobacter pylori* protein complexes in the proteome map (Terradot et al 2004) are also worthile mentioning in this context. A first-draft of the human interactome has also been reported (Lehner and Fraser, 2004); although this human interactome map does not seem to have been included in the computational investigations of Han et al. (2006), it remains to be verified, or validated, by further extensive studies with improved technology and adequate models for a more comprehensive data analysis. The comprehensive twohybrid analysis for exploring the protein interactome network was previously reported by Ito et al. (2001). Alternative interaction mapping strategies have also been developed over the last five years. An example is the tandem affinity purification (TAP) in conjunction with liquid chromatography tandem mass spectrometry (LC-MS/MS; see, for example, Gavin et al 2006). Such methods have, however, both advantages and limitations. An interesting, new approach to the determination of protein complexes has been developed that involves a combination of fluorescence spectroscopy with peptide microarrays (Stoevesandt, *cited in* Warner 2006); this methodology was then applied to investigate T-cell signaling.

#### **4. Cell cyclins expression and modular cancer interactome networks**

Carcinogenesis is a complex process that involves dynamically inter-connected biomolecules in the intercellular, membrane, cytosolic, nuclear and nucleolar compartments that form numerous inter-related pathways referred to as networks. One such family of pathways contains the cell cyclins. Cyclins are often overexpressed in cancerous cells (Dobashi et al 2004).

Our novel theoretical analysis based on recently published studies of cyclin signaling, with special emphasis placed on the roles of cyclins D1 and E, suggests novel clinical trials and rational therapies of cancer through re-establishment of cell cycling inhibition in metastatic cancer cells.

Oncogenomics and Cancer Interactomics 485

The proteins p27 and p21 were reported to be implicated in cyclin regulation and cancer development (**Fig. 2**). Mouse embryonic fibroblasts that were deficient for p27 and p21 were found to contain less cyclin D1 (Hashemolhosseini S, Nagamine Y, Morley SJ, et al., 1998). and D2 (Cheng et al 1999) as well as cyclin D3 (Bagui et al 2000) than controls. Similarly, mammary glands of p27-deficient mice were shown to possess decreased cyclin D1 levels (Muraoka et al 2001). It has been demonstrated *in vivo* that p27 is necessary for maintaining proper levels of cyclins D2 and D3, and this dependency on p27 is common to a wide variety of cells/tissues *in vivo*. Regarding the molecular interaction between p27 and Dcyclin, CDK4 is a clear candidate as a mediating molecule (Bryja et al 2004). Cells employ CDK4/6– cyclin D complexes to flexibly titrate p27 from the complexes containing CDK2, and thereby they control their proliferation. However, mutual dependency between cyclin D and p27 serves also some yet unidentified function in differentiation-related processes. Thus, loss of p27 not only causes unrestricted growth due to inefficient inhibition of CDK2– cyclin E/A, but may also elicit a decrease in levels of D-type cyclins, resulting in differentiation defects. Upon ablation of cyclin D, cells lose their ability to titrate p27 from CDK2–cyclin A/E complexes and proliferation is suppressed. However, defects in differentiation caused by the absence of D-cyclin are reminiscent to defects produced by the absence of p27 (Bryja et al 2004). When the changes in levels of p27 and/or D-type cyclins occur, an equilibrium alteration could result between proliferation/differentiation processes

The D-type and E-type cyclins control the G1 → S phase transition during normal cell cycling and are important components of steroid- and growth factor-induced mitogenesis in breast epithelial cells (Sutherland and Musgrove, 2004). Cyclin D1 null mice are resistant to breast cancer that is induced by the *neu* and *ras* oncogenes, which suggests a pivotal role for

Fig. 1. Gene database of Cyclin-D1; *Source:* PBD website:

that may in the end result in tumorigenesis (Bryja et al 2004).

**4.2 The p27 and p21 proteins** 

**4.3 D1 vs. E- cyclins** 

http://www.dsi.univ-paris5.fr/genatlas/fiche.php?symbol=CCND1

#### **4.1 Cyclins**

Cyclins are proteins that link several critical pro-apoptotic and other cell cycling/division components, including the tumor suppressor gene TP53 and its product, the Thomsen-Friedenreich antigen (T antigen), Rb, mdm2, c-Myc, p21, p27, Bax), which all play major roles in carcinogenesis of many cancers. Cyclin-dependent kinases (CDK), their respective cyclins, and inhibitors of CDKs (CKIs) were identified as instrumental components of the cell cycle-regulating machinery. CDKs are enzymes that phosphorylate several cellular proteins thus 'fueling' the sequential transitions through the cell division cycle. In mammalian cells the complexes of cyclins D1, D2, D3, A and E with CDKs are considered motors that drive cells to enter and pass through the **"S"** phase. Cell cycle regulation is a critical mechanism governing cell division and proliferation, and is finely regulated by the interaction of cyclins with CDKs and CKIs, among other molecules (Morgan et al 1995).

It was also reported that CDKs have another key role –the coordination of cell cycle progression with responses to possible DNA-damage that could, if unchecked or unfixed, lead to a lack of genomic integrity marking the onset of cell disease including cancers (Huang et al 2006 in *Science*). The **S**-phase is thought to be the most vulnerable interval of the cell cycle because during this interval all of 3 billion DNA bases of the human genome must be replicated precisely in the sense of 'carbon copies' being made of the existing DNA strands, without any breaks in the sequence or base substitutions of the copied/replicated strands. Therefore, this correct replication process controls the cell's survival, especially under genotoxic conditions such as those caused for example by mutagens or X-ray and gamma-radiation. Furthermore, Huang et al. (2006) reported that CDK mediated the phosphorylation of the FOXO1 transcriptional activator of the proapoptotic genes during the **S**-phase; when DNA damage occurs either before or during the **S**-phase, a complex network is activated in the cell which 'silences' CDK thereby either delaying or stopping/arresting the cell cycle progression. This may allow the cell to repair the DNA damage by recombination involving BRCA2 and survive. However, if this is not possible because the DNA damage was too great/irreparable, then FOXO1 would trigger apoptosis (cell death). It was proposed that during the unperturbed (normal) **S**-phase CDK2 phosphorylates FOXO1 at the Serine249 residue in the cell nucleus, which then results in the transfer and sequestering of the FOXO1 in the cytoplasm, where it is well--separated from the proapoptotic genes, the 'target' of FOXO1 action.

Moreover, the CDK-mediated phosphorylation of BRCA2 during the unperturbed **S**-phase renders inactive the DNA recombination. On the other hand, when DNA becomes damaged, CDK2 is inhibited through the Cdc25A pathway, with the consequence of a dephosphorylated FOXO1 which then remains in the cell nucleus and is able to activate the proapoptotic genes, unless BRCA2 is able to induce DNA recombination and repair in time to prevent apoptosis. The steps that follow are then as explained above: either DNA repair and continued cell cycling, or apoptosis induced by FOXO1. There are still several important questions regarding the entire process that need to be answered before the FOXO1 and CDK2 mechanisms of action can be translated into successful clinical trials based on such knowledge.

A positive correlation has been noticed between over-expression of several cell--cycle proteins and unfavorable prognoses and outcomes in several different cancer types (van Diest et al 1995; Fukuse et al 2000). In human lung tumors and soft tissue sarcomas, it was discovered that cyclin A/cdk2 complex expression and kinase activity were reliable predictors of proliferation and unfavorable prognosis, thereby further substantiating the epidemiological factors of cyclin signaling (Dobashi et al 2003; Noguchi et al 2000).

Cyclins are proteins that link several critical pro-apoptotic and other cell cycling/division components, including the tumor suppressor gene TP53 and its product, the Thomsen-Friedenreich antigen (T antigen), Rb, mdm2, c-Myc, p21, p27, Bax), which all play major roles in carcinogenesis of many cancers. Cyclin-dependent kinases (CDK), their respective cyclins, and inhibitors of CDKs (CKIs) were identified as instrumental components of the cell cycle-regulating machinery. CDKs are enzymes that phosphorylate several cellular proteins thus 'fueling' the sequential transitions through the cell division cycle. In mammalian cells the complexes of cyclins D1, D2, D3, A and E with CDKs are considered motors that drive cells to enter and pass through the **"S"** phase. Cell cycle regulation is a critical mechanism governing cell division and proliferation, and is finely regulated by the interaction of cyclins with CDKs and CKIs, among other molecules (Morgan et al 1995). It was also reported that CDKs have another key role –the coordination of cell cycle progression with responses to possible DNA-damage that could, if unchecked or unfixed, lead to a lack of genomic integrity marking the onset of cell disease including cancers (Huang et al 2006 in *Science*). The **S**-phase is thought to be the most vulnerable interval of the cell cycle because during this interval all of 3 billion DNA bases of the human genome must be replicated precisely in the sense of 'carbon copies' being made of the existing DNA strands, without any breaks in the sequence or base substitutions of the copied/replicated strands. Therefore, this correct replication process controls the cell's survival, especially under genotoxic conditions such as those caused for example by mutagens or X-ray and gamma-radiation. Furthermore, Huang et al. (2006) reported that CDK mediated the phosphorylation of the FOXO1 transcriptional activator of the proapoptotic genes during the **S**-phase; when DNA damage occurs either before or during the **S**-phase, a complex network is activated in the cell which 'silences' CDK thereby either delaying or stopping/arresting the cell cycle progression. This may allow the cell to repair the DNA damage by recombination involving BRCA2 and survive. However, if this is not possible because the DNA damage was too great/irreparable, then FOXO1 would trigger apoptosis (cell death). It was proposed that during the unperturbed (normal) **S**-phase CDK2 phosphorylates FOXO1 at the Serine249 residue in the cell nucleus, which then results in the transfer and sequestering of the FOXO1 in the cytoplasm, where it is well--separated from

Moreover, the CDK-mediated phosphorylation of BRCA2 during the unperturbed **S**-phase renders inactive the DNA recombination. On the other hand, when DNA becomes damaged, CDK2 is inhibited through the Cdc25A pathway, with the consequence of a dephosphorylated FOXO1 which then remains in the cell nucleus and is able to activate the proapoptotic genes, unless BRCA2 is able to induce DNA recombination and repair in time to prevent apoptosis. The steps that follow are then as explained above: either DNA repair and continued cell cycling, or apoptosis induced by FOXO1. There are still several important questions regarding the entire process that need to be answered before the FOXO1 and CDK2 mechanisms of

A positive correlation has been noticed between over-expression of several cell--cycle proteins and unfavorable prognoses and outcomes in several different cancer types (van Diest et al 1995; Fukuse et al 2000). In human lung tumors and soft tissue sarcomas, it was discovered that cyclin A/cdk2 complex expression and kinase activity were reliable predictors of proliferation and unfavorable prognosis, thereby further substantiating the

action can be translated into successful clinical trials based on such knowledge.

epidemiological factors of cyclin signaling (Dobashi et al 2003; Noguchi et al 2000).

the proapoptotic genes, the 'target' of FOXO1 action.

**4.1 Cyclins** 

Fig. 1. Gene database of Cyclin-D1; *Source:* PBD website: http://www.dsi.univ-paris5.fr/genatlas/fiche.php?symbol=CCND1

#### **4.2 The p27 and p21 proteins**

The proteins p27 and p21 were reported to be implicated in cyclin regulation and cancer development (**Fig. 2**). Mouse embryonic fibroblasts that were deficient for p27 and p21 were found to contain less cyclin D1 (Hashemolhosseini S, Nagamine Y, Morley SJ, et al., 1998). and D2 (Cheng et al 1999) as well as cyclin D3 (Bagui et al 2000) than controls. Similarly, mammary glands of p27-deficient mice were shown to possess decreased cyclin D1 levels (Muraoka et al 2001). It has been demonstrated *in vivo* that p27 is necessary for maintaining proper levels of cyclins D2 and D3, and this dependency on p27 is common to a wide variety of cells/tissues *in vivo*. Regarding the molecular interaction between p27 and Dcyclin, CDK4 is a clear candidate as a mediating molecule (Bryja et al 2004). Cells employ CDK4/6– cyclin D complexes to flexibly titrate p27 from the complexes containing CDK2, and thereby they control their proliferation. However, mutual dependency between cyclin D and p27 serves also some yet unidentified function in differentiation-related processes. Thus, loss of p27 not only causes unrestricted growth due to inefficient inhibition of CDK2– cyclin E/A, but may also elicit a decrease in levels of D-type cyclins, resulting in differentiation defects. Upon ablation of cyclin D, cells lose their ability to titrate p27 from CDK2–cyclin A/E complexes and proliferation is suppressed. However, defects in differentiation caused by the absence of D-cyclin are reminiscent to defects produced by the absence of p27 (Bryja et al 2004). When the changes in levels of p27 and/or D-type cyclins occur, an equilibrium alteration could result between proliferation/differentiation processes that may in the end result in tumorigenesis (Bryja et al 2004).

#### **4.3 D1 vs. E- cyclins**

The D-type and E-type cyclins control the G1 → S phase transition during normal cell cycling and are important components of steroid- and growth factor-induced mitogenesis in breast epithelial cells (Sutherland and Musgrove, 2004). Cyclin D1 null mice are resistant to breast cancer that is induced by the *neu* and *ras* oncogenes, which suggests a pivotal role for

Oncogenomics and Cancer Interactomics 487

Fig. 2. Pro-Apoptotic Cancer Cycling Model: an update based on the previous model of

patients. For example, Alizadeh et al. (2000) have successfully used such an approach to identify molecularly distinct subclasses of diffuse large B-cell lymphoma that could not be distinguished by conventional diagnostic tools. In another study, a molecular fingerprint comprising approximately 50 genes has been isolated from a total of over 6,000, and this fingerprint can reliably differentiate between acute myeloid leukemia and acute

Aguda et al. (2003).

lymphoblastic leukemia Golub et al (1999).

cyclin D1 in the development of some mammary carcinomas (Sutherland and Musgrove, 2004). Cyclin D1 and E1 are usually overexpressed in breast cancer, with some association with adverse outcomes, which is likely due in part to their ability to confer resistance to endocrine therapies. The consequences of cyclin E overexpression in breast cancer are related to cyclin E's role in cell cycle progression, and that of cyclin D1 may also be a consequence of a role in transcriptional regulation (Sutherland and Musgrove, 2004). One critical pathway determining cell cycle transition rates of **G1 → S** phase is the cyclin/cyclindependent kinase (Cdk)/ p16Ink4A/ retinoblastoma protein (pRb) pathway (Sutherland and Musgrove, 2004). Alterations of different components of this particular pathway are very ubiquitous in human cancer (Malumbres and Barbacid, 2001). There appears to be a certain degree of tissue specificity in the genetic abnormalities within the Rb pathway. A model relating Rb to cyclin control in the overall scheme of pro-apoptotic behavior is shown in **Fig. 2.**

In breast cancer these abnormalities include the over-expression of cyclins D1, D3 and E1, the decreased expression of the p27Kip1 CKI and p16Ink4A gene silencing through promoter methylation. These aberrations occur with high frequency in breast cancer, as each abnormality occurs in ~40% of primary tumors. This fact implicates a major role for the loss of function of the Rb pathway in breast cancer. Cyclin D1 is the product of the *CCND1* gene and was first connected to breast cancer after localization of the gene to chromosome 11q13, a region commonly amplified in several human carcinomas, including ~15% of breast cancers (Ormandy et al 2003). The fact that cyclin D1 was overexpressed at the mRNA and protein levels in 50% of primary breast cancers have caused cyclin D1 to be considered one of the most commonly over-expressed breast cancer oncogenes (Gillett et al 1994). Although cyclin E1 locus amplification is rare in breast cancer, the protein product is overexpressed in over 40% of breast carcinomas (Loden et al 2002). Cyclin D1 is pre-dominantly overexpressed in ERC tumors, and cyclin E overexpression is confined to ER¡ tumors (Gillett et al 1994; Loden et al 2002). The overexpression of several cell cycle regulators has been strongly associated with apoptotic-like behavior, as well as frank apoptosis, in cancer cells, which include c-Myc, E2F-1 and HPV. Apoptosis and its connection to cell cycle-related proteins is of interest therapeutically, as these types therapies could ultimately lead to the cancer cell annihilation *via* apoptosis. Recently, a shift has occurred, changing the focus of chemotherapy from exploration of agents that cause cell growth arrest to those that favor apoptosis.

#### **5. Biomedical applications of microarrays in clinical trials**

#### **5.1 Microarray applications to gene expression: identifying signaling pathways**

Changes in homeostasis can be followed through various experimental strategies that monitor gene expression profiling, for example, by employing high-throughput microarray technology. This section discusses briefly the successful use of microarray technology in RNA expression studies aimed at identifying signaling pathways that are regulated by key genes implicated in carcinogenesis/ tumorigenesis. A primary objective of tumor-profiling experiments is to identify transcriptional changes that may be the cause of the transition from the normal to the tumor phenotype. Such changes may, however, occur also as a consequence of various neoplastic transformation(s). More importantly, this approach may allow the identification of molecular fingerprints that can be utilized for the classification of different tumor types, and are therefore valuable diagnostic molecular tools in cancer

cyclin D1 in the development of some mammary carcinomas (Sutherland and Musgrove, 2004). Cyclin D1 and E1 are usually overexpressed in breast cancer, with some association with adverse outcomes, which is likely due in part to their ability to confer resistance to endocrine therapies. The consequences of cyclin E overexpression in breast cancer are related to cyclin E's role in cell cycle progression, and that of cyclin D1 may also be a consequence of a role in transcriptional regulation (Sutherland and Musgrove, 2004). One critical pathway determining cell cycle transition rates of **G1 → S** phase is the cyclin/cyclindependent kinase (Cdk)/ p16Ink4A/ retinoblastoma protein (pRb) pathway (Sutherland and Musgrove, 2004). Alterations of different components of this particular pathway are very ubiquitous in human cancer (Malumbres and Barbacid, 2001). There appears to be a certain degree of tissue specificity in the genetic abnormalities within the Rb pathway. A model relating Rb to cyclin control in the overall scheme of pro-apoptotic behavior is shown

In breast cancer these abnormalities include the over-expression of cyclins D1, D3 and E1, the decreased expression of the p27Kip1 CKI and p16Ink4A gene silencing through promoter methylation. These aberrations occur with high frequency in breast cancer, as each abnormality occurs in ~40% of primary tumors. This fact implicates a major role for the loss of function of the Rb pathway in breast cancer. Cyclin D1 is the product of the *CCND1* gene and was first connected to breast cancer after localization of the gene to chromosome 11q13, a region commonly amplified in several human carcinomas, including ~15% of breast cancers (Ormandy et al 2003). The fact that cyclin D1 was overexpressed at the mRNA and protein levels in 50% of primary breast cancers have caused cyclin D1 to be considered one of the most commonly over-expressed breast cancer oncogenes (Gillett et al 1994). Although cyclin E1 locus amplification is rare in breast cancer, the protein product is overexpressed in over 40% of breast carcinomas (Loden et al 2002). Cyclin D1 is pre-dominantly overexpressed in ERC tumors, and cyclin E overexpression is confined to ER¡ tumors (Gillett et al 1994; Loden et al 2002). The overexpression of several cell cycle regulators has been strongly associated with apoptotic-like behavior, as well as frank apoptosis, in cancer cells, which include c-Myc, E2F-1 and HPV. Apoptosis and its connection to cell cycle-related proteins is of interest therapeutically, as these types therapies could ultimately lead to the cancer cell annihilation *via* apoptosis. Recently, a shift has occurred, changing the focus of chemotherapy from exploration of agents that cause cell growth arrest to those that favor

**5. Biomedical applications of microarrays in clinical trials** 

**5.1 Microarray applications to gene expression: identifying signaling pathways**  Changes in homeostasis can be followed through various experimental strategies that monitor gene expression profiling, for example, by employing high-throughput microarray technology. This section discusses briefly the successful use of microarray technology in RNA expression studies aimed at identifying signaling pathways that are regulated by key genes implicated in carcinogenesis/ tumorigenesis. A primary objective of tumor-profiling experiments is to identify transcriptional changes that may be the cause of the transition from the normal to the tumor phenotype. Such changes may, however, occur also as a consequence of various neoplastic transformation(s). More importantly, this approach may allow the identification of molecular fingerprints that can be utilized for the classification of different tumor types, and are therefore valuable diagnostic molecular tools in cancer

in **Fig. 2.**

apoptosis.

Fig. 2. Pro-Apoptotic Cancer Cycling Model: an update based on the previous model of Aguda et al. (2003).

patients. For example, Alizadeh et al. (2000) have successfully used such an approach to identify molecularly distinct subclasses of diffuse large B-cell lymphoma that could not be distinguished by conventional diagnostic tools. In another study, a molecular fingerprint comprising approximately 50 genes has been isolated from a total of over 6,000, and this fingerprint can reliably differentiate between acute myeloid leukemia and acute lymphoblastic leukemia Golub et al (1999).

Oncogenomics and Cancer Interactomics 489

localization signal and a domain that binds to many cellular proteins, and tandem BRCT domains in its C-terminal region. BRCA1 is associated with a diverse range of biological processes, such as DNA repair, cell cycle control, transcriptional regulation, apoptosis and centrosome duplication. Thus, a specific role has already been postulated for BRCA1 in transcriptional regulation. The C-terminal domain of BRCA1 was reported to contain a potent transactivation domain when this was fused to a heterologous DNA binding motif (Monteiro, August and Hanafusa, 1996). The oligonucleotide array-based expression profiling described above in Section 2.2 was employed by Haber (2000) in collaboration with Affymetrix Co. to identify the downstream transcriptional targets of the BRCA1 tumorsuppressor gene in order to define its function (Harkin et al 1999). A known biochemical function of BRCA1 is its E3 ubiquitin ligase activity. The following reported observations provide only indirect, additional clues to the tumor-suppressor gene function of BRCA1. Germ line mutations of BRCA1 were reported for half of breast-ovarian cancer pedigrees and for approximately 10% of women with early onset of breast cancer, uncorrelated with their family history (Fitzgerald et al 1996). It was also shown in other studies that somatic inactivation of BRCA1 is rare in sporadic breast cancers (Futreal P, Liu Q, Shattuck-Eidens D et al., 1994) and mutations were reported for approximately 10% of sporadic ovarian cancers, therefore suggesting potentially distinct genetic mechanisms for sporadic, breast and ovarian cancers (Berchuk et al 1998). The reduced BRCA1 protein expression reported for the majority of sporadic breast cancers indicates that *epigenetic mechanisms* such as those described in **Section 6** was suggested to play a significant role in regulating the BRCA1 expression (Wilson et al 1999). Furthermore, a defect was reported in the transcriptioncoupled repair of oxidative-induced DNA damage in mouse embryo fibroblasts with attenuated BRCA1 function (Gowen et al 1998); this observation would suggest that BRCA1 plays a more general role in mediating the cellular response to DNA damage. Thus, BRCA1 has also been reported to be involved in cell cycle checkpoint control, by becoming hyperphosphorylated during late **G1** and **S** cell phases, and then changing to transiently dephosphorylated early after the M phase (Ruffner and Verma, 1997). Moreover, the BRCA1 overexpression has been reported to induce a **G1/S** arrest in human colon cancer cells (Somasundaram et al, 1997). By comparison with the cancer regulation model in **Figure 2**, it seems very significant for oncogenesis that BRCA1 is *physically associated* with the transcriptional regulators *p*53 (Ouichi et al 1998), CtIP (Yu et al 1998), c-Myc (Wang et al 1998), as well as the histone deacetylases HDAC1 and HDAC2 (Yarden and Brody 1999). The physical association of BRCA1 with c-Myc acquires special significance as c-Myc seems to be involved in controlling telomerase activity, whereas p53 is involved in DNA-repair, cell-cycling and apoptosis. Therefore, in the simplified model presented in Figure 3, one should add the BRCA1 links to both p53 and c-Myc in order to facilitate an understanding of

There are several related problems in studying gene function by expression profiling. For example, it has been often reported to be difficult to generate cell lines that overexpress genes such as BRCA1, or p53, because their forced overexpression can lead either to growth suppression or apoptosis (as shown for example in Figure 3, and at the end of the previous section). However, in the case of BRCA1, it was reported that the *tet-off* inducible expression system (Gossen and Bujard 1992) can be utilized to generate cell lines with highly regulated inducible expression of BRCA1 (Harkin et al, 1999). This inducible

the BRCA1 possible roles in oncogenesis.

**5.1.3 Selecting gene expression systems** 

#### **5.1.1 Identification of specific transcriptional targets in cancer**

The approach requires, however, multiple independent experiments with several large groups of samples in order to enable one to reliably and reproducibly separate the biologically relevant changes from false ones that may occur as a result of the genetic heterogeneity between individual samples from the same tumor, for example. The two examples quoted above were able to reproducibly identify tumor type-specific molecular determinants through multiple experiments with various tissue samples.

A different experimental approach to the one presented above is, however, needed for identifying specific targets such as defined genes that are implicated in cancer progression; this involves monitoring changes in transcriptional profile that occur as a result of modulation of the expression level of the defined gene, or genes, selected for such studies. The altered expression profile can be viewed as a 'blueprint' by which the defined gene controls its cellular function. The transcriptional profiles are thus employed to define *downstream signaling pathways* that have been previously validated through other techniques such as differential display Tanaka et al (2000) and serial analysis of gene expression Yu et al. (1999). This approach combined with microarray technology allows the simultaneous identification of all potential targets. Its only drawback is the reliance upon the prior knowledge of the selected genome for such investigations. The caveat is, however, that the investigator who employs this approach needs also to devise additional experiments in order to confirm that genes identified with the microarray are indeed *physiologically relevant*  targets.

#### **5.1.2 Identification of downstream transcriptional targets of the BRCA1 tumorsuppressor gene**

The breast and ovarian cancer susceptibility gene BRCA1 is probably the most studied gene in the breast cancer field because of its clinical significance and multiple functions. BRCA1 was shown to be mutated in the germ line of women with a genetic predisposition to either breast or ovarian cancer Mikki et al (1994). Most mutations identified reported have resulted in the premature truncation of the BRCA1 protein. BRCA1 is known to encode a 1863 amino acid phosphoprotein that is predominantly localized to the nucleus, presumably with a unique function. Protein sequence analysis identified a C-terminal BRCT motif, which was then postulated to play a role in cell cycle checkpoint control in response to DNA damage Koonin EV, Altschul and Bork (1996). Consistent with this postulated role, BRCA1 becomes hyperphosphorylated in response to various agents that damage DNA such as /X--rayirradiation, an effect that was reported to be partially mediated by chk2 kinases (Lee et al. 2000). Furthermore, BRCA1 has been shown to be implicated in at least three functional pathways:


However, the physiological significance of such BRCA1 actions as well as their relationships with the function of BRCA1 as a tumor-suppressor gene still remain to be defined. Further details are presented next.

#### **The BRCA1-BARD1 ubiquitin ligase**

As shown above the BRCA1 gene encodes a 1863-amino-acid protein (Miki et al 1994) which consists of a RING-finger domain in its terminal N-region, a region that includes a nuclear

The approach requires, however, multiple independent experiments with several large groups of samples in order to enable one to reliably and reproducibly separate the biologically relevant changes from false ones that may occur as a result of the genetic heterogeneity between individual samples from the same tumor, for example. The two examples quoted above were able to reproducibly identify tumor type-specific molecular

A different experimental approach to the one presented above is, however, needed for identifying specific targets such as defined genes that are implicated in cancer progression; this involves monitoring changes in transcriptional profile that occur as a result of modulation of the expression level of the defined gene, or genes, selected for such studies. The altered expression profile can be viewed as a 'blueprint' by which the defined gene controls its cellular function. The transcriptional profiles are thus employed to define *downstream signaling pathways* that have been previously validated through other techniques such as differential display Tanaka et al (2000) and serial analysis of gene expression Yu et al. (1999). This approach combined with microarray technology allows the simultaneous identification of all potential targets. Its only drawback is the reliance upon the prior knowledge of the selected genome for such investigations. The caveat is, however, that the investigator who employs this approach needs also to devise additional experiments in order to confirm that genes identified with the microarray are indeed *physiologically relevant* 

**5.1.2 Identification of downstream transcriptional targets of the BRCA1 tumor-**

The breast and ovarian cancer susceptibility gene BRCA1 is probably the most studied gene in the breast cancer field because of its clinical significance and multiple functions. BRCA1 was shown to be mutated in the germ line of women with a genetic predisposition to either breast or ovarian cancer Mikki et al (1994). Most mutations identified reported have resulted in the premature truncation of the BRCA1 protein. BRCA1 is known to encode a 1863 amino acid phosphoprotein that is predominantly localized to the nucleus, presumably with a unique function. Protein sequence analysis identified a C-terminal BRCT motif, which was then postulated to play a role in cell cycle checkpoint control in response to DNA damage Koonin EV, Altschul and Bork (1996). Consistent with this postulated role, BRCA1 becomes hyperphosphorylated in response to various agents that damage DNA such as /X--rayirradiation, an effect that was reported to be partially mediated by chk2 kinases (Lee et al. 2000). Furthermore, BRCA1 has been shown to be implicated in at least three functional

However, the physiological significance of such BRCA1 actions as well as their relationships with the function of BRCA1 as a tumor-suppressor gene still remain to be defined. Further

As shown above the BRCA1 gene encodes a 1863-amino-acid protein (Miki et al 1994) which consists of a RING-finger domain in its terminal N-region, a region that includes a nuclear

**5.1.1 Identification of specific transcriptional targets in cancer** 

determinants through multiple experiments with various tissue samples.

targets.

pathways:

details are presented next.

**The BRCA1-BARD1 ubiquitin ligase** 

 Mediating the cellular response to DNA damage, Acting as a cell cycle checkpoint protein, and Functioning in the regulation of transcription.

**suppressor gene** 

localization signal and a domain that binds to many cellular proteins, and tandem BRCT domains in its C-terminal region. BRCA1 is associated with a diverse range of biological processes, such as DNA repair, cell cycle control, transcriptional regulation, apoptosis and centrosome duplication. Thus, a specific role has already been postulated for BRCA1 in transcriptional regulation. The C-terminal domain of BRCA1 was reported to contain a potent transactivation domain when this was fused to a heterologous DNA binding motif (Monteiro, August and Hanafusa, 1996). The oligonucleotide array-based expression profiling described above in Section 2.2 was employed by Haber (2000) in collaboration with Affymetrix Co. to identify the downstream transcriptional targets of the BRCA1 tumorsuppressor gene in order to define its function (Harkin et al 1999). A known biochemical function of BRCA1 is its E3 ubiquitin ligase activity. The following reported observations provide only indirect, additional clues to the tumor-suppressor gene function of BRCA1. Germ line mutations of BRCA1 were reported for half of breast-ovarian cancer pedigrees and for approximately 10% of women with early onset of breast cancer, uncorrelated with their family history (Fitzgerald et al 1996). It was also shown in other studies that somatic inactivation of BRCA1 is rare in sporadic breast cancers (Futreal P, Liu Q, Shattuck-Eidens D et al., 1994) and mutations were reported for approximately 10% of sporadic ovarian cancers, therefore suggesting potentially distinct genetic mechanisms for sporadic, breast and ovarian cancers (Berchuk et al 1998). The reduced BRCA1 protein expression reported for the majority of sporadic breast cancers indicates that *epigenetic mechanisms* such as those described in **Section 6** was suggested to play a significant role in regulating the BRCA1 expression (Wilson et al 1999). Furthermore, a defect was reported in the transcriptioncoupled repair of oxidative-induced DNA damage in mouse embryo fibroblasts with attenuated BRCA1 function (Gowen et al 1998); this observation would suggest that BRCA1 plays a more general role in mediating the cellular response to DNA damage. Thus, BRCA1 has also been reported to be involved in cell cycle checkpoint control, by becoming hyperphosphorylated during late **G1** and **S** cell phases, and then changing to transiently dephosphorylated early after the M phase (Ruffner and Verma, 1997). Moreover, the BRCA1 overexpression has been reported to induce a **G1/S** arrest in human colon cancer cells (Somasundaram et al, 1997). By comparison with the cancer regulation model in **Figure 2**, it seems very significant for oncogenesis that BRCA1 is *physically associated* with the transcriptional regulators *p*53 (Ouichi et al 1998), CtIP (Yu et al 1998), c-Myc (Wang et al 1998), as well as the histone deacetylases HDAC1 and HDAC2 (Yarden and Brody 1999). The physical association of BRCA1 with c-Myc acquires special significance as c-Myc seems to be involved in controlling telomerase activity, whereas p53 is involved in DNA-repair, cell-cycling and apoptosis. Therefore, in the simplified model presented in Figure 3, one should add the BRCA1 links to both p53 and c-Myc in order to facilitate an understanding of the BRCA1 possible roles in oncogenesis.

#### **5.1.3 Selecting gene expression systems**

There are several related problems in studying gene function by expression profiling. For example, it has been often reported to be difficult to generate cell lines that overexpress genes such as BRCA1, or p53, because their forced overexpression can lead either to growth suppression or apoptosis (as shown for example in Figure 3, and at the end of the previous section). However, in the case of BRCA1, it was reported that the *tet-off* inducible expression system (Gossen and Bujard 1992) can be utilized to generate cell lines with highly regulated inducible expression of BRCA1 (Harkin et al, 1999). This inducible

Oncogenomics and Cancer Interactomics 491

Recently, there is an increasing number of reports suggesting that human cancers frequently involve pathogenic mechanisms which give rise to numerous alterations in signal transduction pathways. Therefore, novel therapeutic agents that target specific signal transduction molecules or signaling pathways altered in cancer are currently undergoing clinical trials often with remarkable results in cancer treatments of patients in which chemoand/or radio- therapy resistant tumors have become apparent. For example, several new

('Gleevec', or Imatinib Mesylate), ZD-1839 ('Iressa'), OSI-774, and flavopiridol, which are ATP-site antagonists and have recently completed phase I and phase II trials (see for

several other kinase antagonists that are currently undergoing clinical evaluations,

 other strategies for downmodulating kinase-driven signaling include 17-allyl- amino-17 demethoxygeldanamycin and rapamycin derivatives. Phospholipase-directed signaling

 Farnesyltransferase inhibitors, originally developed as inhibitors of *ras*-driven signals, may attain activity by affecting other/or additional targets (see for example, Zujewski,

Signal transduction is an efficient method for fine-tuning the development and modeling of cancer treatments (Ideker at al., 2001, 2002). There is also a detailed NCI report on clinical trial and signal transduction modulators as novel anticancer (Sausville, Elsayed, Monga and

**5.3 Interactome-transcriptome analysis and differential gene expression in cancer**  It has been claimed that high-throughput yeast-two-hybrid (HT-Y2H) methods will allow a systematic approach to functional genomics, by placing individual genes in the global context of cellular functions (Mendelsohn and Brent, 1999). One finds that high-throughput screening methods such as HT-Y2H have indeed allowed the mapping of the first interactomes for three eukaryotes (Giot et al 2003; Li et al 2004; Uetz et al 2000). Because of the human interactome's much larger size and its very high-degree of complexity there will be quite high costs and labor involved in obtaining the data necessary, for example, for an HT-Y2H mapping of a complete human cell interactome. Furthermore, the complete data analysis together with the assembly of the complete interactome network is likely to require both conceptual and computational advances, in addition to a significant amount of time and collective effort(s) by one or several research teams. In view of the high, potential importance of the human interactome for cancer therapy, and also for improved diagnosis and 'rational' clinical trials, such an effort should now be a top priority. Such an effort should also be coordinated with an improved mapping of the complete yeast interactome as a model, or test, system. Meantime, there have been since 2005 a few reports of 'surrogate', or partial, human cancer cell interactomes in the form of predicted maps of human protein interaction networks based on partial data and comparative analysis. Such studies emphasize even further the need and urgency for the complete mapping of several human

**5.2 Clinical trials with signal transduction inhibitors -- novel anticancer drugs active** 

**in chemo-resistant tumors** 

classes of such anti-cancer drugs are:

example, Liu et al, 1999);

Kim, 2003).

including UCN-01 and PD184352;

tyrosine/threonine kinase inhibitors, including: STI-571

may also be modulated by alkylphospholipids.

monoclonal antibodies Herceptin and C225.

Horak, Bol, et al., 2000; End, Smets, Todd, et al., 2001).

expression system introduces into the cells a chimeric transactivator; the latter consists in the *tet* repressor fused to the VP16 transactivation domain. This chimeric transactivator is inactive in the presence of tetracycline, whereas in the absence of tetracycline it can bind to promoters that contain the *tet* operator sequence; the latter sequence is then utilized to drive the expression of BRCA1. This expression system has a major advantage in that it allows the change in just one parameter involved in the induction of BRCA1. The BRCA1 induction in one population is the only difference between the genetic backgrounds of the two populations that are being compared by oligonucleotides arrays. A number of BRCA1 transcriptional targets can thus be identified with Affymetrix oligonucleotides arrays, and among these, the stress and DNA damage-inducible gene *GADD45* was the gene that exhibited the greatest degree of differential signal intensity (Harkin et al, 1999). The specific target genes thus identified were also verified by Northern blot or quantitative reverse transcriptase-PCR analysis in order to confirm induction in response to the stimulus, that is, the induction of BRCA1 (Harkin et al, 1999). Total RNA was extracted from cells in which the exogenous BRCA1 was either switched off (+ *tet*) or switched on (– *tet).* Fluorescent images were generated using the Affymetrix human cancer G110 array containing approximately 1,700 genes that were previously reported to be implicated in cancer; such fluorescent images were then scanned and analyzed. Two lanes were present in such images that corresponded to individual arrays hybridized with biotinylated cRNA probes and were generated from cells in which exogenous BRCA1 was either induced (+ *tet*) or repressed (- *tet*). Each gene on the array was represented by 16 probe pairs, one being wild-type and one containing a mismatch at the central nucleotide*.* In such fluorescent images, two genes, GADD45 and ATF3 were identified (and confirmed by Northern blot analysis) as being the *transcriptional targets of the BRCA1 tumor-suppressor gene*. Furthermore, in this BRCA1 study, the induction of GADD45 by BRCA1 was reported to be correlated with the BRCA1-mediated activation of the c-jun *N*-terminal kinase/stress-activated protein kinase JNK/SAPK pathway. Significantly, the activation of JNK/SAPK was then shown to be required for the BRCA1-mediated apoptotic cell death in this cell line system. This finding suggests an interesting model for the BRCA1 mediated apoptosis, as presented in some detail by Harkin et al (1999). Most significantly, the experimental approach reported by Harkin et al (1999) was indeed able to defin*e physiologically relevant* target genes. In another recent report, Yu et al (2001) utilized a modified version of the *tet-off* inducible expression system to define the downstream transcriptional targets of the *p*53 tumor--suppressor gene (Yu et al 1999). A total of 34 genes were identified that exhibited at least a 10-fold upregulation in response to the inducible expression of *p*53. Somewhat surprisingly, there was a marked heterogeneity of the response when it was evaluated in different cell lines derived from the same tissue of origin. Among the 33 genes studied only nine were found to be induced in a panel of five unrelated colorectal cell lines, and 17 were induced in a subset; eight were not induced at all in any of the five cell lines examined. This can be interpreted as being due to a high degree of cell type specificity. Furthermore, *p*53 was not absolutely required for induction -- for the majority of the genes identified-- in response to either adriamycin or 5-FU. Therefore, these agents do not seem to act exclusively through *p*53, suggesting that there is inherent redundancy in the majority of signaling pathways. Such inherent redundancy in signaling pathways of cancer, and untransformed, cells might be important in understanding the results of clinical trials in cancer treatment with signal transduction modulators that will be discussed in the next subsection **(5.2).**

expression system introduces into the cells a chimeric transactivator; the latter consists in the *tet* repressor fused to the VP16 transactivation domain. This chimeric transactivator is inactive in the presence of tetracycline, whereas in the absence of tetracycline it can bind to promoters that contain the *tet* operator sequence; the latter sequence is then utilized to drive the expression of BRCA1. This expression system has a major advantage in that it allows the change in just one parameter involved in the induction of BRCA1. The BRCA1 induction in one population is the only difference between the genetic backgrounds of the two populations that are being compared by oligonucleotides arrays. A number of BRCA1 transcriptional targets can thus be identified with Affymetrix oligonucleotides arrays, and among these, the stress and DNA damage-inducible gene *GADD45* was the gene that exhibited the greatest degree of differential signal intensity (Harkin et al, 1999). The specific target genes thus identified were also verified by Northern blot or quantitative reverse transcriptase-PCR analysis in order to confirm induction in response to the stimulus, that is, the induction of BRCA1 (Harkin et al, 1999). Total RNA was extracted from cells in which the exogenous BRCA1 was either switched off (+ *tet*) or switched on (– *tet).* Fluorescent images were generated using the Affymetrix human cancer G110 array containing approximately 1,700 genes that were previously reported to be implicated in cancer; such fluorescent images were then scanned and analyzed. Two lanes were present in such images that corresponded to individual arrays hybridized with biotinylated cRNA probes and were generated from cells in which exogenous BRCA1 was either induced (+ *tet*) or repressed (- *tet*). Each gene on the array was represented by 16 probe pairs, one being wild-type and one containing a mismatch at the central nucleotide*.* In such fluorescent images, two genes, GADD45 and ATF3 were identified (and confirmed by Northern blot analysis) as being the *transcriptional targets of the BRCA1 tumor-suppressor gene*. Furthermore, in this BRCA1 study, the induction of GADD45 by BRCA1 was reported to be correlated with the BRCA1-mediated activation of the c-jun *N*-terminal kinase/stress-activated protein kinase JNK/SAPK pathway. Significantly, the activation of JNK/SAPK was then shown to be required for the BRCA1-mediated apoptotic cell death in this cell line system. This finding suggests an interesting model for the BRCA1 mediated apoptosis, as presented in some detail by Harkin et al (1999). Most significantly, the experimental approach reported by Harkin et al (1999) was indeed able to defin*e physiologically relevant* target genes. In another recent report, Yu et al (2001) utilized a modified version of the *tet-off* inducible expression system to define the downstream transcriptional targets of the *p*53 tumor--suppressor gene (Yu et al 1999). A total of 34 genes were identified that exhibited at least a 10-fold upregulation in response to the inducible expression of *p*53. Somewhat surprisingly, there was a marked heterogeneity of the response when it was evaluated in different cell lines derived from the same tissue of origin. Among the 33 genes studied only nine were found to be induced in a panel of five unrelated colorectal cell lines, and 17 were induced in a subset; eight were not induced at all in any of the five cell lines examined. This can be interpreted as being due to a high degree of cell type specificity. Furthermore, *p*53 was not absolutely required for induction -- for the majority of the genes identified-- in response to either adriamycin or 5-FU. Therefore, these agents do not seem to act exclusively through *p*53, suggesting that there is inherent redundancy in the majority of signaling pathways. Such inherent redundancy in signaling pathways of cancer, and untransformed, cells might be important in understanding the results of clinical trials in cancer treatment with signal transduction

modulators that will be discussed in the next subsection **(5.2).**

#### **5.2 Clinical trials with signal transduction inhibitors -- novel anticancer drugs active in chemo-resistant tumors**

Recently, there is an increasing number of reports suggesting that human cancers frequently involve pathogenic mechanisms which give rise to numerous alterations in signal transduction pathways. Therefore, novel therapeutic agents that target specific signal transduction molecules or signaling pathways altered in cancer are currently undergoing clinical trials often with remarkable results in cancer treatments of patients in which chemoand/or radio- therapy resistant tumors have become apparent. For example, several new classes of such anti-cancer drugs are:


Signal transduction is an efficient method for fine-tuning the development and modeling of cancer treatments (Ideker at al., 2001, 2002). There is also a detailed NCI report on clinical trial and signal transduction modulators as novel anticancer (Sausville, Elsayed, Monga and Kim, 2003).

#### **5.3 Interactome-transcriptome analysis and differential gene expression in cancer**

It has been claimed that high-throughput yeast-two-hybrid (HT-Y2H) methods will allow a systematic approach to functional genomics, by placing individual genes in the global context of cellular functions (Mendelsohn and Brent, 1999). One finds that high-throughput screening methods such as HT-Y2H have indeed allowed the mapping of the first interactomes for three eukaryotes (Giot et al 2003; Li et al 2004; Uetz et al 2000). Because of the human interactome's much larger size and its very high-degree of complexity there will be quite high costs and labor involved in obtaining the data necessary, for example, for an HT-Y2H mapping of a complete human cell interactome. Furthermore, the complete data analysis together with the assembly of the complete interactome network is likely to require both conceptual and computational advances, in addition to a significant amount of time and collective effort(s) by one or several research teams. In view of the high, potential importance of the human interactome for cancer therapy, and also for improved diagnosis and 'rational' clinical trials, such an effort should now be a top priority. Such an effort should also be coordinated with an improved mapping of the complete yeast interactome as a model, or test, system. Meantime, there have been since 2005 a few reports of 'surrogate', or partial, human cancer cell interactomes in the form of predicted maps of human protein interaction networks based on partial data and comparative analysis. Such studies emphasize even further the need and urgency for the complete mapping of several human

Oncogenomics and Cancer Interactomics 493

partial validation of the HT-Y2H protein network mapping without, however, necessarily achieving the claimed, global validation of the predicted (hypothetical) interactome. Differentially expressed genes (DEGs) from squamous cell carcinomas (SCCs) were then identified as discussed above and their connectivity in the network graph was examined to determine their 'topological' properties, such as the edge distribution for DEGs in

The genes that are upregulated in SCC were found to exhibit a positive correlation (Pearson's r-coefficient of 0.82) with the number of edges associated with them (Fig. 1a of Wachi, Yoneda and Wu 2005), which was interpreted as indicating that DEGs that are upregulated in SCC are also highly connected. However, the downregulated genes were reported also to have a positive correlation (r = 0.75) to connectivity, albeit slightly lower (Fig. 1b of Wachi, Yoneda and Wu, 2005). On the other hand, microarray probesets that matched the genes in the protein network (n =-2,137) had a negligible correlation coefficient (r =0.06) to link number, proving that the genes on the test microarrays did not contribute to

A k-core analysis of DEGs in SCC of the human lung was also carried out *(loc. cit*.) which were reported to measure "*how close are the DEGs to the topological 'center' of the human PPI network"*. Based on the k-core analysis, it was concluded that: "*the upregulated genes are more centrally located in the protein network than the down-regulated genes*". If duplicated and validated, such studies would be important as the 'topological centrality' of the genes in the interactome was previously reported to be associated with the *essential* functions of the genes in the yeast (Jeong et al 2001). Such essential genes, are lethal when mutated, and also tend to have high connectivity. Moreover, other genes that are not essential in this sense, but provide a vital function in toxin metabolism were reported to have a high number of edges associated with the nodes, and to be less well connected than the essential genes in yeast (Said et al 2004). Furthermore, a *k*-core analysis has also been performed on the yeast essential genes and they were reported to be global hubs, whereas the non-essential genes were not hubs (Wuchty and Almaas, 2005). It was also claimed that these essential, global hubs are conserved throughout different species; however, one notes that, thus far, there is insufficient data and evidence to prove this claim, or hypothesis. Nevertheless, one may consider as a 'working hypothesis' that "*there should be a core set of genes that needs to be maintained throughout the course of somatic evolution in the tumor microenvironment*" (Wachi, Yoneda and Wu, 2005). This hypothesis is thus consistent with the *somatic evolution model* of cancer. Such conserved genes might be the 'essential genes' in cancer cells, and they may also have somewhat analogous to the global hub, essential genes reported in yeast (Wuchty and Almaas, 2005). DEGs would thus be essential for the survival and proliferation of cancer cells in SSC of the human lung, and the upregulated genes would be centrally located in the protein network as well as have higher connectivity, perhaps suggesting their possible essential role(s) in human (SSC) lung cancer. As this is the first report of a predicted/ hypothetical human cancer interactome network one should definitely consider 'replicating' the reported studies and also evaluating such potentially important findings in the context of a complete human cancer interactome (differential) analysis. This possibility that DEGs might be essential for the survival and proliferation of cancer cells in SSC of the human lung has much too important consequences to be ignored; therefore, it must be thoroughly

**5.3.2 Differentially expressed genes –DEG- results for SCC of human lung** 

comparison with the surrounding graph subnetwork.

bias in the number of links for DEGs in SCC.

cancer cell interactomes. Following the seminal studies of DeRisi et al (1996) that utilized cDNA microarray to analize gene expression patterns in human cancer, there have been relatively few attempts at deriving hypothetical gene expression patterns in human cancer. The first claim of such an attempt was recently made by Wachi, Yoneda and Wu (2005) for genes that were differentially expressed in squamous cell lung cancer tissues from five patients who had undergone surgical removal of the tumor(s). cRNA samples were prepared and hybridized to arrays obtained from Affymetrix® (Hg-U133ATM.). These authors were able to carry out paired *t*-test analyses for each *individual* patient in order to distinguish the genes in which expression levels in their squamous lung cancer cells differed from the paired normal lung tissue (control samples) obtained from the same five individuals. The authors' prediction methodology will be briefly discussed in the next subsection as some of the details are relevant for the evaluation of these results which were the first to be reported for the (hypothetical) interactome—transcriptome analysis of human cancer cell data for a group of five patients with the same diagnosed form of (lung) cancer, and with the same treatment (tumor removal by surgery). The hypothetical human protein interaction maps are a relatively new endeavor (Brown and Jurisica, 2005; Lehner and Fraser, 2004) perhaps because they are likely to have many false positives, as well as miss a significant fraction of the relevant/real protein-protein interactions. Currently, microarray analysis still suffers inherently from relatively high noise levels and the accompanying information loss (buried in noise); although this inherent noise problem is partially eliminated through multiple replicate analyses, the number of replicates is often limited by the availability and the material cost. Another significant problem of such microarray projects is the huge amount of data that needs to be processed in order to obtain useable information (Claverie, 1999).

#### **5.3.1 Analysis of human protein-protein interactions (HPPI) and integration of array data into a predicted protein-protein interaction network (PPIN)**

Wachi, Yoneda and Wu (2005; WYU05) employed for their human cell data analysis a webpresented database (OPHID, April 25, 2005) of predicted interactions between human proteins (Brown and Jurisica, 2005) based on data for human and other four organisms which included the intensely-studied yeast and fruit fly. (OPHID is freely available to academic users at http://ophid.utoronto.ca). This protein interaction database listed 16,034 known human protein interactions obtained from various public protein interaction databases, as well as 23,889 additional, predicted interactions which are evaluated using protein domains, gene co-expression and Gene Ontology terms. The results can be visualized in OPHID using a customized, graph visualization program. The data comprises literature-derived human PPI from BIND, HPRD and MINT, "with predictions made from *Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster* and *Mus musculus".*  The genes in the WYU05 array were matched to those in OPHID using gene symbols and protein sequences. In this manner, 2137 genes in the WYU05 microarray experiments were 'matched to the protein network from OPHID'. These predictions should, however, be thought of only as 'hypotheses' until they are experimentally validated. On the other hand, there is increasing evidence that at least certain PPIs may be conserved through evolution (Pagel et al 2004; Wuchty et al 2003). Recently, Sharan et al (2005) claimed that about 50% of the protein-protein interactions predicted by using *interologs* between microorganisms are also experimentally validated. The interologs approach might play therefore a role in the

cancer cell interactomes. Following the seminal studies of DeRisi et al (1996) that utilized cDNA microarray to analize gene expression patterns in human cancer, there have been relatively few attempts at deriving hypothetical gene expression patterns in human cancer. The first claim of such an attempt was recently made by Wachi, Yoneda and Wu (2005) for genes that were differentially expressed in squamous cell lung cancer tissues from five patients who had undergone surgical removal of the tumor(s). cRNA samples were prepared and hybridized to arrays obtained from Affymetrix® (Hg-U133ATM.). These authors were able to carry out paired *t*-test analyses for each *individual* patient in order to distinguish the genes in which expression levels in their squamous lung cancer cells differed from the paired normal lung tissue (control samples) obtained from the same five individuals. The authors' prediction methodology will be briefly discussed in the next subsection as some of the details are relevant for the evaluation of these results which were the first to be reported for the (hypothetical) interactome—transcriptome analysis of human cancer cell data for a group of five patients with the same diagnosed form of (lung) cancer, and with the same treatment (tumor removal by surgery). The hypothetical human protein interaction maps are a relatively new endeavor (Brown and Jurisica, 2005; Lehner and Fraser, 2004) perhaps because they are likely to have many false positives, as well as miss a significant fraction of the relevant/real protein-protein interactions. Currently, microarray analysis still suffers inherently from relatively high noise levels and the accompanying information loss (buried in noise); although this inherent noise problem is partially eliminated through multiple replicate analyses, the number of replicates is often limited by the availability and the material cost. Another significant problem of such microarray projects is the huge amount of data that needs to be processed in order to obtain useable

**5.3.1 Analysis of human protein-protein interactions (HPPI) and integration of array** 

Wachi, Yoneda and Wu (2005; WYU05) employed for their human cell data analysis a webpresented database (OPHID, April 25, 2005) of predicted interactions between human proteins (Brown and Jurisica, 2005) based on data for human and other four organisms which included the intensely-studied yeast and fruit fly. (OPHID is freely available to academic users at http://ophid.utoronto.ca). This protein interaction database listed 16,034 known human protein interactions obtained from various public protein interaction databases, as well as 23,889 additional, predicted interactions which are evaluated using protein domains, gene co-expression and Gene Ontology terms. The results can be visualized in OPHID using a customized, graph visualization program. The data comprises literature-derived human PPI from BIND, HPRD and MINT, "with predictions made from *Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster* and *Mus musculus".*  The genes in the WYU05 array were matched to those in OPHID using gene symbols and protein sequences. In this manner, 2137 genes in the WYU05 microarray experiments were 'matched to the protein network from OPHID'. These predictions should, however, be thought of only as 'hypotheses' until they are experimentally validated. On the other hand, there is increasing evidence that at least certain PPIs may be conserved through evolution (Pagel et al 2004; Wuchty et al 2003). Recently, Sharan et al (2005) claimed that about 50% of the protein-protein interactions predicted by using *interologs* between microorganisms are also experimentally validated. The interologs approach might play therefore a role in the

**data into a predicted protein-protein interaction network (PPIN)** 

information (Claverie, 1999).

partial validation of the HT-Y2H protein network mapping without, however, necessarily achieving the claimed, global validation of the predicted (hypothetical) interactome. Differentially expressed genes (DEGs) from squamous cell carcinomas (SCCs) were then identified as discussed above and their connectivity in the network graph was examined to determine their 'topological' properties, such as the edge distribution for DEGs in comparison with the surrounding graph subnetwork.

#### **5.3.2 Differentially expressed genes –DEG- results for SCC of human lung**

The genes that are upregulated in SCC were found to exhibit a positive correlation (Pearson's r-coefficient of 0.82) with the number of edges associated with them (Fig. 1a of Wachi, Yoneda and Wu 2005), which was interpreted as indicating that DEGs that are upregulated in SCC are also highly connected. However, the downregulated genes were reported also to have a positive correlation (r = 0.75) to connectivity, albeit slightly lower (Fig. 1b of Wachi, Yoneda and Wu, 2005). On the other hand, microarray probesets that matched the genes in the protein network (n =-2,137) had a negligible correlation coefficient (r =0.06) to link number, proving that the genes on the test microarrays did not contribute to bias in the number of links for DEGs in SCC.

A k-core analysis of DEGs in SCC of the human lung was also carried out *(loc. cit*.) which were reported to measure "*how close are the DEGs to the topological 'center' of the human PPI network"*. Based on the k-core analysis, it was concluded that: "*the upregulated genes are more centrally located in the protein network than the down-regulated genes*". If duplicated and validated, such studies would be important as the 'topological centrality' of the genes in the interactome was previously reported to be associated with the *essential* functions of the genes in the yeast (Jeong et al 2001). Such essential genes, are lethal when mutated, and also tend to have high connectivity. Moreover, other genes that are not essential in this sense, but provide a vital function in toxin metabolism were reported to have a high number of edges associated with the nodes, and to be less well connected than the essential genes in yeast (Said et al 2004). Furthermore, a *k*-core analysis has also been performed on the yeast essential genes and they were reported to be global hubs, whereas the non-essential genes were not hubs (Wuchty and Almaas, 2005). It was also claimed that these essential, global hubs are conserved throughout different species; however, one notes that, thus far, there is insufficient data and evidence to prove this claim, or hypothesis. Nevertheless, one may consider as a 'working hypothesis' that "*there should be a core set of genes that needs to be maintained throughout the course of somatic evolution in the tumor microenvironment*" (Wachi, Yoneda and Wu, 2005). This hypothesis is thus consistent with the *somatic evolution model* of cancer. Such conserved genes might be the 'essential genes' in cancer cells, and they may also have somewhat analogous to the global hub, essential genes reported in yeast (Wuchty and Almaas, 2005). DEGs would thus be essential for the survival and proliferation of cancer cells in SSC of the human lung, and the upregulated genes would be centrally located in the protein network as well as have higher connectivity, perhaps suggesting their possible essential role(s) in human (SSC) lung cancer. As this is the first report of a predicted/ hypothetical human cancer interactome network one should definitely consider 'replicating' the reported studies and also evaluating such potentially important findings in the context of a complete human cancer interactome (differential) analysis. This possibility that DEGs might be essential for the survival and proliferation of cancer cells in SSC of the human lung has much too important consequences to be ignored; therefore, it must be thoroughly

Oncogenomics and Cancer Interactomics 495

techniques would need to be combined with epigenomic tools and analysis in order to gain an improved understanding of functional genomics and interactomics. Epigenomic tools and novel techniques begin to address the complex and varied needs of epigenetic studies, as well as their applications to controlling cell division and growth. Such tools are, therefore, potentially very important in medical areas such as cancer research and therapy, as well as for improving 'domestic' animal phenotypes *without* involving genomic modifications of the organism. This raises the interesting question if 'epigenomically controlled-growth organisms' (ECGOs) -- to be produced in the future-- would be still argued against by the same group of people who currently objects to GMOs, even though genetic modifications

**6.2 Novel tools in epigenomics: rapid and ultra-sensitive analyses of nucleic acid –**

Several novel techniques could also be applied for the highly-selective detection of epigenomic changes in mammalian cells related to diseases such as individual types of cancer (Jones and Laird, 1999; Plass, 2002) and Alzheimer disease. Such novel tools are likely to be utilized in a wide range of applications in biotechnology research related to Post-Genomics and Epigenomics. Tumor suppressor genes are transcriptionally silenced by *promoter hypermethylation* that also appears to lead to alterations in chromatin structure- a possible mechanism for such repression of the suppressor genes. In contrast to the genetic mutation or deletion mechanism of tumor suppressor gene inactivation, epigenetic inactivation of tumor suppressor genes would occur *via* methylation of specific DNA regions that could be prevented by DNA methyl-transferase or histone deacetylase inhibitors. Aberrant CpG--island methylation has non-random/tumor-type-specific patterns (Costello et al 2000). Such patterns can be identified by employing methylation- specific PCR (MS-PCR; Herman et al 1996), and can also be employed either for tumor class prediction by microarray-based DNA methylation analysis (Adorjan et al 2002) or for high-throughput microarray-based detection and analysis of methylated CpG islands (Yan et al 2002). Hypermethylation profiling is important for both accurate diagnosis and the development of optimal strategies in cancer therapy. Gene promoter hypermethylation has been reported in both tumors and serum of patients diagnosed with several types of cancer: head and neck cancers (Sanchez-Caspedes et al. 2000), nasopharyngeal carcinoma (Wong et al. 2002), non-small cell lung cancer (Belinsky et al 1998; An et al 2002), gastric carcinoma (Lee et al 2002), liver, prostate, bladder and colorectal cancers (Wong et al 1999; Jeronimo et al 2002). Substantial efforts are being made recently for the development of new methods and tools that are capable of sensitive and quantitative DNA methylation analysis, as well as early and accurate diagnosis of cancer. Among such tools are: Fluorescent methylation--specific polymerase chain reaction assay (FMS-PCR; Goessl et al 2000), SNIRF (Mahmood and Weissleder, 2003), indocyanine green-labeling (IGL) for human breast carcinomas (Ntziachristos et al 2000), ConLight-MSP (Rand et al 2002), COBRA (Xlong and Laird, 2002), Methylation-Sensitive *Single Nucleotide* Primer Extension (Ms-SnuPE; Gonzalgo and Jones, 1997), DNA microarray sensitive detection by Metal-Enhanced Fluorescence (MASD/MEF; Lakowicz, 2001; Malicka et al 2003 a, b)), and NIR Fluorescence Micro-Spectroscopy (NIRFMS), single cancer cell detection (Baianu et al 2004a). Specific molecular markers of cancer (Sidransky, 2002) hold the promise to identify those molecular signatures that are *unique* to specific types of cancer, and are essential for the *early accurate diagnosis* and treatment of

would be neither present nor traceable in such ECGO organisms?'

**protein interactions** 

investigated and also tested with sufficiently extensive, translational genomics and transcriptional databases that do not seem to be currently available (Han et al. 2006). **Further supporting analyses for this conjecture made by** Wachi, Yoneda and Wu (2005) are considered in the next section.

#### **5.4 Cancer proteins and the global topology of the human interactome network**

A recent and extensive study of both cancer and non-cancer proteins (Jonsson and Bates, 2006) was integrated into a validated protein-protein interaction (PPI) network, or interactome, of human proteins. In their report, the connectivity properties were investigated for all proteins previously shown to be modified as a result of mutations leading to cancer (Furteal, et al 2004). A global protein-protein interaction network was then constructed by a homology--based method which is claimed to accurately predict proteinprotein interactions. It was then suggested that human proteins that are involved in cancer, or 'cancer proteins', exhibit a network topology which is substantially different from that of other proteins which are considered not to be involved in cancer. Notably, increased connectivity was pointed out for cancer proteins involved in the following subnetworks: cell growth and apotosis-related, signal transduction (MAPK, TGF-beta, insulin, T-cell and B-cell receptor, adipocytokine, cytokine-cytokine interaction), cell motility/cytoskeleton, cell communication, adherence junction, focal adhesion, leukocyte migration, antigen processing and folding/sorting/degradation. Furthermore, it was proposed that such observations '*indicate an underlying evolutionary pressure to which cancer genes, as genes of central importance, are subjected.'* Linking these claims with previous proposals by Wuchty and Almaas (2005) that globally central proteins form an *evolutionary backbone* of the proteome and are *essential* to the organism, (and also with the conjecture made by Wachi, Yoneda and Wu, 2005, discussed here in Section 5.3.), Jonsson and Bates (2006) suggested that cancer proteins may generally be older than the non-cancer ones in evolutionary age. Furthermore, they also suggested that the somatically mutated cancer proteins may be of somewhat younger evolutionary average age in comparison with those from the germline, as a consequence of the evolutionary selection pressure postulated to affect germline mutated proteins. Note also that the previous study of (SCC) human lung cancer by Wachi, Yoneda and Wu (2005) also reported increased interaction connectivity in differentially expressed proteins in human lung cancer tissues.

#### **6. Epigenomics in mammalian cells and multi-cellular organisms**

#### **6.1 Epigenetic controls**

Upon completion of the US Human Genome Mapping Project and related studies, it became increasingly evident that a sequence of 30,000 or so 'active' genes that encode and direct the biosynthesis of specific proteins could not possibly exhaust the control mechanisms present in either normal or abnormal cells (such as, for example, cancer cells). This is even more obvious in the case of developing embryos or regenerating organs. Subsequently, more than 120,000 genes were suggested to be active in the human genome (*Nature, 2004*). Furthermore, specific control mechanisms of cellular phenotypes and processes were recently proposed that involve *epigenetic* controls, such as the specific acetylation **<=>** deacetylation reactions of DNA-bound histones (for an overview article on epigenomics see, for example, *Scientific American* 2003, December issue). Such controls intervene from outside the genome but ultimately they also affect gene expression. Therefore, gene profiling

investigated and also tested with sufficiently extensive, translational genomics and transcriptional databases that do not seem to be currently available (Han et al. 2006). **Further supporting analyses for this conjecture made by** Wachi, Yoneda and Wu (2005) are

**5.4 Cancer proteins and the global topology of the human interactome network**  A recent and extensive study of both cancer and non-cancer proteins (Jonsson and Bates, 2006) was integrated into a validated protein-protein interaction (PPI) network, or interactome, of human proteins. In their report, the connectivity properties were investigated for all proteins previously shown to be modified as a result of mutations leading to cancer (Furteal, et al 2004). A global protein-protein interaction network was then constructed by a homology--based method which is claimed to accurately predict proteinprotein interactions. It was then suggested that human proteins that are involved in cancer, or 'cancer proteins', exhibit a network topology which is substantially different from that of other proteins which are considered not to be involved in cancer. Notably, increased connectivity was pointed out for cancer proteins involved in the following subnetworks: cell growth and apotosis-related, signal transduction (MAPK, TGF-beta, insulin, T-cell and B-cell receptor, adipocytokine, cytokine-cytokine interaction), cell motility/cytoskeleton, cell communication, adherence junction, focal adhesion, leukocyte migration, antigen processing and folding/sorting/degradation. Furthermore, it was proposed that such observations '*indicate an underlying evolutionary pressure to which cancer genes, as genes of central importance, are subjected.'* Linking these claims with previous proposals by Wuchty and Almaas (2005) that globally central proteins form an *evolutionary backbone* of the proteome and are *essential* to the organism, (and also with the conjecture made by Wachi, Yoneda and Wu, 2005, discussed here in Section 5.3.), Jonsson and Bates (2006) suggested that cancer proteins may generally be older than the non-cancer ones in evolutionary age. Furthermore, they also suggested that the somatically mutated cancer proteins may be of somewhat younger evolutionary average age in comparison with those from the germline, as a consequence of the evolutionary selection pressure postulated to affect germline mutated proteins. Note also that the previous study of (SCC) human lung cancer by Wachi, Yoneda and Wu (2005) also reported increased interaction connectivity in differentially expressed proteins in

**6. Epigenomics in mammalian cells and multi-cellular organisms** 

Upon completion of the US Human Genome Mapping Project and related studies, it became increasingly evident that a sequence of 30,000 or so 'active' genes that encode and direct the biosynthesis of specific proteins could not possibly exhaust the control mechanisms present in either normal or abnormal cells (such as, for example, cancer cells). This is even more obvious in the case of developing embryos or regenerating organs. Subsequently, more than 120,000 genes were suggested to be active in the human genome (*Nature, 2004*). Furthermore, specific control mechanisms of cellular phenotypes and processes were recently proposed that involve *epigenetic* controls, such as the specific acetylation **<=>** deacetylation reactions of DNA-bound histones (for an overview article on epigenomics see, for example, *Scientific American* 2003, December issue). Such controls intervene from outside the genome but ultimately they also affect gene expression. Therefore, gene profiling

considered in the next section.

human lung cancer tissues.

**6.1 Epigenetic controls** 

techniques would need to be combined with epigenomic tools and analysis in order to gain an improved understanding of functional genomics and interactomics. Epigenomic tools and novel techniques begin to address the complex and varied needs of epigenetic studies, as well as their applications to controlling cell division and growth. Such tools are, therefore, potentially very important in medical areas such as cancer research and therapy, as well as for improving 'domestic' animal phenotypes *without* involving genomic modifications of the organism. This raises the interesting question if 'epigenomically controlled-growth organisms' (ECGOs) -- to be produced in the future-- would be still argued against by the same group of people who currently objects to GMOs, even though genetic modifications would be neither present nor traceable in such ECGO organisms?'

#### **6.2 Novel tools in epigenomics: rapid and ultra-sensitive analyses of nucleic acid – protein interactions**

Several novel techniques could also be applied for the highly-selective detection of epigenomic changes in mammalian cells related to diseases such as individual types of cancer (Jones and Laird, 1999; Plass, 2002) and Alzheimer disease. Such novel tools are likely to be utilized in a wide range of applications in biotechnology research related to Post-Genomics and Epigenomics. Tumor suppressor genes are transcriptionally silenced by *promoter hypermethylation* that also appears to lead to alterations in chromatin structure- a possible mechanism for such repression of the suppressor genes. In contrast to the genetic mutation or deletion mechanism of tumor suppressor gene inactivation, epigenetic inactivation of tumor suppressor genes would occur *via* methylation of specific DNA regions that could be prevented by DNA methyl-transferase or histone deacetylase inhibitors. Aberrant CpG--island methylation has non-random/tumor-type-specific patterns (Costello et al 2000). Such patterns can be identified by employing methylation- specific PCR (MS-PCR; Herman et al 1996), and can also be employed either for tumor class prediction by microarray-based DNA methylation analysis (Adorjan et al 2002) or for high-throughput microarray-based detection and analysis of methylated CpG islands (Yan et al 2002). Hypermethylation profiling is important for both accurate diagnosis and the development of optimal strategies in cancer therapy. Gene promoter hypermethylation has been reported in both tumors and serum of patients diagnosed with several types of cancer: head and neck cancers (Sanchez-Caspedes et al. 2000), nasopharyngeal carcinoma (Wong et al. 2002), non-small cell lung cancer (Belinsky et al 1998; An et al 2002), gastric carcinoma (Lee et al 2002), liver, prostate, bladder and colorectal cancers (Wong et al 1999; Jeronimo et al 2002). Substantial efforts are being made recently for the development of new methods and tools that are capable of sensitive and quantitative DNA methylation analysis, as well as early and accurate diagnosis of cancer. Among such tools are: Fluorescent methylation--specific polymerase chain reaction assay (FMS-PCR; Goessl et al 2000), SNIRF (Mahmood and Weissleder, 2003), indocyanine green-labeling (IGL) for human breast carcinomas (Ntziachristos et al 2000), ConLight-MSP (Rand et al 2002), COBRA (Xlong and Laird, 2002), Methylation-Sensitive *Single Nucleotide* Primer Extension (Ms-SnuPE; Gonzalgo and Jones, 1997), DNA microarray sensitive detection by Metal-Enhanced Fluorescence (MASD/MEF; Lakowicz, 2001; Malicka et al 2003 a, b)), and NIR Fluorescence Micro-Spectroscopy (NIRFMS), single cancer cell detection (Baianu et al 2004a). Specific molecular markers of cancer (Sidransky, 2002) hold the promise to identify those molecular signatures that are *unique* to specific types of cancer, and are essential for the *early accurate diagnosis* and treatment of

Oncogenomics and Cancer Interactomics 497

at future applications in oncogenesis are currently under development both in the direction of improved detection sensitivity and increased time resolution of cellular events, with the limits of single molecule detection and picosecond time resolution already being reached (Sections 2.5, 2.6 and 6.2). The urgency for funding and carrying out the complete mapping of a human cancer interactome with the help of such novel, high-efficiency / low-cost and ultra-sensitive techniques is pointed out for the first time in the context of recent findings by

The author gratefully acknowledges receiving helpful suggestions, pertinent documentation and critical comments from: Dr. Mark Band, Director of the Genotyping/ Transcriptomics Unit at the Keck Center, Dr. Lei Liu, Director of BioInformatics Unit at the Keck Center, Professor Schuyler Korban, and Professor James F. Glazebrook of the Mathematics Dept. at UIUC and Eastern Illinois University, respectively. This research was partially supported by Renessen Co., the IMBA Consortium, an USDA Hatch Grant No. ILLU-0995362 and AES at

[1] Adams J.; Palombella VJ; Sausville EA, et al. (1999). Proteasome inhibitors: a novel class of potent and effective antitumor agents. *Cancer Research,* Vol. 59: 2615-22. [2] Adjei, A.A.; Erlichman C.; Davis JN; Cutler DL, Sloan JA, et al. 2000. A phase I trial of the

[3] Aghajanian C.; Soignet S, Dizon DS, et al. (2001). A phase I trial of the novel proteasome

[4] Alle KM.; Henshall SM, Field AS, Sutherland RL. 1998. Cyclin D1 protein is

[5] Alizadeh AA.; Eisen MB, Davis RE et al. {2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. *Nature*, Vol. 403: 503-11. [6] Amundson, S.A.; et al. (2000). An Informatics Approach Identifying Markers of Chemosensitivity in Human Cancer Cell Lines. *Cancer Res*earch, Vol. 60: 6101-110. [7] Andersen G, Busso D, Poterszman A, et al. (1997.) The structure of cyclin H: common mode of kinase activation and specific features. EMBO J, 16(5): 958–67. [8] Anbazhagan, R.; Tihan, T, Bornman DM, et al.(1999). Classification of small cell lung

[9] Akinaga S; Sugiyama K, Akiyama T. (2000). UCN-01 (7-hydroxystaurosporine) and other

[10] Akiyama T.; Yoshida T, Tsujita T, et al. (1997). G1 phase accumulation induced by

farnesyl transferase inhibitor SCH66336: evidence for biological and clinical

inhibitor PS341 in advanced solid tumor malignancies. *Proc. Amer. Soc. Clin. Oncol.,*

overexpressed in hyperplasia and intraductal carcinoma of the breast. *Clin. Cancer* 

cancer and pulmonary carcinoid by gene expression profiles. *Cancer Research,*

indolocarbazole compounds: a new generation of anti-cancer agents for the new

UCN-01 is associated with dephosphorylation of Rb and CDK2 proteins as well as

translational oncogenomics and human cancer interactome predictions.

**8. Acknowledgments** 

UIUC.

**9. References** 

activity. *Cancer Res.,* 60: 1871-77.

Vol. 20: 338 (*Abstr*.).

*Res.,* 4:847–854.

Vol.59: 5119-22.

century? *Anticancer Drug Des.,* 15: 43-52.

cancer. Such novel molecular tools and methodologies could be employed to rapidly and accurately identify molecular signatures of cancer and aging-related diseases in mammalian cells in culture in order to determine how specific epigenomic mechanisms involved in the control of cell division and apoptosis operate throughout the cell cycle. Among the specific epigenomic control mechanisms that one could investigate with such new tools are: CpG-island methylation, p15 (INK4b) and p16 (INK4a) hyper-methylation (in synchronous hepatic carcinoma cells), GSTP1 methylation in non-neoplastic/ synchronous cells, as well as histone-deacetylation and its effects on histone- nucleic acid interactions in stable synchronous cell populations in culture. Both cancer and aging were reported to involve DNA methylation of specific genome regions (van Helden & van Helden, 1989; Ahuja et al 1998). Gene expression profiling and epigenomic testing could be carried out with both ultra-sensitive, novel human and mouse microarrays. Powerful spectroscopic and microspectrosopic techniques can be then employed for the analysis and further improvement of such tools for the investigation of nucleic acid--protein interactions.


\*\*The testing of these new tools can be carried out for example with stable and synchronous mammalian (human HeLa and mouse) cells in culture.

Table 1. Techniques under Development and Related Applications that are commercially supported \*,\*\*.

#### **7. Conclusions and discussion**

Novel translational oncogenomics research is rapidly expanding with a view to the application of new technologies, findings and computational models in both pharmaceutical and clinical areas. Sample analyses in recent clinical studies have shown that gene expression data can be employed to distinguish between tumor types as well as to predict outcomes. Important, potential applications of such results are *individualized* human cancer therapy (Pharmacogenomics) and 'personalized medicine'. There is clearly a need for individualized cancer therapy strategies based on high-throughput microarray information recorded for isolated tumor cell lines from stage I through stage III cancer patients. Studies of Differential Gene Expression in human cancer cell lines are clearly required for developing new strategies for efficient cancer therapies for patients whose tumors have developed resistance to existing therapies. Such gene profiling expression, proteomic, interactomic and tissue array data is essential for improving the survival rate of stage III cancer patients undergoing clinical trials with novel signaling pathway inhibitors/ blocker medicines, such as those discussed in some detail in **Section 5**. Several technologies aimed at future applications in oncogenesis are currently under development both in the direction of improved detection sensitivity and increased time resolution of cellular events, with the limits of single molecule detection and picosecond time resolution already being reached (Sections 2.5, 2.6 and 6.2). The urgency for funding and carrying out the complete mapping of a human cancer interactome with the help of such novel, high-efficiency / low-cost and ultra-sensitive techniques is pointed out for the first time in the context of recent findings by translational oncogenomics and human cancer interactome predictions.

#### **8. Acknowledgments**

496 Bioinformatics – Trends and Methodologies

cancer. Such novel molecular tools and methodologies could be employed to rapidly and accurately identify molecular signatures of cancer and aging-related diseases in mammalian cells in culture in order to determine how specific epigenomic mechanisms involved in the control of cell division and apoptosis operate throughout the cell cycle. Among the specific epigenomic control mechanisms that one could investigate with such new tools are: CpG-island methylation, p15 (INK4b) and p16 (INK4a) hyper-methylation (in synchronous hepatic carcinoma cells), GSTP1 methylation in non-neoplastic/ synchronous cells, as well as histone-deacetylation and its effects on histone- nucleic acid interactions in stable synchronous cell populations in culture. Both cancer and aging were reported to involve DNA methylation of specific genome regions (van Helden & van Helden, 1989; Ahuja et al 1998). Gene expression profiling and epigenomic testing could be carried out with both ultra-sensitive, novel human and mouse microarrays. Powerful spectroscopic and microspectrosopic techniques can be then employed for the analysis and further improvement of such tools for the investigation of nucleic acid--protein

High-field 2D NMR of protein--protein and protein--nucleic acid interactions

 Ms-SnuPE; FMS-PCR; Lux TM Fluorogenic Primers\*/ RT-PCR\*, MyArray TM DNA- Human\*, GeneFilters R Human Regular Arrays\*\*.

Specific Knock-out or silencing shRNAi's (SuperArrayTM).

mammalian (human HeLa and mouse) cells in culture.

**7. Conclusions and discussion** 

NIR Chemical Imaging of protein clusters in cells and single cancer cells in tissue;

MEF and FCS/FCCS/ FRET detection of single molecules amplified-ELISA; NASBA

\*\*The testing of these new tools can be carried out for example with stable and synchronous

Novel translational oncogenomics research is rapidly expanding with a view to the application of new technologies, findings and computational models in both pharmaceutical and clinical areas. Sample analyses in recent clinical studies have shown that gene expression data can be employed to distinguish between tumor types as well as to predict outcomes. Important, potential applications of such results are *individualized* human cancer therapy (Pharmacogenomics) and 'personalized medicine'. There is clearly a need for individualized cancer therapy strategies based on high-throughput microarray information recorded for isolated tumor cell lines from stage I through stage III cancer patients. Studies of Differential Gene Expression in human cancer cell lines are clearly required for developing new strategies for efficient cancer therapies for patients whose tumors have developed resistance to existing therapies. Such gene profiling expression, proteomic, interactomic and tissue array data is essential for improving the survival rate of stage III cancer patients undergoing clinical trials with novel signaling pathway inhibitors/ blocker medicines, such as those discussed in some detail in **Section 5**. Several technologies aimed

Table 1. Techniques under Development and Related Applications that are commercially

interactions.

supported \*,\*\*.

NIR-FMS; SNIRF

The author gratefully acknowledges receiving helpful suggestions, pertinent documentation and critical comments from: Dr. Mark Band, Director of the Genotyping/ Transcriptomics Unit at the Keck Center, Dr. Lei Liu, Director of BioInformatics Unit at the Keck Center, Professor Schuyler Korban, and Professor James F. Glazebrook of the Mathematics Dept. at UIUC and Eastern Illinois University, respectively. This research was partially supported by Renessen Co., the IMBA Consortium, an USDA Hatch Grant No. ILLU-0995362 and AES at UIUC.

#### **9. References**


Oncogenomics and Cancer Interactomics 499

[25] Baianu IC. (2004c). Molecular Models of Genetic and Organismic Structures. *CERN* 

[26] Baianu IC., (Editor). (2006). Complex Systems Biology and Life's Logics, *Axiomathes*,

[27] Baianu IC.; Kumosinski TF, Bechtel P.J, et al. (1988). NMR Studies of Chemical Activity

[28] Baianu IC.; Ozu EM, Wei TC, et al. (1993). Molecular Dynamics and NMR Studies of

[29] Baianu IC.; Costescu D, You T, et al. (2004a). Near Infrared, Fluorescence

[30] Baianu IC.; Costescu D, Hoffman NE, et al. (2004). Fourier Transform Near Infrared

[31] Baianu IC.; Brown R, Georgescu G, and Glazebrook, JF. (2006). Complex Non-Linear

[32] Barabasi AL.; & Oltvai ZN. (2004). Network biology: understanding the cell's functional

[33] Barco, A; Alarcon, JM and Kandel, E. R. (2002). Expression of constitutively active

[34] Baselga J.; Herbst R, LoRusso P, et al. (2000). Continuous administration of ZD1839

[35] Baselga J.; Tripathy D, Mendelsohn J, et al. (1996). Phase II study of weekly intravenous

[36] Becker J. (2004). Signal transduction inhibitors- a work in progress. *Nature Biotechnology*,

organization. *Nature Review Genetics*, Vol. 5: 101–13.

synaptic capture. *Cell,* Vol. 108: 689-703.

electronic/other/ext/ext-2004-067/MolecularModelsICB3.doc

Vol.16: 1- 243. Springer: Dordrecht, Germany.

pp. 241-273.

069.pdf

16*:* 65-122*.* 

(Abstr.).

Vol. 14: 737-44.

Vol. 22 (1): 15-18.

Food Chemistry*.* American Chemical Society, p.156.

Washington, DC: American Chemical Society. p. 269-324.

*Preprint Archive, EXT-2004-067,* p.1-9*.* Available from: http://doc.cern.ch/archive/

and Prottein-Protein Interactions in Solutions and Hydrated Powders. In: *Proceed. 196th National Meeting of the American Chemical Society-* Division of Agricultural and

Ion-Ion Interactions in Concentrated Electrolytes with Dipoles in Water. In: *Molecular Modeling*. ACS Symp. Ser.# 576. Kumosinski TF and Liebman M, Eds.

Microspectroscopy, Infrared Chemical Imaging and High-Resolution NMR Analysis of Soybean Seeds, Somatic Embryos and Single Cancer Cells. Ch.12 In: *Oil Extraction and Analysis*, Luthria DL, Ed.; AOCS Press : Champaign, Illinois, USA,

Microspectroscopy, Infrared Chemical Imaging, High-Resolution Nuclear Magnetic Resonance and Fluorescence Microspectroscopy Detection of Single Cancer Cells and Single Viral Particles. *CERN Preprints Archive, EXT-2004-069*., pp. 1-20. Available from: http://doc.cern.ch//archive/electronic/other/ext/ext-2004-

Biodynamics in Categories, Higher Dimensional Algebra and LM-Topos: Transformations of Neuronal, Genetic and Neoplastic Networks. *Axiomathes, Vol.* 

CREB protein facilitates the late phase of long-term potentiation by enhancing

(*Iressa*), a novel oral epidermal growth factor receptor tyrosine kinase inhibitor (EGFR-TKI) in patients with five selected tumor types: evidence of activity and good tolerability. *Proceedings American Society of Clinical Oncology,* Vol. 19: 686

recombinant humanized anti-p185HER2 monoclonal antibody in patients with HER2/neu-overexpressing metastatic breast cancer. *Journal of Clinical Oncology,*

induction of CDK inhibitor p21/Cip1/WAF1/sdil in p53-mutated human epidermoid carcinoma A431 cells. *Cancer Res.,* Vol. 57: 1495-501.


[11] An, WG.; Hwang SG, Trepel JB, Blagosklonny MV. (2000). Protease inhibitor-induced

[13] Ashburner M.; et al. (2000). Gene ontology: tool for the unification of biology. The Gene

[14] Bagatolli LA.; Gratton, E. (2000). Two-photon fluorescence microscopy of coexisting

[15] Bagui, TK, Jackson RJ, Agrawal D, and Pledger WJ. (2000). Analysis of cyclin D3- cdk4

[16] Baianu, I. (1969). Theoretical and Experimental Models of Carcinogenesis., Medical

[17] Baianu, I. (1971). Organismic Structures and Qualitative Dynamics of Systems. *Bulletin* 

[18] Baianu IC. (1977). A Logical Model of Genetic Activities in Łukasiewicz Algebras: The

[19] Baianu IC. (1980). Natural Transformations of Organismic Structures, *Bull. Math.*

[20] Baianu I.C. (1983). Natural Transformation Models in Molecular Biology, In: *Proceedings* 

[21] Baianu IC. (1984). A Molecular--Set--Variable Model of Structural and Regulatory Activities in Metabolic and Genetic Networks. *FASEB Proceedings*, Vol. 43: 917. [22] Baianu IC. (1987a). Computer Models and Automata Theory in Biology and Medicine.,

[24] Baianu I.C. (2004b). Complex Systems Analysis of Cell Cycling Models in

http://doc.cern.ch//archive/electronic/other/ext/ext-2004-072.pdf [23] Baianu IC. (2004a). Interactomics and Cancer Mechanisms, *Bioline Preprint No*.

*00001978*. p.1- 19 Available from: http://cogprints.org/3810/;

http://doc.cern.ch/archive/electronic/other/ext/ext--04065/

http://bioline.utsc.utoronto.ca/archive/00001978/;

*Preprint Archive,* EXT-2004-065. p.1-16. Available from:

ANeuralGenNetworkLuknTopos\_oknu4.pdf

In: M. Witten (ed.), *Mathematical Models in Medicine*, Vol.7., New York : Pergamon Press. p.1513-77; *CERN Preprint No. EXT--2004—072*, Available from:

Carcinogenesis:II. Cell Genome and Interactome, Neoplastic Non-random Transformation Models in Topoi with Łukasiewicz-Logic and MV Algebras. *CERN* 

Non-linear Theory*. Bull. Mathematical Biology*, Vol.39: 249-58.

*of the SIAM Natl. Meet*., Denver, CO.; *Eprint*: Available from:

http://cogprints.org/3675/01/Naturaltransfmolbionu6.pdf

independent markers of proteasome inhibition., *Leukemia,* 14: 1276-83. [12] Arguello F.; Alexander M, Sterry JA, et al. (1998). Flavopiridol induces apoptosis of

epidermoid carcinoma A431 cells. *Cancer Res.,* Vol. 57: 1495-501.

Ontology Consortium, *Nature Genetics,* Vol. 25: 25--29.

*of Mathematical Biophysics*, Vol. 33: 339-53.

90.

*Biophys. J.,* 78: 290-305.

*Biol*., 20: 8748– 57.

*Thesis*. pp.1-191.

*Biology,* Vol.42: 431-446.

http://cogprints.org/3675/;

induction of CDK inhibitor p21/Cip1/WAF1/sdil in p53-mutated human

apoptosis: accumulation of wt p53, p21WAF1/CIP1, and induction of apoptosis are

normal lymphoid cells, causes immunosuppression, and has potent antitumor activity in vivo against human leukemia and lymphoma xenografts. *Blood,* 91:2482-

lipid domains in giant unilamellar vesicles of binary phospholipid mixtures.

complexes in fibroblasts expressing and lacking p27kip1 and p21cip1. *Mol. Cell.* 

Biophysics Dept., School of Medicine & School of Physics, Univ. Bucharest., *M.S.* 


Oncogenomics and Cancer Interactomics 501

[55] Claverie JM. (1999). Computational methods for the identification of differential and

[58] Decker T.; Hipp S, Schneller F, et al. (2001). Rapamycin induces G1 arrest and inhibits

[59] DeRisi, JL, et al. (1996). Use of a cDNA microarray to analyse gene expression patterns

[60] DeRisi JL; et al. (1997). Exploring the metabolic and genetic control of gene expression

[61] de Jung H.; Gouze J-L., Hernandez, C, Page M. et al. (2004). Qualitative simulation of

[62] de Jung H.; and Page M. (2000). Qualitative simulation of large and complex genetic

[63] Diaspro A.; & Robello, M. 1999. Multi-photon Excitation Microscopy to Study

[64] Dobashi Y.; Goto A, Fukayama, M, et al. (2004). Overexpression of Cdk4/Cyclin D1, a

[65] Dobashi Y.; Jiang SX, Shoji M, et al. (2003). Diversity in expression and prognostic

[66] Drexler H.C. (1997). Activation of the cell death program by inhibition of proteasome

[67] Drees M.; Dengler WA, Roth T, et al. (1997). Flavopiridol (L86-8275): selective

[68] Dudoit S.; et al. (2003). Open source software for the analysis of microarray data.

[69] Dunker AK.; et al. (2005). Flexible Nets. The roles of intrinsic disorder in protein

[70] Eigen M.; & Rigler R. (1994). Sorting single molecules: Applications to diagnostics and

[71] Elson E.L.; & Magde D. (1974). Fluorescence correlation spectroscopy. I: Conceptual

[72] End D.W.; Smets, G, Todd, AV, et al. (2001). Characterization of the antitumor effects of

[73] Erlichman C.; Adjei AA, Thomas JP, et al. (2001). A phase I trial of the proteasome

the selective farnesyl protein transferase inhibitor R115777 in vivo and in vitro.

inhibitor PS-341 in patients with advanced cancer. *Proc. Am. Soc. Clin. Oncol.,* 20:

p70S6 kinase in proliferating B-CLL cells: cyclin D3 and cyclin E as molecular

genetic regulatory networks using piecewise-linear models. *Bull. Math. Biology*,

regulatory systems. In W. Horn (ed.), *Proc. 14th Europ. Conf. AI. (ECAI 2000)*, pp.141-

possible mediator of apoptosis and an indicator of prognosis in human primary

significance of G1/S cyclins in human primary lung carcinomas. *J. Pathol*, 199: 208-

antitumor activity *in vitro* and activity *in vivo* for prostate carcinoma cells. *Clin.* 

coordinated gene expression. *Human Mol. Genetics*, 8: 1821–32. [56] Compton J. (1991). Nucleic acid sequence-based amplification. *Nature,* 350: 91-92. [57] Costello JF, et al. (2000)*.* Aberrant CpG-island methylation has non-random and

tumour-type-specific patterns. *Nature Genet.,* 24: 132-38.

in human cancer. *Nature Genetics*, Vol. 14: 457–460.

Biosystems. *European Microscopy and Analysis,* 5: 5-7.

lung carcinoma. *Intl. J. Cancer,* 110 : 532-541.

interaction networks. *FEBS J*. 272: 5129-5148.

basis and theory. *Biopolymers,* Vol*.* 13: 1.

evolutionary biotechnology. *PNAS-USA,* 91: 5740-43.

function. *PNAS-USA*, 94: 855-60.

*Cancer Res.* Vol. 3: 273-79.

*Biotechniques*, Suppl, 45–51.

*Cancer Res.,* 61: 131-37.

337 (Abstr.).

on a genomic scale. *Science,* 278: 680-686.

targets. *Blood,* 98: 632 (Abstr.).

66(2): 301-340.

145, IOS Press.

220.


[37] Belov L.; de la Vega, O, dos Remedios CG, et al. (2001). Immunophenotyping of

[38] Berchuck A.; Heron KA, Carney ME et al. 1998. Frequency of germline and somatic

[39] Bishop WR, Bond R, Petrin J, et al. (1995). Biochemical characterization and inhibition of *Ras* modification in transfected Cos cells. *J. Biol. Chem.,* 270: 30611-18. [40] Bittner, W. et al.(2000 a,b). Gene-Expression Profiles in Hereditary Breast Cancer. *N.* 

[41] Blaschek, H. P. (1996). Recent Develpoments in the Genetic Manipulation of

[42] Brown, PA; & Botstein, D. (1999). Exploring the new world of the genome with DNA

[43] Brown, P.; &Wouters, BG. (1999). Apoptosis, p53, and Tumor Cell Sensitivity to

[44] Brown, K.R, and Jurisica, I. (2005). Online predicted human interaction database-

[45] Bryja V, Pachernik J, Faldikova L, et al. (2004). The role of p27(Kip1) in maintaining the

[46] Bubendorf L, Kononen J, Koivisto P, et al. (1999). Survey of gene amplifications during

[47] Cheng M, Olivier P, Diehl JA, et al. (1999). The p21Cip1 and p27kip1 'inhibitors' are

[48] Bunch RT, Eastman A. (1996). Enhancement of cisplatin-induced cytotoxicity by 7-

[49] Carter P, Presta L, Gorman CM, et al. (1992). Humanization of an anti-p185HER2

[50] Carlson BA, Dubay MM, Sausville EA, et al. (1996). Flavopiridol induces G1 arrest with

[51] Chee M, et al. (1996). Accessing genetic information with high-density DNA arrays.

[52] Chen X, Lowe M, Keyomarsi K. (1999). UCN-01 mediated G1 arrest in normal but not tumor breast cells is pRb-dependent and p53-independent. *Oncogene,* 18: 5691-702. [53] Ciardiello F, Caputo R, Bianco R, et al. (2000). Antitumor effect and potentiation of

[54] Clarke FC, Jee DR, Moffat AC, Hammond SV. (2001). Effective sample volume for measurements by NIR Microscopy. (Abstract), *British Pharmaceutical Conference*.

prostate cancer progression by high-throughput fluorescence *in situ* hybridization

essential activators of cyclin D-dependent kinases in murine fibroblasts. *EMBO J*,

hydroxystaurosporine (UCN-01), a new G2-checkpoint inhibitor. *Clin. Cancer Res.,*

inhibition of cyclin-dependent kinase (CDK) 2 and CDK4 in human breast

cytotoxic drugs activity in human cancer cells by ZD-1839 (Iressa), an epidermal growth factor receptor-selective tyrosine kinase inhibitor. *Clin. Cancer Res.,* 6: 2053-

levels of D-type cyclins *in vivo*. *Biochim Biophys Acta*, 3: 1691-96.

antibody for human cancer therapy. *PNAS*-*USA,* 89: 4285-89.

Microorganims for biotechnology applications. *In*: Baianu I.C.; Pessen H, and Kumosinski TF, Eds. *Physical Chemistry of Food Processes*. Vol 2. New York: Van

BRCA1 mutations in ovarian cancer. *Clin. Cancer Res*, 4: 2433-37.

*Engl. J. Med.*, 344 (26): 2028-29; (2001) Vol. 345 (8): 628.

microarrays. *Nature Genetics*, 21 (Suppl.): 33-37.

(OPHID). *Bioinformatics*, 21: 2076–2082.

on tissue microarrays. *Cancer Res*., 59: 803-6.

carcinoma cells. *Cancer Res.,* 56: 2973-78.

Anticancer Agents*. Cancer Res*., Vol. 59: 1391-1399.

Nostrand Reinhold. p. 459-74.

61: 4483-89.

18: 1571– 83.

*Science,* 274: 610-614.

2:791-97.

63.

leukemias using a cluster of differentiation antibody microarray. *Cancer Res., Vol.*


Oncogenomics and Cancer Interactomics 503

[93] Han J-D, et al. (2004). Evidence for dynamically organized modularity in the yeast

[94] Harkin PD. (2002). Uncovering Functionally Relevant Signaling Pathways Using Microarray-Based Expression Profiling. *The Oncologist*, 5(6): 501-507. [95] Harkin DP, Bean JM, Miklos D et al. (1999). Induction of GADD45 and JNK/SAPKdependent apoptosis following inducible expression of BRCA1. *Cell*, 97: 575- 86. [96] Hashemolhosseini S, Nagamine Y, Morley SJ, et al. (1998). Rapamycin inhibition of the

[97] Herman JG, et al. (1996). Methylation-specific PCR: a novel PCR assay for methylation

[98] Ho Y, et al. (2002). Systematic identification of protein complexes in *Saccharomyces*

[99] Hughes TR, et al. (2000). Functional discovery *via* a compendium of expression profiles.

[100] Ideker T, et al. (2001). A new approach to decoding life: systems biology. *Annu. Rev.* 

[101] Ideker T, et al. (2002). Discovering regulatory and signaling circuits in molecular

[102] Irizarry RA, et al. (2003). Summaries of Affymetrix GeneChip probe level data. *Nucleic* 

[103] Ito T, et al. (2001). A comprehensive two-hybrid analysis to explore the yeast protein

[104] Jain KK. (2000). Applications of proteomics in oncology. *Pharmacogenomics*, 1: 385-93. [105] Jeong H, et al. (2001). Lethality and centrality in protein networks. *Nature*, 415: 180-3. [106] Jones PA, Laird PW. (1999). Cancer epigenetics comes of age. *Nature Genet.,* 21: 163-

[107] Jonsson, P.F. and Bates, P.A. (2006). Global topological features of cancer proteins in

[108] Jonsson, P.F. et al. (2006). Cluster analysis of networks generated through homology:

[109] Johnston SR, Ellis PA, Houston S, Hickish T, Howes AJ, et al. (2000). A phase II study

[110] Jung CP, Motwani MV, Schwartz GK. (2001). Flavopiridol increases sensitization to

[112] Kaur G, Stetler-Stevenson M, Sebers S, et al.(1992). Growth inhibition with reversible

automatic identification of important protein communities involved in cancer

of the farnesyl transferase inhibitor R115777 in patients with advanced breast

gemcitabine in human gastrointestinal cancer cell lines and correlates with downregulation of ribonucleotide reductase M2 subunit. *Clin. Cancer Res.,* 7: 2527-36. [111] Kabelka EA.; Diers BW, Fehr WR, LeRoy AR, Baianu IC, et al. (2003). Identification of

putative yield enhancing quantitative trait loci from exotic soybean germplasm,

cell cycle arrest of carcinoma cells by flavone L86-8275. *J. Natl. Cancer Inst.,* Vol. 84

G1 to S transition is mediated by effects on cyclin D1 mRNA and protein stability *J.* 

protein-protein interaction network. *Nature*, 430: 88-93.

status of CpG islands. *PNAS-USA,* 93: 9821-26.

*cerevisiae* by mass spectrometry. *Nature*, 415: 180-183.

interaction networks. *Bioinformatics*, 18: S233-S240.

Interactome., *Proc. Natl. Acad. Sci. USA*., 98: 4569-4574.

the human Interactome. *Bioinformatics*, 22 (18): 2291-97.

*Biol. Chem.*, 273: 14424-29.

*Genomics Human Genet*., 2: 343–72.

metastasis. *BMC Bioinformatics*, 7: 2.

*Crop Sci.,* 42*:* 149-162.

1736-40.

cancer. *Proc. Amer.. Soc. Clin. Oncol.,* 19: 318.

*Cell*, 102: 109-126.

*Acids Res.*, 31: e15.

167.


[74] Esteller M.; et al. (2002). CpG island hypermethylation and tumor suppressor genes: a

[75] Ferry D, Hammond L, Ranson M, et al. (2000). Intermittent oral ZD1839 (Iressa), a novel

[76] Formstecher E, et al. (2005). Protein interaction mapping: a *Drosophila* case study.

[77] Fraser HB et al.. (2005). Evolutionary rate in the protein interaction network. *Science*,

[78] Fukuse T, Hirata T, Naiki H, et al. (2000). Prognostic significance of cyclin E overexpression in resected non-small cell lung cancer. *Cancer Res*., 60: 242-4. [79] Futreal P, Liu Q, Shattuck-Eidens D et al. (1994). BRCA1 mutations in primary breast

[80] Furteal PA, et al.(2004). A census of human cancer genes. *Nature Rev. Cancer*, 4:177-183. [81] Galfalvy HC, et al. (2003). Sex genes for genomic analysis in human brain: internal controls for comparison of probe level data extraction. *BMC Bioinformatics*, 4: 37. [82] Gavin AC, et al. (2002). Functional organization of the yeast proteome by systematic

[83] Georgescu, G.; (2006). N-valued Logics and Łukasiewicz--Moisil Algebras. *Axiomathes*,

[84] Gillett C.; Fantl V, Smith R, Fisher C, et al*.* (1994). Amplification and over-expression of

[85] Giot L.; et al. (2002). A Protein Interaction Map of *Drosophila melanogaster*., *Science*, 302:

[86] Glass L and Kauffman, S.A. (1973). The logical analysis of continuous non-linear

[87] Golub TR.; Slonim, DK, Tamayo P, et al*.* (1999). Molecular classification of cancer: Class

[88] Gonzalgo, M. & Jones, P. (1997). Rapid quantitation of methylation differences at

[89] Gowen L, Avrutskaya AV, Latour AM et al. (1998). BRCA1 required for transcription-

[90] Gray, J.W. et al (1998): High-resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. *Nature Genetics*, 20: 207-211. [91] Hamilton AL, Eder JP, Pavlick AC, et al. (2001). PS-341: phase I study of a novel

[92] Han J-DJ, Dupuy D, Bertin N, et al. (2005). Effect of sampling on the topology

coupled repair of oxidative DNA damage. *Science*, 281: 1009-12.

biochemical control networks. *J. Theor. Biology*, 39: 103-129.

SnuPE). *Nucleic Acids Res.,* 25: 2529-31.

*Oncol.,* 20:336 (Abstr.).

cyclin D1 in breast cancer detected by immunohistochemical staining. *Cancer Res.,* 

discovery and class prediction by gene expression monitoring. *Science*, Vol. 286:

specific sites using methylation-sensitive single nucleotide primer extension Ms-

proteasome inhibitor with pharmacodynamic endpoints. *Proc. Amer. Soc. Clin.* 

predictions of protein-protein interaction networks. *Nature Biotechnology*, 23(7): 839-

epidermal growth factor receptor tyrosine kinase inhibitor (EGFR-TKI), shows evidence of good tolerability and activity: final results from phase I study. *Proc.* 

booming present, a brighter future. *Oncogene,* 21: 5427- 40.

*Amer. Soc. Clin. Oncol.,* 19: 5 (Abstr.).

and ovarian carcinomas. *Science*, 266: 120-2.

analysis of protein complexes. *Nature,* 415: 141-147.

*Genome Res*., Vol.15: 376-384.

Vol. 296: 750-52.

Vol.16:123-136.

54: 1812–17.

1727-36.

531-537.

844.


Oncogenomics and Cancer Interactomics 505

[132] Loden M., Sighall M, Nielsen NH, et al*.* (2002). The cyclin D1 high and cyclin E high

[133] Malicka J et al. (2003). DNA hybridization assays using metal-enhanced fluorescence.

[134] Malumbres M, and Barbacid M. (2001). To cycle or not to cycle: A critical decision in

[135] Matthews LR, et al. (2001) Identification of potential interaction networks using

[136] Mendel DB, Schreck RE, West DC, et al. (2000). The angiogenesis inhibitor SU5416 has

[137] Mendelsohn, AR and Brent R. (1999). Protein interaction methods—toward an

[138] Miettinen HE, Jarvinen TA, Kellner U, et al. (2000). High topoisomerase II-alpha

[139] Mohammadi M, McMahon G, Sun L, et al. (1997). Structures of the tyrosine kinase

[140] Mohr S.; Leikauf GD, Keith G and Rihn BH. (2002). Microarrays as Cancer Keys: An

[141] Mollinedo F, Martinez-Dalmau R, Modolell M. (1993). Early and selective induction of

[142] Monteiro ANA, August A, Hanafusa H. (1996). Evidence for a transcriptional

[144] Motwani M, Delohery TM, Schwartz GK. (1999). Sequential dependent enhancement

[145] Motwani M, Jung C, Sirotnak FM, et al. (2001). Augmentation of apoptosis and tumor

[146] Moyer JD, Barbacci EG, Iwata KK, et al. (1997). Induction of apoptosis and cell cycle

[147] Muraoka RS, Lenferink AEG, Simpson J, et al. (2001). Cyclin-dependent kinase

[148] Noguchi T, Dobashi Y, Minehara H, et al. (2000). Involvement of cyclins in cell

oligodendrogliomas. *Neuropathol. Appl. Neurobiol*., 26: 504-12.

Array of Possibilities., *J. Clinical Oncol.*, 20(14): 3165-75.

[143] Morgan, DO. (1995). Principles of CDK regulation. *Nature*, 374:131-4.

gastric and breast cancer cells. *Clin. Cancer Res.,* 5: 1876-83.

monolayers and xenografts. *Clin. Cancer Res.,* 7: 4209-19.

*Biochem. Biophys. Res. Commun.*, 192: 603-9.

kinase. *Cancer Res.,* 57: 4838-48.

function. *J. Cell Biol*, 153: 917–931.

*Amer J. Pathol*., 156: 2135–47.

*BBRC*, 306: 213-218.

*Genome Res*., 11: 2120–26.

276: 955-60.

93:13595- 599.

cancer. *Nat. Rev. Cancer,* 1: 222–31.

endgame. *Science*, 284: 1948–1950.

and function. *Clin. Cancer Res.,* 6: 4848-58.

subgroups of breast cancer: Separate pathways in tumorigenesis based on pattern of genetic aberrations and inactivation of the pRb node. *Oncogene,* Vol*.* 21: 4680– 90.

sequence-based searches for conserved protein–protein interactions or '*interologs*',

long-lasting effects on vascular endothelial growth factor receptor phosphorylation

expression associates with high proliferation rate and poor prognosis in

domain of fibroblast growth factor receptor in complex with inhibitors. *Science,* Vol*.* 

apoptosis in human leukemic cells by the alkyl-lysophospholipid ET-18-OCH3.

activation function of BRCA1 C-terminal region. *Proc Natl Acad Sci USA*, Vol.

of caspase activation and apoptosis by flavopiridol on paclitaxel-treated human

regression by flavopiridol in the presence of CPT-11 in Hct116 colon cancer

arrest by CP-358,774, an inhibitor of epidermal growth factor receptor tyrosine

inhibitor p27kip1 is required for mouse mammary gland morphogenesis and

proliferation and their clinical implications in soft tissue smooth muscle tumors.


[113] Kawamata S., Sakaida H, Hori T, et al. (1998). The upregulation of p27Kip1 by

[114] King R.W.; Deshaies RJ, Peters JM, Kirschner MW. (1996). How proteolysis drives the

[115] Kettling, U., Koltermann, A., Schwille, P., and Eigen, M. (1998). Real-time enzyme

[117] Klint P.; and Claesson-Welsh L. (1999). Signal transduction by fibroblast growth factor

[118] Kodadek T. (2001). Protein Microarrays: Prospects and problems. *Chem. Biol*., Vol.

[119] Koltermann, A., Kettling, U., Bieschke, J., Winkler, T., and Eigen, M.(1998). Rapid

[120] Koonin EV, Altschul SF, Bork P. BRCA1 protein products: functional motifs. (1996).

[121] Koziczak M, Holbro T, and Hynes NE.(2004). Blocking of FGFR signaling inhibits

[122] Kuenen BC.; Rosen L, Smit EF, et al. (2002). Dose-finding and pharmacokinetics study

[124] Lee J.S.; Collins KM, Brown AL et al. 2000. hCds1-mediated phosphorylation of

[125] Lehner B and Fraser AG. (2004). A first-draft human protein-interaction map. *Genome*

[126] Lewis TS, Shapiro PS, Ahn NG. 1998. Signal transduction through MAP kinase

[127] Li, S. et al.(2004). A Map of the Interactome Network of the Metazoan *C. elegans.,* 

[128] Li, E. (2002). Chromatin modification and epigenetic reprogramming in mammalian

[129] Liu M, Bryant MS, Chen J, Lee S, et al. (1999). Effects of SCH 59228, an orally

[130] Lo YM. et al*.*(1999). Quantitative analysis of aberrant pl6 methylation using real-time

[131] Lockhart, D.J., et al. (1996). Expression monitoring by hybridization to high-density

bioavailable farnesyl protein transferase inhibitor, on the growth of oncogenetransformed fibroblasts and a human colon carcinoma xenograft in nude mice.

quantitative methylation-specific polymerase chain reaction. *Cancer Res.,* Vol*.* 59:

BRCA1 regulates the DNA damage response. *Nature*, 404: 201-4.

[116] Kitano, H. (2002). Systems biology: a brief overview. *Science*, 295, 1662–1664.

561-69.

8:105- 115.

1421-26.

20: 1657-67.

*Biol*., 5: R63.

3899-3903.

*Science*, 303: 540-543.

cell cycle. *Science,* 274: 1652-59.

receptors. *Front. Biosci*., 4: D165–D177.

*Nature Genetics*, Vol. 13: 266-68.

[123] Lakowicz JR. (2001). *Anal. Biochem*., 298: 1-24.

cascades. *Adv. Cancer Res.,* 74: 49-139.

development. *Nature Rev. Genet.,* 3: 662-73.

*Cancer Chemother. Pharmacol.,* 43:50-58.

oligonucleotide arrays. *Nat. Biotechnol.*, 14: 1675–80.

*Oncogene*, 23: 3501–08.

*PNAS-USA,* 95: 1416- 20.

rapamycin results in G1 arrest in exponentially growing T-cell lines. *Blood,* Vol. 91:

kinetics monitored by dual-color fluorescence cross-correlation spectroscopy.

assay processing by integration of dual-color fluorescence cross-correlation spectroscopy: High throughput screening for enzyme activity. *PNAS*-*USA,* 95:

breast cancer cell proliferation through downregulation of D-type cyclins.

of cisplatin, gemcitabine, and SU5416 in patients with solid tumors. *J. Clin. Oncol.,*


Oncogenomics and Cancer Interactomics 507

[166] Rand M et al. (2002). Conversion-specific detection of DNA methylation using real-

[167] Rigler R. and Widengren J. 1990. Ultrasensitive detection of single molecules by fluorescence correlation spectroscopy, *BioScience (Ed. Klinge & Owman*). p.180. [168] Rigler R., Mets Ü., Widengren J. and Kask P. (1993). Fluorescence correlation

[169] Rippe K. (2000). Simultaneous Binding of Two DNA Duplexes to the NtrC- Enhancer

[170] Rosen L, Mulay M, Mayers A, et al. (1999). Phase I dose-escalating trial of SU5416, a

[171] Ross DT, Scherf U, Eisen MB, et al. (2000). Systematic variation in gene expression

[172] Ruffner H, Verma IM. BRCA1 is a cell cycle-regulated nuclear phosphoprotein. 1997.

[173] Saeed M.R. et al. (2006). Protein-protein interactions, evolutionary rate, abundance

[174] Said MR, et al. (2004). Global network analysis of phenotypic effects: protein networks and toxicity modulation in *Saccharomyces cerevisiae*. *PNAS*-*USA*, 101:18006–11. [175] Salwinsky, L. et al. (2004). The Database of Interacting Proteins: 2004 update. *Nucleic* 

[176] Schellens JH, de Klerk G, Swart M, et al. (2000). Phase I and pharmacologic study with the novel farnesyltransferase inhibitor R115777. *Proc. Am. Soc. Clin. Oncol.,* 19: 715. [177] Schena, M. et al. (1995). Quantitative monitoring of gene expression patterns with a

[178] Schwille, P. (2001). Fluorescence Correlation Spectroscopy. Theory and applications.

[179] Schwille, P., Bieschke, J. and Oehlenschläger F. (1997). Kinetic investigations by

[180] Schwille P, Meyer-Almes F-J, and Rigler R. (1997). Dual-color fluorescence cross-

[181] Schwille P, Oehlenschläger F and Walter, NG. (1997). Comparative hybridization

[182] Schwille P, Oehlenschläger F and Walter N. (1996). Analysis of RNA-DNA

[183] Schwille P, Haupts U, Maiti S, and Webb W. (1999). Molecular dynamics in living cells

fluorescence correlation spectroscopy: The analytical and diagnostic potential of

correlation spectroscopy for multicomponent diffusional analysis in solution,

kinetics of DNA-oligonucleotides to a folded RNA target in solution. *Biophys.* 

hybridization kinetics by fluorescence correlation spectroscopy, *Biochemistry,* Vol.

observed by fluorescence correlation spectroscopy with one- and two-photon

patterns in human cancer cell lines. *Nature Genetics*, 24: 227-235.

complementary DNA microarray. *Science,* 270: 467-70.

diffusion studies, *Biophys. Chem.,* 66*:* 211-228.

excitation. *Biophysical Journal,* 77*(10):* 2251-65.

Rigler R and Elson ES. eds, Berlin: Springer Verlag. p. 360.

27:114-20.

diffusion, *Eur. Biophys J.,* 22: 69.

*Biochemistry, 39* (9): 2131-2139.

*Proc Natl Acad. Sci USA,* 94: 7138-43*.* 

and age. *BMC Bioinformatics*, 7: 128.

*acids Res*., 32: D449-D451.

*Biophys. J.,* 72: 1878-80.

*Chem*., Vol. 66: 211-228.

35: 10182.

*Clin. Oncol.,* 18: 618.

time polymerase chain reaction (ConLight-MSP) to avoid false positives. *Methods,* 

spectroscopy with high-count rate and low background: Analysis of translational

Complex Studied by Two-Color Fluorescence Cross-Correlation Spectroscopy.

novel angiogenesis inhibitor in patients with advanced malignancies. *Proc. Am. Soc.* 


[152] Ouichi T, Monteiro ANA, August A et al. (1998). BRCA1 regulates p53-dependent

[153] Pagel, P.; et al*.* (2005). The MIPS mammalian protein-protein interaction database.

[154] Pandey A, Mann M. (2000). Proteomics to study genes and genomes., *Nature*, Vol. 405:

[155] Pasini P, Musiani, M, Russo C, et al. (1998). Chemiluminescence imaging in bioanalysis. *Journal of Pharmacology and Biomedical Analysis,* 18: 555-64. [156] Patel V, Lahusen T, Sy T, et al.(2002). Perifosine, a novel alkylphospholipid, induces

[157] Paweletz CP, Charnoneau L, Bichsel VE, et al. (2001). Reverse phase protein

[158] Pendergast, G.C., Orliff, A. (2000). Farnesyltransferase inhibitors: antineoplastic

[159] Perou, C.M.; Sorlie, T., Eisen, M.B., et al. (2000). Molecular portraits of human breast

[160] Peri S. et al. (2003). Development of human protein reference database as an initial platform for approaching systems biology in humans, *Genome Res*., 13: 2363-2371. [161] Pinkel D, Gray JW, et al. (1998). High resolution analysis of DNA copy number

[162] Peng D, Fan Z, Lu Y, et al. (1996). Anti-epidermal growth factor receptor monoclonal

[164] Prisecaru V, and Baianu IC. (2004a). Cell Cycling Models of Carcinogenesis: A

[165] Prisecaru V., and Baianu IC. (2004b). Complex Biological Systems Analysis of Cell

pathways at the cancer invasion front, *Oncogene*, 20: 1981-89.

p21(WAF1) expression in squamous carcinoma cells through a p53- independent pathway, leading to loss in cyclin-dependent kinase activity and cell cycle arrest.

microarrays which capture disease progression show activation of pro-survival

properties, mechanisms of action, and clinical prospects. *Semin. Cancer Biol.,* Vol. 10:

variation using comparative genomic hybridization to microarrays. *Nature Genetics*,

antibody 225 up-regulates p27KIP1 and induces G1 arrest in prostatic cancer cell

Complex Systems Analysis. *q-bio.MN/0406046 Archive*, p.1-22. Available from:

Cycling Models in Carcinogenesis: I. The essential roles of modifications in the c-Myc, TP53/p53, p27 and hTERT modules in Cancer Initiation and Progression.

nucleic acid sequence-based amplification combined with fluorescence correlation

[149] Ohta T, Fukuda M. (2004). Ubiquitin and breast cancer. *Oncogene*, 23(11): 2079-88. [150] Ormandy CJ, Musgrove EA, Hui R, et al. (2003). Cyclin D1, EMS1 and 11q13 amplification in human breast cancers. *Breast Cancer Res. Treat.,* 78: 323–335. [151] Oehlenschläger F.; Schwille P, and Eigen M. (1996). Detection of HIV-1 RNA by

spectroscopy. *PNAS-USA,* 93: 1281.

*Bioinformatics*, 21: 821-34.

*Cancer Res.,* 62: 1401-9.

tumors. *Nature*, 406: 747-752.

line DU145. *Cancer Res.,* 56: 3666-69.

Cancersignaling\_ICBval.pdf

[163] Plass C. (2002). Cancer epigenomics. *Hum. Mol. Genet.,* 11: 2479-88.

*CERN Archive EXT-2004-057*., pp.1- 17. Available from:

http://doc.cern.ch/archive/electronic/other/ext/ext-2004-057/

http://lanl.arxiv.org/ftp/q-bio/papers/0406/0406046.pdf

837-46.

443-52.

20: 207-11.

gene expression. *Proc Natl Acad Sci USA*, 95: 2302-06.


Oncogenomics and Cancer Interactomics 509

[202] Tortora G, Caputo R, Pomatico G, et al. (1999). Cooperative inhibitory effect of novel

[203] Uberall F, Oberhuber H, Maly K, et al. (1991). Hexadecylphosphocholine inhibits inositol phosphate formation and protein kinase C activity. *Cancer Res.,* 51: 807-12. [204] van Diest PJ, Michalides RJ, Jannink L, et al. (1995). Cyclin D1 expression in invasive breast cancer: Correlation and prognostic value. *Amer. J. Pathol*., 150:705-11. [205] Velicescu, M. et al*.* (2002). Cell division is required for *de novo* methylation of CpG

[206] Velculescu VE.; Zhang L, Vogelstein B, et al. (1995). Serial analysis of gene expression.

[207] Velculescu, VE. (1999). Tantalizing Transcriptomes—SAGE and Its Use in Global Gene

[208] von Eggeling F.; Davies H, Lomas L, et al. (2000). Tissue-specific microdissection

[209] Wachi S, et al. (2005). Interactome—transcriptome analysis reveals the high centrality

[210] Walter N.; Schwille P. and Eigen M. (1996). Fluorescence correlation analysis of probe

[211] Wang Q, Fan S, Eastman A, et al. (1996). UCN-01: a potent abrogator of G2 checkpoint function in cancer cells with disrupted p53. *J. Natl. Cancer Inst.* 88: 956-965. [212] Wang Q, Zhang H, Kajino K et al. (1998). BRCA1 binds c-Myc and inhibits its transcriptional and transforming activity in cells. *Oncogene*, 17: 1939-48. [213] Weinstein J.N.; et al. (1997). An Information-Intensive Approach to the Molecular

[214] Weinstein, J.N. (2000). Pharmacogenomics-Teaching Old Drugs New Tricks., *New* 

[215] Wilson CA, Ramos L, Villasenor MR et al. (1999). Localization of human BRCA1 and its loss in high-grade non-inherited breast carcinoma. *Nature Genet ics*, 21: 236-40. [216] Winkler T, Kettling U, Koltermann, A, Eigen M. (1999). Confocal fluorescence

[217] Winkler T, Bieschke J, Schwille P. (1997). Development of a dual-color cross--

[218] Winkler T, Schwille P, Oehlenschläger F. (1998). Detection of HIV-1 RNA by NASBA-

[219] Wodicka L, Dong H, Mittmann M, et al. (1997). Genome-wide expression monitoring

in *Saccharomyces cerevisiae*., *Nature Biotechnol*., 15: 1359-1367.

coincidence analysis: An approach to ultra high-throughput screening. *PNAS-*

correlation system for FCS. Available from: http://www.mpibpc.gwdg.de

FCS: available from: www.mpibpc.gwdg.de/abteilungen/081/fcs/nasba/english

coupled with ProteinChip array technologies: Applications in cancer research.

of genes differentially expressed in lung cancer tissues. *Bioinformatics*, 21: 4205-

diffusion simplifies quantitative pathogen detection by PCR., *Proc. Natl. Acad. Sci.* 

cancer cell growth. *Clin. Cancer Res.,* 5: 875-81.

Expression Analysis. *Science*, 286 (5444): 1491-2.

Pharmacology of Cancer. *Science*, 275: 343-349.

/abteilungen/081/fcs/correlation/english.

[220] Wong, IH, et al. (1999)*. Cancer Res.,* 59: 71-73.

*Science*, 270: 484-487.

*USA,* 93*:* 12805-08.

4208.

*Biotechniques,* 29: 1066-1070.

*Engl. J. Med.,* 343:1408-1409.

*USA,* 96: 1375-1378.

islands in bladder cancer cells. *Cancer Res.* 62: 2378-2384.

mixed backbone oligonucleotide targeting protein kinase A in combination with docetaxel and anti-epidermal growth factor-receptor antibody on human breast


[184] Sebolt-Leopold JS, Dudley DT, Herrera R, et al. (1999). Blockade of the MAP kinase pathway suppresses growth of colon tumors in vivo. *Nat. Med.* 5:810-16. [185] Senderowicz AM, Sausville EA. (2000). Preclinical and clinical development of cyclin-

[186] Senior K. (1999). Fingerprinting disease with protein chip arrays. *Mol. Med. Today*, 5:

[187] Sekulic A, Hudson CC, Homme JL, et al. 2000. A direct linkage between the

Herceptin Multinational Investigator Study Group. *Semin. Oncol.,* 26:71-77. [189] Shao R, Cao C, Shimiu T, O'Connor PM, et al. (1997). Abrogation of an S-phase

[190] Shapiro GI, Supko JG, Patterson A, et al.(2001). A phase II trial of the cyclin-

[191] Sharan R et al. (2005). Conserved patterns of protein interactions in multiple species.

[192] Sidransky, D. (2002). Emerging molecular markers of cancer. *Nature Rev. Cancer,* 2:

[193] Silverman, L., R. Campbell, and J. R. Broach.(1998). New assay technologies for high throughput screening. *Current Opinion in Chemical Biology,* 2: 397-403. 11:825-28. [194] Snijders, A.M. et al. (2001). Assembly of microarrays for genome-wide measurement

[195] Somasundaram K, Zhang H, Zeng YX et al. 1997. Arrest of the cell cycle by the

[196] Sorlie, T., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. *PNAS* -*USA*, 98: 10869–74. [197] Staunton, JE, et al.(2001). Chemosensitivity prediction by transcriptional profiling.

[198] Sutherland RL, Musgrove EA. (2004). Cyclins and breast cancer. *J. Mammary Gland* 

[199] Tanaka H, Arakawa H, Yamaguchi T et al. (2000). A ribonucleotide reductase gene

[200] Terwogt JM, Mandjes IA, Sindermann H, et al. (1999). Phase II trial of topically

[201] Thompson NL. (1991). in *Topics of Fluorescence Spectroscopy*., Lakowicz, J.R. ed., New

involved in a p53-dependent cell-cycle checkpoint for DNA damage. *Nature,* Vol.

applied miltefosine solution in patients with skin-metastasized breast cancer. *Br. J.* 

tumour-suppressor BRCA1 requires the CDK-inhibitor p21 WAF1/CiP1. *Nature*,

iv non- small cell lung cancer. *Clin. Cancer Res.,* 7: 1590-99.

of DNA copy number. *Nature Genetics*, 29: 263-4.

York and London: Plenum Press, Vol.1. p. 337.

phosphoinositide 3-kinase-AKT signaling pathway and the mammalian target of rapamycin in mitogen-stimulated and transformed cells. *Cancer Res.,* 60: 3504-13. [188] Shak S. (1999). Overview of the trastuzumab (Herceptin) anti-HER2 monoclonal

antibody clinical program in HER2-overexpressing metastatic breast cancer.

checkpoint and potentiation of camptothecin cytotoxicity by 7 hydroxystaurosporine (UCN-01) in human cancer cell lines, possibly influenced by

dependent kinase inhibitor flavopiridol in patients with previously untreated stage

dependent kinase modulators. *J. Natl. Cancer Inst.,* 92:376-87.

p53 function. *Cancer Res.,* 57: 4029-35.

*PNAS-USA*, Vol. 102: 1974-1979.

*PNAS*-USA, 98 (19): 10787-92.

*Biol. Neoplasia*, 9(1): 95-104.

326-327.

210-19.

389: 187-90.

404: 42-49.

*Cancer,* 79:1158-61.


**Part 7** 

**Transcriptional Analysis** 


## **Part 7**

**Transcriptional Analysis** 

510 Bioinformatics – Trends and Methodologies

[221] Wuchty S. (2004). Evolution and topology in the yeast interaction network, *Genome* 

[222] Xiong Z, Laird PW. (1997). COBRA: a sensitive and quantitative DNA methylation

[223] Yan, PS, et al*.* (2002)*.* Applications of CpG island microarrays for high-throughput

[224] Yarden RI, Brody LC. (1999). BRCA1 interacts with components of the histone

[225] Yu J, Zhang L, Hwang PM et al. (1999). Identification and classification of p53-

[226] Yu X, Wu LC, Bowcock AM et al.(1998).The C-terminal (BRCT) domains of BRCA1

[227] Zhu H. and Snyder M. (2001). Protein arrays and microarrays. *Curr. Opin. Chem Biol*.,

[228] Zujewski J, Horak, ID, Bol CJ, et al. (2000). Phase I and pharmacokinetic study of

[229] Baianu IC, et al. (2010). Łukasiewicz -Moisil Many--Valued Logic Algebras of Highly-

[230] Baianu IC, and Poli R. (2011). From Simple to Complex and Ultra-complex Systems: A

interact in vivo with CtIP, a protein implicated in the CtBP pathway of

farnesyl protein transferase inhibitor R115777 in advanced cancer. *J. Clin. Oncol.*

Complex Systems, *BRAIN*-- *Broad Research in Artificial Intelligence and Neuroscience*,

Paradigm Shift Towards Non-Abelian Systems Dynamics -v. 4.0. *philoso.philica.com* 

*Res.,* 14: 1310-1314.

5: 40-45.

18:927-34.

assay. *Nucleic Acids Res.,* 25: 2532-2534.

ISSN 2067-3957, Volume 1: 1-- 15.

*Article number 256:1-18*.

analysis of DNA methylation. *J. Nutr.,* 132: 2430S-2434S.

deacetylase complex. *Proc Natl Acad Sci USA*, 96: 4983-88.

transcriptional repression *J. Biol. Chem*, 273: 25388-92.

regulated genes. *PNAS -USA*, 96: 14517-14522.

**23** 

*1Egypt 2UK* 

*In-silico* **Approaches for RNAi** 

Mahmoud ElHefnawi1 and Mohamed Mysara2

*1National Research Center 2The University of Nottingham* 

**Post-Transcriptional Gene Regulation:** 

**Optimizing siRNA Design and Selection** 

RNA interference (RNAi) is a naturally occurring endogenous biological posttranscriptional cellular mechanism that regulates against foreign genetic elements such as viruses and inserted gene transcripts as well as in-house gene expression regulation. Small interfering RNA (siRNA) molecules utilize this mechanism to promote homology

The utilization of siRNA as a molecular target to silence gene expression has been used extensively as a research tool in functional genomics. The unprecedented advantage of siRNA molecules, which is mainly related to the ability of effective and specific inhibition of disease causing genes, elicited great expectations in therapeutic applications and drug discovery. siRNAs' potential as a drugs was investigated in viral and cancer models, and showed successful results with diseases such as HIV, HCV and several types of cancer; as most of these diseases have no cure. One advantage of siRNA-based drugs is their feasibility in clinical trials following approval of phase 1. Moreover, they do not rely on an intact immune system which give the advantage over other long double stranded RNA (dsRNA). However, several factors challenge the design of selective siRNA molecules with highly guaranteed silencing efficiency. Therefore, careful selection of siRNAs complying with all

This Chapter discusses RNA interference using small interfering RNS (siRNA) starting with the biological nature of mRNA and siRNA. Then it tackles factors contributing to siRNAmRNA silencing from both biological and bioinformatics aspects that should affect siRNA effectiveness. Then, it represents step wise workflow for rational siRNA design considering state of the art tools and algorithms. By the end of this chapter, various tools are presented for siRNA evaluation phases that are used to predict siRNA efficiency and efficacy, with a

Small interfering RNAs 'siRNAs' are one of the cell defence mechanisms that act against not only exogenous genetic materials like virus genes but also against cell endogenous genes as

**1. Introduction** 

dependent messenger RNA (mRNA) degradation.

necessary properties is crucial for efficient functional performance.

practical example applying the proposed methodology.

**2. Small interfering RNA** 

### *In-silico* **Approaches for RNAi Post-Transcriptional Gene Regulation: Optimizing siRNA Design and Selection**

Mahmoud ElHefnawi1 and Mohamed Mysara2 *1National Research Center 2The University of Nottingham 1Egypt 2UK* 

#### **1. Introduction**

RNA interference (RNAi) is a naturally occurring endogenous biological posttranscriptional cellular mechanism that regulates against foreign genetic elements such as viruses and inserted gene transcripts as well as in-house gene expression regulation. Small interfering RNA (siRNA) molecules utilize this mechanism to promote homology dependent messenger RNA (mRNA) degradation.

The utilization of siRNA as a molecular target to silence gene expression has been used extensively as a research tool in functional genomics. The unprecedented advantage of siRNA molecules, which is mainly related to the ability of effective and specific inhibition of disease causing genes, elicited great expectations in therapeutic applications and drug discovery. siRNAs' potential as a drugs was investigated in viral and cancer models, and showed successful results with diseases such as HIV, HCV and several types of cancer; as most of these diseases have no cure. One advantage of siRNA-based drugs is their feasibility in clinical trials following approval of phase 1. Moreover, they do not rely on an intact immune system which give the advantage over other long double stranded RNA (dsRNA). However, several factors challenge the design of selective siRNA molecules with highly guaranteed silencing efficiency. Therefore, careful selection of siRNAs complying with all necessary properties is crucial for efficient functional performance.

This Chapter discusses RNA interference using small interfering RNS (siRNA) starting with the biological nature of mRNA and siRNA. Then it tackles factors contributing to siRNAmRNA silencing from both biological and bioinformatics aspects that should affect siRNA effectiveness. Then, it represents step wise workflow for rational siRNA design considering state of the art tools and algorithms. By the end of this chapter, various tools are presented for siRNA evaluation phases that are used to predict siRNA efficiency and efficacy, with a practical example applying the proposed methodology.

#### **2. Small interfering RNA**

Small interfering RNAs 'siRNAs' are one of the cell defence mechanisms that act against not only exogenous genetic materials like virus genes but also against cell endogenous genes as

*In-silico* Approaches for RNAiPost-Transcriptional

**2.1.2 Seed-mediated translational attenuation** 

M. Anderson et al. 2008; Birmingham et al. 2006).

Gene Regulation: Optimizing siRNA Design and Selection 515

The complementation between the siRNA seeding region hexamer (from the second to the seventh position) and the 3'UTR (untranslated region) of the mature mRNA has been identified capable of inhibition of that mRNA's translation and causing its degradation (E.

Fig. 2. Naturally occurring siRNA synthesis pathway and three its possible mechanisms of action. Endogenous (naturally occurring) siRNA are produced from either microRNA or long double strand RNA after their cleavage by the Dicer enzyme so they produce double strand siRNA. Both endogenous and exogenous (introduced by researchers) ds-siRNA pass

siRNA has another mechanism of interference by chromatin modification as illustrated by Dorsett and Tuschl in their description of Scherer work that siRNA is one of the three major

nucleic-acid-based gene silencing mechanisms (Dorsett & Thomas Tuschl 2004).

through the activation process starting with unwinding and RNA induced silencing complex (RISC) to the lead single strand siRNA. Then RISC- single strand siRNA complex silence the targeted gene either by one of the three mechanisms: 1) Binding to the mRNA leading to their breakage through Age2 mechanism. 2) Binding to the 3' end and mediate translational attenuation of the mRNA. 3) Gene silencing through chromatin modification

[Figure from the work of (Dorsett & Thomas Tuschl 2004)].

**2.1.3 Chromatin modification** 

one of the post-transcription regulation method (Ullu et al. 2002). These natural-occurring siRNAs target mRNAs (whether they are over expressed or abnormal) in a manner, so selective and potent, that they became the core of interest of many biologists in the last decade. Although siRNAs are not the only layer responsible for post-transcriptional regulation, they have the advantage of hardly invoking the innate immune response (Interferon-response) in contrast to long double stranded RNA (Stark et al. 1998) . In addition, siRNAs, have shown to be very promising new therapeutic agents in various diseases especially in Cancer, Aids and Neurodegenerative disorders as most of these diseases have no cure (Hutvágner & Zamore 2002; Surabhi & Gaynor 2002; Xia et al. 2004). That is why siRNA has been used as a drug for cancer clinical trials on human producing the efficient and specific effect on human as it was expected (Davis et al. 2010).

#### **2.1 siRNA mechanism of action**

The mechanism pathway of siRNA is as follows: long dsRNA is cleaved by "DICER" a ribonuclease III-type enzyme into the short molecules of siRNA duplexes, being homologous to the mRNA targeted for silencing, siRNA triggers the formation of RNAinduced silencing complex (RISC) in which the double stranded siRNA is incorporated cutting the long double-stranded RNA molecules to double stranded small interfering RNA (ds-siRNA), as illustrated in the [Fig. 1]. Then it is anwounded leading to single stranded siRNA that binds to the target mRNA sequence resulting in its cleavage, and according to the type of the RISC complex the RNAi action is directed through mRNA degradation, action arrest or chromatin modification. [5]. This is detailed below:

Fig. 1. Small interfering RNA formed of two short stranded RNA sequences complementary to each other.

Due to the homology (similarity) between the double stranded siRNA (ds-siRNA) and the targeted messenger RNA (mRNA), the aggregation of a complex called RNA induced silencing complex (RISC) is triggered. After binding with ds-siRNA, RISC acts to separate (unwind) the strand making the sense and the antisense strands (passenger and guide strand). After siRNA unwinding into small single strand, it could produce its action with three different mechanisms [Fig 2].

#### **2.1.1 Direct cleavage method**

The single stranded RNA together with RISC bind to the targeted mRNA and induce its degradation by the Ago-2 degradation (protein triggered by RISC-siRNA complex acts to break the targeted mRNA). The degraded mRNA is finally digested with, what is called, cellular lysosomes. This is the main mechanism by which siRNA causes selective and potent gene silencing, but this only occurs in case of high level of similarity between siRNA and the targeted mRNA region (Birmingham et al. 2006).

#### **2.1.2 Seed-mediated translational attenuation**

514 Bioinformatics – Trends and Methodologies

one of the post-transcription regulation method (Ullu et al. 2002). These natural-occurring siRNAs target mRNAs (whether they are over expressed or abnormal) in a manner, so selective and potent, that they became the core of interest of many biologists in the last decade. Although siRNAs are not the only layer responsible for post-transcriptional regulation, they have the advantage of hardly invoking the innate immune response (Interferon-response) in contrast to long double stranded RNA (Stark et al. 1998) . In addition, siRNAs, have shown to be very promising new therapeutic agents in various diseases especially in Cancer, Aids and Neurodegenerative disorders as most of these diseases have no cure (Hutvágner & Zamore 2002; Surabhi & Gaynor 2002; Xia et al. 2004). That is why siRNA has been used as a drug for cancer clinical trials on human producing

The mechanism pathway of siRNA is as follows: long dsRNA is cleaved by "DICER" a ribonuclease III-type enzyme into the short molecules of siRNA duplexes, being homologous to the mRNA targeted for silencing, siRNA triggers the formation of RNAinduced silencing complex (RISC) in which the double stranded siRNA is incorporated cutting the long double-stranded RNA molecules to double stranded small interfering RNA (ds-siRNA), as illustrated in the [Fig. 1]. Then it is anwounded leading to single stranded siRNA that binds to the target mRNA sequence resulting in its cleavage, and according to the type of the RISC complex the RNAi action is directed through mRNA degradation,

Fig. 1. Small interfering RNA formed of two short stranded RNA sequences complementary

Due to the homology (similarity) between the double stranded siRNA (ds-siRNA) and the targeted messenger RNA (mRNA), the aggregation of a complex called RNA induced silencing complex (RISC) is triggered. After binding with ds-siRNA, RISC acts to separate (unwind) the strand making the sense and the antisense strands (passenger and guide strand). After siRNA unwinding into small single strand, it could produce its action with

The single stranded RNA together with RISC bind to the targeted mRNA and induce its degradation by the Ago-2 degradation (protein triggered by RISC-siRNA complex acts to break the targeted mRNA). The degraded mRNA is finally digested with, what is called, cellular lysosomes. This is the main mechanism by which siRNA causes selective and potent gene silencing, but this only occurs in case of high level of similarity between siRNA and the

the efficient and specific effect on human as it was expected (Davis et al. 2010).

action arrest or chromatin modification. [5]. This is detailed below:

**2.1 siRNA mechanism of action** 

to each other.

three different mechanisms [Fig 2].

targeted mRNA region (Birmingham et al. 2006).

**2.1.1 Direct cleavage method** 

The complementation between the siRNA seeding region hexamer (from the second to the seventh position) and the 3'UTR (untranslated region) of the mature mRNA has been identified capable of inhibition of that mRNA's translation and causing its degradation (E. M. Anderson et al. 2008; Birmingham et al. 2006).

Fig. 2. Naturally occurring siRNA synthesis pathway and three its possible mechanisms of action. Endogenous (naturally occurring) siRNA are produced from either microRNA or long double strand RNA after their cleavage by the Dicer enzyme so they produce double strand siRNA. Both endogenous and exogenous (introduced by researchers) ds-siRNA pass through the activation process starting with unwinding and RNA induced silencing complex (RISC) to the lead single strand siRNA. Then RISC- single strand siRNA complex silence the targeted gene either by one of the three mechanisms: 1) Binding to the mRNA leading to their breakage through Age2 mechanism. 2) Binding to the 3' end and mediate translational attenuation of the mRNA. 3) Gene silencing through chromatin modification [Figure from the work of (Dorsett & Thomas Tuschl 2004)].

#### **2.1.3 Chromatin modification**

siRNA has another mechanism of interference by chromatin modification as illustrated by Dorsett and Tuschl in their description of Scherer work that siRNA is one of the three major nucleic-acid-based gene silencing mechanisms (Dorsett & Thomas Tuschl 2004).

*In-silico* Approaches for RNAiPost-Transcriptional

considered in the target space selection.

**3.1.3 Single Nucleotide Polymorphism (SNPs)** 

**3.1.2 Multiple splicing and orthologs consensus** 

Gene Regulation: Optimizing siRNA Design and Selection 517

One mRNA could be coding for several proteins as the process of splicing is accompanied by rearrangement of exons. There are several mechanisms of alternative (differential) splicing as exon insertion or deletion but the main mechanism; as described in the work of Black; is exon skipping (Black 2003). This phenomena form a huge obstacle if there is a need to target all the mRNA transcripts; therefore, regions in common among them should be recognized and targeted [Fig 4]. All the mRNA's transcripts should be included in target space selection. In case of handling multiple organisms (as in global vaccines or rapidly mutated species as virus) the consensus between different targeted mRNAs should be

Single Nucleotide polymorphism (SNP) is very crucial in siRNA design where single (several) Nucleotide(s) difference could cause dramatic shift in the produced protein (or in its regulation) or could have a non-sensible effect in this case it is named silence polymorphism. There are two main locations for SNPs existence non-coding and coding regions [Fig 5]. The first region is the **non-coding region,** where SNP existing in the Introns will not affect the mature mRNA, thus the siRNA targeting it. However, if the SNP is located in the 3' UTR or 5' UTR, caution should be taken in cases where the siRNA is designed to target them. The second region is **coding region,** where SNP exists in the protein coding region (ORF or Exons), there are two possibilities: SNPs will not affect the produced protein due to degeneracy of the genetic code, or it could cause changes in the

Fig. 4. mRNA alternative splicing phenomena results in several transcripts from the same gene. Each of these transcripts is later translated in a different protein. These proteins

functions could be similar or non-similar to each other.

produced protein, hence siRNA targeting this region should be excluded.

### **3. Factors that affect siRNA design**

In order to understand the interaction between siRNA and the targeted mRNA, several factors have been known to affect the design of effective and specific siRNA. These factors can be further sub classified into four major classes design as illustrated by Birmingham (Birmingham et al. 2007). **F**irstly, Targeted region or what is called "sequence space", this section handles the identification of regions in the mRNA to be targeted by the designed siRNA. This step is highly critical as targeting the wrong region would abolish the effect of all designed siRNAs. Sequence space is affected by several factors: Transcript region, Transcript size, mRNA multiple splicing, Orthologs consensus and Single nucleotide polymorphism. **S**econdly, siRNA sequence space preparation, here we discuss internal repeats, positional preferences, and other desirable/undesirable words/motifs are discussed. **T**hirdly, siRNA thermodynamic properties and both siRNA and mRNA target accessibility. It includes parameters like GC content, palindromes, in addition to thermodynamic stability and differential ends stability which have been identified to be highly important factors in siRNA selection. **F**orthly, siRNA specificity describing mechanisms through which siRNA could invoke immune reaction or has off-target effect. Each of these factors can greatly affect siRNA selection and therefore they should be studied thoroughly.

#### **3.1 Target sequence space [Targeted region preprocessing]**

Targeted regions (or what is called "sequence space") are areas of the mRNA that should be assigned for targeting by the designed siRNA. There are five factors affecting the selection of the proper sequence space summarized in (Birmingham et al. 2007).

#### **3.1.1 Transcript regions and size**

siRNA should target regions in the mRNA that is not affected by the maturation process, hence targeting 3'UTR, 5'UTR and (most importantly) open reading frame (ORF) [Fig 3]. Normally both 3'UTR and 5'UTR could be excluded from targeted sequence space, unless sequence space needs to be widened. If the mRNA length is < 500 nucleotides, 3'UTR and 5'UTR should be included in target space selection.

Fig. 3. Maturation process of premature to mature mRNA. This figure illustrates different regions that vary due to omission and insertion during the maturation process. In the maturation process omission of the introns (non coding areas) and addition of 5' cap and 3' tail) takes place (Mysara 2010).

In order to understand the interaction between siRNA and the targeted mRNA, several factors have been known to affect the design of effective and specific siRNA. These factors can be further sub classified into four major classes design as illustrated by Birmingham (Birmingham et al. 2007). **F**irstly, Targeted region or what is called "sequence space", this section handles the identification of regions in the mRNA to be targeted by the designed siRNA. This step is highly critical as targeting the wrong region would abolish the effect of all designed siRNAs. Sequence space is affected by several factors: Transcript region, Transcript size, mRNA multiple splicing, Orthologs consensus and Single nucleotide polymorphism. **S**econdly, siRNA sequence space preparation, here we discuss internal repeats, positional preferences, and other desirable/undesirable words/motifs are discussed. **T**hirdly, siRNA thermodynamic properties and both siRNA and mRNA target accessibility. It includes parameters like GC content, palindromes, in addition to thermodynamic stability and differential ends stability which have been identified to be highly important factors in siRNA selection. **F**orthly, siRNA specificity describing mechanisms through which siRNA could invoke immune reaction or has off-target effect. Each of these factors can greatly affect siRNA selection and therefore they should be studied

Targeted regions (or what is called "sequence space") are areas of the mRNA that should be assigned for targeting by the designed siRNA. There are five factors affecting the selection

siRNA should target regions in the mRNA that is not affected by the maturation process, hence targeting 3'UTR, 5'UTR and (most importantly) open reading frame (ORF) [Fig 3]. Normally both 3'UTR and 5'UTR could be excluded from targeted sequence space, unless sequence space needs to be widened. If the mRNA length is < 500 nucleotides, 3'UTR and

Fig. 3. Maturation process of premature to mature mRNA. This figure illustrates different regions that vary due to omission and insertion during the maturation process. In the maturation process omission of the introns (non coding areas) and addition of 5' cap and 3'

**3. Factors that affect siRNA design** 

**3.1 Target sequence space [Targeted region preprocessing]** 

of the proper sequence space summarized in (Birmingham et al. 2007).

thoroughly.

**3.1.1 Transcript regions and size** 

tail) takes place (Mysara 2010).

5'UTR should be included in target space selection.

#### **3.1.2 Multiple splicing and orthologs consensus**

One mRNA could be coding for several proteins as the process of splicing is accompanied by rearrangement of exons. There are several mechanisms of alternative (differential) splicing as exon insertion or deletion but the main mechanism; as described in the work of Black; is exon skipping (Black 2003). This phenomena form a huge obstacle if there is a need to target all the mRNA transcripts; therefore, regions in common among them should be recognized and targeted [Fig 4]. All the mRNA's transcripts should be included in target space selection. In case of handling multiple organisms (as in global vaccines or rapidly mutated species as virus) the consensus between different targeted mRNAs should be considered in the target space selection.

#### **3.1.3 Single Nucleotide Polymorphism (SNPs)**

Single Nucleotide polymorphism (SNP) is very crucial in siRNA design where single (several) Nucleotide(s) difference could cause dramatic shift in the produced protein (or in its regulation) or could have a non-sensible effect in this case it is named silence polymorphism. There are two main locations for SNPs existence non-coding and coding regions [Fig 5]. The first region is the **non-coding region,** where SNP existing in the Introns will not affect the mature mRNA, thus the siRNA targeting it. However, if the SNP is located in the 3' UTR or 5' UTR, caution should be taken in cases where the siRNA is designed to target them. The second region is **coding region,** where SNP exists in the protein coding region (ORF or Exons), there are two possibilities: SNPs will not affect the produced protein due to degeneracy of the genetic code, or it could cause changes in the produced protein, hence siRNA targeting this region should be excluded.

Fig. 4. mRNA alternative splicing phenomena results in several transcripts from the same gene. Each of these transcripts is later translated in a different protein. These proteins functions could be similar or non-similar to each other.

*In-silico* Approaches for RNAiPost-Transcriptional

**3.3.1 GC content** 

**3.3.2 Palindrome** 

Gene Regulation: Optimizing siRNA Design and Selection 519

target accessibility evaluation represents where the mRNA is more likely be accessed by short oligomers as siRNAs, it involves not only mRNA secondary structure evaluation, but also energetic calculation of siRNA and mRNA. For interaction between two RNA sequences (siRNA and mRNA) two types of energies are needed: first energy required for opening the binding site, second energy required to gain hybridization the summation of these three energies is defined as interaction energy. The energy required for opening siRNA duplex and mRNA should have lesser than the hybridization energy between siRNA and the mRNA. There are evidence based correlation between siRNA inhibition efficiency and siRNA-mRNA binding energy (Mückstein et al. 2006), that strengthens the findings of Ladunga in which target accessibility information was found to provide the most predictive feature among the 142 features studied and improves the prediction of highly efficient siRNA (Ladunga 2007). Other parameters affecting target accessibility are presented below:

GC content represent the percentage of Guanine and Cytosine (two of the four nucleotides types that build the mRNA) should not be too high in order not to impair the double strand

Palindrome should be addressed in target accessibility evaluationwhere region(s) in one strand binds to another region in the same strand due to reverse complementation. Therefore, palindromes should be avoided in siRNA design as they tend to make intra-

Fig. 6. Palindrome patterns and their affect on siRNA binding to RISC and the targeted mRNA. Palindromes lead to changing double stranded siRNA secondary structure which in

turn affects their ability to bind to RISC and the targeted mRNA (Mysara 2010).

siRNA unwinding and enables the ease of RISC protein entrance.

molecular structure (2ry structure) which impairs RISC binding [Fig 6].

#### **3.2 SiRNA sequence space [Positional/word preferences]**

Positional/word preferences in the sense/antisense strand of the siRNA are a crucial determinant of siRNA functionality. Several position dependant preferences were identified from analysis of siRNA experimental dataset, which can affect siRNA selection process. Among those preferences within the sense strand (Ui-Tei et al. 2004): (i) A/U at the 5' end of the antisense strand;(ii) G/C at the 5'end of the sense strand; (iii) at least five A/U residues in the 5' terminal one-third of the antisense strand; and (iv) the absence of any GC stretch of more than 9 nt in length. (Reynolds et al. 2004): (I) At least 3 'A/U' bases at positions 15–19 (sense strand). (II) Absence of internal repeats. (III) An 'A' base at position 19 (sense strand). (IV) An 'A' base at position 3 (sense strand). (V) A 'U' base at position 10 (sense strand). (VI) A base other than 'G' or 'C' at 19 (sense strand). (VII) A base other than 'G' at position 13 (sense strand). (Mohammed Amarzguioui & Prydz 2004): asymmetry in the stability of the duplex ends (measured as the A/U differential of the three terminal basepairs at either end of the duplex) and the motifs S1, A6, and W19. The presence of the motifs U1 or G19 was associated with lack of functionality.

Fig. 5. Classification of SNPs according to region of occurrence in the mRNA.

Several positions in siRNA duplex that could affect their efficiency, as in (Birmingham et al. 2007), candidate duplexes with five or more of any single base in a row, should be removed. Although less detrimental than G/C stretches, repeated bases have also been shown to reduce functionality. Stretches of repeated base-containing sequences are less selective, and A or U/T stretches may additionally target regulatory motifs. Moreover, candidate duplexes with more than six consecutive G's and/or C's stretches of G's and C's have been shown to be one of the strongest negative determinants for siRNA activity that should be removed. Such regions have pronounced local stability, greatly inhibiting duplex dissociation. In addition, GC- rich stretches are not compatible with some synthetic nucleic acid chemistries utilized in vector-based expression.

#### **3.3 The target accessibility evaluation**

Several studies have been done to illustrate the structural and sequence features affection siRNA functionality, all of these aspects affect siRNA and mRNA accessibility (Patzel et al. 2005; Ladunga 2007). Target accessibility evaluation is crucial for proper designing of efficient siRNA, as mRNA tends to form secondary structure that affects its accessibility and hence reduces the capability to design siRNA targeting certain regions of mRNA. Therefore, target accessibility evaluation represents where the mRNA is more likely be accessed by short oligomers as siRNAs, it involves not only mRNA secondary structure evaluation, but also energetic calculation of siRNA and mRNA. For interaction between two RNA sequences (siRNA and mRNA) two types of energies are needed: first energy required for opening the binding site, second energy required to gain hybridization the summation of these three energies is defined as interaction energy. The energy required for opening siRNA duplex and mRNA should have lesser than the hybridization energy between siRNA and the mRNA. There are evidence based correlation between siRNA inhibition efficiency and siRNA-mRNA binding energy (Mückstein et al. 2006), that strengthens the findings of Ladunga in which target accessibility information was found to provide the most predictive feature among the 142 features studied and improves the prediction of highly efficient siRNA (Ladunga 2007). Other parameters affecting target accessibility are presented below:

#### **3.3.1 GC content**

518 Bioinformatics – Trends and Methodologies

Positional/word preferences in the sense/antisense strand of the siRNA are a crucial determinant of siRNA functionality. Several position dependant preferences were identified from analysis of siRNA experimental dataset, which can affect siRNA selection process. Among those preferences within the sense strand (Ui-Tei et al. 2004): (i) A/U at the 5' end of the antisense strand;(ii) G/C at the 5'end of the sense strand; (iii) at least five A/U residues in the 5' terminal one-third of the antisense strand; and (iv) the absence of any GC stretch of more than 9 nt in length. (Reynolds et al. 2004): (I) At least 3 'A/U' bases at positions 15–19 (sense strand). (II) Absence of internal repeats. (III) An 'A' base at position 19 (sense strand). (IV) An 'A' base at position 3 (sense strand). (V) A 'U' base at position 10 (sense strand). (VI) A base other than 'G' or 'C' at 19 (sense strand). (VII) A base other than 'G' at position 13 (sense strand). (Mohammed Amarzguioui & Prydz 2004): asymmetry in the stability of the duplex ends (measured as the A/U differential of the three terminal basepairs at either end of the duplex) and the motifs S1, A6, and W19. The presence of the motifs U1 or G19 was

Fig. 5. Classification of SNPs according to region of occurrence in the mRNA.

Several positions in siRNA duplex that could affect their efficiency, as in (Birmingham et al. 2007), candidate duplexes with five or more of any single base in a row, should be removed. Although less detrimental than G/C stretches, repeated bases have also been shown to reduce functionality. Stretches of repeated base-containing sequences are less selective, and A or U/T stretches may additionally target regulatory motifs. Moreover, candidate duplexes with more than six consecutive G's and/or C's stretches of G's and C's have been shown to be one of the strongest negative determinants for siRNA activity that should be removed. Such regions have pronounced local stability, greatly inhibiting duplex dissociation. In addition, GC- rich stretches are not compatible with some synthetic nucleic acid chemistries

Several studies have been done to illustrate the structural and sequence features affection siRNA functionality, all of these aspects affect siRNA and mRNA accessibility (Patzel et al. 2005; Ladunga 2007). Target accessibility evaluation is crucial for proper designing of efficient siRNA, as mRNA tends to form secondary structure that affects its accessibility and hence reduces the capability to design siRNA targeting certain regions of mRNA. Therefore,

**3.2 SiRNA sequence space [Positional/word preferences]** 

associated with lack of functionality.

utilized in vector-based expression.

**3.3 The target accessibility evaluation** 

GC content represent the percentage of Guanine and Cytosine (two of the four nucleotides types that build the mRNA) should not be too high in order not to impair the double strand siRNA unwinding and enables the ease of RISC protein entrance.

#### **3.3.2 Palindrome**

Palindrome should be addressed in target accessibility evaluationwhere region(s) in one strand binds to another region in the same strand due to reverse complementation. Therefore, palindromes should be avoided in siRNA design as they tend to make intramolecular structure (2ry structure) which impairs RISC binding [Fig 6].

Fig. 6. Palindrome patterns and their affect on siRNA binding to RISC and the targeted mRNA. Palindromes lead to changing double stranded siRNA secondary structure which in turn affects their ability to bind to RISC and the targeted mRNA (Mysara 2010).

*In-silico* Approaches for RNAiPost-Transcriptional

response (interferon response) (Patzel 2007).

off-targeted mRNA.

**3.4.2 Off-target effect** 

trigger off-targeting actions:

(Patzel et al. 2005).

**3.4.1 Innate immunity effect** 

Gene Regulation: Optimizing siRNA Design and Selection 521

targeted mRNA. In other words, siRNA should not invoke innate immunity nor has any

Concerning the innate immunity effect, by rational selection of appropriate length of the siRNA (21-23 nucleotides) the innate immunity will not be triggered (Birmingham et al. 2006). Although**,** duplexes of less than 30 nt are short enough to evade immunorecognition by cytosolic double-stranded RNA (dsRNA) receptors, but are long enough to trigger Tolllike receptor 7 sequence-dependent recognition (Patzel et al. 2005). Recognition of motifs as 5'-GUCCUUCAA-3', 5'-UGUGU-3' and tetrad-forming poly(G) stretches and avoidance of their presence in the sensitized siRNA, help over coming Toll-like receptor recognition. There was several works using chemical modification in order to mask the innate immunity

Apart from that, comes the problem of off-targets which is one of the most important factors for siRNA selection and filtration. "siRNA off-target" is mainly any target that is affected by siRNA other than the assigned target. It is very common for siRNA to have a multi-target as they are only 21-23 nucleotide length; therefore, there is a good chance siRNA could match with more than one mRNA. In fact, as observed in the work of Jackson et al, both sense and antisense and know to have an off-target effect with several mRNA transcripts (Jackson et al. 2003; Jackson & Linsley 2010). There are different mechanisms through which siRNA can

This mechanism is triggered whenever the designed siRNA is completely identical (or with one mismatch) with a region in the off-targeted mRNA. This complete (near complete) matches between siRNA and mRNA leads to the destruction of that mRNA with the same

If the designed siRNA seeding region (second to seventh position) matches with 3'UTR of off-targeted mRNA, this will result in affecting the off-target translation as illustrated by (E. M. Anderson et al. 2008). Therefore, these siRNAs are considered as partial off-targets and should be excluded. Chemical modifications have been applied here to reduce off-target and increase the specificity (Birmingham et al. 2007). These off-target effects (complete, near complete or partial) are responsible for loss of specificity as they make this unwanted silencing with other proteins synthesising genes. Moreover, they also cause loss of siRNA potency as the unwanted off-targeting of other mRNA could lead to unavailability of these

In addition to those types of off-target effects, there is the protein interaction. As siRNAs are known to bind to different cellular proteins and alter them, which is known as "Aptamer Effect" as described in (Semizarov et al. 2003). Moreover, avoidance of sequence motifs interfering with RNA synthesis and purification should be considered, as Guanine-rich RNA sequences and sequences containing consecutive stretches of more than three G bases

**I) Complete or near complete off-target matches(siRNA-like effects)** 

mechanism that siRNA silences the targeted mRNA as described before. **II) Partial off-target (miRNA-like effects through Seed matching off-target)** 

siRNAs at the original targets (Semizarov et al. 2003; Vert et al. 2006).

#### **3.3.3 Thermodynamic stability**

It is important to keep/introduce relative thermodynamic stability at both ends of the siRNA (3'UTR and 5'UTR) and low stability at the central zone as these facilitate ds-siRNA cleavage.

#### **3.3.4 Differential end stability**

Differential end stability is considered one of the most important features that affect siRNA functionality (Schwarz et al. 2003) [Fig 7]. RISC binds to either sense or the antisense strand, but with different ratio. This ratio depends on "Differential stability" between the first couple of bases ofthe 5' end from both strands. As these couple of bases affect what is called Thermo-Dynamic stability (TDS), so the lower the stability the better it binds to RISC. It has been found that only the antisense (leading) strand is capable of causing gene silencing (Dorsett & Thomas Tuschl 2004). Therefore, it is essential design siRNA with low TDS at 5' end of the antisense than TDS of the 3' end, to have a better binding with RISC and better efficiency in silencing the target mRNA.

Fig. 7. RISC and Differential end stability. This figure illustrates the effect of differential end stability of RISC annealing with ds siRNA. Therefore, it is very important to ensure the 5' end of the siRNA lead-strand is less stable than the 5' end of the passenger-strand. This way RISC would form complex with only the lead-strand that is designed to bind to the targeted mRNA.

#### **3.3.5 The number of single-stranded base pairs at the 5' and 3' ends of the target mRNA**

It has recently been shown to significantly contribute to the effectiveness of siRNAs by Patzel and Kaufman in their recent (S. H. E. Kaufmann & Patzel 2008). (Patzel, Rutz et al. 2005) (Patzel, Rutz et al. 2005) (Patzel, Rutz et al. 2005) (Patzel, Rutz et al. 2005) The same conclusion was reached by Gredell and co-workers in July 2008.

#### **3.4 siRNA specificity [The off-targeting effect]**

*"Ideally, the siRNA must not cause any effects other than those related to the knock down of the target gene"* (Semizarov et al. 2003). It is essential that the designed siRNA affects only the targeted mRNA. In other words, siRNA should not invoke innate immunity nor has any off-targeted mRNA.

#### **3.4.1 Innate immunity effect**

520 Bioinformatics – Trends and Methodologies

It is important to keep/introduce relative thermodynamic stability at both ends of the siRNA (3'UTR and 5'UTR) and low stability at the central zone as these facilitate ds-siRNA

Differential end stability is considered one of the most important features that affect siRNA functionality (Schwarz et al. 2003) [Fig 7]. RISC binds to either sense or the antisense strand, but with different ratio. This ratio depends on "Differential stability" between the first couple of bases ofthe 5' end from both strands. As these couple of bases affect what is called Thermo-Dynamic stability (TDS), so the lower the stability the better it binds to RISC. It has been found that only the antisense (leading) strand is capable of causing gene silencing (Dorsett & Thomas Tuschl 2004). Therefore, it is essential design siRNA with low TDS at 5' end of the antisense than TDS of the 3' end, to have a better binding with RISC and better

Fig. 7. RISC and Differential end stability. This figure illustrates the effect of differential end stability of RISC annealing with ds siRNA. Therefore, it is very important to ensure the 5' end of the siRNA lead-strand is less stable than the 5' end of the passenger-strand. This way RISC would form complex with only the lead-strand that is designed to bind to the targeted

**3.3.5 The number of single-stranded base pairs at the 5' and 3' ends of the target** 

conclusion was reached by Gredell and co-workers in July 2008.

**3.4 siRNA specificity [The off-targeting effect]** 

It has recently been shown to significantly contribute to the effectiveness of siRNAs by Patzel and Kaufman in their recent (S. H. E. Kaufmann & Patzel 2008). (Patzel, Rutz et al. 2005) (Patzel, Rutz et al. 2005) (Patzel, Rutz et al. 2005) (Patzel, Rutz et al. 2005) The same

*"Ideally, the siRNA must not cause any effects other than those related to the knock down of the target gene"* (Semizarov et al. 2003). It is essential that the designed siRNA affects only the

**3.3.3 Thermodynamic stability** 

**3.3.4 Differential end stability** 

efficiency in silencing the target mRNA.

cleavage.

mRNA.

**mRNA**

Concerning the innate immunity effect, by rational selection of appropriate length of the siRNA (21-23 nucleotides) the innate immunity will not be triggered (Birmingham et al. 2006). Although**,** duplexes of less than 30 nt are short enough to evade immunorecognition by cytosolic double-stranded RNA (dsRNA) receptors, but are long enough to trigger Tolllike receptor 7 sequence-dependent recognition (Patzel et al. 2005). Recognition of motifs as 5'-GUCCUUCAA-3', 5'-UGUGU-3' and tetrad-forming poly(G) stretches and avoidance of their presence in the sensitized siRNA, help over coming Toll-like receptor recognition. There was several works using chemical modification in order to mask the innate immunity response (interferon response) (Patzel 2007).

#### **3.4.2 Off-target effect**

Apart from that, comes the problem of off-targets which is one of the most important factors for siRNA selection and filtration. "siRNA off-target" is mainly any target that is affected by siRNA other than the assigned target. It is very common for siRNA to have a multi-target as they are only 21-23 nucleotide length; therefore, there is a good chance siRNA could match with more than one mRNA. In fact, as observed in the work of Jackson et al, both sense and antisense and know to have an off-target effect with several mRNA transcripts (Jackson et al. 2003; Jackson & Linsley 2010). There are different mechanisms through which siRNA can trigger off-targeting actions:

#### **I) Complete or near complete off-target matches(siRNA-like effects)**

This mechanism is triggered whenever the designed siRNA is completely identical (or with one mismatch) with a region in the off-targeted mRNA. This complete (near complete) matches between siRNA and mRNA leads to the destruction of that mRNA with the same mechanism that siRNA silences the targeted mRNA as described before.

#### **II) Partial off-target (miRNA-like effects through Seed matching off-target)**

If the designed siRNA seeding region (second to seventh position) matches with 3'UTR of off-targeted mRNA, this will result in affecting the off-target translation as illustrated by (E. M. Anderson et al. 2008). Therefore, these siRNAs are considered as partial off-targets and should be excluded. Chemical modifications have been applied here to reduce off-target and increase the specificity (Birmingham et al. 2007). These off-target effects (complete, near complete or partial) are responsible for loss of specificity as they make this unwanted silencing with other proteins synthesising genes. Moreover, they also cause loss of siRNA potency as the unwanted off-targeting of other mRNA could lead to unavailability of these siRNAs at the original targets (Semizarov et al. 2003; Vert et al. 2006).

In addition to those types of off-target effects, there is the protein interaction. As siRNAs are known to bind to different cellular proteins and alter them, which is known as "Aptamer Effect" as described in (Semizarov et al. 2003). Moreover, avoidance of sequence motifs interfering with RNA synthesis and purification should be considered, as Guanine-rich RNA sequences and sequences containing consecutive stretches of more than three G bases (Patzel et al. 2005).

*In-silico* Approaches for RNAiPost-Transcriptional

Gene Regulation: Optimizing siRNA Design and Selection 523

Fig. 8. Different phases for designing siRNA with high efficiency & sensitivity. There are seven distinguished phases for siRNA design: 1st choosing the targeted gene for silencing. 2nd identifying the proper target sequence space that represent all gene's transcripts and doesn't have any unstable regions. 3rd designing all possible siRNA with nineteen nucleotides length with both sense and antisense strand.4th these potential siRNAs are scored and evaluated according to several scoring mechanisms and criteria and then filter them according to produced scores. 5th siRNA are filtered according to target accessibility. 6th off-target filtration of the remaining siRNA is performed excluding siRNAs with unwanted off-target effect. 8th select the best designed siRNAs that pass all the previous

filtration phases and achieve the highest predicted efficiency (Mysara 2010).

#### **3.5 siRNA duplex chemical modification**

Several chemical modifications could be introduced to the designed siRNA in the aim of enhancing its tolerability, improving it stability, limiting its off-target effect and conjugation with tracking agent properly. There are multiple types of chemical modifications that are typically introduced into siRNAs, as summarized in the work of (Birmingham et al. 2007):

I) Sense strand disabling: it is done to increase the specificity and efficiency of siRNA designed. Various approaches were used as2' ribose modifications including 2'-OR where R¼ fluoro, alkyl, O-alkyl40–43; LNA modifications at the 5' end of the sense strand.

II) Stabilization: Chemical modifications of the phosphate backbone (e.g. phosphorothioate linkages), the ribose (e.g. locked nucleic acids, 2'-deoxy-2'-uorouridine, 2'-O-ethyl),and/or the base (e.g.2'-uoropyrimidines) increase the resistance of siRNA to nuclease. Stability of siRNAs in biological fluids needs various modifications as 2'-halogen, 2'-alkyl and/or 2'-Oalkyl modifications of one or both strands of the siRNA as well as stabilizing internucleotide modifications of the overhangs. Care should be taken when addressing those modifications not to interfere with siRNA efficiency.

III) Specificity: Chemical modifications in the aim of increasing siRNA specificity and decrease its off-target activity, include includes 2'-O-alkyl modification of unique positions of the sense and/or antisense strand. These modification patterns severely limit sense and antisense off-target effects by disrupting seed-mediated off-target activity.

IV) Conjugations: siRNAs have been conjugated with lipophilic derivatives of cholesterol, lauric acid or lithocholic acid to enhance their cellular uptake and specificity (Lorenz et al. 2004). The safest sites for conjugation are the 5' and 3' termini of the sense strand.

#### **4. Guidelines for siRNA rational design**

After have discussed factors influencing the siRNA efficacy,here we present our methodology and phases for efficient siRNA rational design. Originally, this methodology was inspired by the repeated Influenza pandemics, and our trials to design a novel siRNA therapy that would work for any new pandemic(ElHefnawi, Alaidi et al. 2011). There are seven phases that should be considered for proper designing of siRNA with high specificity and efficiency [Fig 8].

#### **4.1 Targeted gene selection**

"Targeted gene" selection is extremely critical for siRNA design as the purpose of gene silencing is to stop the expression of specific abnormal proteins most commonly be involved in the biological pathways as "cancer pathways". Therefore, the protein of interest should play a key role in this pathway, in order to produce the desired therapeutic effect from the silencing process. Therefore, there is a need to search the key regularity protein annotated in various biological pathway databases as "Reactome" and "KEGG" (http://www.reactome.org/ & http://www.genome.jp/kegg/pathway.html) and design siRNA capable of targeting them.

#### **4.2 Targeted sequence specification and filtration**

After selecting the gene of interest, as gene itself is not targeted but rather its transcript(s), all the available transcripts should be located. In some instance only one transcript should be targeted; in this case all other transcripts should be excluded as targeting them is considered as lack of specificity (off-target). But on the other hand, if there is a need to silence all of the gene's transcripts (which is very common), several options are available for handling such situation [Fig 9] either by mapping the transcripts on the genome, as

Several chemical modifications could be introduced to the designed siRNA in the aim of enhancing its tolerability, improving it stability, limiting its off-target effect and conjugation with tracking agent properly. There are multiple types of chemical modifications that are typically introduced into siRNAs, as summarized in the work of (Birmingham et al. 2007): I) Sense strand disabling: it is done to increase the specificity and efficiency of siRNA designed. Various approaches were used as2' ribose modifications including 2'-OR where

II) Stabilization: Chemical modifications of the phosphate backbone (e.g. phosphorothioate linkages), the ribose (e.g. locked nucleic acids, 2'-deoxy-2'-uorouridine, 2'-O-ethyl),and/or the base (e.g.2'-uoropyrimidines) increase the resistance of siRNA to nuclease. Stability of siRNAs in biological fluids needs various modifications as 2'-halogen, 2'-alkyl and/or 2'-Oalkyl modifications of one or both strands of the siRNA as well as stabilizing internucleotide modifications of the overhangs. Care should be taken when addressing those modifications

III) Specificity: Chemical modifications in the aim of increasing siRNA specificity and decrease its off-target activity, include includes 2'-O-alkyl modification of unique positions of the sense and/or antisense strand. These modification patterns severely limit sense and

IV) Conjugations: siRNAs have been conjugated with lipophilic derivatives of cholesterol, lauric acid or lithocholic acid to enhance their cellular uptake and specificity (Lorenz et al.

After have discussed factors influencing the siRNA efficacy,here we present our methodology and phases for efficient siRNA rational design. Originally, this methodology was inspired by the repeated Influenza pandemics, and our trials to design a novel siRNA therapy that would work for any new pandemic(ElHefnawi, Alaidi et al. 2011). There are seven phases that should be considered for proper designing of siRNA with high specificity and efficiency [Fig 8].

"Targeted gene" selection is extremely critical for siRNA design as the purpose of gene silencing is to stop the expression of specific abnormal proteins most commonly be involved in the biological pathways as "cancer pathways". Therefore, the protein of interest should play a key role in this pathway, in order to produce the desired therapeutic effect from the silencing process. Therefore, there is a need to search the key regularity protein annotated in various biological pathway databases as "Reactome" and "KEGG" (http://www.reactome.org/ & http://www.genome.jp/kegg/pathway.html) and design siRNA capable of targeting them.

After selecting the gene of interest, as gene itself is not targeted but rather its transcript(s), all the available transcripts should be located. In some instance only one transcript should be targeted; in this case all other transcripts should be excluded as targeting them is considered as lack of specificity (off-target). But on the other hand, if there is a need to silence all of the gene's transcripts (which is very common), several options are available for handling such situation [Fig 9] either by mapping the transcripts on the genome, as

R¼ fluoro, alkyl, O-alkyl40–43; LNA modifications at the 5' end of the sense strand.

antisense off-target effects by disrupting seed-mediated off-target activity.

2004). The safest sites for conjugation are the 5' and 3' termini of the sense strand.

**3.5 siRNA duplex chemical modification** 

not to interfere with siRNA efficiency.

**4. Guidelines for siRNA rational design** 

**4.2 Targeted sequence specification and filtration** 

**4.1 Targeted gene selection** 

Fig. 8. Different phases for designing siRNA with high efficiency & sensitivity. There are seven distinguished phases for siRNA design: 1st choosing the targeted gene for silencing. 2nd identifying the proper target sequence space that represent all gene's transcripts and doesn't have any unstable regions. 3rd designing all possible siRNA with nineteen nucleotides length with both sense and antisense strand.4th these potential siRNAs are scored and evaluated according to several scoring mechanisms and criteria and then filter them according to produced scores. 5th siRNA are filtered according to target accessibility. 6th off-target filtration of the remaining siRNA is performed excluding siRNAs with unwanted off-target effect. 8th select the best designed siRNAs that pass all the previous filtration phases and achieve the highest predicted efficiency (Mysara 2010).

*In-silico* Approaches for RNAiPost-Transcriptional

**4.3.1 Selection of the appropriate siRNA length** 

efficiency of antisense strand loading to the RISC complex.

**4.3.2 Picking up siRNA from sequence space**

Gene Regulation: Optimizing siRNA Design and Selection 525

Using siRNA (with its short length) has better advantages over using long double stranded RNAi as they do not trigger immune response and they also silence the targeted mRNA more efficiently. However, siRNA with length equal to thirty nucleotides were found to be inactive (S M Elbashir, Lendeckel, et al. 2001). After that the selection of proper siRNA length was heavily studied, upper and lower limits have been assigned. It was found that shortening the length from nineteen to seventeen affected the siRNA capability to silence the targeted gene; as at least nineteen nucleotides are required for RISC binding. (Czauderna et al. 2003). To establish the upper length limit, it was found that siRNAs with length from (18 to 23) are at least eight folds more effective than other lengths. In addition the 24-25 nucleotide length siRNAs were completely inactive (S M Elbashir, J Martinez, et al. 2001). Therefore, siRNA with length 19-21 plus a 2-nt overhang is the appropriate length for siRNA design and any further deviation above or below this length threshold will have a direct effect on the siRNA activity. It was also demonstrated that using 2-3mer nucleotides dT 3' UTR overhangs increases the

After establishing the desired siRNA length (19 nucleotides), all possible siRNA molecules should be considered using one nucleotide shift per time [Fig 10] till reaching the end of sequence space. Although the ideal case is that each gene would have only one transcript with no SNPs nor highly variable regions, the vast majority of the gene's sequence space would be separate pieces (not intact) as shown in [Fig. 9]. Therefore, the selection of the 19 nucleotide length siRNA should only be restricted to those sequence spaces free from any gaps.

Fig. 10. Designing of all possible siRNA using fixed frame shift. After choosing the

appropriate length (most propably 19 nucleotides + 2 overhangs) the target sequence space is scanced and all possible siRNAs with one nucleotide shift at a time(Mysara 2010).

proposed in the work of (Y.-kyu Park et al. 2008), or via using multiple sequence alignment (MSA). Multiple sequence alignment is performed either by aligning different gene transcripts and selecting the transcript with the highest identity to the alignment profile, in other word; select the transcript that is more capable of representing all the other transcripts. Another manoeuvre is by considering all transcripts' regions in common (conserved).

Fig. 9. Different approaches of handling multiple gene transcripts. There are two proposed approaches 1st by aligning between different gene transcripts in order to get the alignment profile and choose the closest transcript to the alignment profile. 2nd way is to get the gaped consensus between these transcripts and choose the regions that are 100% conserved between them (Mysara 2010).

The latter method ensures designing siRNA targeting all the gene's transcripts, this is very important as one mismatch between the target mRNA (transcript) and siRNA could dramatically affect siRNA efficiency (Czauderna et al. 2003) there was noticeable decrease in the efficiency of designed siRNA when induced central single nucleotide variation between the siRNA and targeted mRNA, all together with the findings in the following works (M. Amarzguioui 2003; Sayda M Elbashir et al. 2002). However, one of the main disadvantages of this approach is that it narrows the target sequence space and it is possible that no active siRNA will pass the multi-filtration phases described in [Fig 8]. In this occassion,sequence space should be widened via inclusion of (3'UTR and 5'UTR) or using the first approach.

After locating the targeted sequence space, both SNPs and unstable (highly variable) positions should be identified (if any) and any designed siRNAs targeting these residues/regions should be rejected. This way the targeted sequence space will be limited to mRNA (or the conserved region among different gene transcripts) either representing the ORF or (ORF + 3'UTR + 5'UTR) free from any SNPs or unstable regions.

#### **4.3 Designing all possible siRNA targeting the selected regions**

This section illustrates the proper siRNA length that ensures high efficiency and stability having neutral effect on host innate immunity. Then it discusses how to select siRNA from the mRNA sequence space.

proposed in the work of (Y.-kyu Park et al. 2008), or via using multiple sequence alignment (MSA). Multiple sequence alignment is performed either by aligning different gene transcripts and selecting the transcript with the highest identity to the alignment profile, in other word; select the transcript that is more capable of representing all the other transcripts. Another manoeuvre is by considering all transcripts' regions in common (conserved).

Fig. 9. Different approaches of handling multiple gene transcripts. There are two proposed approaches 1st by aligning between different gene transcripts in order to get the alignment profile and choose the closest transcript to the alignment profile. 2nd way is to get the gaped

The latter method ensures designing siRNA targeting all the gene's transcripts, this is very important as one mismatch between the target mRNA (transcript) and siRNA could dramatically affect siRNA efficiency (Czauderna et al. 2003) there was noticeable decrease in the efficiency of designed siRNA when induced central single nucleotide variation between the siRNA and targeted mRNA, all together with the findings in the following works (M. Amarzguioui 2003; Sayda M Elbashir et al. 2002). However, one of the main disadvantages of this approach is that it narrows the target sequence space and it is possible that no active siRNA will pass the multi-filtration phases described in [Fig 8]. In this occassion,sequence space should be widened via inclusion of (3'UTR and 5'UTR) or using the first approach. After locating the targeted sequence space, both SNPs and unstable (highly variable) positions should be identified (if any) and any designed siRNAs targeting these residues/regions should be rejected. This way the targeted sequence space will be limited to mRNA (or the conserved region among different gene transcripts) either representing the

This section illustrates the proper siRNA length that ensures high efficiency and stability having neutral effect on host innate immunity. Then it discusses how to select siRNA from

consensus between these transcripts and choose the regions that are 100% conserved

ORF or (ORF + 3'UTR + 5'UTR) free from any SNPs or unstable regions.

**4.3 Designing all possible siRNA targeting the selected regions** 

between them (Mysara 2010).

the mRNA sequence space.

#### **4.3.1 Selection of the appropriate siRNA length**

Using siRNA (with its short length) has better advantages over using long double stranded RNAi as they do not trigger immune response and they also silence the targeted mRNA more efficiently. However, siRNA with length equal to thirty nucleotides were found to be inactive (S M Elbashir, Lendeckel, et al. 2001). After that the selection of proper siRNA length was heavily studied, upper and lower limits have been assigned. It was found that shortening the length from nineteen to seventeen affected the siRNA capability to silence the targeted gene; as at least nineteen nucleotides are required for RISC binding. (Czauderna et al. 2003). To establish the upper length limit, it was found that siRNAs with length from (18 to 23) are at least eight folds more effective than other lengths. In addition the 24-25 nucleotide length siRNAs were completely inactive (S M Elbashir, J Martinez, et al. 2001). Therefore, siRNA with length 19-21 plus a 2-nt overhang is the appropriate length for siRNA design and any further deviation above or below this length threshold will have a direct effect on the siRNA activity. It was also demonstrated that using 2-3mer nucleotides dT 3' UTR overhangs increases the efficiency of antisense strand loading to the RISC complex.

#### **4.3.2 Picking up siRNA from sequence space**

After establishing the desired siRNA length (19 nucleotides), all possible siRNA molecules should be considered using one nucleotide shift per time [Fig 10] till reaching the end of sequence space. Although the ideal case is that each gene would have only one transcript with no SNPs nor highly variable regions, the vast majority of the gene's sequence space would be separate pieces (not intact) as shown in [Fig. 9]. Therefore, the selection of the 19 nucleotide length siRNA should only be restricted to those sequence spaces free from any gaps.


Fig. 10. Designing of all possible siRNA using fixed frame shift. After choosing the appropriate length (most propably 19 nucleotides + 2 overhangs) the target sequence space is scanced and all possible siRNAs with one nucleotide shift at a time(Mysara 2010).

*In-silico* Approaches for RNAiPost-Transcriptional

Gene Regulation: Optimizing siRNA Design and Selection 527

the presence of partial off-target by excluding siRNAs with matches between their seeding regions (second to seventh position) and the 3' UTR of the off-targeted mRNAs. This way

Fig. 11. Off-target filtration workflow describing decision making process for siRNAs off-

The final step is sorting the acceptable siRNAs candidates according to the predicted inhibition score. Taking the top 10-50 (if applicable) and order them for synthesis by adding UU or dTdT to the 3' ends. Final result is a double strand siRNAs containing leading strand and antisense strand with two 3'end overhangs. There are also several chemical modifications could be applied to the ds-siRNA would serve in increasing the stability, efficiency or neutralizing the

immune response by various possible modifications (Birmingham et al. 2007).

target filtration.

**4.7 Selecting the best designed siRNA** 

the selected siRNA candidates would have the required specificity [Fig 11].

#### **4.4 siRNA scoring and scores filtration**

This stage is the most important stage for siRNA design as proper scoring and evaluation of siRNA activity assist the time and cost consumtion. Moreover, developing siRNA scoring tools with enhanced specificity and sensitivity would also serve a lot in that regardAs normally single mRNA would produce thousand potential siRNAs, these siRNAs need to be evaluated in order to filter them to smaller number suitable for experimental testing. There are several tools have been developed to predict siRNA activity; these tools differ a lot in these prediction capabilities in the terms of specificity and sensitivity. They use several rules and trained with various datasets, therefore, careful evaluation and picking up the right tool is essential for proper siRNA scoring phase. The details regarding siRNA scoring is further explained in the next section of the chapter.

#### **4.5 siRNAs target accessibility filtration**

For interaction between two RNA sequences (siRNA and mRNA) two types of energies are needed: first energy required for opening the binding site, second energy required to gain hybridization. There are several programs that is used to calculate each energy among them *RNAduplex* is capable of calculating duplex energy and RNAplfold capable of calculating opening energy for ds-siRNA and targeted mRNA (target site accessibility energy). Both *RNAduplex* and *RNAplfold* belong to Vienna RNA package http://www.tbi.univie.ac.at/~ivo/ RNA/. There are two more tools that are able to provide better advantages, RNAup and RNAxs.

#### **4.5.1 RNAup**

*RNAup* (that also belongs to Vienna RNA package) is capable of calculating all the three energies required for assisting the interaction energy (Mückstein et al. 2006). *RNAup* starts with calculating the probability that the sequence intervals (after splicing the sequence in small subsequences) are unpaired. Then, it computes the interaction energy, ending with choosing the ones with the least free energy (i.e. the highest stability). However, it cannot handle sequences longer than 5000 nucleotides as it needs a lot of memory.

#### **4.5.2 RNAxs**

*RNAxs* program (modification of the older *RNAplfold)* is one of the programs used to evaluate siRNA efficiency according to target accessibility evaluation; it combines *RNAplfold, RNAfold and RNAduplex* (Tafer et al. 2008). *RNAxs* is able to provide two major advantages: Time reduction and single phased process, which has been shown in the work of (Hofacker & Tafer 2010) the comparative experiment done between *RNAxs* and 2 other target accessibility based programs (*OligoWalk & Sirna*). It was found that only *RNAxs* is able to identify siRNAs with inhibition efficiency >50% and to classify 50% of experiment siRNA producing prediction capability higher than the other two programs.

#### **4.6 siRNAs off-target filtration**

All the siRNAs that pass the assigned thresholds for each scoring tools are filtered by their tendency to trigger off-target effect. As described in section later (under siRNA specificity), first siRNA having complete matches (or near complete) with the off-target mRNA should be excluded (19/19 or 18/19 or 18/18). Next, the rest of the siRNAs are filtered according to

This stage is the most important stage for siRNA design as proper scoring and evaluation of siRNA activity assist the time and cost consumtion. Moreover, developing siRNA scoring tools with enhanced specificity and sensitivity would also serve a lot in that regardAs normally single mRNA would produce thousand potential siRNAs, these siRNAs need to be evaluated in order to filter them to smaller number suitable for experimental testing. There are several tools have been developed to predict siRNA activity; these tools differ a lot in these prediction capabilities in the terms of specificity and sensitivity. They use several rules and trained with various datasets, therefore, careful evaluation and picking up the right tool is essential for proper siRNA scoring phase. The details regarding siRNA scoring is further

For interaction between two RNA sequences (siRNA and mRNA) two types of energies are needed: first energy required for opening the binding site, second energy required to gain hybridization. There are several programs that is used to calculate each energy among them *RNAduplex* is capable of calculating duplex energy and RNAplfold capable of calculating opening energy for ds-siRNA and targeted mRNA (target site accessibility energy). Both *RNAduplex* and *RNAplfold* belong to Vienna RNA package http://www.tbi.univie.ac.at/~ivo/ RNA/. There are two more tools that are able to provide

*RNAup* (that also belongs to Vienna RNA package) is capable of calculating all the three energies required for assisting the interaction energy (Mückstein et al. 2006). *RNAup* starts with calculating the probability that the sequence intervals (after splicing the sequence in small subsequences) are unpaired. Then, it computes the interaction energy, ending with choosing the ones with the least free energy (i.e. the highest stability). However, it cannot

*RNAxs* program (modification of the older *RNAplfold)* is one of the programs used to evaluate siRNA efficiency according to target accessibility evaluation; it combines *RNAplfold, RNAfold and RNAduplex* (Tafer et al. 2008). *RNAxs* is able to provide two major advantages: Time reduction and single phased process, which has been shown in the work of (Hofacker & Tafer 2010) the comparative experiment done between *RNAxs* and 2 other target accessibility based programs (*OligoWalk & Sirna*). It was found that only *RNAxs* is able to identify siRNAs with inhibition efficiency >50% and to classify 50% of experiment

All the siRNAs that pass the assigned thresholds for each scoring tools are filtered by their tendency to trigger off-target effect. As described in section later (under siRNA specificity), first siRNA having complete matches (or near complete) with the off-target mRNA should be excluded (19/19 or 18/19 or 18/18). Next, the rest of the siRNAs are filtered according to

handle sequences longer than 5000 nucleotides as it needs a lot of memory.

siRNA producing prediction capability higher than the other two programs.

**4.4 siRNA scoring and scores filtration** 

explained in the next section of the chapter.

**4.5 siRNAs target accessibility filtration** 

better advantages, RNAup and RNAxs.

**4.6 siRNAs off-target filtration** 

**4.5.1 RNAup** 

**4.5.2 RNAxs** 

the presence of partial off-target by excluding siRNAs with matches between their seeding regions (second to seventh position) and the 3' UTR of the off-targeted mRNAs. This way the selected siRNA candidates would have the required specificity [Fig 11].

Fig. 11. Off-target filtration workflow describing decision making process for siRNAs offtarget filtration.

#### **4.7 Selecting the best designed siRNA**

The final step is sorting the acceptable siRNAs candidates according to the predicted inhibition score. Taking the top 10-50 (if applicable) and order them for synthesis by adding UU or dTdT to the 3' ends. Final result is a double strand siRNAs containing leading strand and antisense strand with two 3'end overhangs. There are also several chemical modifications could be applied to the ds-siRNA would serve in increasing the stability, efficiency or neutralizing the immune response by various possible modifications (Birmingham et al. 2007).

*In-silico* Approaches for RNAiPost-Transcriptional

interpret the experimentally obtained data.

**5.2 Huesken dataset dependant [Second Generation]**

tools and provide comparison between them [Table 2].

**5.2.1 Biopred** 

**5.2.2 DSIR** 

later on (Ichihara et al. 2007).

**5.1 Huesken dataset non dependant [First Generation]** 

Gene Regulation: Optimizing siRNA Design and Selection 529

These tool were developed to select the most efficient siRNAs, and they depend on differential ends Thermodynamic stability measures, mRNA secondary structure and base preferences specific position target uniqueness. Example of these rules: Reynolds (Reynolds et al. 2004), Amarzguioui (Mohammed Amarzguioui & Prydz 2004), Takasaki (Takasaki et al. 2004), Katoh (Katoh & Suzuki 2007), Ui-Tei (Ui-Tei et al. 2004), Hsieh (Hsieh et al. 2004). However, these first generation scoring techniques have shown to have low accuracy, as up to 65% of the siRNAs predicted as active (by these tools) failed to achieve 90% inhibition when tested experimentally and up to 20% of them were false positive, as described by (Ren et al. 2006). Therefore, there was a need for another approach that does not only take the site-specific position into consideration but also implement data mining techniques to

This class has been developed mainly through experimental data observation, as the existence of dataset with fully annotated experimentally siRNAs with their different efficiency enabling sophisticated data mining handling of this data, was not available until the dataset of Novartis that was introduced by Huesken (Huesken et al. 2006) and used for training of several scoring tool as: *Biopredsi* (Huesken et al. 2006), *DSIR* (Vert et al. 2006), *ThermoComposition21* (S.A.

These scoring techniques predict siRNA efficiency more accurately than the older tools. Although they use completely different algorithms to evaluate siRNA efficiency, they have very close accuracy compared to the rest of the second generation algorithms, as described by (Ichihara et al. 2007). As in the comparative study done in the Ichihara's work all the second generation (except for Scales) and only Reynold and Katoh from the first generation achieved 90% successful prediction. Moreover, sensitivity of the second generation compared to Reynold and Karoh (which appear to have approximate accuracy) was at least 8 fold lower that the second generation sensitivity (this also supports Ren's findings mentioned earlier). Here, we handle the basicinformation of each member of this group of

In Biopred, artificial neural network (ANN) was trained using huge number of records (2,182 training and 259 test), considering not only single nucleotide residue but certain patterns (as di-nucleotides). This work is considered the start of the second generation siRNA approaches and noticeable shift in the scoring accuracy. Although ANN used in this work provided ambiguity to the module and prevented further development (due to the complexity of the model), it was considered, at that time period, the best way to handle all these different parameters. The server based Biopred model was later simulated and released as Biopredsi excel-based tool together with i-Score, which is going to be illustrated

In DSIR, they used the exact training and test data as Biopredsi but with simplified linear regression model to give prediction based to two main sequence features and three main parameters with Pearson Correlation coefficient = 0.67. The main sequence feature is A/U presence at the first position of the 5' end guidance strand and the absence of Cytosine from

Shabalina et al. 2006), *i-Score* (Ichihara et al. 2007) and *Scales* (Matveeva et al. 2007).

#### **4.8 Automation of siRNA design**

This stepwise approach for designing siRNAs with acceptable target accessibility properties, passing predicted score, SNPS and off-target filtration could be automated using various programs and tools. The extend of considering those steps varies from program to another according to the used algorithm and the state of the art protocol available at its time, in our previous work, we managed to develop MysiRNA-Designer, siRNA design tool that implements all of the steps presented above [**Table 1**](Mysara, J. Garibaldi, et al. 2011).


Table 1. Comparison between MysiRNA-Designer and several programs used for siRNA full automation designing. This Comparison involves tools ability to perform alignment between different transcripts, conserved regions consideration, all together with siRNA candidate evaluation using several algorithms and target accessibility. siRNAs filtration by the presence of Single Nucleotide Polymorphisms and off-targets (both full homology and seed regions)(Mysara, J. Garibaldi, et al. 2011, submitted). \*1 siDESIGN Center at http://www.dharmacon.com/designcenter/DesignCenterPage.aspx. \*2 Asi-Designer available at http://sysbio.kribb.re.kr:8080/AsiDesigner/menuDesigner.jsf. \*3 RNAxs available at http://rna.tbi.univie.ac.at/cgi-bin/RNAxs. \*4 siDRM available at http://sidrm.biolead.org/index.php.

#### **5. Models used for predicting siRNA activity**

There are several methods for scoring and predicting designed siRNA activity, some of them are more accurate than the others; however, they are classified into two groups (Ichihara et al. 2007): (i) Huesken dataset non-dependant [first generation]. (ii)Huesken dataset dependant [second generation]

#### **5.1 Huesken dataset non dependant [First Generation]**

These tool were developed to select the most efficient siRNAs, and they depend on differential ends Thermodynamic stability measures, mRNA secondary structure and base preferences specific position target uniqueness. Example of these rules: Reynolds (Reynolds et al. 2004), Amarzguioui (Mohammed Amarzguioui & Prydz 2004), Takasaki (Takasaki et al. 2004), Katoh (Katoh & Suzuki 2007), Ui-Tei (Ui-Tei et al. 2004), Hsieh (Hsieh et al. 2004). However, these first generation scoring techniques have shown to have low accuracy, as up to 65% of the siRNAs predicted as active (by these tools) failed to achieve 90% inhibition when tested experimentally and up to 20% of them were false positive, as described by (Ren et al. 2006). Therefore, there was a need for another approach that does not only take the site-specific position into consideration but also implement data mining techniques to interpret the experimentally obtained data.

#### **5.2 Huesken dataset dependant [Second Generation]**

This class has been developed mainly through experimental data observation, as the existence of dataset with fully annotated experimentally siRNAs with their different efficiency enabling sophisticated data mining handling of this data, was not available until the dataset of Novartis that was introduced by Huesken (Huesken et al. 2006) and used for training of several scoring tool as: *Biopredsi* (Huesken et al. 2006), *DSIR* (Vert et al. 2006), *ThermoComposition21* (S.A. Shabalina et al. 2006), *i-Score* (Ichihara et al. 2007) and *Scales* (Matveeva et al. 2007).

These scoring techniques predict siRNA efficiency more accurately than the older tools. Although they use completely different algorithms to evaluate siRNA efficiency, they have very close accuracy compared to the rest of the second generation algorithms, as described by (Ichihara et al. 2007). As in the comparative study done in the Ichihara's work all the second generation (except for Scales) and only Reynold and Katoh from the first generation achieved 90% successful prediction. Moreover, sensitivity of the second generation compared to Reynold and Karoh (which appear to have approximate accuracy) was at least 8 fold lower that the second generation sensitivity (this also supports Ren's findings mentioned earlier). Here, we handle the basicinformation of each member of this group of tools and provide comparison between them [Table 2].

#### **5.2.1 Biopred**

528 Bioinformatics – Trends and Methodologies

This stepwise approach for designing siRNAs with acceptable target accessibility properties, passing predicted score, SNPS and off-target filtration could be automated using various programs and tools. The extend of considering those steps varies from program to another according to the used algorithm and the state of the art protocol available at its time, in our previous work, we managed to develop MysiRNA-Designer, siRNA design tool that implements all of the steps presented above [**Table 1**](Mysara, J. Garibaldi, et al. 2011).

**4.8 Automation of siRNA design** 

**Tools name** 

http://sidrm.biolead.org/index.php.

dataset dependant [second generation]

**5. Models used for predicting siRNA activity** 

**Multi-transcripts** 

**Consideration** 

**Conserved Region** 

**Analysis** 

**SNPs Evaluation** 

**MysiRNA-Designer + + + + + + + + -** 

**siDESIGN Center \*1 + + + - - - + + +** 

**Asi-Designer \*2 + - + - + - + - +** 

**RNAxs \*3 - - - - + + - - +** 

**siDRM \*4 - - - - - - + + +** 

Table 1. Comparison between MysiRNA-Designer and several programs used for siRNA full automation designing. This Comparison involves tools ability to perform alignment between different transcripts, conserved regions consideration, all together with siRNA candidate evaluation using several algorithms and target accessibility. siRNAs filtration by the presence of Single Nucleotide Polymorphisms and off-targets (both full homology and

There are several methods for scoring and predicting designed siRNA activity, some of them are more accurate than the others; however, they are classified into two groups (Ichihara et al. 2007): (i) Huesken dataset non-dependant [first generation]. (ii)Huesken

seed regions)(Mysara, J. Garibaldi, et al. 2011, submitted). \*1 siDESIGN Center at http://www.dharmacon.com/designcenter/DesignCenterPage.aspx. \*2 Asi-Designer available at http://sysbio.kribb.re.kr:8080/AsiDesigner/menuDesigner.jsf. \*3 RNAxs

available at http://rna.tbi.univie.ac.at/cgi-bin/RNAxs. \*4 siDRM available at

**Multi- algorithms** 

**Scoring** 

**2ry structure** 

**Evaluation** 

**Target accessibility** 

**Full Homology Offtarget**  **Seed Region off-target** 

**Server Based** 

In Biopred, artificial neural network (ANN) was trained using huge number of records (2,182 training and 259 test), considering not only single nucleotide residue but certain patterns (as di-nucleotides). This work is considered the start of the second generation siRNA approaches and noticeable shift in the scoring accuracy. Although ANN used in this work provided ambiguity to the module and prevented further development (due to the complexity of the model), it was considered, at that time period, the best way to handle all these different parameters. The server based Biopred model was later simulated and released as Biopredsi excel-based tool together with i-Score, which is going to be illustrated later on (Ichihara et al. 2007).

#### **5.2.2 DSIR**

In DSIR, they used the exact training and test data as Biopredsi but with simplified linear regression model to give prediction based to two main sequence features and three main parameters with Pearson Correlation coefficient = 0.67. The main sequence feature is A/U presence at the first position of the 5' end guidance strand and the absence of Cytosine from

*In-silico* Approaches for RNAiPost-Transcriptional

**5.2.5 Scales** 

submitted).

**6. Experimental section** 

**Tools Model** 

**Biopredsi** (Huesken

et al. 2006)

Gene Regulation: Optimizing siRNA Design and Selection 531

Linear regression model fitting with local stability of siRNA duplex and other parameters was the way Matveeva's team managed to score siRNA in "siRNA Scales", using Huesken dataset for training and three other dataset from various pharmaceutical companies for validation. The use of linear regression provided additional advantages over neural network, as it enabled the introduction of relevant importance to the same parameter at different positions which cannot be applied to the same node parameter in the neural network. In "scales" the linear regression was build on two sets of parameters: the first group covers the stability of siRNA ends especially the 1st and last two base pair of the siRNA the second group depends on evaluation of certain nucleotide at specific positions. A comparison between all second generation tools is provided in [**Table 3**] (Mysara 2010). However, all these models have limitations in performance. There are recent efforts to enhance the siRNA scoring functionality through applying a second artificial intelligent layer that depends on the predicted scores of other second generation tool, as in MysiRNA model. It is siRNA functionality/efficacy prediction model that was developed by combining two existing scoring algorithms (ThermoComposition21 and i-Score), together with the whole stacking energy (∆G), in a multi-layer artificial neural network. It was found that this kind of combination increases the correlation coefficient of the prediction accuracy from (0.5 to 0.6) between scales and MysiRNA models (Mysara, M. Elhefnawi, et al. 2011,

Here we present an example about working with the previously mentioned protocol for proper siRNA design for targeting human TP53 gene that has been identified as oncogenes. We start with finding P53 mRNA, by searching the NCBI Nucleotide dataset; we will find mRNA refseq id "NM\_000546.4" for Homo sapiens tumor protein p53 (TP53), transcript variant 1. Knowing that we need to target all the gene's transcripts, we should find all available transcripts. One way to do that is by blasting the mRNA refseq database, searching for mRNA sharing the same name and organism, using NCBI remote Blast. Seven different mRNAs were identified as following: NM\_000546, NM\_001126112, NM\_001126115, NM\_001126117, NM\_001126116, NM\_001126114, NM\_001126113. All of these transcripts were later alignment together, as an approach to identify conserved regions. We used ClustalW to align those 7 transcripts with their different lengths 2586, 2583, 2271, 2331, 2404, 2719 and 2646 respectively. The resulted alignment file, was the treated with "cons" tool to

> **Training Dataset used**

2,431 records from Huesken dataset.

**Tool available at** 

http://www.bi opredsi.org

**Disadvantages** 

Possible over estimation due to over fitting of training set with test (S.A. Shabalina et al.

2006)

find the consensus between those transcripts, using 100% conservation.

**Technique** 

Neural network

both positions seven and eleven. The main parameters that have been used to build the model are: sprase21, spectra21, composition representation. These three parameters divide the siRNA by different manners and calculate the total score of all of them providing a very representative and interpretable method to evaluate a siRNA sequence, see Table 2.


Table 2. Description of parameters considered by ThermoComposition (Mysara 2010).

#### **5.2.3 Thermo composition**

Here a small number of parameters have been used in to train neural network using 653 siRNA-records as a training set. These parameters have been carefully selected from 18 parameters leading to this small number of parameters (three parameters), that had provided the advantage of simplicity over other neural network as no need for huge number of training dataset is required, and that opened the door for any further development. The uniqueness in this work is that it combines "the position dependant features" with "Thermodynamic features" [Table 2].

#### **5.2.4 i-Score**

In i-Score, linear regression model was built on identifying the nucleotide that is preferred in each position and calculated the inhibition score (i-score) working on Huesken dataset (2431) with Pearson correlation coefficient = 0.635. Also in this work they pointed out a very important threshold as the exclusion of Thermostable siRNA (with stacking energy (whole ∆G) < -34.6 k.cal) improved the score accuracy of not only i-Score but also DSIR, Biopredsi and ThermoComposition21 (Ichihara et al. 2007).

#### **5.2.5 Scales**

530 Bioinformatics – Trends and Methodologies

both positions seven and eleven. The main parameters that have been used to build the model are: sprase21, spectra21, composition representation. These three parameters divide the siRNA by different manners and calculate the total score of all of them providing a very

> As several positions has been identified to be conserved between effective siRNA, so the are scored out of 11 for the presence of desirable residues and out of 10 (in -VE charge) for the presence of

As the occurrence of some dinucleotide combinations have exceeded the random distribution, so by combining these unique pairs with the level of effectiveness of these siRNA, it was found (or precisely confirmed) the low frequency of G/C dinucleotide pairs accompany

It was found that the difference between 5' and 3' in free energy (or its oppose "stability") especially at the last 2,3,4,5 from each side plays a crucial role in not only distinguishing the sense and antisense

representative and interpretable method to evaluate a siRNA sequence, see Table 2.

undisirable residue in specific position.

**Parameter Description** 

high siRNA efficacy.

features" with "Thermodynamic features" [Table 2].

and ThermoComposition21 (Ichihara et al. 2007).

but in efficiency evaluation.

Table 2. Description of parameters considered by ThermoComposition (Mysara 2010).

Here a small number of parameters have been used in to train neural network using 653 siRNA-records as a training set. These parameters have been carefully selected from 18 parameters leading to this small number of parameters (three parameters), that had provided the advantage of simplicity over other neural network as no need for huge number of training dataset is required, and that opened the door for any further development. The uniqueness in this work is that it combines "the position dependant

In i-Score, linear regression model was built on identifying the nucleotide that is preferred in each position and calculated the inhibition score (i-score) working on Huesken dataset (2431) with Pearson correlation coefficient = 0.635. Also in this work they pointed out a very important threshold as the exclusion of Thermostable siRNA (with stacking energy (whole ∆G) < -34.6 k.cal) improved the score accuracy of not only i-Score but also DSIR, Biopredsi

**Position dependant consensus** 

**Dinuleotide Content Index** 

**Thermodynamic Profile & Free Energy (∆G)** 

**5.2.4 i-Score** 

**5.2.3 Thermo composition** 

Linear regression model fitting with local stability of siRNA duplex and other parameters was the way Matveeva's team managed to score siRNA in "siRNA Scales", using Huesken dataset for training and three other dataset from various pharmaceutical companies for validation. The use of linear regression provided additional advantages over neural network, as it enabled the introduction of relevant importance to the same parameter at different positions which cannot be applied to the same node parameter in the neural network. In "scales" the linear regression was build on two sets of parameters: the first group covers the stability of siRNA ends especially the 1st and last two base pair of the siRNA the second group depends on evaluation of certain nucleotide at specific positions. A comparison between all second generation tools is provided in [**Table 3**] (Mysara 2010).

However, all these models have limitations in performance. There are recent efforts to enhance the siRNA scoring functionality through applying a second artificial intelligent layer that depends on the predicted scores of other second generation tool, as in MysiRNA model. It is siRNA functionality/efficacy prediction model that was developed by combining two existing scoring algorithms (ThermoComposition21 and i-Score), together with the whole stacking energy (∆G), in a multi-layer artificial neural network. It was found that this kind of combination increases the correlation coefficient of the prediction accuracy from (0.5 to 0.6) between scales and MysiRNA models (Mysara, M. Elhefnawi, et al. 2011, submitted).

#### **6. Experimental section**

Here we present an example about working with the previously mentioned protocol for proper siRNA design for targeting human TP53 gene that has been identified as oncogenes. We start with finding P53 mRNA, by searching the NCBI Nucleotide dataset; we will find mRNA refseq id "NM\_000546.4" for Homo sapiens tumor protein p53 (TP53), transcript variant 1. Knowing that we need to target all the gene's transcripts, we should find all available transcripts. One way to do that is by blasting the mRNA refseq database, searching for mRNA sharing the same name and organism, using NCBI remote Blast. Seven different mRNAs were identified as following: NM\_000546, NM\_001126112, NM\_001126115, NM\_001126117, NM\_001126116, NM\_001126114, NM\_001126113. All of these transcripts were later alignment together, as an approach to identify conserved regions. We used ClustalW to align those 7 transcripts with their different lengths 2586, 2583, 2271, 2331, 2404, 2719 and 2646 respectively. The resulted alignment file, was the treated with "cons" tool to find the consensus between those transcripts, using 100% conservation.


*In-silico* Approaches for RNAiPost-Transcriptional

CTTTGCTGCCACCTGTGTGTCTGAGGGGTG

model, were accepted.

Gene Regulation: Optimizing siRNA Design and Selection 533

CTTTTCGACATAGTGTGGTGGTGCCCTATGAGCCGCCTGAGGTTGGCTCTGACTGTAC CACCATCCACTACAACTACATGTGTAACAGTTCCTGCATGGGCGGCATGAACCGGA GGCCCATCCTCACCATCATCACACTGGAAGACTCCAGTGGTAATCTACTGGGACGG AACAGCTTTGAGGTGCGTGTTTGTGCCTGTCCTGGGAGAGACCGGCGCACAGAGGA AGAGAATCTCCGCAAGAAAGGGGAGCCTCACCACGAGCTGCCCCCAGGGAGCACT AAGCGAGCACTGCCCAACAACACCAGCTCCTCTCCCCAGCCAAAGAAGAAACCAC TGGATGGAGAATATTTCACCCTTCAGNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNCCGTGGGCGTGAGCGCTTCGAGATGTTCCGAGAGCTGAATGAGGC CTTGGAACTCAAGGATGCCCAGGCTGGGAAGGAGCCAGGGGGGAGCAGGGCTCAC TCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCAT GTTCAAGACAGAAGGGCCTGACTCAGACTGACATTCTCCACTTCTTGTTCCCCACTG ACAGCCTCCCACCCCCATCTCTCCCTCCCCTGCCATTTTGGGTTTTGGGTCTTTGAAC CCTTGCTTGCAATAGGTGTGCGTCAGAAGCACCCAGGACTTCCATTTGCTTTGTCCCG GGGCTCCACTGAACAAGTTGGCCTGCACTGGTGTTTTGTTGTGGGGAGGAGGATGGG GAGTAGGACATACCAGCTTAGATTTTAAGGTTTTTACTGTGAGGGATGTTTGGGAGA TGTAAGAAATGTTCTTGCAGTTAAGGGTTAGTTTACAATCAGCCACATTCTAGGTAG GGGCCCACTTCACCGTACTAACCAGGGAAGCTGTCCCTCACTGTTGAATTTTCTCTA ACTTCAAGGCCCATATCTGTGAAATGCTGGCATTTGCACCTACCTCACAGAGTGCAT TGTGAGGGTTAATGAAATAATGTACATCTGGCCTTGAAACCACCTTTTATTACATGG GGTCTAGAACTTGACCCCCTTGAGGGTGCTTGTTCCCTCTCCCTGTTGGTCGGTGGGT TGGTAGTTTCTACAGTTGGGCAGCTGGTTAGGTAGAGGGAGTTGTCAAGTCTCTGCT GGCCCAGCCAAACCCTGTCTGACAACCTCTTGGTGAACCTTAGTACCTAAAAGGAA ATCTCACCCCATCCCACACCCTGGAGGATTTCATCTCTTGTATATGATGATCTGGATC CACCAAGACTTGTTTTATGCTCAGGGTCAATTTCTTTTTTCTTTTTTTTTTTTTTTTTTCT TTTTCTTTGAGACTGGGTCTCGCTTTGTTGCCCAGGCTGGAGTGGAGTGGCGTGATCT TGGCTTACTGCAGCCTTTGCCTCCCCGGCTCGAGCAGTCCTGCCTCAGCCTCCGGAG TAGCTGGGACCACAGGTTCATGCCACCATGGCCAGCCAACTTTTGCATGTTTTGTAG AGATGGGGTCTCACAGTGTTGCCCAGGCTGGTCTCAAACTCCTGGGCTCAGGCGATC CACCTGTCTCAGCCTCCCAGAGTGCTGGGATTACAATTGTGAGCCACCACGTCCAGC TGGAAGGGTCAACATCTTTTACATTCTGCAAGCACATCTGCATTTTCACCCCACCCTT CCCCTCCTTCTCCCTTTTTATATCCCATTTTTATATCGATCTCTTATTTTACAATAAAA

Then, we used this consensus to evaluate its target accessibility using RNAxs, finding all possible regions to be targeted by siRNA. 1033 possible siRNA were designed using RNAxs. Those siRNAs were evaluated using 10 siRNA efficiency prediction tools as Reynolds (Reynolds et al. 2004), Amarzguioui (Mohammed Amarzguioui & Prydz 2004), Takasaki (Takasaki et al. 2004), Katoh (Katoh & Suzuki 2007), Ui-Tei (Ui-Tei et al. 2004), Hsieh (Hsieh et al. 2004), *Biopredsi* (Huesken et al. 2006), *DSIR* (Vert et al. 2006), *ThermoComposition21* (S.A. Shabalina et al. 2006) and *i-Score* (Ichihara et al. 2007). Selecting siRNA passing 90% or 0.90 predicted score. 111 siRNAs passed these filtration processes, those siRNAs were searched to identify SNPs occurrence residues. All of those 111 siRNAs were found to be targeting SNPs free regions. The last step was to filter those siRNAs against mRNA dataset, to identify those having off-targets. Any siRNA with either complete or partial off-target should be excluded. 85 siRNAs were found to be off-target free candidates. Finally they were filtered and only siRNA with inhibition efficiency above 90%, according to MysiRNA


Table 3. Comparison between different second generation scoring techniques (Mysara 2010). \*In case of ThermoComposition19, however, Huesken dataset used for training ThermoComposition21 and 653 siRNA used for validation.

The resulted consensus was found as per below:

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNCCANGNNATGGANGNTNNNNTGNNCTGTNNNCNNACNAN NNTGNNCNANGNNTNANTGANAGANCNAGNNCCNNNNNNANNNNNNANAAN NNNANAGNNNNNNNCNCCNGNGGNNNNNNCNCCANCNNNNNCTANNNCGNN NGNNNNNTGCANNAGNNNCNTNCNNNNNNNTGTNNNNTTNCTGNCNNNTNCCA GNNNNNNTANCNGNNCANNTNNGNNNTNNNTNTNNNCTNNNTGNNTNCTNNN NNNNNNNNNTCNNNNNCNTNCANGTACTCCCCTGCCCTCAACAAGATGTTTTGCC AACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGC ACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGT GAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCA GCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACA

2,431 records from Huesken dataset.

653 records from

different publications.

2,431 records from Huesken dataset.

2,431 records from Huesken dataset.

Table 3. Comparison between different second generation scoring techniques (Mysara 2010).

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNCCANGNNATGGANGNTNNNNTGNNCTGTNNNCNNACNAN NNTGNNCNANGNNTNANTGANAGANCNAGNNCCNNNNNNANNNNNNANAAN NNNANAGNNNNNNNCNCCNGNGGNNNNNNCNCCANCNNNNNCTANNNCGNN NGNNNNNTGCANNAGNNNCNTNCNNNNNNNTGTNNNNTTNCTGNCNNNTNCCA GNNNNNNTANCNGNNCANNTNNGNNNTNNNTNTNNNCTNNNTGNNTNCTNNN NNNNNNNNNTCNNNNNCNTNCANGTACTCCCCTGCCCTCAACAAGATGTTTTGCC AACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGC ACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGT GAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCA GCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACA

\*In case of ThermoComposition19, however, Huesken dataset used for training

**Tool available at** 

http://biodev. extra.cea.fr/DS IR/DSIR.html

ftp://ftp.ncbi. nlm.nih.gov/p ub/shabalin/si RNA/Thermo Composition

http://www. med.nagoyau.ac.jp/neurog enetics/i\_Score /i\_score.html

http://gestela nd.genetics.uta h.edu/siRNA\_ scales/

**Disadvantages** 

Too many parameters resulted in slow

performance

due to secondary structure calculation

Small number of parameters for ANN training. Slow

Relatively low statistical accuracy

Relatively low statistical accuracy

**Training Dataset used** 

**Tools Model** 

**DSIR** (Vert et al.

**ThermoCompsition\*** (S.A. Shabalina et al.

**i-Score** (Ichihara et

**Scales** (Matveeva et

2006)

2006)

al. 2007)

al. 2007)

**Technique** 

Linear Regression

Neural network

Linear Regression

Linear Regression

ThermoComposition21 and 653 siRNA used for validation.

The resulted consensus was found as per below:

CTTTTCGACATAGTGTGGTGGTGCCCTATGAGCCGCCTGAGGTTGGCTCTGACTGTAC CACCATCCACTACAACTACATGTGTAACAGTTCCTGCATGGGCGGCATGAACCGGA GGCCCATCCTCACCATCATCACACTGGAAGACTCCAGTGGTAATCTACTGGGACGG AACAGCTTTGAGGTGCGTGTTTGTGCCTGTCCTGGGAGAGACCGGCGCACAGAGGA AGAGAATCTCCGCAAGAAAGGGGAGCCTCACCACGAGCTGCCCCCAGGGAGCACT AAGCGAGCACTGCCCAACAACACCAGCTCCTCTCCCCAGCCAAAGAAGAAACCAC TGGATGGAGAATATTTCACCCTTCAGNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNCCGTGGGCGTGAGCGCTTCGAGATGTTCCGAGAGCTGAATGAGGC CTTGGAACTCAAGGATGCCCAGGCTGGGAAGGAGCCAGGGGGGAGCAGGGCTCAC TCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCAT GTTCAAGACAGAAGGGCCTGACTCAGACTGACATTCTCCACTTCTTGTTCCCCACTG ACAGCCTCCCACCCCCATCTCTCCCTCCCCTGCCATTTTGGGTTTTGGGTCTTTGAAC CCTTGCTTGCAATAGGTGTGCGTCAGAAGCACCCAGGACTTCCATTTGCTTTGTCCCG GGGCTCCACTGAACAAGTTGGCCTGCACTGGTGTTTTGTTGTGGGGAGGAGGATGGG GAGTAGGACATACCAGCTTAGATTTTAAGGTTTTTACTGTGAGGGATGTTTGGGAGA TGTAAGAAATGTTCTTGCAGTTAAGGGTTAGTTTACAATCAGCCACATTCTAGGTAG GGGCCCACTTCACCGTACTAACCAGGGAAGCTGTCCCTCACTGTTGAATTTTCTCTA ACTTCAAGGCCCATATCTGTGAAATGCTGGCATTTGCACCTACCTCACAGAGTGCAT TGTGAGGGTTAATGAAATAATGTACATCTGGCCTTGAAACCACCTTTTATTACATGG GGTCTAGAACTTGACCCCCTTGAGGGTGCTTGTTCCCTCTCCCTGTTGGTCGGTGGGT TGGTAGTTTCTACAGTTGGGCAGCTGGTTAGGTAGAGGGAGTTGTCAAGTCTCTGCT GGCCCAGCCAAACCCTGTCTGACAACCTCTTGGTGAACCTTAGTACCTAAAAGGAA ATCTCACCCCATCCCACACCCTGGAGGATTTCATCTCTTGTATATGATGATCTGGATC CACCAAGACTTGTTTTATGCTCAGGGTCAATTTCTTTTTTCTTTTTTTTTTTTTTTTTTCT TTTTCTTTGAGACTGGGTCTCGCTTTGTTGCCCAGGCTGGAGTGGAGTGGCGTGATCT TGGCTTACTGCAGCCTTTGCCTCCCCGGCTCGAGCAGTCCTGCCTCAGCCTCCGGAG TAGCTGGGACCACAGGTTCATGCCACCATGGCCAGCCAACTTTTGCATGTTTTGTAG AGATGGGGTCTCACAGTGTTGCCCAGGCTGGTCTCAAACTCCTGGGCTCAGGCGATC CACCTGTCTCAGCCTCCCAGAGTGCTGGGATTACAATTGTGAGCCACCACGTCCAGC TGGAAGGGTCAACATCTTTTACATTCTGCAAGCACATCTGCATTTTCACCCCACCCTT CCCCTCCTTCTCCCTTTTTATATCCCATTTTTATATCGATCTCTTATTTTACAATAAAA CTTTGCTGCCACCTGTGTGTCTGAGGGGTG

Then, we used this consensus to evaluate its target accessibility using RNAxs, finding all possible regions to be targeted by siRNA. 1033 possible siRNA were designed using RNAxs. Those siRNAs were evaluated using 10 siRNA efficiency prediction tools as Reynolds (Reynolds et al. 2004), Amarzguioui (Mohammed Amarzguioui & Prydz 2004), Takasaki (Takasaki et al. 2004), Katoh (Katoh & Suzuki 2007), Ui-Tei (Ui-Tei et al. 2004), Hsieh (Hsieh et al. 2004), *Biopredsi* (Huesken et al. 2006), *DSIR* (Vert et al. 2006), *ThermoComposition21* (S.A. Shabalina et al. 2006) and *i-Score* (Ichihara et al. 2007). Selecting siRNA passing 90% or 0.90 predicted score. 111 siRNAs passed these filtration processes, those siRNAs were searched to identify SNPs occurrence residues. All of those 111 siRNAs were found to be targeting SNPs free regions. The last step was to filter those siRNAs against mRNA dataset, to identify those having off-targets. Any siRNA with either complete or partial off-target should be excluded. 85 siRNAs were found to be off-target free candidates. Finally they were filtered and only siRNA with inhibition efficiency above 90%, according to MysiRNA model, were accepted.

*In-silico* Approaches for RNAiPost-Transcriptional

latter for miRNA target recognitionpredictions.

*Nucleic Acids Research*, 31(2), pp.589-595.

specificity. *Nature protocols*, 2(9), pp.2068-78.

*biochemistry*, 72, pp.291-336.

*research*, 32(3), pp.893-901.

with RNAi off-targets. *Nature methods*, 3(3), pp.199-204.

via targeted nanoparticles. *Nature*, 464(7291), pp.1067-70.

nucleotide RNAs. *Genes & development*, 15(2), pp.188-200.

*Methods in molecular biology (Clifton, N.J.)*, 623, pp.137-54.

network. *Nature Biotechnology*, 23(8), pp.995-1002.

**8. References** 

Gene Regulation: Optimizing siRNA Design and Selection 535

filtration, selecting the best designed siRNA. We cover state of the art tools for siRNA efficiency prediction, in two generation: the **first generation tools** select the most efficient siRNAs depending on differential ends thermodynamic stability measures, mRNA secondary structure and base preferences specific position target uniqueness. **The second generation tools** have been developed by applying sophisticated data mining techniques to handle huge annotated records of siRNAs with their experimental inhibition, as in *Biopredsi, ThermoComposition21 and Scales's* artificial neural network model and *DSIR and i-Score's*  linear regression model. By the end of the chapter, we design siRNA targeting human P53 protein, as a practical example of the proposed protocol. Future directions would be to find additional factors that affect shRNA (siRNAs inserted into expression vectors) that further decrease the efficacy of the expressed siRNAs from them, and extending this methodology

Amarzguioui, M., 2003. Tolerance for mutations and chemical modifications in a siRNA.

Amarzguioui, Mohammed & Prydz, H., 2004. An algorithm for selection of functional siRNA sequences. *Biochemical and biophysical research communications*, 316(4), pp.1050-8. Anderson, E.M. et al., 2008. Experimental validation of the importance of seed complement frequency to siRNA specificity. *RNA (New York, N.Y.)*, 14(5), pp.853-61. Birmingham, A. et al., 2006. 3' UTR seed matches, but not overall identity, are associated

Birmingham, A. et al., 2007. A protocol for designing siRNAs with high functionality and

Black, D.L., 2003. Mechanisms of alternative pre-messenger RNA splicing. *Annual review of* 

Czauderna, F. et al., 2003. Structural variations and stabilising modifications of synthetic siRNAs in mammalian cells. *Nucleic acids research*, 31(11), pp.2705-16. Davis, M.E. et al., 2010. Evidence of RNAi in humans from systemically administered siRNA

Dorsett, Y. & Tuschl, Thomas, 2004. siRNAs: applications in functional genomics and potential as therapeutics. *Nature reviews. Drug discovery*, 3(4), pp.318-29. Elbashir, S M, Lendeckel, W. & Tuschl, T, 2001. RNA interference is mediated by 21- and 22-

Elbashir, S M et al., 2001. Functional anatomy of siRNAs for mediating efficient RNAi in Drosophila melanogaster embryo lysate. *The EMBO journal*, 20(23), pp.6877-88. Elbashir, Sayda M et al., 2002. Analysis of gene function in somatic mammalian cells using small interfering RNAs. *Methods (San Diego, Calif.)*, 26(2), pp.199-213. Hofacker, I.L. & Tafer, H., 2010. Designing optimal siRNA based on target site accessibility.

Hsieh, A.C. et al., 2004. A library of siRNA duplexes targeting the phosphoinositide 3-kinase

Huesken, D. et al., 2006. Design of a genome-wide siRNA library using an artificial neural

pathway: determinants of gene silencing for use in cell-based screens. *Nucleic acids* 


Table 4. Final siRNA candidates after all stages of design and filtration.

In ElHefnawi et. Al., other examples of optimal siRNA design and selection as silencers for difficult targets such as the Hepatitis C virus (HCV), and the Influenza a virus that have been experimentally tested for verifications of the methodology are under publication (Mahmoud ElHefnawi1 2011) (Mahmoud ElHefnawi 1 2011)).

#### **7. Conclusion**

In this chapter we provide a comprehensive foundation of the underlying bioinformatics methodology for optimal design and selection of siRNA molecules. We address factors affecting siRNA interference, covering both siRNA and mRNA sides. These factors can be classified into four major classes, **the first class of factors**, "**targeted region**" or "target sequence space", addresses how to identify regions in the mRNA that should be targeted by the designed siRNA; and discusses five factor affecting target sequence space: transcript region, transcript size, mRNA multiple splicing, single nucleotide polymorphism and orthologs consensus. **The second class of factors**, **"siRNA sequence space**", addresses positional/word preferences in the sense/antisense strand of the siRNA. siRNA sequence space is affected by several factors including nucleotide positional preferences Protocol, GC content, and palindrome. In addition, thermodynamic stability and differential ends instability have been identified to be highly important factors in siRNA functionality. **The third class of factors**, is the" **target accessibility**", and how the targeted mRNAs tend to form secondary structure that affect their accessibility hence reduce the capabilities of the designed siRNA to target certain regions of mRNA. Target accessibility is considered as the sum of the energy required to open mRNA and siRNA duplex and the energy required to stabilize siRNA-mRNA duplex. **The fourth class of factors**, "**off-target matches**", that influence siRNA specificity via perfect-match, and partial off targets & sequence motifs that invoke immune reaction. Each of these classes can greatly affect siRNA selection and therefore are studied thoroughly in this chapter.

We present a step wise protocol for designing siRNA with the highest specificity and sensitivity in seven different phases, Targeted gene assignment, targeted sequence specification and filtration, designing all possible siRNAs targeting the selected regions, siRNAs scoring and scores filtration, siRNAs target accessibility filtration, siRNAs off-target filtration, selecting the best designed siRNA. We cover state of the art tools for siRNA efficiency prediction, in two generation: the **first generation tools** select the most efficient siRNAs depending on differential ends thermodynamic stability measures, mRNA secondary structure and base preferences specific position target uniqueness. **The second generation tools** have been developed by applying sophisticated data mining techniques to handle huge annotated records of siRNAs with their experimental inhibition, as in *Biopredsi, ThermoComposition21 and Scales's* artificial neural network model and *DSIR and i-Score's*  linear regression model. By the end of the chapter, we design siRNA targeting human P53 protein, as a practical example of the proposed protocol. Future directions would be to find additional factors that affect shRNA (siRNAs inserted into expression vectors) that further decrease the efficacy of the expressed siRNAs from them, and extending this methodology latter for miRNA target recognitionpredictions.

#### **8. References**

534 Bioinformatics – Trends and Methodologies

**position Sense Antisense Predicted** 

800 GCGUGUGGAGUAUUUGGAU AUCCAAAUACUCCACACGCaa 90.6% 822 AGAAACACUUUUCGACAUA UAUGUCGAAAAGUGUUUCUgu 92.6% 883 GUACCACCAUCCACUACAA UUGUAGUGGAUGGUGGUACag 91.5% 1330 CCCGCCAUAAAAAACUCAU AUGAGUUUUUUAUGGCGGGag 91.4% 1842 GAAACCACCUUUUAUUACA UGUAAUAAAAGGUGGUUUCaa 92.3% 1915 GGUGGGUUGGUAGUUUCUA UAGAAACUACCAACCCACCga 92.1% 1919 GGUUGGUAGUUUCUACAGU ACUGUAGAAACUACCAACCca 90% 2016 CCUUAGUACCUAAAAGGAA UUCCUUUUAGGUACUAAGGuu 95.4% 2111 GCUCAGGGUCAAUUUCUUU AAAGAAAUUGACCCUGAGCau 92.9% 2499 CCCUCCUUCUCCCUUUUUA UAAAAAGGGAGAAGGAGGGga 92% 2530 CUCCCUUUUUAUAUCCCAU AUGGGAUAUAAAAAGGGAGaa 91.5% 25030 AUAUCGAUCUCUUAUUUUA UAAAAUAAGAGAUCGAUAUaa 93.1%

In ElHefnawi et. Al., other examples of optimal siRNA design and selection as silencers for difficult targets such as the Hepatitis C virus (HCV), and the Influenza a virus that have been experimentally tested for verifications of the methodology are under publication

In this chapter we provide a comprehensive foundation of the underlying bioinformatics methodology for optimal design and selection of siRNA molecules. We address factors affecting siRNA interference, covering both siRNA and mRNA sides. These factors can be classified into four major classes, **the first class of factors**, "**targeted region**" or "target sequence space", addresses how to identify regions in the mRNA that should be targeted by the designed siRNA; and discusses five factor affecting target sequence space: transcript region, transcript size, mRNA multiple splicing, single nucleotide polymorphism and orthologs consensus. **The second class of factors**, **"siRNA sequence space**", addresses positional/word preferences in the sense/antisense strand of the siRNA. siRNA sequence space is affected by several factors including nucleotide positional preferences Protocol, GC content, and palindrome. In addition, thermodynamic stability and differential ends instability have been identified to be highly important factors in siRNA functionality. **The third class of factors**, is the" **target accessibility**", and how the targeted mRNAs tend to form secondary structure that affect their accessibility hence reduce the capabilities of the designed siRNA to target certain regions of mRNA. Target accessibility is considered as the sum of the energy required to open mRNA and siRNA duplex and the energy required to stabilize siRNA-mRNA duplex. **The fourth class of factors**, "**off-target matches**", that influence siRNA specificity via perfect-match, and partial off targets & sequence motifs that invoke immune reaction. Each of these classes can greatly affect siRNA selection and

We present a step wise protocol for designing siRNA with the highest specificity and sensitivity in seven different phases, Targeted gene assignment, targeted sequence specification and filtration, designing all possible siRNAs targeting the selected regions, siRNAs scoring and scores filtration, siRNAs target accessibility filtration, siRNAs off-target

Table 4. Final siRNA candidates after all stages of design and filtration.

(Mahmoud ElHefnawi1 2011) (Mahmoud ElHefnawi 1 2011)).

therefore are studied thoroughly in this chapter.

**efficiency** 

**siRNA** 

**7. Conclusion** 


*In-silico* Approaches for RNAiPost-Transcriptional

*virology*, 76(24), pp.12963-73.

*Nature biotechnology*, 26(5), pp.578-83.

*mammalian and chick RNA interference.*

*BMC bioinformatics*, 7(1), p.520.

revision.

Gene Regulation: Optimizing siRNA Design and Selection 537

Surabhi, R.M. & Gaynor, R.B., 2002. RNA interference directed against viral and cellular

Tafer, H. et al., 2008. The impact of target site accessibility on the design of effective siRNAs.

Takasaki, S., Kotani, S. & Konagaya, A., 2004. An effective method for selecting siRNA target sequences in mammalian cells. *Cell cycle (Georgetown, Tex.)*, 3(6), pp.790-5. Ui-Tei, K. et al., 2004. *Guidelines for the selection of highly effective siRNA sequences for* 

Ullu, E. et al., 2002. RNA interference: advances and questions. *Philosophical transactions of the Royal Society of London. Series B, Biological sciences*, 357(1417), pp.65-70. Vert, J.-P. et al., 2006. An accurate and interpretable model for siRNA efficacy prediction.

Xia, H. et al., 2004. RNAi suppresses polyglutamine-induced neurodegeneration in a model

ElHefnawi, M., O. Alaidi, et al. (2011). "Identification of novel conserved functional motifs

BACKGROUND: Influenza A virus poses a continuous threat to global public

Mahmoud ElHefnawi1, Rania Siam3, Nafisa Hassan2 , Mona Kamar2 , Marco Sgarbanti4,

Annalisa Rimoli4, Iman El-Azab5, Osama AlAidy6, Giulia Marsiliin Marco Sgarbanti4 (2011). "The design of optimal therapeutic small interfering RNA molecules targeting diverse strains of influenza A virus." *Bioinformatics* OUP under

health. Design of novel universal drugs and vaccine requires a careful analysis of different strains of Influenza A viral genome from diverse hosts and subtypes. We performed a systematic in silico analysis of Influenza A viral segments of all available Influenza A viral strains and subtypes and grouped them based on host, subtype, and years isolated, and through multiple sequence alignments we extrapolated conserved regions, motifs, and accessible regions for functional mapping and annotation. RESULTS: Across all species and strains 87 highly conserved regions (conservation percentage > = 90%) and 19 functional motifs (conservation percentage = 100%) were found in PB2, PB1, PA, NP, M, and NS segments. The conservation percentage of these segments ranged between 94-98% in human strains (the most conserved), 85-93% in swine strains (the most variable), and 91-94% in avian strains. The most conserved segment was different in each host (PB1 for human strains, NS for avian strains, and M for swine strains). Target accessibility prediction yielded 324 accessible regions, with a single stranded probability > 0.5, of which 78 coincided with conserved regions. Some of the interesting annotations in these regions included sites for protein-protein interactions, the RNA binding groove, and the proton ion channel. CONCLUSIONS: The influenza virus has evolved to adapt to its host through variations in the GC content and conservation percentage of the conserved regions. Nineteen universal conserved functional motifs were discovered, of which some were accessible regions with interesting biological functions. These regions will serve as a foundation for universal drug targets as well as universal vaccine design.

of spinocerebellar ataxia. *Nature medicine*, 10(8), pp.816-20.

across most Influenza A viral strains." *Virol J* 8: 44.

targets inhibits human immunodeficiency Virus Type 1 replication. *Journal of* 


Hutvágner, G. & Zamore, P.D., 2002. A microRNA in a multiple-turnover RNAi enzyme

Ichihara, M. et al., 2007. Thermodynamic instability of siRNA duplex is a prerequisite for dependable prediction of siRNA activities. *Nucleic Acids Research*, pp.1-10. Jackson, A.L. et al., 2003. Expression profiling reveals off-target gene regulation by RNAi.

Jackson, A.L. & Linsley, P.S., 2010. Recognizing and avoiding siRNA off-target effects for

Katoh, T. & Suzuki, T., 2007. Specific residues at every third position of siRNA shape its

Kaufmann, S.H.E. & Patzel, V., 2008. Structures of Active Guide Rna Molecules and Method

Ladunga, I., 2007. More complete gene silencing by fewer siRNAs: transparent optimized design and biophysical signature. *Nucleic acids research*, 35(2), pp.433-40. Lorenz, C. et al., 2004. Steroid and lipid conjugates of siRNAs to enhance cellular uptake and gene silencing in liver cells. *Bioorganic & medicinal chemistry letters*, 14(19), pp.4975-7. Matveeva, O. et al., 2007. Comparison of approaches for rational siRNA design leading to a

Mysara, M., 2010. *MysiRNA: Automation of siRNA Design Considering Multi-score Filtration*. Mysara, M. et al., 2011. MysiRNA: Improving siRNA Efficacy Prediction Using a Machine-

Mysara, M., Garibaldi, J. & Elhefnawi, M., 2011. MysiRNA-Designer : a Workflow for

Mückstein, U. et al., 2006. Thermodynamics of RNA-RNA binding. *Bioinformatics (Oxford,* 

Park, Y.-kyu et al., 2008. AsiDesigner : exon-based siRNA design server considering alternative splicing. *Knowledge Creation Diffusion Utilization*, 36(May), pp.97-103. Patzel, V., 2007. In silico selection of active siRNA. *Drug Discovery Today*, 12(3-4), pp.139-48. Patzel, V. et al., 2005. Design of siRNAs producing unstructured guide-RNAs results in improved RNA interference efficiency. *Nature biotechnology*, 23(11), pp.1440-4. Ren, Y. et al., 2006. siRecords : an extensive database of mammalian siRNAs with efficacy

Reynolds, A. et al., 2004. Rational siRNA design for RNA interference. *Nature biotechnology*,

Schwarz, D.S. et al., 2003. Asymmetry in the Assembly of the RNAi Enzyme Complex. *Cell*,

Semizarov, D. et al., 2003. Specificity of short interfering RNA determined through gene

Shabalina, S.A., Spiridonov, A.N. & Ogurtsov, A.Y., 2006. Computational models with

Stark, G.R. et al., 1998. How cells respond to interferons. *Annual review of biochemistry*, 67,

expression signatures. *Proceedings of the National Academy of Sciences of the United* 

thermodynamic and composition features improve siRNA design. *BMC* 

Learning Model Combining Multi-tools and Whole Stacking Energy (∆G). *Journal of* 

target identification and therapeutic application. *Nature reviews. Drug discovery*,

complex. *Science (New York, N.Y.)*, 297(5589), pp.2056-60.

efficient RNAi activity. *Nucleic acids research*, 35(4), p.e27.

new efficient and transparent method. *Access*, 35(8), pp.1-10.

*Nature biotechnology*, 21(6), pp.635-7.

*Biomedical Informatics*, pp.1-23.

*England)*, 22(10), pp.1177-82.

ratings. *Access*, pp.1-10.

*bioinformatics*, 7(1), p.65.

pp.227-64.

*States of America*, 100(11), pp.6347-52.

22(3), pp.326-30.

115(2), pp.199-208.

Efficient siRNA Design. *PLoS One*, pp.1-14.

9(1), pp.57-67.

of Selection.


**24** 

Zhiguo Wang

*Canada* 

**MicroRNA Targeting in Heart:** 

*Research Centre, Montreal Heart Institute, Montreal, and Department of Medicine,* 

Cardiovascular disease remains the major cause of morbidity and mortality; according to statistics, heart failure, the syndrome consequential to many diseases of the cardiovascular system, is estimated to have a prevalence of 1–2% and an annual incidence of 5–10 per 1,000 in the developed countries and is the leading cause of hospitalization in the population over 50 years of age. Worse so, there is a clear tendency of increasing prevalence of cardiovascular disease in this planet particularly in the developing nations. This problem

We have entered post-genome era after the human genome project had been completed years ago. Only <2% of all transcribed bases of the entire human genome constitutes the genetic sequence encoding proteins and the rest of 98% accounting for ~70% of all genes carry the sequences for RNAs not encoding a polypeptide chain that was used to be considered for many years "junk DNA" of no physiologic function; proteins were generally assumed the sole biopolymer capable of regulatory function. Intriguinly, the proportion of transcribed non-protein-coding sequences increases with developmental complexity and is a better indicator of phylogenetic level than the number of protein-coding genes of an organism. It is now known that "junk DNA" encodes non-protein-coding RNAs (ncRNAs) that are involved in determining the expression of protein-coding genes by regulating the activity of that 2% of the genome. These ncRNAs include microRNAs (miRNAs), once ignored completely or overlooked as cellular detritus, which were discovered over a decade ago have recently taken many by surprise because of their widespread expression and

miRNAs are endogenous small mRNAs of ~22 nucleotide in length which act primarily to repress gene expression at the post-transcriptional level. To date, ~6400 vertebrates mature miRNAs have been registered in miRBase, an online repository for miRNA, among which ~5100 miRNAs are found in mammals which include >850 human miRNAs. These miRNAs are predicted to regulates >60% of protein–coding genes. The discovery of miRNA challenges the central dogma of molecular biology that has be hold since the latter half of the 20th century. With the recent rapid evolution of miRNA research, researchers have begun to appreciate the roles of these small non-protein-coding mRNAs in the cardiovascular system. Based on the published studies, it is now clear that miRNAs are involved in nearly all aspects of cardiac function and pathogenesis (Wang, 2010; Wang et al.,

casts enormous health concern and costly socioeconomic burden worldwide.

**1. Introduction** 

diverse functions.

2008; Yang et al., 2008).

 **A Theoretical Analysis** 

*Universite de Montreal, Montreal,* 


### **MicroRNA Targeting in Heart: A Theoretical Analysis**

Zhiguo Wang

*Research Centre, Montreal Heart Institute, Montreal, and Department of Medicine, Universite de Montreal, Montreal, Canada* 

#### **1. Introduction**

538 Bioinformatics – Trends and Methodologies

Mahmoud ElHefnawi 1, TaeKyu Kim3, Mona A. Kamar 2, Nafisa M. Hassan 2, Iman A El-

Azab4, Suher Zada5, Marc P. Windisch3\* (2011). "Novel DESIGN AND SELECTION OF EFFICIENT SPECIFIC UNIVERSAL SMALL INTERFERING RNA MOLECULES tested in Hepatitis C Virus replicon cell lines." *PLOS1* submitted. Patzel, V., S. Rutz, et al. (2005). "Design of siRNAs producing unstructured guide-RNAs results in improved RNA interference efficiency." *Nat Biotechnol* 23(11): 1440-1444.

> Cardiovascular disease remains the major cause of morbidity and mortality; according to statistics, heart failure, the syndrome consequential to many diseases of the cardiovascular system, is estimated to have a prevalence of 1–2% and an annual incidence of 5–10 per 1,000 in the developed countries and is the leading cause of hospitalization in the population over 50 years of age. Worse so, there is a clear tendency of increasing prevalence of cardiovascular disease in this planet particularly in the developing nations. This problem casts enormous health concern and costly socioeconomic burden worldwide.

> We have entered post-genome era after the human genome project had been completed years ago. Only <2% of all transcribed bases of the entire human genome constitutes the genetic sequence encoding proteins and the rest of 98% accounting for ~70% of all genes carry the sequences for RNAs not encoding a polypeptide chain that was used to be considered for many years "junk DNA" of no physiologic function; proteins were generally assumed the sole biopolymer capable of regulatory function. Intriguinly, the proportion of transcribed non-protein-coding sequences increases with developmental complexity and is a better indicator of phylogenetic level than the number of protein-coding genes of an organism. It is now known that "junk DNA" encodes non-protein-coding RNAs (ncRNAs) that are involved in determining the expression of protein-coding genes by regulating the activity of that 2% of the genome. These ncRNAs include microRNAs (miRNAs), once ignored completely or overlooked as cellular detritus, which were discovered over a decade ago have recently taken many by surprise because of their widespread expression and diverse functions.

> miRNAs are endogenous small mRNAs of ~22 nucleotide in length which act primarily to repress gene expression at the post-transcriptional level. To date, ~6400 vertebrates mature miRNAs have been registered in miRBase, an online repository for miRNA, among which ~5100 miRNAs are found in mammals which include >850 human miRNAs. These miRNAs are predicted to regulates >60% of protein–coding genes. The discovery of miRNA challenges the central dogma of molecular biology that has be hold since the latter half of the 20th century. With the recent rapid evolution of miRNA research, researchers have begun to appreciate the roles of these small non-protein-coding mRNAs in the cardiovascular system. Based on the published studies, it is now clear that miRNAs are involved in nearly all aspects of cardiac function and pathogenesis (Wang, 2010; Wang et al., 2008; Yang et al., 2008).

MicroRNA Targeting in Heart: A Theoretical Analysis 541

our technologies and approaches to conduct thorough characterization of miRNA targeting. As a surrogate to these limitations, we have conducted a rationally designed bioinformatics analysis in conjunction with experimental approaches to identify the miRNAs which have the potential to regulate human cardiac ion channel genes and to validate the analysis with several pathological settings associated with the deregulated miRNAs and ion channel genes in the heart (Luo et al., 2010). In this way, we have been able to identify an array of miRNAs that are expressed in cardiac cells and have the potential to regulate the genes encoding cardiac ion channels, transporters and intracellular Ca2+ handling proteins. Our data well explain the ionic remodelling processes occurring in hypertrophy/heart failure, myocardial ischemia, or atrial fibrillation at the level of miRNA; the changes of miRNAs appear to have anti-correlation with the changes of many of the genes encoding cardiac ion

We used the miRecords miRNA database and target-prediction website for our initial analysis. The miRecords is resource for animal miRNA-target interactions developed at the University of Minnesota (Xiao et al., 2009). The miRecords consists of two separate databases. The Validated Targets database contains the experimentally validated miRNA targets being updated from meticulous literature curation. The Predicted Targets database of miRecords is an integration of predicted miRNA targets produced by 11 established miRNA target prediction programs. These algorithms include DIANA-microT, MicroInspector, miRanda, MirTarget2, miTarget, NBmiRTar, PicTar, PITA, RNA22,

As an initial "screening" process, we performed miRNA target prediction through the miRecords database (Xiao et al., 2009). This miRNA database integrates miRNA target predictions by 11 algorithms. Four of the 11 algorithms (microInspector, miTarget, NBmiRTar, and RNA22) were removed from our data analysis because they failed to predict; these websites require manual input of 3'UTR sequences of the genes. Thus, our data analysis was based upon the prediction from seven algorithms (TargetScan, DIANAmiT3.0, miRanda, PicTar, PITA, RNAHybrid, and miRTarget2) (Enright et al., 2003; Kertesz et al., 2007; Kiriakidou et al., 2004; Krek et al., 2005; Lewis et al., 2003 & 2005; Rehmsmeier et al., 2004; Wang & El Naqa., 2008). These prediction techniques are based on algorithms with different parameters (such as miRNA seed:mRNA 3'UTR complementarity, thermodynamic stability of base-pairing (assessed by free energy), evolutionary conservation across orthologous 3'UTRs in multiple species, structural accessibility of the binding sites, nucleotide composition beyond the seed sequence, number of binding sites in 3'UTR, and anti-correlation between miRNAs and their target mRNAs). A then-updated set of RefSeq genes and their annotations was used to define a set of human 3' UTRs. Orthologous UTRs (based on whole-genome alignments) were obtained for 22 other species from UCSC Genome Bioinformatics. Conservation of each miRNA site was evaluated using phylogenetic branch lengths of all species containing the site based on the methods by Friedman et al (2009). For all highly conserved miRNAs, the probability of preferentially conserved targeting for each site was estimated as described (Friedman et al., 2009). Predicted consequential pairing to 3' end of miRNAs was included only if the raw 3' pairing

channels under these pathological conditions.

**2.1** *In Silico* **analysis of miRNA targets** 

RNAhybrid, and TargetScan/TargetScanS.

score (Grimson et al., 2007) is at least 3.0.

It known that an individual miRNA has the potential to target multiple protein-coding genes and *vice versa* a single protein-coding gene may be regulated by multiple miRNAs, implying that the action of miRNAs is not gene-specific. This fact creates an obstacle to our thorough understanding of miRNA functions and the mechanism underlying these functions. For example, we have recently found that *miR-125-5p* target GJA1 (encoding gap junction channel protein connexin43) and GJC1 (encoding another gap junction channel protein connexin40) to cause slowing of both ventricular and atrial conduction promoting arrhythmogenesis in failing heart where its expression is upregulated (Wang, 2010). However, based upon computational prediction it can also target SCN5A (encoding Nav1.5 Na+ channel α-subunit). An immediate question is whether the potential repression of SCN5A also plays a role in *miR-125-5p*-induced conduction slowing. On the other hand, GJA1, GJC1 and SCN5A are predicted to be regulated by other miRNAs in addition to *miR-125-5p* (such as *miR-101, miR-125, miR-130, miR-19, miR-23*, *miR-26* and *miR-30*); whether these miRNAs are also involved in the deregulation of these genes in heart failure remained unknown. This same uncertainty may exist in the interactions between literally all miRNAs and protein-coding genes. Proper experimental approaches are ultimately required to clarify these issues. However, at present it is not feasible to have thorough elucidation of the complete set of target genes of a given miRNA or of the complete array of mRNAs that regulate a given protein-coding genes; computational prediction remains the best alternative for rapid identification of miRNA target genes.

This chapter aims to shed light on how to take rational uses of bioinformatics analysis to identify the miRNAs from the currently available miRNA databases which have the potential to regulate human genes related to cardiac function and pathology and to validate the analysis taking several pathological settings associated with the deregulated miRNAs in the heart. The pathological conditions to be discussed include arrhythmogenesis, apoptosis and fibrogenesis that are known to be critical to the adverse electrical, cellular and structural remodelling processes in various diseased states of heart such as myocardial infarction, cardiac hypotrophy and heart failure.

#### **2. Ion channel genes as targets for miRNAs**

Cardiac cells are excitable cells that can generate and propagate excitations; At the cellular level, excitability is reflected by cardiac action potential, which is orchestrated by transmembrane proteins like ion channels and transporters and intracellular proteins for Ca2+ handling. (Wang, 2010; Wang et al., 2008; Yang et al., 2008). Deregulation of miRNA expression has been implicated in a variety of diseased conditions of the heart and aberrant expression of miRNAs can render expression deregulation of ion channel genes resulting in channelopathies−arrhythmogenesis leading to sudden cardiac death. Indeed, we and others have shown the critical involvement of miRNAs, particularly the muscle-specific miRNAs *miR-1* and *miR-133*, in regulating every aspects of cardiac excitability to affect arrhythmogenesis under various pathological conditions including myocardial infarction, cardiac hypertrophy, diabetic cardiomyopathy, etc. This property of miRNAs is conferred by their ability to target ion channels and intracellular Ca2+ handling proteins at the posttranscriptional level, as already revealed by numerous studies. Even though, our current knowledge about miRNA regulation of cardiac ion channels is still rather preliminary with limited experimental data available in the literature. This is largely due to the limitations of

It known that an individual miRNA has the potential to target multiple protein-coding genes and *vice versa* a single protein-coding gene may be regulated by multiple miRNAs, implying that the action of miRNAs is not gene-specific. This fact creates an obstacle to our thorough understanding of miRNA functions and the mechanism underlying these functions. For example, we have recently found that *miR-125-5p* target GJA1 (encoding gap junction channel protein connexin43) and GJC1 (encoding another gap junction channel protein connexin40) to cause slowing of both ventricular and atrial conduction promoting arrhythmogenesis in failing heart where its expression is upregulated (Wang, 2010). However, based upon computational prediction it can also target SCN5A (encoding Nav1.5 Na+ channel α-subunit). An immediate question is whether the potential repression of SCN5A also plays a role in *miR-125-5p*-induced conduction slowing. On the other hand, GJA1, GJC1 and SCN5A are predicted to be regulated by other miRNAs in addition to *miR-125-5p* (such as *miR-101, miR-125, miR-130, miR-19, miR-23*, *miR-26* and *miR-30*); whether these miRNAs are also involved in the deregulation of these genes in heart failure remained unknown. This same uncertainty may exist in the interactions between literally all miRNAs and protein-coding genes. Proper experimental approaches are ultimately required to clarify these issues. However, at present it is not feasible to have thorough elucidation of the complete set of target genes of a given miRNA or of the complete array of mRNAs that regulate a given protein-coding genes; computational prediction remains the best alternative

This chapter aims to shed light on how to take rational uses of bioinformatics analysis to identify the miRNAs from the currently available miRNA databases which have the potential to regulate human genes related to cardiac function and pathology and to validate the analysis taking several pathological settings associated with the deregulated miRNAs in the heart. The pathological conditions to be discussed include arrhythmogenesis, apoptosis and fibrogenesis that are known to be critical to the adverse electrical, cellular and structural remodelling processes in various diseased states of heart such as myocardial infarction,

Cardiac cells are excitable cells that can generate and propagate excitations; At the cellular level, excitability is reflected by cardiac action potential, which is orchestrated by transmembrane proteins like ion channels and transporters and intracellular proteins for Ca2+ handling. (Wang, 2010; Wang et al., 2008; Yang et al., 2008). Deregulation of miRNA expression has been implicated in a variety of diseased conditions of the heart and aberrant expression of miRNAs can render expression deregulation of ion channel genes resulting in channelopathies−arrhythmogenesis leading to sudden cardiac death. Indeed, we and others have shown the critical involvement of miRNAs, particularly the muscle-specific miRNAs *miR-1* and *miR-133*, in regulating every aspects of cardiac excitability to affect arrhythmogenesis under various pathological conditions including myocardial infarction, cardiac hypertrophy, diabetic cardiomyopathy, etc. This property of miRNAs is conferred by their ability to target ion channels and intracellular Ca2+ handling proteins at the posttranscriptional level, as already revealed by numerous studies. Even though, our current knowledge about miRNA regulation of cardiac ion channels is still rather preliminary with limited experimental data available in the literature. This is largely due to the limitations of

for rapid identification of miRNA target genes.

**2. Ion channel genes as targets for miRNAs** 

cardiac hypotrophy and heart failure.

our technologies and approaches to conduct thorough characterization of miRNA targeting. As a surrogate to these limitations, we have conducted a rationally designed bioinformatics analysis in conjunction with experimental approaches to identify the miRNAs which have the potential to regulate human cardiac ion channel genes and to validate the analysis with several pathological settings associated with the deregulated miRNAs and ion channel genes in the heart (Luo et al., 2010). In this way, we have been able to identify an array of miRNAs that are expressed in cardiac cells and have the potential to regulate the genes encoding cardiac ion channels, transporters and intracellular Ca2+ handling proteins. Our data well explain the ionic remodelling processes occurring in hypertrophy/heart failure, myocardial ischemia, or atrial fibrillation at the level of miRNA; the changes of miRNAs appear to have anti-correlation with the changes of many of the genes encoding cardiac ion channels under these pathological conditions.

#### **2.1** *In Silico* **analysis of miRNA targets**

We used the miRecords miRNA database and target-prediction website for our initial analysis. The miRecords is resource for animal miRNA-target interactions developed at the University of Minnesota (Xiao et al., 2009). The miRecords consists of two separate databases. The Validated Targets database contains the experimentally validated miRNA targets being updated from meticulous literature curation. The Predicted Targets database of miRecords is an integration of predicted miRNA targets produced by 11 established miRNA target prediction programs. These algorithms include DIANA-microT, MicroInspector, miRanda, MirTarget2, miTarget, NBmiRTar, PicTar, PITA, RNA22, RNAhybrid, and TargetScan/TargetScanS.

As an initial "screening" process, we performed miRNA target prediction through the miRecords database (Xiao et al., 2009). This miRNA database integrates miRNA target predictions by 11 algorithms. Four of the 11 algorithms (microInspector, miTarget, NBmiRTar, and RNA22) were removed from our data analysis because they failed to predict; these websites require manual input of 3'UTR sequences of the genes. Thus, our data analysis was based upon the prediction from seven algorithms (TargetScan, DIANAmiT3.0, miRanda, PicTar, PITA, RNAHybrid, and miRTarget2) (Enright et al., 2003; Kertesz et al., 2007; Kiriakidou et al., 2004; Krek et al., 2005; Lewis et al., 2003 & 2005; Rehmsmeier et al., 2004; Wang & El Naqa., 2008). These prediction techniques are based on algorithms with different parameters (such as miRNA seed:mRNA 3'UTR complementarity, thermodynamic stability of base-pairing (assessed by free energy), evolutionary conservation across orthologous 3'UTRs in multiple species, structural accessibility of the binding sites, nucleotide composition beyond the seed sequence, number of binding sites in 3'UTR, and anti-correlation between miRNAs and their target mRNAs). A then-updated set of RefSeq genes and their annotations was used to define a set of human 3' UTRs. Orthologous UTRs (based on whole-genome alignments) were obtained for 22 other species from UCSC Genome Bioinformatics. Conservation of each miRNA site was evaluated using phylogenetic branch lengths of all species containing the site based on the methods by Friedman et al (2009). For all highly conserved miRNAs, the probability of preferentially conserved targeting for each site was estimated as described (Friedman et al., 2009). Predicted consequential pairing to 3' end of miRNAs was included only if the raw 3' pairing score (Grimson et al., 2007) is at least 3.0.

MicroRNA Targeting in Heart: A Theoretical Analysis 543

miRNAs in human heart (*miR-1, miR-133a/b, miR-16, miR-100, miR-125a/b, miR-126, miR-145, miR-195, miR-199\*, miR-20a/b, miR-21, miR-26a/b, miR-24, miR-23, miR-29a/b, miR-27a/b, miR-30a/b/c, miR-92a/b, miR-99, and let-7a/c/f/g*). In this way, we generated the modified datasets for subsequent analyses and obtained an overall picture of control of expression of ion

Next, we intended to apply the theoretical prediction to explaining some established observations of the electrical remodeling related to deregulation of both miRNAs and the genes for ion channels and transporters. Three pathological conditions, cardiac hypertrophy/heart failure, ischemic myocardial injuries, and atrial fibrillation, were studied based on the expression profiles and participations of miRNAs in these conditions as

**2.2 Control of expression of Ion channel genes by miRNAs under normal conditions** 

1. One hundred ninety-three out of 718 human miRNAs or out of 222 miRNAs expressed in the heart have the potential to target the genes encoding human cardiac ion channels

2. Only two genes CLCN2 and KCNE2 were predicted not to contain the target site for

3. It appears that the most fundamental and critical ion channels governing cardiac excitability have the largest numbers of miRNAs as their regulators. These include SCN5A for *I*Na (responsible for the upstroke of the cardiac action potential thereby the conduction of excitations), CACNA1C/CACNB2 for *I*Ca,L (accounting for the characteristic long plateau of the cardiac action potential and excitation-contraction coupling), KCNJ2 for *I*K1 (sets and maintains the cardiac membrane potential), SLC8A1 for NCX1 (an antiporter membrane protein which removes Ca2+ from cells), GJA1/GJC1 (gap junction channel responsible for intercellular conduction of excitation), and ATP1B1 for Na+/K+ pump (establishing and maintaining the normal electrochemical gradients of Na+ and K+ across the plasma membrane). Each of these genes is

4. The atrium-specific ion channels, including Kir3.4 for *I*KACh, Kv1.5 for *I*Kur, and

5. All four genes for K+ channel auxiliary β-subunits KCNE1, KCNE2, KCHiP, and

6. Intriguingly, 16 of these top 20 miRNAs are included in the list of the predicted miRNA-target dataset; the other four cardiac-abundant miRNAs miR-21, miR-99, miR-100 and miR-126 are predicted unable to regulate the genes for human cardiac ion

7. There is a rough correlation between the number of predicted targets and the abundance of miRNAs in the heart. It appears that the miRNAs within top 8 separate from the rest 12 less abundant miRNAs in their number of target genes. The muscle-specific miRNA miR-1 was predicted to have the largest number of target genes (9 genes) among all miRNAs most abundantly expressed in the heart, followed by miR-30a/b/c, miR-24 and miR-125a/b that have 6 target genes each. The muscle-specific miRNA miR-133 has four target genes and three of them (KCNH2, KCNQ1 and HCN2) have been experimentally verified

CACNA1G for *I*Ca,T, seem to be the rare targets for miRNAs (<5 miRNAs).

KCNAB2 were also found to have less number of regulator miRNAs (<10).

previously reported. The results are presented in the section following the next.

channel genes by miRNAs in heart under normal conditions.

The above analyses allow us to reach the following notes.

and transporters.

miRNAs expressed in the heart.

theoretically regulated by >30 miRNAs.

(Luo et al., 2008; J Xiao et al., 2007; L Xiao et al., 2008).

channels and transporters.

Fig. 1. Flow chart illustrating the procedures of our analysis. The First Dataset include all miRNAs in the then-updated database that have the potential to target at least an ionchannel-coding gene as predicted by at least 4 of the 7 algorithms. The Second Dataset limits the miRNAs from the First Dataset to only those that are expressed in cardiac cells. These miRNAs represent those that most likely play a role in regulating ion channel expression in the heart. The Third Datasets are the lists of miRNAs that have been shown to be deregulated in the pathological conditions under analysis, e.g. myocardial infarction, heart failure, etc. These miRNAs represent those that likely play an important role in defining ion channel expression under a given diseased state.

Each of the seven algorithms provides a unique dataset. Some of the algorithms have higher sensitivity of prediction but lower accuracy and the others weight on the accuracy in the face of reduced sensitivity. We collected all miRNAs predicted by at least four of the seven algorithms to have the potential to target any one of the selected cardiac ion channel and ion transporter genes. Meanwhile, we also collected all ion channel and ion transporter genes that contain the potential target site(s) (the binding site(s) with favourable free energy profiles) for at least one of then-registered 718 human miRNAs in the miRNA database (miRBase).

Expression of miRNAs is tightly controlled by the genetic programme to ensure certain spatial (depending on cell-, tissue-, or organ-type) and temporal (depending on developmental stage) patterns. The expression profile under a defined condition is considered the miRNA expression signature or miRNA transcriptome of a particular tissue. One way to minimize the possibility of false positive predictions and to narrow down the list of putative miRNA targets would be to compare the *in silico* target predictions to the miRNA transcriptome signatures in the biological system of interest. We therefore conducted miRNA microarray analysis of miRNAs including all 718 human miRNAs for their expression in left ventricular tissues of five healthy human individuals. Using this set of cardiac miRNA expression profiling data in conjunction with published data obtained by real-time RT-PCR by Liang et al (2007), we refined the miRNA–target prediction by filtering out the miRNAs that are not expressed in the heart and focusing on the top 20 abundant

Fig. 1. Flow chart illustrating the procedures of our analysis. The First Dataset include all miRNAs in the then-updated database that have the potential to target at least an ionchannel-coding gene as predicted by at least 4 of the 7 algorithms. The Second Dataset limits the miRNAs from the First Dataset to only those that are expressed in cardiac cells. These miRNAs represent those that most likely play a role in regulating ion channel expression in

deregulated in the pathological conditions under analysis, e.g. myocardial infarction, heart failure, etc. These miRNAs represent those that likely play an important role in defining ion

Each of the seven algorithms provides a unique dataset. Some of the algorithms have higher sensitivity of prediction but lower accuracy and the others weight on the accuracy in the face of reduced sensitivity. We collected all miRNAs predicted by at least four of the seven algorithms to have the potential to target any one of the selected cardiac ion channel and ion transporter genes. Meanwhile, we also collected all ion channel and ion transporter genes that contain the potential target site(s) (the binding site(s) with favourable free energy profiles) for at least one of then-registered 718 human miRNAs in the miRNA database

Expression of miRNAs is tightly controlled by the genetic programme to ensure certain spatial (depending on cell-, tissue-, or organ-type) and temporal (depending on developmental stage) patterns. The expression profile under a defined condition is considered the miRNA expression signature or miRNA transcriptome of a particular tissue. One way to minimize the possibility of false positive predictions and to narrow down the list of putative miRNA targets would be to compare the *in silico* target predictions to the miRNA transcriptome signatures in the biological system of interest. We therefore conducted miRNA microarray analysis of miRNAs including all 718 human miRNAs for their expression in left ventricular tissues of five healthy human individuals. Using this set of cardiac miRNA expression profiling data in conjunction with published data obtained by real-time RT-PCR by Liang et al (2007), we refined the miRNA–target prediction by filtering out the miRNAs that are not expressed in the heart and focusing on the top 20 abundant

the heart. The Third Datasets are the lists of miRNAs that have been shown to be

channel expression under a given diseased state.

(miRBase).

miRNAs in human heart (*miR-1, miR-133a/b, miR-16, miR-100, miR-125a/b, miR-126, miR-145, miR-195, miR-199\*, miR-20a/b, miR-21, miR-26a/b, miR-24, miR-23, miR-29a/b, miR-27a/b, miR-30a/b/c, miR-92a/b, miR-99, and let-7a/c/f/g*). In this way, we generated the modified datasets for subsequent analyses and obtained an overall picture of control of expression of ion channel genes by miRNAs in heart under normal conditions.

Next, we intended to apply the theoretical prediction to explaining some established observations of the electrical remodeling related to deregulation of both miRNAs and the genes for ion channels and transporters. Three pathological conditions, cardiac hypertrophy/heart failure, ischemic myocardial injuries, and atrial fibrillation, were studied based on the expression profiles and participations of miRNAs in these conditions as previously reported. The results are presented in the section following the next.

#### **2.2 Control of expression of Ion channel genes by miRNAs under normal conditions**  The above analyses allow us to reach the following notes.


MicroRNA Targeting in Heart: A Theoretical Analysis 545

3. A variety of Na+ channel abnormalities have been demonstrated in heart failure. Several studies suggest that peak *I*Na is reduced which can cause slowing of cardiac conduction and promote re-entrant arrhythmias (Zicha et al., 2004). It has been speculated that post-transcriptional reduction of the cardiac *I*Na α-subunit protein Nav1.5 may account for the reduction of peak *I*Na. In this study, we found that the only miRNA that can target Nav1.5 and is upregulated in cardiac hypertrophy/CHF is miR-125a/b. As an abundantly expressed miRNA, upregulation of miR-125a/b could well result in

4. The gap junction channel proteins connexin43, connexin45 and connexin40 are important for cell-to-cell propagation of excitations. Downregulation of connexin43 expression is associated with an increased likelihood of ventricular tachyarrhythmias in heart failure (Kitamura et al., 2002). Other connexins, including connexin45 (Yamada et al., 2003) and connexin40 (Dupont et al., 2001), are upregulated in failing hearts, possibly as a compensation for connexin43 downregulation. Our analysis indicates that the upregulation of miR-125a/b and miR-23a/b should produce repression of connexin43 and connexin45 and the down regulation of miR-1, miR-30a/b/c and miR-150 should do the opposite. These two opposing effects may cancel out each other. 5. Prolongation of ventricular APD is typical of heart failure to enable the improvement of contraction strength, thereby supporting the weakened heart. However, APD prolongation consequent to decreases in several repolarizing K+ current (*I*to1, *I*Ks, and *I*K1) in failing heart often results in occurrence of early afterdepolarizations (EADs) (Beuckelmann et la., 1993; Tsuji et al., 2000). Our prediction failed to provide any explanation at the miRNA level: None of the upregulated miRNAs may regulate the genes encoding repolarizing K+ channels. On the contrary, downregulation of miR-1 and miR-133 predict upregulation of KCNE1/minK and KCNQ1/KvLQT1,

6. A majority of published studies showed a decrease in *I*K1 in ventricular myocytes of failing hearts (Beuckelmann et la., 1993; Rose et al., 2005). But whether KCNJ2/Kir2.1, the major subunit underlying *I*K1, is downregulated remained controversial in previous studies and the mechanisms remained obscured. One study noted decreased KCNJ2 mRNA expression but unaltered Kir2.1 protein level (Rose et al., 2005). With our prediction, the upregulated miRNAs (miR-125, miR-214, miR-24, miR-29, and miR-195) predict reduction of inward rectifier K+ channel subunits including KCNJ2/Kir2.1, KCNJ12/Kir2.2, KCNJ14/Kir2.4, and KCNK1/TWIK1, whereas the downregulated

In summary, our analysis of target genes for deregulated miRNAs in hypertrophy/CHF may explain at least partly the enhanced cardiac automaticity (relief of HCN2 repression

miRNAs (miR-1 and miR-30a/b/c) predict increase in KCNJ2/Kir2.1.

upregulation of NCX1 through the miRNA mechanism.

repression of SCN5A/Nav1.5.

respectively.

dysfunction and an increased risk of arrhythmias (Flesch et al., 1996; Nattel et al., 2007; Pogwizd & Bers, 2002). Our target prediction indicates that SLC8A1, the gene encoding NXC1 protein, is a potential target for both miR-1 and miR-30a/b/c. The downregulation of miR-1 and miR-30a/b/c in hypertrophy/failure is deemed to relieve the repression of SLC8A1/NCX1 since a strong tonic repression miR-1 and miR-30a/b/c is anticipated considering the high abundance of these miRNAs. On the other hand, upregulation of miR-214 tends to repress NCX1, but the expression level of miR-214 is of no comparison with those of miR-1 and miR-30a/b/c; its offsetting effect should be minimal. Our prediction thus provides a plausible explanation for the

8. Comparison of the target genes of the three muscle-specific miRNAs miR-1, miR-133 and miR-208 revealed that they might play different role in regulating cardiac excitability. It appears that miR-1 may be involved in all different aspects of cardiac excitability: cardiac conduction by targeting GJA1 and KCNJ2, cardiac automaticity by targeting HCN2 and HCN4, cardiac repolarization by targeting KCNA5, KCND2 and KCNE1, and Ca2+ handling by targeting SLC8A1. By comparison, miR-133a/b mainly controls cardiac repolarization through targeting KCNH2 (encoding HERG/*I*Kr) and KCNQ1 (encoding KvLQT1/*I*Ks), the two major repolarizing K+ channels in the heart. miR-208 was predicted to target only KCNJ2 (encoding Kir2.1 for *I*K1). The non-musclespecific let-7 seed family members seem to regulate mainly cardiac conduction by targeting SCN5A (Nav1.5 for intracellular conduction) and GJC1 (Cx45 for intercellular conduction). miR-30a/b/c and miR-26a/b, miR-125a/b, miR-16, and miR-27a/b were predicted to be L-type Ca2+ channel "blockers" through repressing α1c- and/or β1/β2 subunits.

#### **2.3 Application of our bioinformatics analysis to heart failure**

The mechanisms for arrhythmogenesis in failing heart involve (Nattel et al., 2007): (1) Abnormalities in spontaneous pacemaking function (enhanced cardiac automaticity) as a result of increases in atrial and ventricular If due to increased expression of HCN4 channel may contribute to ectopic beat formation in CHF; (2) Slowing of cardiac repolarization thereby prolongation of APD due to reductions of repolarizing K+ currents (including *I*K1, *I*Ks, and *I*to1) provides the condition for occurrence of early afterdepolarizations (EADs) leading to triggered activities; (3) Delayed afterdepolarizations (DADs) due to enhanced Na+-Ca2+ exchanger (NCX1) activity in cardiac hypertrophy/CHF is a consistent finding by numerous studies. Upregulation of NCX1 expression is the major cause for the enhancement; (4) Reentrant activity due to slowing of cardiac conduction velocity.

To date, there have been seven published studies on role of miRNAs and cardiac hypertrophy (Carè et al., 2007; Cheng et al., 2007; Sayed et al., 2008; Tatsuguch et al., 2007; Thum et al., 2007; van Rooij et al., 2006, 2007). The common finding of these studies is that an array of miRNAs is significantly altered in their expression, either up- or downregulated, and that single miRNAs can critically determine the generation and progression of cardiac hypertrophy. The most consistent changes reported by these studies are upregulation of miR-21 (6 of 6 studies), miR-23a (4 of 6), miR-125b (5 of 6), miR-214 (4 of 6), miR-24 (3 of 6), miR-29 (3 of 6) and miR-195 (3 of 6), and down-regulation of miR-1, miR-133, miR-150 (5 of 6 studies) and miR-30 (5 of 6). These miRNAs were therefore included in our analysis of target genes encoding ion channel and transporter proteins. Our analyses suggest the following.


8. Comparison of the target genes of the three muscle-specific miRNAs miR-1, miR-133 and miR-208 revealed that they might play different role in regulating cardiac excitability. It appears that miR-1 may be involved in all different aspects of cardiac excitability: cardiac conduction by targeting GJA1 and KCNJ2, cardiac automaticity by targeting HCN2 and HCN4, cardiac repolarization by targeting KCNA5, KCND2 and KCNE1, and Ca2+ handling by targeting SLC8A1. By comparison, miR-133a/b mainly controls cardiac repolarization through targeting KCNH2 (encoding HERG/*I*Kr) and KCNQ1 (encoding KvLQT1/*I*Ks), the two major repolarizing K+ channels in the heart. miR-208 was predicted to target only KCNJ2 (encoding Kir2.1 for *I*K1). The non-musclespecific let-7 seed family members seem to regulate mainly cardiac conduction by targeting SCN5A (Nav1.5 for intracellular conduction) and GJC1 (Cx45 for intercellular conduction). miR-30a/b/c and miR-26a/b, miR-125a/b, miR-16, and miR-27a/b were predicted to be L-type Ca2+ channel "blockers" through repressing α1c- and/or β1/β2-

The mechanisms for arrhythmogenesis in failing heart involve (Nattel et al., 2007): (1) Abnormalities in spontaneous pacemaking function (enhanced cardiac automaticity) as a result of increases in atrial and ventricular If due to increased expression of HCN4 channel may contribute to ectopic beat formation in CHF; (2) Slowing of cardiac repolarization thereby prolongation of APD due to reductions of repolarizing K+ currents (including *I*K1, *I*Ks, and *I*to1) provides the condition for occurrence of early afterdepolarizations (EADs) leading to triggered activities; (3) Delayed afterdepolarizations (DADs) due to enhanced Na+-Ca2+ exchanger (NCX1) activity in cardiac hypertrophy/CHF is a consistent finding by numerous studies. Upregulation of NCX1 expression is the major cause for the

enhancement; (4) Reentrant activity due to slowing of cardiac conduction velocity.

To date, there have been seven published studies on role of miRNAs and cardiac hypertrophy (Carè et al., 2007; Cheng et al., 2007; Sayed et al., 2008; Tatsuguch et al., 2007; Thum et al., 2007; van Rooij et al., 2006, 2007). The common finding of these studies is that an array of miRNAs is significantly altered in their expression, either up- or downregulated, and that single miRNAs can critically determine the generation and progression of cardiac hypertrophy. The most consistent changes reported by these studies are upregulation of miR-21 (6 of 6 studies), miR-23a (4 of 6), miR-125b (5 of 6), miR-214 (4 of 6), miR-24 (3 of 6), miR-29 (3 of 6) and miR-195 (3 of 6), and down-regulation of miR-1, miR-133, miR-150 (5 of 6 studies) and miR-30 (5 of 6). These miRNAs were therefore included in our analysis of target genes encoding ion channel and transporter proteins. Our analyses

1. It is known that cardiac myocytes are characterized with re-expression of the funny current (or pacemaker current) If that may underlie the increased risk of arrhythmogenesis in hypertrophic and failing heart (Luo et al., 2008), which is carried by HCN2 channel in cardiac muscles. We have previously verified that downregulation of miR-1 and miR-133 caused upregulation of HCN2 in cardiac hypertrophy (Luo et al., 2008). This may contribute to the enhanced abnormal cardiac automaticity and the

2. The NCX1 is upregulated in cardiac hypertrophy, ischemia, and failure. This upregulation can have an effect on Ca2+ transients and possibly contribute to diastolic

**2.3 Application of our bioinformatics analysis to heart failure** 

subunits.

suggest the following.

associated arrhythmias in CHF.

dysfunction and an increased risk of arrhythmias (Flesch et al., 1996; Nattel et al., 2007; Pogwizd & Bers, 2002). Our target prediction indicates that SLC8A1, the gene encoding NXC1 protein, is a potential target for both miR-1 and miR-30a/b/c. The downregulation of miR-1 and miR-30a/b/c in hypertrophy/failure is deemed to relieve the repression of SLC8A1/NCX1 since a strong tonic repression miR-1 and miR-30a/b/c is anticipated considering the high abundance of these miRNAs. On the other hand, upregulation of miR-214 tends to repress NCX1, but the expression level of miR-214 is of no comparison with those of miR-1 and miR-30a/b/c; its offsetting effect should be minimal. Our prediction thus provides a plausible explanation for the upregulation of NCX1 through the miRNA mechanism.


In summary, our analysis of target genes for deregulated miRNAs in hypertrophy/CHF may explain at least partly the enhanced cardiac automaticity (relief of HCN2 repression

MicroRNA Targeting in Heart: A Theoretical Analysis 547

3. It has been observed that cells in the surviving peri-infarct zone have discontinuous propagation due to abnormal cell-to-cell coupling (Gardner et al., 1985; Peters, 1995; Spear et al., 1983). This is largely due to decreased expression and redistribution of gap junction protein connexins (Cxs). In this study, seven out of 12 upregulated miRNAs were predicted to target Cxs including GJA1/Cx43, GJC1/Cx45, and GJA5/Cx40, but only one downregulated miRNA *miR-185* may regulate GJA5/Cx40. This result clearly points to the role of miRNAs in damaging cardiac conduction in ischemic myocardium. Indeed, repression of GJA1/Cx43 to slow conduction and induce arrhythmias in acute myocardial infarction has been experimentally verified by our previous study (Yang et al., 2007). 4. In ischemic myocardium, fast or peak sodium current (*I*Na) density is reduced, which may also account partly for the conduction slowing and the associated re-entrant arrhythmias (Friedmanet al., 1975; Pu & Boyden, 1997; Spear et al., 1983). Our analysis showed that *let-7f* and *miR-378* may target SCN5A/Nav1.5 and upregulation of these miRNAs is anticipated to cause reduction of *I*Na via downregulating SCN5A/Nav1.5 in myocardial infarction. By comparison, none of the downregulated miRNAs may repress

5. Transient outward K+ current (*I*to1) is reduced in myocardial ischemia and in rats, *I*to1 decreases correlate most closely with downregulation of KCND2-encoded Kv4.2 subunits. *miR-1* is predicted to repress KCND2/Kv4.2, and *miR-29* may target KCHiP2

6. L-type Ca2+ current (*I*Ca,L) is diminished in border-zone cells of dogs. *miR-30* family has the potential to target CACNA1C/Cav1.2 and CACNB2/Cavβ2, and *miR-124, miR-181, miR-320* and *miR-204* to target CACNB2. Upregulation of *miR-30*, *miR-124* and *miR-181* therefore would decrease CACNA1C/Cav1.2 and CACNB2/Cavβ2 expression, but downregulation of *miR-320* and *miR-204* tends to increase the expression of these genes. Considering the relative abundance of these miRNAs, it seems that the decreasing force

overweighs the increasing force with a balance towards a net inhibition of *I*Ca,L. 7. Na+/K+ ATPase is a sarcolemmal ATP-dependent enzyme transporter that transports three intracellular Na+ ions to the extracellular compartment and moves two extracellular K+ ions into the cell to maintain the physiological Na+ and K+ concentration gradients for generating the rapid upstroke of the action potential but also for driving a number of ion-exchange and transport processes crucial for normal cellular function, ion homeostasis and the control of cell volume. It is electrogenic, producing a small outward current IP. We noticed that the ischemia-induced upregulation of *miR-29* and *miR181* expression might render inhibition of Na+/K+ ATPase activity as they possibly target the ATP1B1 β-subunit of the enzyme. This may contribute to the electrical and contractile dysfunction in the ischemic/reperfused myocardium due to the ischemia-induced inhibition of the Na+/K+ ATPase and the failure of intracellular Na+ to recover completely on reperfusion [Fuller et al., 2003]. In a whole, it appears that the expression signature of miRNAs in the setting of myocardial ischemia and the predicted gene targeting of these miRNAs coincide with the ionic remodelling process under this pathological condition. The miRNAs seem to be involved in

alone without changes of KvLQT1 in ischemic myocardium.

SCN5A/Nav1.5 based on our target prediction.

that is known to be critical in the formation of *I*to1.

activation (Sanguinetti et al., 1996; Dun & Boyden, 2005), resembling currents produced by the expression of KvLQT1 in the absence of minK. We have experimentally established KCNE1 as a target for *miR-1* repression [Luo et al., 2007], which was also predicted in the present analysis. Moreover, no other miRNAs were predicted to target KCNQ1. This finding is coincident with the observations on the diminishment of minK

and increased NCX1 expression) and reduced cardiac conduction (repression of Nav1.5). But the data suggest that miRNAs are hardly involved in the abnormality of cardiac repolarization in cardiac hypertrophy and heart failure since the genes for the repolarizing K+ channels were not predicted as targets for the upregulated miRNAs. The prediction of NCX1 upregulation as a result of derepression from miRNAs may be of particular importance aberrantly enhanced NXC1 activity has also been noticed in atrial fibrillation occurring in CHF.

#### **2.4 Application of our bioinformatics analysis to myocardial infarction (MI)**

MI is manifested as cascades of electrical abnormalities and even lethal arrhythmias as a result of deleterious alterations of gene expression outweighing adaptive changes (Carmeliet, 1999). Ischemic myocardium demonstrates characteristic sequential alterations in electrophysiology with an initial shortening of APD and QT interval during the early phase (<15min) of acute ischemia and subsequent lengthening of APD/QT after a prolonged ischemic period and chronic myocardial ischemia (Carmeliet, 1999; Nattel et al., 2007). To exploit if miRNAs could be involved in the remodelling process, several original studies have been published. We first identified upregulation of *miR-1* in acute myocardial infarction and the ischemic arrhythmias caused by this deregulation of *miR-1* expression (Yang et al., 2007). Subsequently, miRNA expression profiles in the setting of myocardial ischemia/reperfusion injuries were reported by four groups (Dong et al., 2009; Luo et al., 2010; Ren et al., 2009; Roy et al., 2009).

Based on these published data, we made an analysis to exclude that miRNAs that were found deregulated by a study but not by others and that were found deregulated in rat heart but was not expressed in human heart. In this way, we identified an array of miRNAs that are likely deregulated in the setting of myocardial ischemia. The MI-upregulated miRNAs include *miR-1, miR-23, miR-29, miR-20, miR-30, miR-146b-5p, miR-193, miR-378, miR-181, miR-491-3p, miR-106*, *miR-199b-5p*, and *let-7f*; and the downregulated miRNAs include *miR-320*, *miR-185*, *miR-324-3p,* and *miR-214*. Interesting to note is that some of the miRNAs demonstrated the opposite directions of changes in their expression between ischemic myocardium and hypertrophic hearts. For example, *miR-1*, *let-7*, *miR-181b*, *miR-29a* and *miR-30a/e* are upregulated in ischemic myocardium, but downregulated in hypertrophy. Similarly, *miR-214*, *miR-320* and *miR-351* are down-regulated in ischemic myocardium, but up-regulated in hypertrophy. This fact further reinforces the notion that different pathological conditions are associated with different expression profiles: miRNA signatures. Our analysis yielded the following notions.


and increased NCX1 expression) and reduced cardiac conduction (repression of Nav1.5). But the data suggest that miRNAs are hardly involved in the abnormality of cardiac repolarization in cardiac hypertrophy and heart failure since the genes for the repolarizing K+ channels were not predicted as targets for the upregulated miRNAs. The prediction of NCX1 upregulation as a result of derepression from miRNAs may be of particular importance aberrantly enhanced NXC1 activity has also been noticed in atrial fibrillation

MI is manifested as cascades of electrical abnormalities and even lethal arrhythmias as a result of deleterious alterations of gene expression outweighing adaptive changes (Carmeliet, 1999). Ischemic myocardium demonstrates characteristic sequential alterations in electrophysiology with an initial shortening of APD and QT interval during the early phase (<15min) of acute ischemia and subsequent lengthening of APD/QT after a prolonged ischemic period and chronic myocardial ischemia (Carmeliet, 1999; Nattel et al., 2007). To exploit if miRNAs could be involved in the remodelling process, several original studies have been published. We first identified upregulation of *miR-1* in acute myocardial infarction and the ischemic arrhythmias caused by this deregulation of *miR-1* expression (Yang et al., 2007). Subsequently, miRNA expression profiles in the setting of myocardial ischemia/reperfusion injuries were reported by four groups (Dong et al., 2009; Luo et al.,

Based on these published data, we made an analysis to exclude that miRNAs that were found deregulated by a study but not by others and that were found deregulated in rat heart but was not expressed in human heart. In this way, we identified an array of miRNAs that are likely deregulated in the setting of myocardial ischemia. The MI-upregulated miRNAs include *miR-1, miR-23, miR-29, miR-20, miR-30, miR-146b-5p, miR-193, miR-378, miR-181, miR-491-3p, miR-106*, *miR-199b-5p*, and *let-7f*; and the downregulated miRNAs include *miR-320*, *miR-185*, *miR-324-3p,* and *miR-214*. Interesting to note is that some of the miRNAs demonstrated the opposite directions of changes in their expression between ischemic myocardium and hypertrophic hearts. For example, *miR-1*, *let-7*, *miR-181b*, *miR-29a* and *miR-30a/e* are upregulated in ischemic myocardium, but downregulated in hypertrophy. Similarly, *miR-214*, *miR-320* and *miR-351* are down-regulated in ischemic myocardium, but up-regulated in hypertrophy. This fact further reinforces the notion that different pathological conditions are associated with different expression profiles: miRNA signatures.

1. Six upregulated miRNAs (*miR-1, miR-29, miR-20, miR-30, miR-193* and *miR-181*) were predicted to target several Kir subunits (KCNJ2, KCNJ12, KCNJ, and KCNK1), but none of the downregulated miRNAs can target these genes (Fig. 4). This is in line with the previous finding that *I*K1 is reduced and membrane is depolarized in ischemic

2. The cardiac slow delayed rectifier K+ current (*I*Ks) is carried by co-assembly of an αsubunit KvLQT1 (encoded by KCNQ1) and a β-subunit mink (encoded by KCNE1). Loss-of-function mutation of either KCNQ1 or KCNE1 can cause long QT syndromes, indicating the importance of *I*Ks in cardiac repolarization. In ischemic myocardium, persistent decreases in minK with normalized KvLQT1 protein expression have been observed which may underlie unusual delayed rectifier currents with very rapid

myocardium (Carmeliet, 1999; Nattel et al., 2007; Yang et al., 2007).

**2.4 Application of our bioinformatics analysis to myocardial infarction (MI)** 

occurring in CHF.

2010; Ren et al., 2009; Roy et al., 2009).

Our analysis yielded the following notions.

activation (Sanguinetti et al., 1996; Dun & Boyden, 2005), resembling currents produced by the expression of KvLQT1 in the absence of minK. We have experimentally established KCNE1 as a target for *miR-1* repression [Luo et al., 2007], which was also predicted in the present analysis. Moreover, no other miRNAs were predicted to target KCNQ1. This finding is coincident with the observations on the diminishment of minK alone without changes of KvLQT1 in ischemic myocardium.


In a whole, it appears that the expression signature of miRNAs in the setting of myocardial ischemia and the predicted gene targeting of these miRNAs coincide with the ionic remodelling process under this pathological condition. The miRNAs seem to be involved in

MicroRNA Targeting in Heart: A Theoretical Analysis 549

3. The miR-30 family, miR-24, miR-23a/b, miR-26a/b, miR-27a/b, miR-145, miR-92a/b, and miR-199a/b may be anti-apoptotic miRNAs as they were predicted to target many important cell death genes, such as CASP3 (encoding caspase 3), CASP7, BCL2L11, BAK1, BAX, FOS, etc. Among these miRNAs, miR-199 and miR-24 have been shown to produce cardioprotective effects against apoptosis (Qian et al., 2011). miR-145 is known to mediate inhibition of proliferation and induction of apoptosis of cancer cells

4. Several miRNAs were predicted to target both survival and apoptotic genes; these include the muscle-specific miRNAs miR-1, miR-133, miR-21, miR-195. In theory, these miRNAs are neutral without affecting ell death or can produce either pro-apoptotic or antiapoptotic effect depending on particular cellular context: expression of particular target genes for a particular miRNA. Indeed, miR-1 and miR-133 do not affect cardiomyocyte apoptosis under normal conditions, but when many survival and death genes are increased in their expression in response to oxidative stress, miR-1 promotes cardiomyocyte apoptosis by targeting heat shock protein 60 whereas miR-133 protects against apoptosis by repressing caspase 9 (Xu et al., 2007). miR-21 has been commonly believed to elicit cardioprotective effects in myocardial ischemia and ischemia/reperfusion injuries (). But in tumor cells it has been reported to be proapoptotic, anti-apoptotic or neutral. For example, knockdown of miR-21 in cultured glioblastoma cells resulted in a significant drop in cell number. This reduction was accompanied by increases in enzyme activity of caspases 3 and 7, as well as terminal deoxyribonucleotidyl transferase-mediated dUTP–digoxigenin nick end-labelling (TUNEL) staining (Chan et al., 2005; Corsten et al., 2007). In MCF-7 human breast cancer cells, miR-21 elicits anti-apoptotic effects (Si et al., 2007; Zhu et al., 2007). However, in neuroblastoma cells, miR-Lat was reported to protect against apoptotic cell death (Gupta

et al., 2006). This property of miR-21 in cancer cells is in line with our prediction. 5. The cardiac-specific miRNAs miR-208a/b and miR-499 do not seem to have significant role in regulating apoptosis since they were predicted to target only a small number of genes involved apoptosis signalling: CDKN1A and E2F6 whose expression levels are low in heart. Moreover, the expression levels of these cardiac-specific miRNAs are also

**3.2 Control of cardiomyocyte apoptosis by miRNAs in ischemic myocardium** 

Myocardial infarction (MI), a typical situation of metabolic stress, is presented as cascades of cellular abnormalities as a result of deleterious alterations of gene expression outweighing adaptive changes. MI can cause severe cardiac injuries and the consequences are contraction failure, electrical abnormalities and even lethal arrhythmias, and eventual death of the cell. Apoptosis is an important mechanism for the cell death occurring in ischemic myocardium. Previous work on miRNAs and apoptosis has been mostly limited to the context of cancer, while studies on apoptosis regulation by miRNAs in non-cancer cells have been sparse. The first evidence for the role of miRNAs in cardiomyocyte apoptosis was obtained in 2007 from my laboratory demonstrating the proapoptotic effect of miR-1 and anti-apoptotic effect of

yet determined.

in the low range.

(Ostenfeld et al., 2010; Sachdeva & Mo, 2010).

been experimentally proven to be pro-apoptotic (Ye et al., 2010). And our unpublished observations indicate a strong promotion of cardiomyocyte apoptosis by miR-20a/b. The studies conducted in cancer cells support our notion that miR-16 induces apoptosis (Cimmino et al., 2005; Tsang & Kwok, 2010), though its effects on cardiac cells have not

all aspects of the abnormalities of cardiac excitability during ischemia, as manifested by the slowing of cardiac conduction due to reduced *I*Na and Cx43, the depolarized membrane potential to adversely affect cardiac conduction due to reduced *I*K1, the impaired excitationcontraction coupling and contractile function due to reduced *I*Ca,L and Na+/K+ ATPase, and the delayed cardiac repolarization due to reduced *I*Ks and *I*to1.

#### **3. Apoptosis-related genes as argets for miRNAs**

It has been nearly 40 years since Kerr named the novel death process ''apoptosis,'' from the Greek word meaning ''falling of the leaves'', an active process that leads to cell death (Kerr et al., 1972). The human body destroys ~60x109 cells/day through an apoptotic process in response to various stresses such as physiological, pathogenic, or cytotoxic stimuli (Reed, 2002). Unlike necrosis, apoptosis is a complex endogenous gene-controlled event that requires an exogenous signal–stimulated or inhibited by a variety of regulatory factors, such as formation of oxygen free radicals, ischemia, hypoxia, reduced intracellular K+ concentration, and generation of nitric oxide. Progressive cell loss due to apoptosis is a pathological hallmark implicated in a wide spectrum of degenerative diseases such as heart disease, atherosclerotic arteries and hypertensive vessels, Alzheimer's disease, etc (Jaffe et al., 1997; Palojoki et a., 2001; Sabbah et a., 1998). Apoptosis as an early and predominant form of cell death has been detected in human acute myocardial infarcts and it was shown to increase in reperfused myocardium. Apoptosis is also believed to account for the loss of cell mass in failing heart. Evidence for the role of miRNAs in cardiomyocytes apoptosis has been rapidly accumulating. My group documented the first of such evidence; the muscle-specific miRNAs miR-1 and miR-133 produce opposing actions on cardiomyocyte apoptosis with the former being proapoptotic while miR-133 being antiapoptotic (Xu et al., 2007). miR-21 is also an antiapoptotic miRNA. It has been shown to produce beneficial effects against H2O2-induced injury on cardiac myocytes and ischemia/reperfusion injury via antiapoptosis through its target Programmed Cell Death 4 (PDCD4). Based on our computational prediction, many other miRNAs, such as the miR-17~92 cluster and its two paralogs miR-106a~363 and miR-106b~25 clusters, also have the potential to regulate cardiomyocyte apoptosis by targeting the related genes in the signalling pathways (unpublished observations).

Following similar procedures we used to predict ion channel genes as targets for miRNAs described in section 2.1, we analyzed the genes known to be crucial for cell survival and death for miRNA regulation.

#### **3.1 Control of cardiomyocyte apoptosis by miRNAs under normal conditions**

Our analyses allowed us to have an overall picture on how the cardiomyocyte homeostasis may be maintained by miRNAs and to divide miRNAs roughly into two groups: proapoptotic miRNAs and anti-apoptotic miRNAs, though there is no clear-cut distinction as each miRNA may simultaneously target both survival and apoptotic genes. This property indicates that cardiomyocyte survival and death is tightly controlled and delicately balanced. Any changes of expression of miRNAs can shift the balance leading to alterations of cell fate.


all aspects of the abnormalities of cardiac excitability during ischemia, as manifested by the slowing of cardiac conduction due to reduced *I*Na and Cx43, the depolarized membrane potential to adversely affect cardiac conduction due to reduced *I*K1, the impaired excitationcontraction coupling and contractile function due to reduced *I*Ca,L and Na+/K+ ATPase, and

It has been nearly 40 years since Kerr named the novel death process ''apoptosis,'' from the Greek word meaning ''falling of the leaves'', an active process that leads to cell death (Kerr et al., 1972). The human body destroys ~60x109 cells/day through an apoptotic process in response to various stresses such as physiological, pathogenic, or cytotoxic stimuli (Reed, 2002). Unlike necrosis, apoptosis is a complex endogenous gene-controlled event that requires an exogenous signal–stimulated or inhibited by a variety of regulatory factors, such as formation of oxygen free radicals, ischemia, hypoxia, reduced intracellular K+ concentration, and generation of nitric oxide. Progressive cell loss due to apoptosis is a pathological hallmark implicated in a wide spectrum of degenerative diseases such as heart disease, atherosclerotic arteries and hypertensive vessels, Alzheimer's disease, etc (Jaffe et al., 1997; Palojoki et a., 2001; Sabbah et a., 1998). Apoptosis as an early and predominant form of cell death has been detected in human acute myocardial infarcts and it was shown to increase in reperfused myocardium. Apoptosis is also believed to account for the loss of cell mass in failing heart. Evidence for the role of miRNAs in cardiomyocytes apoptosis has been rapidly accumulating. My group documented the first of such evidence; the muscle-specific miRNAs miR-1 and miR-133 produce opposing actions on cardiomyocyte apoptosis with the former being proapoptotic while miR-133 being antiapoptotic (Xu et al., 2007). miR-21 is also an antiapoptotic miRNA. It has been shown to produce beneficial effects against H2O2-induced injury on cardiac myocytes and ischemia/reperfusion injury via antiapoptosis through its target Programmed Cell Death 4 (PDCD4). Based on our computational prediction, many other miRNAs, such as the miR-17~92 cluster and its two paralogs miR-106a~363 and miR-106b~25 clusters, also have the potential to regulate cardiomyocyte apoptosis by targeting the related genes in the signalling

Following similar procedures we used to predict ion channel genes as targets for miRNAs described in section 2.1, we analyzed the genes known to be crucial for cell survival and

Our analyses allowed us to have an overall picture on how the cardiomyocyte homeostasis may be maintained by miRNAs and to divide miRNAs roughly into two groups: proapoptotic miRNAs and anti-apoptotic miRNAs, though there is no clear-cut distinction as each miRNA may simultaneously target both survival and apoptotic genes. This property indicates that cardiomyocyte survival and death is tightly controlled and delicately balanced. Any changes of expression of miRNAs can shift the balance leading to alterations of cell fate. 1. Among the top 20 most abundant miRNAs in the heart, only miR-99 and miR-100 have no predicted target genes relevant to apoptosis and others have 1 to 27 targets. 2. The let-7 family, miR-16, miR-20a/b, miR-125a/b, and miR-29a/b were predicted to some major survival genes including BCL2, BCL2L2, AKT2, AKT3, STAT3, IGF0-1 and MCL1. Thus, they are more likely to be pro-apoptotic miRNAs. Indeed, miR-29 has

**3.1 Control of cardiomyocyte apoptosis by miRNAs under normal conditions** 

the delayed cardiac repolarization due to reduced *I*Ks and *I*to1.

**3. Apoptosis-related genes as argets for miRNAs** 

pathways (unpublished observations).

death for miRNA regulation.

been experimentally proven to be pro-apoptotic (Ye et al., 2010). And our unpublished observations indicate a strong promotion of cardiomyocyte apoptosis by miR-20a/b. The studies conducted in cancer cells support our notion that miR-16 induces apoptosis (Cimmino et al., 2005; Tsang & Kwok, 2010), though its effects on cardiac cells have not yet determined.


#### **3.2 Control of cardiomyocyte apoptosis by miRNAs in ischemic myocardium**

Myocardial infarction (MI), a typical situation of metabolic stress, is presented as cascades of cellular abnormalities as a result of deleterious alterations of gene expression outweighing adaptive changes. MI can cause severe cardiac injuries and the consequences are contraction failure, electrical abnormalities and even lethal arrhythmias, and eventual death of the cell. Apoptosis is an important mechanism for the cell death occurring in ischemic myocardium. Previous work on miRNAs and apoptosis has been mostly limited to the context of cancer, while studies on apoptosis regulation by miRNAs in non-cancer cells have been sparse. The first evidence for the role of miRNAs in cardiomyocyte apoptosis was obtained in 2007 from my laboratory demonstrating the proapoptotic effect of miR-1 and anti-apoptotic effect of

MicroRNA Targeting in Heart: A Theoretical Analysis 551

major cause of morbidity and mortality in humans (Fox et al., 2007). The loss of blood flow to the left ventricular free wall of the heart after MI results in death of cardiomyocytes and impaired cardiac contractility. Scar formation at the site of the infarct and interstitial fibrosis of adjacent myocardium prevent myocardial repair, diminish coronary reserve and contribute to loss of pump function, and predisposes individuals to ventricular dysfunction and arrhythmias, which, in turn, confer an increased risk of adverse cardiovascular events (Swynghedauw, 1999). Elucidation of the precise mechanisms responsible for the actions of these factors could forge new frontiers in both risk identification and prevention of fibrosis-

A subset of miRNAs is enriched in cardiac fibroblasts compared to cardiomyocytes. A number of studies have demonstrated the involvement of miRNAs in regulating myocardial fibrosis in the settings of myocardial ischemia or mechanical overload. In this conceptual framework, the investigation of miRNAs might offer a new opportunity to advance our knowledge of the pathogenesis of fibrosis. Characterization of individual miRNAs or miRNA expression profiles that are specifically associated with myocardial fibrosis might allow us to develop diagnostic tools and innovative therapies for fibrogenic cardiac diseases. The identification of miRNAs as potential regulators of myocardial fibrosis has clinical implications; the search for a miRNA expression pattern specific to fibrosis might provide a novel diagnostic approach. Yet, the molecular mechanisms that lead to a fibrogenic cardiac

Our analysis revealed that 19 of the 20 most abundant miRNAs in the heart have the potential to repress multiple genes known to be involved in fibrogenesis including various types of collagens (COL), CTGF (connective tissue growth factor), FBN1/2/3 (fibrillin1/2/3), ASPN (asporin), MMP2 (matrix metallopeptidase 2), FN1 (fibronectin 1), and various types of TRP channels (transient receptor potential). miR-126 is the only one among the 20 most abundant miRNAs that was not predicted to regulate any fibrosisrelevant genes. The cardiac-specific miRNAs miR-208a and miR-208b seem to have minimal effects on fibrosis sine they have only two target genes OMG (oligodendrocyte myelin glycoprotein) and TTN (titin). Another cardia-specific miRNA miR-499 is likely an antifibrotic miRNA as it was predicted to target 12 profibrotic factors including collagens, LAMA1 (laminin 1), FBN2, FN1, OMG, SLN (sarcolipin), TTN, etc. It should be noted that all these target genes encode profibrotic proteins. Our data therefore indicate that the heart is evolved with a super-strong epigenetic program to prevent fibrogenesis or to suppress

Experimentally, several miRNAs including miR-29, miR-30, miR-133 and miR-590 were all found to produce anti-fibrotic effects (Duisters et al., 2009; Shan et al., 2009; van Rooij et al., 2008), whereas evidence exists for miR-208 (van Rooij et al., 2007) and miR-21 as pro-fibrotic miRNAs (Roy et al., 2009; Thum et al., 2008; van Rooij et al., 2008;). The former can be explained based on our prediction that miR-29, miR-30, miR-133 and miR-590 all have the potential to target profibrotic genes. The latter, however, seems not quite straightforward. Surprisingly, a murine genetic miR-21 knockout model failed to show an antifibrotic phenotype after cardiac stress suggesting differences in pharmacological and genetic miR-21 knockdown (Patrick et al., 2010). Indeed, the various miR-21 inhibitor chemistries have different effects on cardiac fibrosis (Thum et al., 2011). It is likely that other genes not included in our analysis may be the targets for miR-208 and miR-21 to produce pro-fibrotic actions.

**4.1 Control of cardiac fibrogenesis by miRNAs under normal conditions** 

derived clinical complications in patients with cardiac disease.

phenotype are still being elucidated.

fibrosis under normal conditions.

miR-133 in response to oxidative stress (Xu et al., 2007), with miR-1 causing proapoptotic effects confirmed by other groups (Yu et al., 2008; Tang et al., 2010). Subsequent studies in 2009 and 2010 revealed the involvement of other miRNAs such as miR-21, miR-24, miR-29, miR-199a, and miR-320 in ischemic myocardial injury (Cheng et al., 2010; Qian et al., 2011; Rane et al., 2009; Ren et al., 2009; Ye et al., 2010; Yin et al., 2008).

Extracting of the overlapping results from different laboratories and filtering with the cardiac expression profile verified by real-time RT-PCR in human hearts allowed us to identify an array of miRNAs that are likely deregulated in the setting of myocardial ischemia. The upregulated miRNAs include miR-1, miR-23, miR-29, miR-20, miR-30, miR-146b-5p, miR-193, miR-378, miR-181, miR-491-3p, miR-106, miR-199b-5p, and let-7f; the downregulated miRNAs include miR-320, miR-185, miR-324-3p, and miR-214. We then applied our procedures to these miRNAs and our analyses yielded the following notions.


#### **4. Fibrosis-related genes as targets for miRNAs**

In tissues composed of post-mitotic cells, like heart, new cells cannot be regenerated; instead, fibroblasts proliferate to fill the gaps created due to removal of dead cells. In the normal heart, two thirds of the cell population is composed of nonmuscle cells, the majority of which are fibroblasts (Maisch, 1995; Manabe et al., 2002). Cardiac fibroblasts, along with cardiomyocytes, play an essential role in the progression of cardiac remodelling. Damaging insults evoke multiple signalling pathways that lead to coordinate and sequential gene regulation; the initial events lead to the activation of cardiac fibroblasts. Cardiac fibrosis is the result of both an increase in fibroblast proliferation and extracellular matrix (ECM) deposition. Cardiac myocytes are normally surrounded by a fine network of collagen fibres. Myocardial fibrosis is an established morphological feature of the structural myocardial remodelling that is a characteristic of all forms of cardiac pathology (Berk et al., 2007; Khan & Sheppard, 2006). A growing body of evidence indicates that, along with cardiomyocytes hypertrophy, diffuse interstitial fibrosis is a key pathologic feature of myocardial remodelling in a number of cardiac diseases of different (e.g. ischemic, hypertensive, valvular, genetic, and metabolic) origin. Acute myocardial infarction due to coronary artery occlusion represents a

miR-133 in response to oxidative stress (Xu et al., 2007), with miR-1 causing proapoptotic effects confirmed by other groups (Yu et al., 2008; Tang et al., 2010). Subsequent studies in 2009 and 2010 revealed the involvement of other miRNAs such as miR-21, miR-24, miR-29, miR-199a, and miR-320 in ischemic myocardial injury (Cheng et al., 2010; Qian et al., 2011;

Extracting of the overlapping results from different laboratories and filtering with the cardiac expression profile verified by real-time RT-PCR in human hearts allowed us to identify an array of miRNAs that are likely deregulated in the setting of myocardial ischemia. The upregulated miRNAs include miR-1, miR-23, miR-29, miR-20, miR-30, miR-146b-5p, miR-193, miR-378, miR-181, miR-491-3p, miR-106, miR-199b-5p, and let-7f; the downregulated miRNAs include miR-320, miR-185, miR-324-3p, and miR-214. We then applied our procedures to these miRNAs and our analyses yielded the following notions. 1. Among the upregulated miRNAs, only miR-99 and miR-100 have no predicted target genes relevant to apoptosis and others have 1 to 27 targets related to cell survival and death. Notably, a majority of these miRNAs are predicted to be pro-apoptotic: let-7f, miR-1, miR-20, miR-29, miR-106, miR-181, miR-193, miR-378 and miR-491-3p, leaving the other four (miR-23, miR-30, miR-146b-5p and miR-199b-5p) being anti-apoptotic. 2. Among the downregulated miRNAs, except for miR-185 which is expressed with extremely low abundance, miR-214, miR-320 and miR-324 are supposed to be neutral as they were predicted to target both survival (AKT3, STAT3 and MCL1) and apoptotic (CASP3, BAX, CDK6, etc) genes. Their downregulation therefore may not cause significant impact on cell death in MI. However, it has been reported that overexpression of miR-320 in cultured adult rat cardiomyocytes enhanced apoptotic cell death, whereas knockdown produced cytoprotective effect against apoptosis, on simulated ischemia/reperfusion injuries, through targeting HSP20 (Ren et al., 2009),

Rane et al., 2009; Ren et al., 2009; Ye et al., 2010; Yin et al., 2008).

which is not within the list of our present theoretical prediction.

**4. Fibrosis-related genes as targets for miRNAs** 

setting of MI.

3. Taken together, it appears that the pro-apoptotic force is enhanced more than the antiapoptptoic force, being in agreenment with the fact that apoptosis is increased in the

In tissues composed of post-mitotic cells, like heart, new cells cannot be regenerated; instead, fibroblasts proliferate to fill the gaps created due to removal of dead cells. In the normal heart, two thirds of the cell population is composed of nonmuscle cells, the majority of which are fibroblasts (Maisch, 1995; Manabe et al., 2002). Cardiac fibroblasts, along with cardiomyocytes, play an essential role in the progression of cardiac remodelling. Damaging insults evoke multiple signalling pathways that lead to coordinate and sequential gene regulation; the initial events lead to the activation of cardiac fibroblasts. Cardiac fibrosis is the result of both an increase in fibroblast proliferation and extracellular matrix (ECM) deposition. Cardiac myocytes are normally surrounded by a fine network of collagen fibres. Myocardial fibrosis is an established morphological feature of the structural myocardial remodelling that is a characteristic of all forms of cardiac pathology (Berk et al., 2007; Khan & Sheppard, 2006). A growing body of evidence indicates that, along with cardiomyocytes hypertrophy, diffuse interstitial fibrosis is a key pathologic feature of myocardial remodelling in a number of cardiac diseases of different (e.g. ischemic, hypertensive, valvular, genetic, and metabolic) origin. Acute myocardial infarction due to coronary artery occlusion represents a major cause of morbidity and mortality in humans (Fox et al., 2007). The loss of blood flow to the left ventricular free wall of the heart after MI results in death of cardiomyocytes and impaired cardiac contractility. Scar formation at the site of the infarct and interstitial fibrosis of adjacent myocardium prevent myocardial repair, diminish coronary reserve and contribute to loss of pump function, and predisposes individuals to ventricular dysfunction and arrhythmias, which, in turn, confer an increased risk of adverse cardiovascular events (Swynghedauw, 1999). Elucidation of the precise mechanisms responsible for the actions of these factors could forge new frontiers in both risk identification and prevention of fibrosisderived clinical complications in patients with cardiac disease.

A subset of miRNAs is enriched in cardiac fibroblasts compared to cardiomyocytes. A number of studies have demonstrated the involvement of miRNAs in regulating myocardial fibrosis in the settings of myocardial ischemia or mechanical overload. In this conceptual framework, the investigation of miRNAs might offer a new opportunity to advance our knowledge of the pathogenesis of fibrosis. Characterization of individual miRNAs or miRNA expression profiles that are specifically associated with myocardial fibrosis might allow us to develop diagnostic tools and innovative therapies for fibrogenic cardiac diseases. The identification of miRNAs as potential regulators of myocardial fibrosis has clinical implications; the search for a miRNA expression pattern specific to fibrosis might provide a novel diagnostic approach. Yet, the molecular mechanisms that lead to a fibrogenic cardiac phenotype are still being elucidated.

#### **4.1 Control of cardiac fibrogenesis by miRNAs under normal conditions**

Our analysis revealed that 19 of the 20 most abundant miRNAs in the heart have the potential to repress multiple genes known to be involved in fibrogenesis including various types of collagens (COL), CTGF (connective tissue growth factor), FBN1/2/3 (fibrillin1/2/3), ASPN (asporin), MMP2 (matrix metallopeptidase 2), FN1 (fibronectin 1), and various types of TRP channels (transient receptor potential). miR-126 is the only one among the 20 most abundant miRNAs that was not predicted to regulate any fibrosisrelevant genes. The cardiac-specific miRNAs miR-208a and miR-208b seem to have minimal effects on fibrosis sine they have only two target genes OMG (oligodendrocyte myelin glycoprotein) and TTN (titin). Another cardia-specific miRNA miR-499 is likely an antifibrotic miRNA as it was predicted to target 12 profibrotic factors including collagens, LAMA1 (laminin 1), FBN2, FN1, OMG, SLN (sarcolipin), TTN, etc. It should be noted that all these target genes encode profibrotic proteins. Our data therefore indicate that the heart is evolved with a super-strong epigenetic program to prevent fibrogenesis or to suppress fibrosis under normal conditions.

Experimentally, several miRNAs including miR-29, miR-30, miR-133 and miR-590 were all found to produce anti-fibrotic effects (Duisters et al., 2009; Shan et al., 2009; van Rooij et al., 2008), whereas evidence exists for miR-208 (van Rooij et al., 2007) and miR-21 as pro-fibrotic miRNAs (Roy et al., 2009; Thum et al., 2008; van Rooij et al., 2008;). The former can be explained based on our prediction that miR-29, miR-30, miR-133 and miR-590 all have the potential to target profibrotic genes. The latter, however, seems not quite straightforward. Surprisingly, a murine genetic miR-21 knockout model failed to show an antifibrotic phenotype after cardiac stress suggesting differences in pharmacological and genetic miR-21 knockdown (Patrick et al., 2010). Indeed, the various miR-21 inhibitor chemistries have different effects on cardiac fibrosis (Thum et al., 2011). It is likely that other genes not included in our analysis may be the targets for miR-208 and miR-21 to produce pro-fibrotic actions.

MicroRNA Targeting in Heart: A Theoretical Analysis 553

genes; rather it merely presents a prediction of the odds of miRNA:mRNA interactions under normal situation and in the context of electrical/ionic remodeling under the selected circumstances of the heart. This theoretical analysis like all other computational studies needs to be eventually verified with the bench-top work and should not be considered original results. Nonetheless, with sparse experimental data published to date and the anticipated difficulties to acquire complete experimental data using the currently available techniques, the analytical procedures described here can well serve as first-hand

The second limitation of the study is the possibility of underestimating the number of miRNAs that could regulate ion channels, apoptosis and fibrosis due to the stringent criterion for inclusion of miRNAs with positive prediction of targets by at least four out of seven algorithms; in the past, we had been able to experimentally verified nearly all the target genes predicted by only one algorithm miRanda for our pre-experiment analysis. However, the fact that our prediction includes all 20 most abundant miRNAs and other highly expressed miRNAs in the myocardium suggests that this limitation might not have significant negative impact on the accuracy of our analysis and inclusion of more miRNAs by more permissive criteria does not guarantee their physiological function if they are scarcely expressed in the heart. Yet it should be noted that the miRNA expression profiles were obtained from myocardium that also includes fibroblasts and caution needs to be

Another important notion is that despite that our prediction of miRNA targeting coincides with the changes of expression of relevant genes under the pathological conditions, it does not imply that miRNAs are necessarily the important or even the only determinant of the electrical remodeling processes. Our data to the most indicate the potential contribution of miRNAs to such conditions; other molecules like transcription factors must also be involved

Finally, it is also difficult to predict the net outcome when two miRNAs target a same gene but alter in their expression in the opposite directions. Yet, with deepened and broadened understanding of miRNA targeting and action, these possible limitations should eventually

The work presented was supported by the Canadian Institute of Health Research (CIHR).

Berk, B.C., Fujiwara, K., & Lehoux S (2007). ECM remodeling in hypertensive heart disease. *J* 

Beuckelmann, D.J., Nabauer, M., & Erdmann, E. (1993). Alterations of K+ currents in isolated

Carè, A., Catalucci, D., Felicetti, F., Bonci, D., Addario, A., Gallo, P., Bang, M.L., Segnalini,

human ventricular myocytes from patients with terminal heart failure. *Circ Res* 73,

P., Gu, Y., Dalton, N.D., Elia, L., Latronico, M.V., Høydal, M., Autore, C., Russo, M.A., Dorn, G.W 2nd., Ellingsen, O., Ruiz-Lozano, P., Peterson, K.L., Croce, C.M., Peschle, C., & Condorelli, G. (2007). MicroRNA-133 controls cardiac hypertrophy.

in the regulation of expression of ion channel genes under these conditions.

information, providing a framework and guideline for future experimental studies.

taken when interpreting the expression data.

*Clin Invest* 117, 568–575.

*Nat Med* 13: 613–618.

be worked out.

**7. References** 

**6. Acknowledgment** 

379–385.

#### **4.2 Control of atrial fibrogenesis by miRNAs during atrial fibrillation**

Atrial fibrillation (AF) is the most commonly encountered clinical arrhythmia that causes tremendous health problems by increasing the risk of stroke and exacerbating heart failure. It is characterized by a process termed atrial structural remodelling with increased atrial fibrosis. Indeed, atrial fibrosis has been strongly associated with the presence of heart diseases/arrhythmias, including congestive heart failure (CHF) and AF (Pellman et al., 2010; Tan & Zimetbaum, 2010).

To determine if miRNAs are involved in atrial structural remodelling, we first conducted expression profiling to identify deregulated miRNAs in the atrial tissues of a canine model of tachypacing-induce chronic AF, using miRNA microarray analysis comparing the differential expressions of miRNAs between control and AF dogs. Four miRNAs miR-223, miR-328, miR-664 and miR-517 were found increased by >2 folds, and six were decreased by at least 50% including miR-101, miR-133, miR-145, miR-320, miR-373 and miR-499. Real-time quantitative RT-PCR (qRT-PCR) analysis confirmed the significant upregulation of miR-223, miR-328 and miR-664 (miR-517 was undetectable), and the significant downregulation of miR-101, miR-320, and miR-499. Intriguingly, none of these deregulated miRNAs are within the list of top 20 most abundant miRNAs. But miR-223 and miR-328 are among the cardiacenriched miRNAs. This notion would suggest that altered miRNA expression in this AF model tends to favour fibrogenesis; however, miRNAs are definitely not the major determinant for atrial structural remodelling associated with fibrosis.

In a recent study reported by Chen's group (Xiao et al., 2011), it was found that miRNA expression undergoes tremendous alterations in atrial tissues from AF patients with mitral stenosis. Intriguingly, out of 20 most abundant miRNAs in the heart, only let-7b/i and miR-30d were found significantly upregulated but 9 of 20 including miR-29, miR-133, miR-24, miR-26, miR-126, miR-125, miR-99, miR-20, and miR-23 were downregulated. Based on our computational prediction, these changes are expected to result in reduction of the antifibrotic force to promote atrial fibrogenesis.

#### **5. Conclusion**

The theoretical analyses in conjunction with experimental demonstration of miRNA expression profiles under various conditions performed presented here allowed us to establish a matrix of miRNAs that are expressed in cardiac cells and have the potential to regulate the genes encoding cardiac ion channels and transporters, proteins responsible for cell survival and death, and proteins involved in fibrogenesis in heart. These miRNAs likely play an important role in controlling cardiac excitability, cardiomyocyte homeostasis and cardiac fibrosis of the heart. In other words, the genes determining these processes may normally be under the post-transcriptional regulation of a group of miRNAs. Indeed, some of the predicted targets have already been demonstrated experimentally. Also we were able to link a particular remodeling process in hypertrophy/heart failure, myocardial ischemia, or atrial fibrillation to the corresponding deregulated miRNAs under that pathological condition; the changes of miRNAs appear to have anti-correlation with the changes of many of the genes responsible for cardiac electrophysiology, cardiomyocyte apoptosis and cardiac fibrosis under these situations. The present study should aid us to pinpoint the individual miRNAs that can most likely take part in the electrical and structural remodelling processes through targeting particular genes.

It should be noted, however, that the present computational study is in no way to replace experimental approaches for understanding the role of miRNAs in regulating expression of

Atrial fibrillation (AF) is the most commonly encountered clinical arrhythmia that causes tremendous health problems by increasing the risk of stroke and exacerbating heart failure. It is characterized by a process termed atrial structural remodelling with increased atrial fibrosis. Indeed, atrial fibrosis has been strongly associated with the presence of heart diseases/arrhythmias, including congestive heart failure (CHF) and AF (Pellman et al., 2010;

To determine if miRNAs are involved in atrial structural remodelling, we first conducted expression profiling to identify deregulated miRNAs in the atrial tissues of a canine model of tachypacing-induce chronic AF, using miRNA microarray analysis comparing the differential expressions of miRNAs between control and AF dogs. Four miRNAs miR-223, miR-328, miR-664 and miR-517 were found increased by >2 folds, and six were decreased by at least 50% including miR-101, miR-133, miR-145, miR-320, miR-373 and miR-499. Real-time quantitative RT-PCR (qRT-PCR) analysis confirmed the significant upregulation of miR-223, miR-328 and miR-664 (miR-517 was undetectable), and the significant downregulation of miR-101, miR-320, and miR-499. Intriguingly, none of these deregulated miRNAs are within the list of top 20 most abundant miRNAs. But miR-223 and miR-328 are among the cardiacenriched miRNAs. This notion would suggest that altered miRNA expression in this AF model tends to favour fibrogenesis; however, miRNAs are definitely not the major

In a recent study reported by Chen's group (Xiao et al., 2011), it was found that miRNA expression undergoes tremendous alterations in atrial tissues from AF patients with mitral stenosis. Intriguingly, out of 20 most abundant miRNAs in the heart, only let-7b/i and miR-30d were found significantly upregulated but 9 of 20 including miR-29, miR-133, miR-24, miR-26, miR-126, miR-125, miR-99, miR-20, and miR-23 were downregulated. Based on our computational prediction, these changes are expected to result in reduction of the anti-

The theoretical analyses in conjunction with experimental demonstration of miRNA expression profiles under various conditions performed presented here allowed us to establish a matrix of miRNAs that are expressed in cardiac cells and have the potential to regulate the genes encoding cardiac ion channels and transporters, proteins responsible for cell survival and death, and proteins involved in fibrogenesis in heart. These miRNAs likely play an important role in controlling cardiac excitability, cardiomyocyte homeostasis and cardiac fibrosis of the heart. In other words, the genes determining these processes may normally be under the post-transcriptional regulation of a group of miRNAs. Indeed, some of the predicted targets have already been demonstrated experimentally. Also we were able to link a particular remodeling process in hypertrophy/heart failure, myocardial ischemia, or atrial fibrillation to the corresponding deregulated miRNAs under that pathological condition; the changes of miRNAs appear to have anti-correlation with the changes of many of the genes responsible for cardiac electrophysiology, cardiomyocyte apoptosis and cardiac fibrosis under these situations. The present study should aid us to pinpoint the individual miRNAs that can most likely take part in the electrical and structural remodelling processes

It should be noted, however, that the present computational study is in no way to replace experimental approaches for understanding the role of miRNAs in regulating expression of

**4.2 Control of atrial fibrogenesis by miRNAs during atrial fibrillation** 

determinant for atrial structural remodelling associated with fibrosis.

fibrotic force to promote atrial fibrogenesis.

through targeting particular genes.

**5. Conclusion** 

Tan & Zimetbaum, 2010).

genes; rather it merely presents a prediction of the odds of miRNA:mRNA interactions under normal situation and in the context of electrical/ionic remodeling under the selected circumstances of the heart. This theoretical analysis like all other computational studies needs to be eventually verified with the bench-top work and should not be considered original results. Nonetheless, with sparse experimental data published to date and the anticipated difficulties to acquire complete experimental data using the currently available techniques, the analytical procedures described here can well serve as first-hand information, providing a framework and guideline for future experimental studies.

The second limitation of the study is the possibility of underestimating the number of miRNAs that could regulate ion channels, apoptosis and fibrosis due to the stringent criterion for inclusion of miRNAs with positive prediction of targets by at least four out of seven algorithms; in the past, we had been able to experimentally verified nearly all the target genes predicted by only one algorithm miRanda for our pre-experiment analysis. However, the fact that our prediction includes all 20 most abundant miRNAs and other highly expressed miRNAs in the myocardium suggests that this limitation might not have significant negative impact on the accuracy of our analysis and inclusion of more miRNAs by more permissive criteria does not guarantee their physiological function if they are scarcely expressed in the heart. Yet it should be noted that the miRNA expression profiles were obtained from myocardium that also includes fibroblasts and caution needs to be taken when interpreting the expression data.

Another important notion is that despite that our prediction of miRNA targeting coincides with the changes of expression of relevant genes under the pathological conditions, it does not imply that miRNAs are necessarily the important or even the only determinant of the electrical remodeling processes. Our data to the most indicate the potential contribution of miRNAs to such conditions; other molecules like transcription factors must also be involved in the regulation of expression of ion channel genes under these conditions.

Finally, it is also difficult to predict the net outcome when two miRNAs target a same gene but alter in their expression in the opposite directions. Yet, with deepened and broadened understanding of miRNA targeting and action, these possible limitations should eventually be worked out.

#### **6. Acknowledgment**

The work presented was supported by the Canadian Institute of Health Research (CIHR).

#### **7. References**


MicroRNA Targeting in Heart: A Theoretical Analysis 555

Gardner, P.I., Ursell, P.C., Fenoglio, J.J. Jr., & Wit, A.L. (1985). Electrophysiologic and

Grimson, A., Farh, K.K-H., Johnston, W.K., Garrett-Engele, P. Lim, L.P., & Bartel, D.P.

Gupta, A., Gartner, J.J., Sethupathy, P., Hatzigeorgiou, A.G., & Fraser, N.W. (2006). Anti-

Jaffe, R., Flugelman, M.Y., Halon, D.A., & Lewis, B.S. (1997). Ventricular remodelling: from

Kerr, J.F., Wyllie, A.H., & Currie, A.R. (1972). Apoptosis: A basic biological phenomenon with wide-ranging implications in tissue kinetics. *Br J Cancer* 26, 239–257. Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., & Segal, E. (2007). The role of site accessibility in microRNA target recognition, *Nature Genet* 39, 1278–1284. Khan, R., & Sheppard, R. (2006). Fibrosis in heart disease: understanding the role of

Kiriakidou, M., Nelson, P.T., Kouranov, A., Fitziev, P., Bouyioukos, C., Mourelatos, Z., &

Kitamura, H., Ohnishi, Y., Yoshida, A., Okajima, K., Azumi, H., Ishida, A., Galeano, E.J.,

Krek, A., Grün, D., Poy, M.N., Wolf, R., Rosenberg, L., Epstein, E.J., MacMenamin, P., da

Lewis, B.P., Burge, C.B., & Bartel, D.P. (2005). Conserved seed pairing, often flanked by

Lewis, B.P., Shih, I.H., Jones-Rhoades, M.W., Bartel, D.P., & Burge, C.B. (2003). Prediction of

Liang, Y., Ridzon, D., Wong, L., & Chen, C. (2007). Characterization of microRNA

Luo, X., Lin, H., Lu, Y., Li, B., Xiao, J., Yang, B., & Wang, Z. (2007). Transcriptional activation

Luo, X., Lin, H., Pan, Z., Xiao, J., Zhang, Y., Lu, Y., Yang, B., & Wang, Z. (2008).

Luo, X., Zhang, H., Xiao, J., & Wang, Z. (2010). Regulation of human cardiac ion channel

Maisch, B. (1995). Extracellular matrix and cardiac interstitium: restriction is not a restricted

Manabe, I., Shindo, T., & Nagai, R. (2002). Gene expression in fibroblasts and fibrosis

expression profiles in normal human tissues. *BMC Genomics* 8, 166.

heterogeneity of their expressions. *J Cell Physiol* 212, 358–367.

involvement in cardiac hypertrophy. *Circ Res* 91, 1103–1113.

predicts human microRNA targets. *Genes Dev* 18, 1165–1178.

tachycardia. *Cardiovasc Electrophysiol* 13, 865–870.

microRNA target predictions. *Nat Genet* 37, 495–500.

mammalian microRNA targets. *Cell* 115, 787–798.

*Cell Physiol Biochem* 25, 571–586.

phenomenon. *Herz* 20, 75–80.

infarcts. *Circulation* 72,: 596–611

pairing. *Molecular Cell* 27, 91–105.

arrhythmia. *Immunology* 118, 10–24.

120, 15–20.

bedside to molecule. *Adv Exp Med Biol* 430, 257–266.

transcript. *Nature* 442, 82–85.

anatomic basis for fractionated electrograms recorded from healed myocardial

(2007). MicroRNA targeting specificity in mammals: determinants beyond seed

apoptotic function of a microRNA encoded by the HSV-1 latencyassociated

transforming growth factor-beta in cardiomyopathy, valvular disease and

Hatzigeorgiou, A. (2004). A combined computational-experimental approach

Kubo, S., Hayashi, Y., Itoh, H., & Yokoyama, M. (2002). Heterogeneous loss of connexin43 protein in nonischemic dilated cardiomyopathy with ventricular

Piedade, I., Gunsalus, K.C., Stoffel, M., & Rajewsky, N. (2005). Combinatorial

adenosines, indicates that thousands of human genes are microRNA targets. *Cell*

by stimulating protein 1 and post-transcriptional repression by muscle-specific microRNAs of IKs-encoding genes and potential implications in regional

Downregulation of miRNA-1/miRNA-133 contributes to re-expression of pacemaker channel genes HCN2 and HCN4 in hypertrophic heart. *J Biol Chem* 283, 20045–20052.

genes by microRNAs: Theoretical perspective and pathophysiological implications.


Carmeliet, E. (1999). Cardiac ionic currents and acute ischemia: from channels to

Chan, J.A., Krichevsky, A.M., & Kosik, K.S. (2005). MicroRNA-21 is an antiapoptotic factor

Cheng, Y., Ji, R., Yue, J., Yang, J., Liu, X., Chen, H., Dean, D.B., & Zhang, C. (2007).

Cheng, Y., Zhu, P., Yang, J., Liu, X., Dong, S., Wang, X., Chun, B., Zhuang, J., & Zhang, C.

Cimmino, A., Calin, G.A., Fabbri, M., Iorio, M.V., Ferracin, M., Shimizu, M., Wojcik, S.E.,

Corsten, M.F., Miranda, R., Kasmieh, R., Krichevsky, A.M., Weissleder, R., & Shah, K. (2007).

Dong S, Cheng Y, Yang J, Li J, Liu X, Wang, X., Wang, D., Krall, T.J., Delphin, E.S., Zhang, C.

Duisters, R.F., Tijsen, A.J., Schroen, B., Leenders, J.J., Lentink, V., van der Made, I., Herias,

Dun, W., & Boyden, P.A (2005). Diverse phenotypes of outward currents in cells that have survived in the 5-day-infarcted heart. *Am J Physiol* 289, H667–H673. Dupont, E., Matsushita, T., Kaba, R.A., Vozzi, C., Coppen, S.R., Khan, N., Kaprielian, R.,

Enright, A.J., John, B., Gaul, U., Tuschl, T., Sander, C., & Marks, D.S. (200). MicroRNA

Flesch, M., Schwinger, R.H., Schiffer, F., Frank, K., Südkamp, M., Kuhn-Regnier, F., Arnold, G., &

diabetes mellitus: The Framingham Heart Study. *Circulation* 115, 1544–1550. Friedman, P.L., Fenoglio, J.J., & Wit, A.L. (1975). Time course for reversal of

Fuller, W., Parmar, V., Eaton, P., Bell, J.R., & Shattock, M.J. (2003). Cardiac ischemia causes

Na+-Ca2+ exchanger in failing human myocardium. *Circulation* 94, 992–1002. Fox, C.S., Coady, S., Sorlie, P.D., D'Agostino, R.B. Sr, Pencina, M.J., Vasan, R.S., Meigs, J.B.,

by targeting BCL2. *Proc Natl Acad Sci USA*. 102(39), 13944–13949.

phase of acute myocardial infarction. *J Biol Chem* 284, 29514–29525.

congestive heart failure. *J Mol Cell Cardiol* 33, 359–371.

are conserved targets of microRNAs. *Genome Res* 19, 92–105.

linked to oxidant stress. *Cardiovasc Res* 57, 1044–1051.

targets in Drosophila. *Genome Biology* 5, R1.

MicroRNAs are aberrantly expressed in hypertrophic heart. Do they play a role in

(2010). Ischemic preconditioning-regulated miR-21 protects the heart from ischemia/reperfusion injury via anti-apoptosis through its target PDCD4.

Aqeilan, R.I., Zupo, S., Dono, M., Rassenti, L., Alder, H., Volinia, S., Liu, C.G., Kipps, T.J., Negrini, M., & Croce, C.M. (2005). miR-15 and miR-16 induce apoptosis

MicroRNA-21 knockdown disrupts glioma growth in vivo and displays synergistic cytotoxicity with neural precursor cell delivered S-TRAIL in human gliomas.

(2009). MicroRNA expression signature and the role of microRNA-21 in the early

V., van Leeuwen, R.E., Schellings, M.W., Barenbrug, P., Maessen, J.G., Heymans, S., Pinto, Y.M., & Creemers, E.E. (2009). miR-133 and miR-30 regulate connective tissue growth factor: implications for a role of microRNAs in myocardial matrix

Yacoub, M.H., & Severs, N.J. (2001). Altered connexin expression in human

Böhm, M. (1996). Evidence for functional relevance of an enhanced expression of the

Levy, D., & Savage, P.J. (2007). Increasing cardiovascular disease burden due to

electrophysiological and ultrastructural abnormalities in subendocardial Purkinje fibers surviving extensive myocardial infarction in dogs. *Circ Res* 36, 127–144. Friedman, R.C., Farh, K.K-H., Burge, C.B., & Bartel, D.P. (2009). Most mammalian mRNAs

inhibition of the Na/KATPase by a labile cytosolic compound whose production is

arrhythmias. *Physiol Rev* 79, 917–1017.

*Cardiovasc Res* 87(3), 431–439.

*Cancer Res* 67, 8994–9000.

remodeling. *Circ Res* 104, 170–178.

in human glioblastoma cells. *Cancer Res* 65, 6029–6033.

cardiac hypertrophy? *Am J Pathol* 170, 1831–1840.


MicroRNA Targeting in Heart: A Theoretical Analysis 557

Sanguinetti, M.C., Curran, M.E., Zou, A., Shen, J., Spector, P.S., Atkinson, D.L., & Keating,

Sayed, D., Hong, C., Chen, I.Y., Lypowy, J., Abdellatif, M. (2007). MicroRNAs play an essential role in the development of cardiac hypertrophy. *Circ Res* 100, 416–424. Shan, H., Zhang, Y., Lu, Y., Zhang, Y., Pan, Z., Cai, B., Wang, N., Li, X., Feng, T., Hong, Y., &

Si, M.L., Zhu, S., Wu, H., Lu, Z., Wu, F., & Mo, Y.Y. (2007). miR-21-mediated tumor growth.

Spear, J.F., Michelson, E.L., & Moore, E.N. (1983). Reduced space constant in slowly conducting regions of chronically infarcted canine myocardium. *Circ Res* 53, 176–185. Swynghedauw, B. (1999). Molecular mechanisms of myocardial remodeling. *Physiol Rev* 79,

Tan, A.Y., & Zimetbaum, P. (2010). Atrial Fibrillation and Atrial Fibrosis. *J Cardiovasc* 

Tang, Y., Zheng, J., Sun, Y., Wu, Z., Liu, Z., & Huang, G. (2009). MicroRNA-1 regulates cardiomyocyte apoptosis by targeting Bcl-2. *Int Heart J* 50, 377–387.

Tatsuguchi, M., Seok, H.Y., Callis, T.E., Thomson, J.M., Chen, J.F., Newman, M., Rojas, M.,

Thum, T., Gross, C., Fiedler, J., Fischer, T., Kissler, S,. Bussen, M., Galuppo, P., Just, S.,

Thum, T. Chau, N., Bhat, B., Gupta, S.K., Linsley, P.S., Bauersachs, J., & Engelhardt, S.

Tsang, W.P., & Kwok, T.T. (2010). Epigallocatechin gallate up-regulation of miR-16 and induction of apoptosis in human cancer cells. *J Nutr Biochem* 21(2), 140–146. Tsuji, Y., Opthof, T., Kamiya, K., Yasui, K., Liu, W., Lu, Z., & Kodama, I. (2000). Pacing-

van Rooij, E., Sutherland, L.B., Liu, N., Williams, A.H., McAnally, J., Gerard, R.D.,

van Rooij, E., Sutherland, L.B., Qi, X., Richardson, J.A., Hill, J., & Olson, E.N. (2007). Control

van Rooij, E., Sutherland, L.B., Thatcher, J.E., DiMaio, J.M., Naseem, R.H., Marshall, W.S.,

Hammond, S.M., & Wang, D.Z. (2007) Expression of microRNAs is dynamically regulated during cardiomyocyte hypertrophy. *J Mol Cell Cardiol* 42, 1137–1141. Thum, T., Galuppo, P., Wolf, C., Fiedler, J., Kneitz, S., van Laake, L.W., Doevendans, P.A.,

Mummery, C.L., Borlak, J., Haverich, A., Gross, C., Engelhardt, S., Ertl, G,. & Bauersachs, J. (2007). MicroRNAs in the human heart: a clue to fetal gene

Rottbauer, W., Frantz, S., Castoldi, M., Soutschek, J., Koteliansky, V., Rosenwald, A., Basson, M.A., Licht, J.D., Pena, J.T., Rouhanifard, S.H., Muckenthaler, M.U., Tuschl, T., Martin, G.R., Bauersachs, J., & Engelhardt, S. (2008). MicroRNA-21 contributes to myocardial disease by stimulating MAP kinase signalling in

(2011). Comparison of different miR-21 inhibitor chemistries in a cardiac disease

induced heart failure causes a reduction of delayed rectifier potassium currents along with decreases in calcium and transient outward currents in rabbit ventricle.

Richardson, J.A., & Olson, E.N. (2006). A signature pattern of stress-responsive microRNAs that can evoke cardiac hypertrophy and heart failure. *Proc Natl Acad* 

of stress-dependent cardiac growth and gene expression by a microRNA. *Science* 

Hill, J.A., & Olson, E.N. (2008). Dysregulation of microRNAs following myocardial

induced atrial remodelling in canines. *Cardiovasc Res* 83, 465–472.

potassium channel. *Nature* 384, 80–83.

*Pharmacol* 2010 Dec 4. [Epub ahead of print]

fibroblasts. *Nature* 456, 980–984.

model. *J Clin Invest* 121, 461–462.

*Cardiovasc Res* 48, 300–309.

*Sci USA* 103, 18255–18260.

316, 575–579.

reprogramming in heart failure. *Circulation* 116, 258–267.

*Oncogene* 26, 2799–2803.

215–262.

M.T. (1996). Coassembly of KvLQT1 and minK (IsK) proteins to form cardiac *I*Ks

Yang, B. (2009). Downregulation of miR-133 and miR-590 contributes to nicotine-


Nattel, S., Maguy, A., Le Bouter, S., & Yeh, Y-H. (2007). Arrhythmogenic ion-channel

Ostenfeld, M.S., Bramsen, J.B., Lamy, P., Villadsen, S.B., Fristrup, N., Sørensen, K.D., Ulhøi, B.,

Pellman, J., Lyon, R.C., & Sheikh, F. (2010). Extracellular matrix remodeling in atrial fibrosis: mechanisms and implications in atrial fibrillation. *J Mol Cell Cardiol* 48(3), 461–467. Peters, N.S. (1995). Myocardial gap junction organization in ischemia and infarction. *Microsc* 

Pogwizd, S.M., & Bers, D.M. (2002). Na/Ca exchange in heart failure: contractile

Pu, J., & Boyden, P.A. (1997). Alterations of Na+ currents in myocytes from epicardial border

Rane, S., He, M., Sayed, D., Vashistha, H., Malhotra, A., Sadoshima, J., Vatner, D.E., Vatner,

Rehmsmeier, M., Steffen, P., Hochsmann, M., & Giegerich, R. (2004). Fast and effective

Ren, X.P., Wu, J., Wang, X., Sartor, M.A., Qian, J., Jones, K., Nicolaou, P., Pritchard, T.J., &

Rose, J., Armoundas, A.A., Tian, Y., DiSilvestre, D., Burysek, M., Halperin, V., O'Rourke, B.,

Roy, S., Khanna, S., Hussain, S.R., Biswas, S., Azad, A., Rink, C., Gnyawali, S., Shilo, S.,

Qian, L., Van Laake, L.W., Huang, Y., Liu, S., Wendland, M.F., & Srivastava, D. (2011). miR-

Sabbah, H.N., Sharov, V.G., & Goldstein, S. (1998). Programmed cell death in the

Sachdeva, M., & Mo, Y.Y. (2010). miR-145-mediated suppression of cell growth, invasion

prediction of microRNA/target duplexes. *RNA* 10, 1507–1517.

Reed, J.C. (2002). Apoptosis-based therapies. *Nat Rev Drug Discov* 1, 111–121.

phosphatase and tensin homologue. *Cardiovasc Res* 82:21-29.

progression of heart failure. *Ann Med* 30, S33–S38.

and metastasis. *Am J Transl Res* 2(2), 170–180.

dysfunction and arrhythmogenesis. *Ann NY Acad Sci* 976, 454–465.

microRNA-21 in mice. *J Clin Invest* 120, 3912–3916.

postrepolarization refractoriness. *Circ Res* 81, 110–119.

cardiac myocytes. *Circ Res* 104, 879–886.

*Physiol Rev* 87, 425–456.

*Res Tech* 31, 375–386.

2357–2366.

H2077–H2087.

pp549–560.

remodeling in the heart: heart failure, myocardial infarction, and atrial fibrillation.

Borre, M., Kjems, J., Dyrskjøt, L., & Orntoft, T.F. (2010). miR-145 induces caspasedependent and -independent cell death in urothelial cancer cell lines with targeting of an expression signature present in Ta bladder tumors. *Oncogene* 29(7), 1073–1084. Palojoki, E., Saraste, A., Eriksson, A., Pullkki, K., Kallajoki, M., Voipio Pulkki, L.M., &

Tikkanen, I. (2001). Cardiomyocyte apoptosis and ventricular remodeling after myocardial infarction in rats. *Am J Physiol Heart Circ Physiol* 280, H2726–H2731. Patrick, D. M. Montgomery, R.L., Qi, X., Obad, S., Kauppinen, S., Hill, J.A., van Rooij, E., &

Olson, E.N. (2010) Stress-dependent cardiac remodeling occurs in the absence of

zone of the infarcted heart. A possible ionic mechanism for reduced excitability and

S.F., & Abdellatif, M. (2009). Downregulation of miR-199a derepresses hypoxiainducible factor-1alpha and Sirtuin 1 and recapitulates hypoxia preconditioning in

Fan, G.C. (2009). MicroRNA-320 is involved in the regulation of cardiac ischemia/reperfusion injury by targeting heat-shock protein 20. *Circulation* 119,

Kass, D.A., Marbán, E., & Tomaselli, G.F. (2005). Molecular correlates of altered expression of potassium currents in failing rabbit myocardium. *Am J Physiol* 288,

Nuovo, G.J., & Sen, C.K. (2009). MicroRNA expression in response to murine myocardial infarction: miR-21 regulates fibroblast metalloprotease-2 via

24 inhibits apoptosis and represses Bim in mouse cardiomyocytes. *J Exp Med* 208(3),


**25** 

*China* 

**Genome-Wide Identification of Estrogen** 

*1College of Bioengineering, Henan University of Technology, Zhengzhou,* 

*2Guangzhou Institute of Biomedicine and Health, Chinese Academy of ScienceGuangzhou,* 

MicroRNAs (miRNAs) are one class of endogenous non-coding RNA which can repress protein translation or cause target mRNA degradation(Bartel 2004). Currently 15,172 entries, including 1,048 human miRNAs, are recorded in a major miRNAs database miRbase (Release 16: Sept 2010) (Kozomara and Griffiths-Jones 2011). MiRNAs reside reside in protein-coding, intronic and intergenic regions throughout the genome. MiRNAs are mainly transcribed into long primary miRNAs (pri-miRNAs) by RNA polymerase II(Lee et al. 2004). Since mammalian miRNA genes are often clustered along the genome, the pri-miRNA can contain one single miRNA gene or multiple clustered miRNA genes. In the nucleus, primiRNAs, which are both capped and polyadenylated, are processed by RNase III enzymes Drosha into about 70-nucleotide hairpins called pre-miRNAs(Lee et al. 2002). The transporter protein exportin-5 then exports pre-miRNAs to the cytoplasm, where they are cleaved by another RNase III Dicer to generate mature miRNA duplexes. One strand of miRNA duplex preferentially enters into miRNA-induced silencing complexes (miRISCs) and guides the complex to recognize its target genes. Previous studies indicated that this target inhibition of miRNAs mainly function via imperfect base pairing with the targeting sequences on the 3' untranslated region(3'UTR) and the first 2–8 bases of a particular mature miRNA sequence referred to the "seed'' region. MiRNAs play essential regulatory roles in diverse biological processes. For example, we recently found that miRNA-153, the expression level of which is significantly repressed in glioblastoma (GBM), could inhibit cell proliferation and induce apoptosis via targeting B-cell lymphoma 2 (Bcl-2), myeloid cell leukemia sequence 1 (Mcl-1) and insulin receptor substrate-2 (Irs-2) in glioblastoma cell

In the past few years, computational approaches have played an important role in miRNA studies, for example, dozens of prediction tools used for miRNA gene finding and miRNA target prediction were developed. These tools have greatly facilitated experimental discovery. However, knowledge about the regulation of these essential regulators is at its early stage(Schanen and Li 2010; Li et al. 2010). Transcriptional regulations mediated by specific transcriptional factors (TFs) have only been intensively studied on a small number of miRNAs(Lee et al. 2004; Houbaviy et al. 2005). Importantly, certain "oncogenic miRNAs"

**1. Introduction** 

lines(Xu, Liao, and Wong 2010; Xu et al. 2011).

**Transcription Factor Binding Data** 

Jianzhen Xu1, Xi Zhou2 and Chi-Wai Wong3

*3NeuMed Pharmaceuticals Limited, Hong Kong,* 

**Receptor Alpha Regulated miRNAs Using** 

infarction reveals a role of miR-29 in cardiac fibrosis. *Proc Natl Acad Sci USA* 105, 13027–13032.


### **Genome-Wide Identification of Estrogen Receptor Alpha Regulated miRNAs Using Transcription Factor Binding Data**

Jianzhen Xu1, Xi Zhou2 and Chi-Wai Wong3

*1College of Bioengineering, Henan University of Technology, Zhengzhou, 2Guangzhou Institute of Biomedicine and Health, Chinese Academy of ScienceGuangzhou, 3NeuMed Pharmaceuticals Limited, Hong Kong, China* 

#### **1. Introduction**

558 Bioinformatics – Trends and Methodologies

Wang, X., & El Naqa, I.M. (2008). Prediction of both conserved and nonconserved

Wang, Z. (2010). MicroRNAs and cardiovascular disease. *Bentham Science*. doi:

Wang, Z., Luo, X., Lu, Y., & Yang, B. (2008). miRNAs at the heart of the matter. *J Mol Med*

Xiao, F., Zuo, Z., Cai, G., Kang, S., Gao, X., & Li, T. (2009). miRecords: an integrated resource for microRNA-target interactions. *Nucleic Acids Res* 37, D105–D110. Xiao, J., Liang, D., Zhang, Y., Liu, Y., Zhang, H., Liu, Y., Li, L., Liang, X., Sun, Y., & Chen,

Xiao, J., Luo, X., Lin, H., Zhang, Y., Lu, Y., Wang, N., Zhang, Y., Yang, B., & Wang, Z. (2007).

Xiao, L., Xiao, J., Luo, X., Lin, H., Wang, Z., & Nattel, S. (2008). Feedback remodeling of

Xu, C., Lu ,Y., Pan, Z., Chu, W., Luo, X., Lin, H., Xiao, J., Shan, H., Wang, Z., & Yang, B.

Yamada, K.A., Rogers, J.G., Sundset, R., Steinberg, T.H., & Saffitz, J.E. (2003). Up-regulation of connexin45 in heart failure. J *Cardiovasc Electrophysiol* 14, 1205–1212. Yang, B., Lin, H., Xiao, J., Lu, Y., Luo, X., Li, B., Zhang, Y., Xu, C., Bai, Y., Wang, H., Chen,

Ye, Y., Hu, Z., Lin, Y., Zhang, C., & Perez-Polo, J.R. (2010). Down-regulation of microRNA-

Yu, X.Y., Song, Y.H., Geng, Y.J., Lin, Q.X., Shan, Z.X., Lin, S.G., & Li, Y. (2008). Glucose

Zhu, S., Si, M.L.,Wu, H., & Mo, Y.Y. (2007). MicroRNA-21 targets the tumor suppressor gene

Zicha, S., Maltsev, V.A., Nattel, S., Sabbah, H.N., & Undrovinas, A.I. (200). Post-

myocardial ischemia-reperfusion injury. *Cardiovasc Res* 87(3), 535–544. Yin, C., Salloum, F.N., & Kukreja, R.C. (2009). A novel role of microRNA in late

microRNA targets in animals. *Bioinformatics* 24, 325–332.

10.2174/97816080518471100101 (eISBN: 978-1-60805-184-7).

stenosis. *Physiol Genomics* 2011 Feb 15. [Epub ahead of print].

prolongation in diabetic hearts. *J Biol Chem* 282, 12363–12367.

repolarization reserve. *Circulation* 118, 983–992.

13027–13032.

86, 772–783.

*Cell Sci* 120, 3045–3052.

*Res* 79, 571–580.

protein 70. *Circ Res* 104, 572–575.

tropomyosin 1 (TPM1). *J Biol Chem* 282, 14328–36.

chronic heart failure. *J Mol Cell Cardiol* 37, 91–100.

*Res Commun* 376, 548–552.

infarction reveals a role of miR-29 in cardiac fibrosis. *Proc Natl Acad Sci USA* 105,

Y.H. (2011). MicroRNA expression signature in atrial fibrillation with mitral

MicroRNA miR-133 represses HERG K+ channel expression contributing to QT

cardiac potassium current expression. A novel potential mechanism for control of

(2007). The muscle-specific microRNAs miR-1 and miR-133 produce opposing effects on apoptosis by targeting HSP60, HSP70 and caspase-9 in cardiomyocytes. *J* 

G., & Wang, Z. (2007). The muscle-specific microRNA miR-1 regulates cardiac arrhythmogenic potential by targeting GJA1 and KCNJ2. *Nat Med* 13, 486–491. Yang, B., Lu, Y., & Wang, Z. (2008). Control of cardiac excitability by microRNAs. *Cardiovasc* 

29 by antisense inhibitors and a PPAR-{gamma} agonist protects against

preconditioning: upregulation of endothelial nitric oxide synthase and heat shock

induces apoptosis of cardiomyocytes via microRNA-1 and IGF-1. *Biochem Biophys* 

transcriptional alterations in the expression of cardiac Na+ channel subunits in

MicroRNAs (miRNAs) are one class of endogenous non-coding RNA which can repress protein translation or cause target mRNA degradation(Bartel 2004). Currently 15,172 entries, including 1,048 human miRNAs, are recorded in a major miRNAs database miRbase (Release 16: Sept 2010) (Kozomara and Griffiths-Jones 2011). MiRNAs reside reside in protein-coding, intronic and intergenic regions throughout the genome. MiRNAs are mainly transcribed into long primary miRNAs (pri-miRNAs) by RNA polymerase II(Lee et al. 2004). Since mammalian miRNA genes are often clustered along the genome, the pri-miRNA can contain one single miRNA gene or multiple clustered miRNA genes. In the nucleus, primiRNAs, which are both capped and polyadenylated, are processed by RNase III enzymes Drosha into about 70-nucleotide hairpins called pre-miRNAs(Lee et al. 2002). The transporter protein exportin-5 then exports pre-miRNAs to the cytoplasm, where they are cleaved by another RNase III Dicer to generate mature miRNA duplexes. One strand of miRNA duplex preferentially enters into miRNA-induced silencing complexes (miRISCs) and guides the complex to recognize its target genes. Previous studies indicated that this target inhibition of miRNAs mainly function via imperfect base pairing with the targeting sequences on the 3' untranslated region(3'UTR) and the first 2–8 bases of a particular mature miRNA sequence referred to the "seed'' region. MiRNAs play essential regulatory roles in diverse biological processes. For example, we recently found that miRNA-153, the expression level of which is significantly repressed in glioblastoma (GBM), could inhibit cell proliferation and induce apoptosis via targeting B-cell lymphoma 2 (Bcl-2), myeloid cell leukemia sequence 1 (Mcl-1) and insulin receptor substrate-2 (Irs-2) in glioblastoma cell lines(Xu, Liao, and Wong 2010; Xu et al. 2011).

In the past few years, computational approaches have played an important role in miRNA studies, for example, dozens of prediction tools used for miRNA gene finding and miRNA target prediction were developed. These tools have greatly facilitated experimental discovery. However, knowledge about the regulation of these essential regulators is at its early stage(Schanen and Li 2010; Li et al. 2010). Transcriptional regulations mediated by specific transcriptional factors (TFs) have only been intensively studied on a small number of miRNAs(Lee et al. 2004; Houbaviy et al. 2005). Importantly, certain "oncogenic miRNAs"

Genome-Wide Identification of Estrogen Receptor Alpha

**2.3 Cells culture, cell counting and qRT-PCR** 

**3.1 Identifying putative miRNAs regulated by ERs** 

multiple TFs, respectively.

**3. Results** 

Regulated miRNAs Using Transcription Factor Binding Data 561

is a database storing eukaryotic transcription factors and the transcription regulating DNA sequence elements(Matys et al. 2006). Position weight matrixes (PWMs) were obtained from the TRANSFAC 7.0 database. The Eukaryotic Promoter Database (EPD) collected annotated eukaryotic RNA polymerase II promoters sequences around the experimentally determined transcription start site(Schmid et al. 2006). The human promoter sequence (-499,100 around TSS) from EPD were used as background sequences. The 0th order of the Markov model with prior 0.1 was chosen to compute both the background sequences and the actual sequence frequencies. The p-value and significance value indicate the probability that the observed over-representation of the motif is achieved by random selection for a single or

MCF-7 cell line was from American Type Culture Collection (Manassas, VA). Cells were grown with phenol red-free D-MEM supplemented with 0.5% charcoal stripped FBS for 3 Days. The estrogen-deprived MCF-7 cells were treated with 10 nM 17β-estradiol (E2, Sigma-Aldrich Co.) or DMSO as a control. At the indicated time points, cells were rinsed with PBS and counted manually under the microscope. Total RNA was collected and extracted with Trizol reagent (Invitrogen). Reverse transcription of mature miRNAs and quantitative realtime PCR analysis were performed as previously reported(Xu, Liao, and Wong 2010).

Estrogen receptor alpha (ERα) and beta (ERβ) are members of the nuclear receptors superfamily which are ligand-regulated transcription factors. Estrogen (17β-estradiol, E2) is a potent ligand for both ERs. ERs either directly interact with cis-regulatory elements of target genes by binding to estrogen-response elements (EREs) or indirectly tether to transcription factors such as AP1 and SP1 (Ali and Coombes 2002). By transcriptional control of a large number of target genes, ERs regulate a wide variety of cellular processes including development and differentiation (Deroo and Korach 2006). In particular, ERα is thought to be involved in the progression of breast cancer. Depending on the status of ERα expression, breast cancer is classified into ERα+ and ERα- subtypes. Differential anti-hormone treatments are prescribed in conjunction with anti-cancer drugs to manage these breast cancer subtypes. Therefore, understanding how ER affects the expressions of oncogenic and

tumour suppressor miRNAs may assist better development of anti-cancer therapy.

binding region within the miR-143~145 cluster upstream regulatory region.

Carroll et al. previously used ChIP-chip technology to analyze ER binding regions genomewide. They found that the majority of ER binding regions are located outside of the classical promoter-proximal regions, suggesting distal regulation by ER (Carroll et al. 2006). To account for the possible bias when only focusing on promoter-proximal regions, we regarded 50 Kb upstream of all known pre-miRNAs as possible regulatory regions. Besides, setting a wider candidate region should be beneficial at this exploring step. Totally 59 miRNA regulatory regions were found that overlapped with 65 Carroll's ER binding regions (Table 1). As shown in Figure 1, there are three representative patterns of ER binding regions relative to specific miRNAs. For example, the promoter of hsa-miR-342 contains both promoter-proximal and distal ER binding regions, whereas the promoter of hsa-miR-21 is characterized by two proximal ER binding regions. In contrast, there is only one distal ER

Primers specific for the indicated miRNAs were available upon request.

and "tumour suppressor miRNAs" are inappropriately expressed in cancers. However, our understanding as to the TFs or chromatin modifications responsible for governing the expression levels of these essential miRNAs remains limited.

At the transcriptional level, gene expression is governed by interactions among TFs and ciselements such as promoters and enhancers. Chromatin immunoprecipitation (ChIP) experiment discovers specific protein-DNA interactions in a given cell type and is regarded as a major tool for investigating interactions between TFs and their binding sites. Based on the pairing of ChIP with DNA microarray and high-throughput sequencing technologies (ChIP-chip and ChIP-seq), genome-wide maps of TF binding sites can now be readily produced. Many groups have used ChIP-chip and ChIP-seq assays to globally study direct targets of TFs and provided significant insights into gene regulation networks (Farnham 2009). Together with mRNA-based expression microarrays, vast amounts of data are publicly available for analysis by bioinformatics. Indeed, networks of gene expression (or systems biology) are gaining popularity to help uncover the physiology regulation underneath and interpret the biological meaning behind these networks.

In principle, one can use the genome-wide binding map of a specific TF (or a chromatic modifying factor) to search for its putative target miRNAs, i.e. locate putative binding sites inside miRNA regulatory regions (such as promoter or enhancer) according to genomic coordinates. In the following section, we present a procedure which uses published ChIPchip data to predict candidate miRNAs regulated by a specific TF. Specifically, based on one genome-wide estrogen receptor (ER) binding map, we found 59 miRNA regulatory regions in which there is at least one ER binding site. Several putative ER-regulated oncogenic and tumour suppressor miRNAs were further confirmed in a breast cancer cell model.

#### **2. Methods**

#### **2.1 Prediction of ER-regulated miRNAs**

Accessing and analyzing the genomic sequence and functional annotations were based on the UCSC Genome Browser(Rhead et al. 2010) and Galaxy platform(Goecks, Nekrutenko, and Taylor 2010). UCSC Genome Browser is a web tool for convenient displaying and accessing the genome sequences, together with rich annotation tracks. Galaxy platform is an interactive system that combines existing multiple genome resources via a simple web portal. Users can manipulate remote resources and perform flexible operations such as intersections, unions, and subtractions. Currently, 718 miRNAs have been annotated in human genome (hg18). Their regulatory regions (50 Kb upstream from the pre-miRNAs) were collected from UCSC Genome Browser. The original ChIP-chip data were produced by Carroll et al. (Carroll et al. 2006). Totally 3,665 estrogens receptor binding regions, which were considered with high confidence, were used in this study. Since the original coordinates were annotated in hg17, a web-based liftOver utility with default settings was used to convert the genomic coordinates to the hg18 version of the human genome (http://genome.ucsc.edu/cgi-bin/hgLiftOver). Overlapping regions were searched in Galaxy platform.

#### **2.2 Motif analysis**

The ER binding regions which overlapped with miRNA regulatory sequences were download from UCSC Genome Browser and further analyzed by TOUCAN 2, a widely used regulatory sequence analysis suite (Aerts et al. 2005). It screened the input sequences against a precompiled library of motifs to find the statistically over-represented motifs. TRANSFAC is a database storing eukaryotic transcription factors and the transcription regulating DNA sequence elements(Matys et al. 2006). Position weight matrixes (PWMs) were obtained from the TRANSFAC 7.0 database. The Eukaryotic Promoter Database (EPD) collected annotated eukaryotic RNA polymerase II promoters sequences around the experimentally determined transcription start site(Schmid et al. 2006). The human promoter sequence (-499,100 around TSS) from EPD were used as background sequences. The 0th order of the Markov model with prior 0.1 was chosen to compute both the background sequences and the actual sequence frequencies. The p-value and significance value indicate the probability that the observed over-representation of the motif is achieved by random selection for a single or multiple TFs, respectively.

#### **2.3 Cells culture, cell counting and qRT-PCR**

MCF-7 cell line was from American Type Culture Collection (Manassas, VA). Cells were grown with phenol red-free D-MEM supplemented with 0.5% charcoal stripped FBS for 3 Days. The estrogen-deprived MCF-7 cells were treated with 10 nM 17β-estradiol (E2, Sigma-Aldrich Co.) or DMSO as a control. At the indicated time points, cells were rinsed with PBS and counted manually under the microscope. Total RNA was collected and extracted with Trizol reagent (Invitrogen). Reverse transcription of mature miRNAs and quantitative realtime PCR analysis were performed as previously reported(Xu, Liao, and Wong 2010). Primers specific for the indicated miRNAs were available upon request.

#### **3. Results**

560 Bioinformatics – Trends and Methodologies

and "tumour suppressor miRNAs" are inappropriately expressed in cancers. However, our understanding as to the TFs or chromatin modifications responsible for governing the

At the transcriptional level, gene expression is governed by interactions among TFs and ciselements such as promoters and enhancers. Chromatin immunoprecipitation (ChIP) experiment discovers specific protein-DNA interactions in a given cell type and is regarded as a major tool for investigating interactions between TFs and their binding sites. Based on the pairing of ChIP with DNA microarray and high-throughput sequencing technologies (ChIP-chip and ChIP-seq), genome-wide maps of TF binding sites can now be readily produced. Many groups have used ChIP-chip and ChIP-seq assays to globally study direct targets of TFs and provided significant insights into gene regulation networks (Farnham 2009). Together with mRNA-based expression microarrays, vast amounts of data are publicly available for analysis by bioinformatics. Indeed, networks of gene expression (or systems biology) are gaining popularity to help uncover the physiology regulation

In principle, one can use the genome-wide binding map of a specific TF (or a chromatic modifying factor) to search for its putative target miRNAs, i.e. locate putative binding sites inside miRNA regulatory regions (such as promoter or enhancer) according to genomic coordinates. In the following section, we present a procedure which uses published ChIPchip data to predict candidate miRNAs regulated by a specific TF. Specifically, based on one genome-wide estrogen receptor (ER) binding map, we found 59 miRNA regulatory regions in which there is at least one ER binding site. Several putative ER-regulated oncogenic and

Accessing and analyzing the genomic sequence and functional annotations were based on the UCSC Genome Browser(Rhead et al. 2010) and Galaxy platform(Goecks, Nekrutenko, and Taylor 2010). UCSC Genome Browser is a web tool for convenient displaying and accessing the genome sequences, together with rich annotation tracks. Galaxy platform is an interactive system that combines existing multiple genome resources via a simple web portal. Users can manipulate remote resources and perform flexible operations such as intersections, unions, and subtractions. Currently, 718 miRNAs have been annotated in human genome (hg18). Their regulatory regions (50 Kb upstream from the pre-miRNAs) were collected from UCSC Genome Browser. The original ChIP-chip data were produced by Carroll et al. (Carroll et al. 2006). Totally 3,665 estrogens receptor binding regions, which were considered with high confidence, were used in this study. Since the original coordinates were annotated in hg17, a web-based liftOver utility with default settings was used to convert the genomic coordinates to the hg18 version of the human genome (http://genome.ucsc.edu/cgi-bin/hgLiftOver). Overlapping regions were searched in

The ER binding regions which overlapped with miRNA regulatory sequences were download from UCSC Genome Browser and further analyzed by TOUCAN 2, a widely used regulatory sequence analysis suite (Aerts et al. 2005). It screened the input sequences against a precompiled library of motifs to find the statistically over-represented motifs. TRANSFAC

expression levels of these essential miRNAs remains limited.

underneath and interpret the biological meaning behind these networks.

**2. Methods** 

Galaxy platform.

**2.2 Motif analysis** 

**2.1 Prediction of ER-regulated miRNAs** 

tumour suppressor miRNAs were further confirmed in a breast cancer cell model.

#### **3.1 Identifying putative miRNAs regulated by ERs**

Estrogen receptor alpha (ERα) and beta (ERβ) are members of the nuclear receptors superfamily which are ligand-regulated transcription factors. Estrogen (17β-estradiol, E2) is a potent ligand for both ERs. ERs either directly interact with cis-regulatory elements of target genes by binding to estrogen-response elements (EREs) or indirectly tether to transcription factors such as AP1 and SP1 (Ali and Coombes 2002). By transcriptional control of a large number of target genes, ERs regulate a wide variety of cellular processes including development and differentiation (Deroo and Korach 2006). In particular, ERα is thought to be involved in the progression of breast cancer. Depending on the status of ERα expression, breast cancer is classified into ERα+ and ERα- subtypes. Differential anti-hormone treatments are prescribed in conjunction with anti-cancer drugs to manage these breast cancer subtypes. Therefore, understanding how ER affects the expressions of oncogenic and tumour suppressor miRNAs may assist better development of anti-cancer therapy.

Carroll et al. previously used ChIP-chip technology to analyze ER binding regions genomewide. They found that the majority of ER binding regions are located outside of the classical promoter-proximal regions, suggesting distal regulation by ER (Carroll et al. 2006). To account for the possible bias when only focusing on promoter-proximal regions, we regarded 50 Kb upstream of all known pre-miRNAs as possible regulatory regions. Besides, setting a wider candidate region should be beneficial at this exploring step. Totally 59 miRNA regulatory regions were found that overlapped with 65 Carroll's ER binding regions (Table 1). As shown in Figure 1, there are three representative patterns of ER binding regions relative to specific miRNAs. For example, the promoter of hsa-miR-342 contains both promoter-proximal and distal ER binding regions, whereas the promoter of hsa-miR-21 is characterized by two proximal ER binding regions. In contrast, there is only one distal ER binding region within the miR-143~145 cluster upstream regulatory region.

Genome-Wide Identification of Estrogen Receptor Alpha

Regulated miRNAs Using Transcription Factor Binding Data 563

**chromosome chrStart chrEnd miRNA regulatory regions strand**  chr5 148738673 148788673 hsa-mir-143\_up\_50000\_chr5\_148738674\_f + chr5 148740401 148790401 hsa-mir-145\_up\_50000\_chr5\_148740402\_f + chr6 166842911 166892911 hsa-mir-1913\_up\_50000\_chr6\_166842912\_r chr6 135551990 135601990 hsa-mir-548a-2\_up\_50000\_chr6\_135551991\_f + chr6 107288692 107338692 hsa-mir-587\_up\_50000\_chr6\_107288693\_f + chr7 101833307 101883307 hsa-mir-548o\_up\_50000\_chr7\_101833308\_r chr8 128827389 128877389 hsa-mir-1204\_up\_50000\_chr8\_128827390\_f + chr8 128992060 129042060 hsa-mir-1205\_up\_50000\_chr8\_128992061\_f + chr8 129181543 129231543 hsa-mir-1208\_up\_50000\_chr8\_129181544\_f + chr8 1702803 1752803 hsa-mir-596\_up\_50000\_chr8\_1702804\_f + chr1 154656845 154706845 hsa-mir-9-1\_up\_50000\_chr1\_154656846\_r chr1 153381591 153431591 hsa-mir-92b\_up\_50000\_chr1\_153381592\_f + chr10 14468580 14518580 hsa-mir-1265\_up\_50000\_chr10\_14468581\_f + chr10 134911115 134961115 hsa-mir-202\_up\_50000\_chr10\_134911116\_r chr10 98578511 98628511 hsa-mir-607\_up\_50000\_chr10\_98578512\_r chr10 29931281 29981281 hsa-mir-938\_up\_50000\_chr10\_29931282\_r chr11 74723878 74773878 hsa-mir-326\_up\_50000\_chr11\_74723879\_r chr11 2112015 2162015 hsa-mir-483\_up\_50000\_chr11\_2112016\_r chr11 64918504 64968504 hsa-mir-612\_up\_50000\_chr11\_64918505\_f + chr12 96431720 96481720 hsa-mir-135a-2\_up\_50000\_chr12\_96431721\_f + chr12 52621788 52671788 hsa-mir-196a-2\_up\_50000\_chr12\_52621789\_f + chr12 63252555 63302555 hsa-mir-548c\_up\_50000\_chr12\_63252556\_f + chr12 12910029 12960029 hsa-mir-614\_up\_50000\_chr12\_12910030\_f + chr12 52664000 52714000 hsa-mir-615\_up\_50000\_chr12\_52664001\_f + chr14 99595744 99645744 hsa-mir-342\_up\_50000\_chr14\_99595745\_f + chr15 60853208 60903208 hsa-mir-190\_up\_50000\_chr15\_60853209\_f + chr15 61950271 62000271 hsa-mir-422a\_up\_50000\_chr15\_61950272\_r chr15 78921469 78971469 hsa-mir-549\_up\_50000\_chr15\_78921470\_r chr15 68158861 68208861 hsa-mir-629\_up\_50000\_chr15\_68158862\_r chr15 73433079 73483079 hsa-mir-631\_up\_50000\_chr15\_73433080\_r chr16 84332807 84382807 hsa-mir-1910\_up\_50000\_chr16\_84332808\_r chr16 2211748 2261748 hsa-mir-940\_up\_50000\_chr16\_2211749\_f + chr17 43469612 43519612 hsa-mir-152\_up\_50000\_chr17\_43469613\_r chr17 55223408 55273408 hsa-mir-21\_up\_50000\_chr17\_55223409\_f + chr17 26876542 26926542 hsa-mir-365-2\_up\_50000\_chr17\_26876543\_f + chr19 10473797 10523797 hsa-mir-1238\_up\_50000\_chr19\_10473798\_f + chr19 58817033 58867033 hsa-mir-1323\_up\_50000\_chr19\_58817034\_f + chr19 10789172 10839172 hsa-mir-199a-1\_up\_50000\_chr19\_10789173\_r chr19 58811744 58861744 hsa-mir-512-1\_up\_50000\_chr19\_58811745\_f + chr19 58814222 58864222 hsa-mir-512-2\_up\_50000\_chr19\_58814223\_f + chr2 232236267 232286267 hsa-mir-1244\_up\_50000\_chr2\_232236268\_f + chr20 48664729 48714729 hsa-mir-1302-5\_up\_50000\_chr20\_48664730\_r chr20 48585729 48635729 hsa-mir-645\_up\_50000\_chr20\_48585730\_f + chr20 61971237 62021237 hsa-mir-941-1\_up\_50000\_chr20\_61971238\_f + chr20 61971544 62021544 hsa-mir-941-2\_up\_50000\_chr20\_61971545\_f + chr20 61971656 62021656 hsa-mir-941-3\_up\_50000\_chr20\_61971657\_f + chr22 36570324 36620324 hsa-mir-658\_up\_50000\_chr22\_36570325\_r chr22 36573727 36623727 hsa-mir-659\_up\_50000\_chr22\_36573728\_r chr3 187937154 187987154 hsa-mir-1248\_up\_50000\_chr3\_187937155\_f +


**chromosome chrStart chrEnd miRNA regulatory regions strand**  chr1 149734895 149784895 hsa-mir-554\_up\_50000\_chr1\_149734896\_f + chr1 160528959 160578959 hsa-mir-556\_up\_50000\_chr1\_160528960\_f + chr1 154656845 154706845 hsa-mir-9-1\_up\_50000\_chr1\_154656846\_r chr1 153381591 153431591 hsa-mir-92b\_up\_50000\_chr1\_153381592\_f + chr10 14468580 14518580 hsa-mir-1265\_up\_50000\_chr10\_14468581\_f + chr10 134911115 134961115 hsa-mir-202\_up\_50000\_chr10\_134911116\_r chr10 98578511 98628511 hsa-mir-607\_up\_50000\_chr10\_98578512\_r chr10 29931281 29981281 hsa-mir-938\_up\_50000\_chr10\_29931282\_r chr11 74723878 74773878 hsa-mir-326\_up\_50000\_chr11\_74723879\_r chr11 2112015 2162015 hsa-mir-483\_up\_50000\_chr11\_2112016\_r chr11 64918504 64968504 hsa-mir-612\_up\_50000\_chr11\_64918505\_f + chr12 96431720 96481720 hsa-mir-135a-2\_up\_50000\_chr12\_96431721\_f + chr12 52621788 52671788 hsa-mir-196a-2\_up\_50000\_chr12\_52621789\_f + chr12 63252555 63302555 hsa-mir-548c\_up\_50000\_chr12\_63252556\_f + chr12 12910029 12960029 hsa-mir-614\_up\_50000\_chr12\_12910030\_f + chr12 52664000 52714000 hsa-mir-615\_up\_50000\_chr12\_52664001\_f + chr14 99595744 99645744 hsa-mir-342\_up\_50000\_chr14\_99595745\_f + chr15 60853208 60903208 hsa-mir-190\_up\_50000\_chr15\_60853209\_f + chr15 61950271 62000271 hsa-mir-422a\_up\_50000\_chr15\_61950272\_r chr15 78921469 78971469 hsa-mir-549\_up\_50000\_chr15\_78921470\_r chr15 68158861 68208861 hsa-mir-629\_up\_50000\_chr15\_68158862\_r chr15 73433079 73483079 hsa-mir-631\_up\_50000\_chr15\_73433080\_r chr16 84332807 84382807 hsa-mir-1910\_up\_50000\_chr16\_84332808\_r chr16 2211748 2261748 hsa-mir-940\_up\_50000\_chr16\_2211749\_f + chr17 43469612 43519612 hsa-mir-152\_up\_50000\_chr17\_43469613\_r chr17 55223408 55273408 hsa-mir-21\_up\_50000\_chr17\_55223409\_f + chr17 26876542 26926542 hsa-mir-365-2\_up\_50000\_chr17\_26876543\_f + chr19 10473797 10523797 hsa-mir-1238\_up\_50000\_chr19\_10473798\_f + chr19 58817033 58867033 hsa-mir-1323\_up\_50000\_chr19\_58817034\_f + chr19 10789172 10839172 hsa-mir-199a-1\_up\_50000\_chr19\_10789173\_r chr19 58811744 58861744 hsa-mir-512-1\_up\_50000\_chr19\_58811745\_f + chr19 58814222 58864222 hsa-mir-512-2\_up\_50000\_chr19\_58814223\_f + chr2 232236267 232286267 hsa-mir-1244\_up\_50000\_chr2\_232236268\_f + chr20 48664729 48714729 hsa-mir-1302-5\_up\_50000\_chr20\_48664730\_r chr20 48585729 48635729 hsa-mir-645\_up\_50000\_chr20\_48585730\_f + chr20 61971237 62021237 hsa-mir-941-1\_up\_50000\_chr20\_61971238\_f + chr20 61971544 62021544 hsa-mir-941-2\_up\_50000\_chr20\_61971545\_f + chr20 61971656 62021656 hsa-mir-941-3\_up\_50000\_chr20\_61971657\_f + chr22 36570324 36620324 hsa-mir-658\_up\_50000\_chr22\_36570325\_r chr22 36573727 36623727 hsa-mir-659\_up\_50000\_chr22\_36573728\_r chr3 187937154 187987154 hsa-mir-1248\_up\_50000\_chr3\_187937155\_f + chr3 129513697 129563697 hsa-mir-1280\_up\_50000\_chr3\_129513698\_f + chr3 50135762 50185762 hsa-mir-566\_up\_50000\_chr3\_50135763\_f + chr3 113264337 113314337 hsa-mir-567\_up\_50000\_chr3\_113264338\_f + chr4 8058008 8108008 hsa-mir-95\_up\_50000\_chr4\_8058009\_r chr5 167920556 167970556 hsa-mir-103-1\_up\_50000\_chr5\_167920557\_r chr5 41461490 41511490 hsa-mir-1274a\_up\_50000\_chr5\_41461491\_f + chr5 132791297 132841297 hsa-mir-1289-2\_up\_50000\_chr5\_132791298\_r chr5 153656858 153706858 hsa-mir-1294\_up\_50000\_chr5\_153656859\_f +


Genome-Wide Identification of Estrogen Receptor Alpha

fashions.

warrants further investigation.

Regulated miRNAs Using Transcription Factor Binding Data 565

We then used TOUCAN 2 to see whether there are TFs binding sites over-represented in these miRNA-related ER-binding regions. The top 10 significant binding motifs are listed in Table 2. Not surprisingly, we identified the consensus ERE (AGGTCANNNTGACC) as the most common TF binding motif presented in these miRNA regulatory regions bound by ER. In addition, we also observed enrichments of activator protein 1(AP-1) and forkhead (FKH) motifs among the miRNA-related ER-binding regions. The AP-1 family consists of proteins belonging to the JUN, FOS and ATF subfamilies. These subunits can hetero-dimerize and bind to their DNA target genes. AP-1 complex modulates a variety of cellular processes in response to environmental stimuli. Specially, AP-1 complex is an important regulator in tumour development since its target genes are involved in oncogenic transformation, tumour suppression, invasive growth and angiogenesis(Wagner 2001; Jochum, Passegue, and Wagner 2001). FKH proteins are a super-family of transcription factors that participate in regulating the expression of genes involved in cell growth, proliferation and differentiation. Many FKH proteins are important to embryonic development, glucose homeostasis, tumourigenesis and even vocal learning (Hannenhalli and Kaestner 2009). In previous analysis of mRNA targets, these two binding motifs were also shown to enriched in ER binding regions, suggesting their role in ER–regulated mRNAs transcription (Carroll et al. 2006). Our findings further implied that AP-1 and forkhead family members are cooperating transcription factors to regulate ER responsive miRNAs in combinatorial

In our result, p53 motif is the fourth most significant enriched binding sites in ER binding regions. p53 is an essential tumour suppressor because mutations or aberrations in the expression of p53 gene were frequently observed in a variety of cancer cell lines and clinical tumour samples(Nigro et al. 1989). Liu et al. also found that ERα can bind directly to p53 and repress its target genes (Liu et al. 2006). This important finding has profound translational implications because the same group of investigators recently demonstrated that (1) Ionizing radiation disrupts the ERα-p53 interaction in breast tumours, functionally leads to p53 restoration in breast tumours subjected to radiation therapy and elucidates a novel mechanism underlying the anti-tumour effect of radiation therapy(Liu et al. 2009); (2) The presence of wild-type p53 is an important determinant for responsiveness to antiestrogen therapy since anti-estrogens could reactivate p53 by disrupting the ERα–p53 interaction and subsequently p53 activates many tumour suppressor genes(Konduri et al. 2010). Similar with the situation for mRNA target regulation, we therefore hypothesize that ERα–p53 interaction may also involve in modulating the transcription of "oncogenic miRNAs" and "tumour suppressor miRNAs" although the exact mechanism needs further analysis. Except for co-regulator of ERα, there is also possible interaction between the enriched TFs. For example, cross-talk between glucocorticoid receptor (GR) and AP-1 has been well established (Herrlich 2001). In our result, both GR and AP-1 are also enriched in the binding region, whether such interactions are involved in miRNAs target regulation

In the original analysis of binding sites in mRNA promoters, Carroll et al. found there is a strong correlation among ERα, Forkhead, Oct, Ap-1 and C/EBP(Carroll et al. 2006). In our analysis of miRNA promoter regions, we did not find significant enrichment for Oct and C/EBP. But interestingly, several novel TF binding sites (v-Maf, Meis-1, p53, GR-α, RORα1, Hand1) are over-represented in miRNA-related ER-binding regions. This observation perhaps reflect the similar (in the case of common TFs, i.e. ERα, FHK and AP-1) and distinct

modes of ERα modulation in miRNA and mRNA gene regulation.


Table 1. miRNA regulatory regions overlapped with ER binding regions. Each miRNA regulatory region was annotated with chromosome, start and end position and the strand it resides.

Fig. 1. ER binding sites relative to specific miRNAs. The blue boxes represent ER binding regions and the black blocks represent upstream 50 Kbp of miRNAs. The pre-miRNAs, miR-342, miR-21, miR-143~145 were represented by red colour. To note, there are two ER binding regions within miR-342 and miR-21 regulatory regions. Correspondingly, miR-143 and miR-145 share the same ER binding region.

**chromosome chrStart chrEnd miRNA regulatory regions strand**  chr3 129513697 129563697 hsa-mir-1280\_up\_50000\_chr3\_129513698\_f + chr3 50135762 50185762 hsa-mir-566\_up\_50000\_chr3\_50135763\_f + chr3 113264337 113314337 hsa-mir-567\_up\_50000\_chr3\_113264338\_f + chr4 8058008 8108008 hsa-mir-95\_up\_50000\_chr4\_8058009\_r chr5 167920556 167970556 hsa-mir-103-1\_up\_50000\_chr5\_167920557\_r chr5 41461490 41511490 hsa-mir-1274a\_up\_50000\_chr5\_41461491\_f + chr5 132791297 132841297 hsa-mir-1289-2\_up\_50000\_chr5\_132791298\_r chr5 153656858 153706858 hsa-mir-1294\_up\_50000\_chr5\_153656859\_f + chr5 148738673 148788673 hsa-mir-143\_up\_50000\_chr5\_148738674\_f + chr5 148740401 148790401 hsa-mir-145\_up\_50000\_chr5\_148740402\_f + chr6 166842911 166892911 hsa-mir-1913\_up\_50000\_chr6\_166842912\_r chr6 135551990 135601990 hsa-mir-548a-2\_up\_50000\_chr6\_135551991\_f + chr6 107288692 107338692 hsa-mir-587\_up\_50000\_chr6\_107288693\_f + chr7 101833307 101883307 hsa-mir-548o\_up\_50000\_chr7\_101833308\_r chr8 128827389 128877389 hsa-mir-1204\_up\_50000\_chr8\_128827390\_f + chr8 128992060 129042060 hsa-mir-1205\_up\_50000\_chr8\_128992061\_f + chr8 129181543 129231543 hsa-mir-1208\_up\_50000\_chr8\_129181544\_f + chr8 1702803 1752803 hsa-mir-596\_up\_50000\_chr8\_1702804\_f +

Table 1. miRNA regulatory regions overlapped with ER binding regions. Each miRNA regulatory region was annotated with chromosome, start and end position and the strand it

Fig. 1. ER binding sites relative to specific miRNAs. The blue boxes represent ER binding regions and the black blocks represent upstream 50 Kbp of miRNAs. The pre-miRNAs, miR-342, miR-21, miR-143~145 were represented by red colour. To note, there are two ER binding regions within miR-342 and miR-21 regulatory regions. Correspondingly, miR-143

and miR-145 share the same ER binding region.

resides.

We then used TOUCAN 2 to see whether there are TFs binding sites over-represented in these miRNA-related ER-binding regions. The top 10 significant binding motifs are listed in Table 2. Not surprisingly, we identified the consensus ERE (AGGTCANNNTGACC) as the most common TF binding motif presented in these miRNA regulatory regions bound by ER. In addition, we also observed enrichments of activator protein 1(AP-1) and forkhead (FKH) motifs among the miRNA-related ER-binding regions. The AP-1 family consists of proteins belonging to the JUN, FOS and ATF subfamilies. These subunits can hetero-dimerize and bind to their DNA target genes. AP-1 complex modulates a variety of cellular processes in response to environmental stimuli. Specially, AP-1 complex is an important regulator in tumour development since its target genes are involved in oncogenic transformation, tumour suppression, invasive growth and angiogenesis(Wagner 2001; Jochum, Passegue, and Wagner 2001). FKH proteins are a super-family of transcription factors that participate in regulating the expression of genes involved in cell growth, proliferation and differentiation. Many FKH proteins are important to embryonic development, glucose homeostasis, tumourigenesis and even vocal learning (Hannenhalli and Kaestner 2009). In previous analysis of mRNA targets, these two binding motifs were also shown to enriched in ER binding regions, suggesting their role in ER–regulated mRNAs transcription (Carroll et al. 2006). Our findings further implied that AP-1 and forkhead family members are cooperating transcription factors to regulate ER responsive miRNAs in combinatorial fashions.

In our result, p53 motif is the fourth most significant enriched binding sites in ER binding regions. p53 is an essential tumour suppressor because mutations or aberrations in the expression of p53 gene were frequently observed in a variety of cancer cell lines and clinical tumour samples(Nigro et al. 1989). Liu et al. also found that ERα can bind directly to p53 and repress its target genes (Liu et al. 2006). This important finding has profound translational implications because the same group of investigators recently demonstrated that (1) Ionizing radiation disrupts the ERα-p53 interaction in breast tumours, functionally leads to p53 restoration in breast tumours subjected to radiation therapy and elucidates a novel mechanism underlying the anti-tumour effect of radiation therapy(Liu et al. 2009); (2) The presence of wild-type p53 is an important determinant for responsiveness to antiestrogen therapy since anti-estrogens could reactivate p53 by disrupting the ERα–p53 interaction and subsequently p53 activates many tumour suppressor genes(Konduri et al. 2010). Similar with the situation for mRNA target regulation, we therefore hypothesize that ERα–p53 interaction may also involve in modulating the transcription of "oncogenic miRNAs" and "tumour suppressor miRNAs" although the exact mechanism needs further analysis. Except for co-regulator of ERα, there is also possible interaction between the enriched TFs. For example, cross-talk between glucocorticoid receptor (GR) and AP-1 has been well established (Herrlich 2001). In our result, both GR and AP-1 are also enriched in the binding region, whether such interactions are involved in miRNAs target regulation warrants further investigation.

In the original analysis of binding sites in mRNA promoters, Carroll et al. found there is a strong correlation among ERα, Forkhead, Oct, Ap-1 and C/EBP(Carroll et al. 2006). In our analysis of miRNA promoter regions, we did not find significant enrichment for Oct and C/EBP. But interestingly, several novel TF binding sites (v-Maf, Meis-1, p53, GR-α, RORα1, Hand1) are over-represented in miRNA-related ER-binding regions. This observation perhaps reflect the similar (in the case of common TFs, i.e. ERα, FHK and AP-1) and distinct modes of ERα modulation in miRNA and mRNA gene regulation.

Genome-Wide Identification of Estrogen Receptor Alpha

Regulated miRNAs Using Transcription Factor Binding Data 567

expression profiles between MCF-7/pcDNA (tamoxifen-sensitive) and MCF-7/HER2Δ16 (tamoxifen-resistant) cells when both cell lines were treated for 24 hr with 100 pM 17-βestradiol (E2) and 1μM 4-hydroxytamoxifen (TAM). They found that miR-342 was the most dramatically down-regulated miRNA in the tamoxifen resistant MCF-7/HER2Δ16 cells. They further proved that other tamoxifen resistant cell lines such as TAMR1 and LCC2, all exhibited dramatically suppressed levels of miR-342 whereas another tamoxifen sensitive MCF-7/HER2 cell lines also expressed high levels of miR-342, indicating that loss of miR-

Fig. 2. ER binding sites relative to specific miRNAs. (a) The effects of E2 (10 nM) on MCF-7 cell number over the course of 12 days are shown. (b-c) E2 induces expression of miR-342, miR-21, miR-422a, miR-124 and miR-181c; whereas, decreases expression of miR-143, miR-145 and miR-483 in MCF-7 cells. Cells were treated with E2 (10 nM) for indicated time and

The expression level of miR-21 was previously found to be significantly changed in various cancers; especially, it is higher in ERα+ than ERα– breast tumour (Mattie et al. 2006; Volinia et al. 2006). However, inconsistent results were reported regarding the effect of E2 on miR-21 expression in MCF-7. Wickramasinghe et al. reported that E2 inhibited miR-21 expression after 6 hr (Wickramasinghe et al. 2009). In contrast, another group found that miR-21 was induced after a 4 hr E2 treatment (Bhat-Nakshatri et al. 2009). In our investigation, we found that initially miR-21 was repressed after 4 days on E2. However, miR-21 was up-regulated

miRNAs were subjected to qRT–PCR analysis.

342 was a common feature of tamoxifen resistance(Cittelly et al. 2010).


Table 2. Top 10 enriched motifs in the miRNA-related ER-binding sites. N: number of times TF site appears in the input sequences. Note that TF binding site might appear more than once in one sequence. P-value: probability to find even more occurrences than N in the input sequences. Sig value: a significance coefficient used to select the most overrepresented patterns among the distinct motifs. When analyzing only one TF site, a P-value smaller than 0.05 could be considered as being over-represented. In case of multiple TF sites, sig-value is used to select the significant result. Generally, positive sig values mean significant.

#### **3.2 Confirmation of ER-regulated miRNA in a breast cancer cell model**

MCF-7 is a well established ERα+ cell line that reflects hormone-dependent breast cancer; namely, E2 increases MCF-7 cell proliferation. In our hand, the cell number significantly increased after four days of treatment with 10 nM E2 compared to DMSO control while a late phase increase in cell number was observed starting on day 9 (Figure 2a). Among the predicted E2-regulated miRNAs, we randomly selected 8 miRNAs and used qRT-PCR to detect the time-dependent changes in their expression levels during cell proliferation. Compared to DMSO control, miR-342, miR-21, miR-422a, miR-124, and miR-181c were generally found to be up-regulated by E2 treatment; whereas miR-143, miR-145, and miR-483 were down-regulated (Figure 2b and 2c), suggesting that they are under the respective influences of positive and negative EREs. Intriguingly, the down-regulated miRNAs exhibit wave patterns of expression, i.e., significantly suppressed on day 4 with differential levels of restoration on day 7 followed by another round of suppression and partial rebound. Other than miR-124 which displays a wave pattern of induction, the rest of the E2-induced miRNAs show a gradual pattern of induction. The determinants and regulatory networks that dictate these patterns of expression await comprehensive investigations.

Of those up-regulated miRNAs, miR-342 was induced to the highest extent by E2. MiR-342 is encoded in an intron of the gene EVL and commonly suppressed in human colorectal cancer (Grady et al. 2008). Over-expression of miR-342 in the colorectal cancer cell line HT-29 induced apoptosis, pointing towards a pro-apoptotic tumour suppressor function (Grady et al. 2008). On the other hand, miR-342 expression level in breast tumours is more complicated with highest level in ER and HER2/neu-positive luminal B tumours but lowest level in ER, PR and HER2/neu triple-negative tumours (Lowery et al. 2009). Adding to the uncertainty regarding its role, miR-342 is down-regulated in tamoxifen-resistant MCF-7 cells (Miller et al. 2008). Consistent with these findings, Cittelly et al. compared miRNA

M00291 16 8.48E-5 1.655 FOXC1(Forkhead box protein C1)

M00269 24 2.14E-4 1.254 Xenopus fork head domain factor 3

M00222 22 1.86E-4 1.313 E47(Hand1)

used to select the significant result. Generally, positive sig values mean significant.

**3.2 Confirmation of ER-regulated miRNA in a breast cancer cell model** 

that dictate these patterns of expression await comprehensive investigations.

Of those up-regulated miRNAs, miR-342 was induced to the highest extent by E2. MiR-342 is encoded in an intron of the gene EVL and commonly suppressed in human colorectal cancer (Grady et al. 2008). Over-expression of miR-342 in the colorectal cancer cell line HT-29 induced apoptosis, pointing towards a pro-apoptotic tumour suppressor function (Grady et al. 2008). On the other hand, miR-342 expression level in breast tumours is more complicated with highest level in ER and HER2/neu-positive luminal B tumours but lowest level in ER, PR and HER2/neu triple-negative tumours (Lowery et al. 2009). Adding to the uncertainty regarding its role, miR-342 is down-regulated in tamoxifen-resistant MCF-7 cells (Miller et al. 2008). Consistent with these findings, Cittelly et al. compared miRNA

Table 2. Top 10 enriched motifs in the miRNA-related ER-binding sites. N: number of times TF site appears in the input sequences. Note that TF binding site might appear more than once in one sequence. P-value: probability to find even more occurrences than N in the input sequences. Sig value: a significance coefficient used to select the most overrepresented patterns among the distinct motifs. When analyzing only one TF site, a P-value smaller than 0.05 could be considered as being over-represented. In case of multiple TF sites, sig-value is

MCF-7 is a well established ERα+ cell line that reflects hormone-dependent breast cancer; namely, E2 increases MCF-7 cell proliferation. In our hand, the cell number significantly increased after four days of treatment with 10 nM E2 compared to DMSO control while a late phase increase in cell number was observed starting on day 9 (Figure 2a). Among the predicted E2-regulated miRNAs, we randomly selected 8 miRNAs and used qRT-PCR to detect the time-dependent changes in their expression levels during cell proliferation. Compared to DMSO control, miR-342, miR-21, miR-422a, miR-124, and miR-181c were generally found to be up-regulated by E2 treatment; whereas miR-143, miR-145, and miR-483 were down-regulated (Figure 2b and 2c), suggesting that they are under the respective influences of positive and negative EREs. Intriguingly, the down-regulated miRNAs exhibit wave patterns of expression, i.e., significantly suppressed on day 4 with differential levels of restoration on day 7 followed by another round of suppression and partial rebound. Other than miR-124 which displays a wave pattern of induction, the rest of the E2-induced miRNAs show a gradual pattern of induction. The determinants and regulatory networks

**TRANSFAC Motif N P-value Sig value TFs** 

M00191 33 1.76E-13 10.338 ER-α M00035 21 1.52E-7 4.401 v-Maf M00419 28 1.22E-6 3.498 Meis-1 M00272 21 2.32E-5 2.217 P53 M00192 34 4.13E-5 1.967 GR-α M00199 28 5.54E-5 1.84 AP-1 M00156 16 7.24E-5 1.723 RORα1 expression profiles between MCF-7/pcDNA (tamoxifen-sensitive) and MCF-7/HER2Δ16 (tamoxifen-resistant) cells when both cell lines were treated for 24 hr with 100 pM 17-βestradiol (E2) and 1μM 4-hydroxytamoxifen (TAM). They found that miR-342 was the most dramatically down-regulated miRNA in the tamoxifen resistant MCF-7/HER2Δ16 cells. They further proved that other tamoxifen resistant cell lines such as TAMR1 and LCC2, all exhibited dramatically suppressed levels of miR-342 whereas another tamoxifen sensitive MCF-7/HER2 cell lines also expressed high levels of miR-342, indicating that loss of miR-342 was a common feature of tamoxifen resistance(Cittelly et al. 2010).

Fig. 2. ER binding sites relative to specific miRNAs. (a) The effects of E2 (10 nM) on MCF-7 cell number over the course of 12 days are shown. (b-c) E2 induces expression of miR-342, miR-21, miR-422a, miR-124 and miR-181c; whereas, decreases expression of miR-143, miR-145 and miR-483 in MCF-7 cells. Cells were treated with E2 (10 nM) for indicated time and miRNAs were subjected to qRT–PCR analysis.

The expression level of miR-21 was previously found to be significantly changed in various cancers; especially, it is higher in ERα+ than ERα– breast tumour (Mattie et al. 2006; Volinia et al. 2006). However, inconsistent results were reported regarding the effect of E2 on miR-21 expression in MCF-7. Wickramasinghe et al. reported that E2 inhibited miR-21 expression after 6 hr (Wickramasinghe et al. 2009). In contrast, another group found that miR-21 was induced after a 4 hr E2 treatment (Bhat-Nakshatri et al. 2009). In our investigation, we found that initially miR-21 was repressed after 4 days on E2. However, miR-21 was up-regulated

Genome-Wide Identification of Estrogen Receptor Alpha

published transcriptional factor binding datasets.

functional exploration.

**4. Discussion** 

discovery.

Regulated miRNAs Using Transcription Factor Binding Data 569

Except for miR-342, miR-21, miR-143, and miR-145, little is known for the other miRNAs regarding their roles in breast cancer development. Since our analysis has already implicated several oncogenic and tumour suppressor miRNAs to be regulated by ER, we believe that this strategy can provide promising miRNAs candidates for additional

Understanding gene regulation is crucial to elucidating the mechanisms of development, differentiation and signaling response. Over the past three decades, advances in technologies such as genomic sequencing and expression profiling by microarray have paved ways to more thorough investigations into gene regulatory networks. These advances also necessitated the development of bioinformatics. Namely, analytical tools and methods are continuingly being invented for processing the vast amount of information generated and mining the corresponding datasets; hence, new discoveries are observed and novel concepts are developed for hypothesis building and testing. Nonetheless, the accumulation of datasets sometimes outpaces the development of bioinformatics and a certain amount of valuable information is left un-mined. In this chapter, we present a case of utilizing developed bioinformatics tools to learn more about gene regulation network based on

We used a previous published ER ChIP-chip data to find a set of putative ER-regulated miRNAs. This concept and method can be extended to other aspects. Firstly, several ChIP based techniques, such as ChIP-PET (paired-end tag), ChIP-DSL (DNA selection and ligation), were developed to map TFs binding sites. The genome-wide TF binding sites generated from these variations of ChIP-chip techniques could also be used to map the miRNAs promoters. Secondly, our methods can be extended to other nuclear hormone receptors (NHRs) and TFs providing that corresponding genomic coordinates of TFs binding are available. Importantly, the specificity of TF binding sites could be investigated by comparing different but related TF binding data. For example, recently by comparison of ER and estrogen-related receptor (ERR) binding data in breast cancer cell line MCF-7, Deblois et al. showed that ERR and ER display strict binding site specificity while a small number of binding sites were shared by both transcriptional factors(Deblois et al. 2009).Another prominent feature of this versatile procedure lies in its easy application and low cost. In recent years, ChIP based techniques are popular assays to study direct targets of TFs genome-wide. For example , many NHR binding maps have been published (Deblois and Giguere 2008). Surprisingly, few miRNAs regulated by a specific NHR were mined from these valuable datasets. The directly targeted miRNAs by a specific NHR or TF can be readily discovered through our procedure if the genome-wide binding sites for this TF have been produced by others. Thus, it avoids redundant experiments and greatly facilitate rapid

MiRNAs microarray is a common practice to identify miRNA expression changes upon a specific treatment. However, there are some limitations inherited in microarray platform. For instance, microarray data is usually mixed with primary, secondary, and even tertiary gene expression changes, making it difficult to dissect which TFs are responsible for these different levels of regulation. Our procedure directly links the candidate TFs with putative target miRNAs through analyzing ChIP-chip and ChIP-seq binding data. Uniquely, our analysis also allows investigation into the relationships between mRNAs and miRNAs co-

by E2 upon long term treatments. MiR-21 is thought to be an oncogenic miRNA (oncomiR) and several confirmed endogenous targets such as PDCD-4 and PTEN are important tumor suppressers (Asangani et al. 2008; Folini et al. 2010; Meng et al. 2007). Consistent with its proposed role as an oncomiR, our results showed that miR-21 expression progressively increased from day 7 to day 12 in parallel with the late phase increase in cell number.

Fig. 3. ER,AP-1 and p53 binding motifs relative to miR-342,miR-21 and miR-143~145. The red, blue and pink boxes represent ER, AP-1 and p53 binding sites respectively. ER\_2920 and ER2921 located in miR-342 upstream region; ER\_3188 and ER\_3189 located in miR-21 regulatory region; ER\_1259 located in miR-21 upstream region (please referring to Figure 1).

MiR-143 and miR-145 are clustered miRNAs with their expression levels co-ordinately down-regulated in multiple forms of cancer (Akao et al. 2007; Michael et al. 2003; Sevignani et al. 2007; Wang et al. 2008). They can function as important tumour suppressers by targeting multiple key genes in apoptosis, proliferation, and metastasis signalling pathways (Chen et al. 2009; Chiyomaru et al. 2010; Sevignani et al. 2007; Zaman et al. 2010). However, whether miR-143 and miR-145 are regulated by E2 in MCF-7 breast cell is unclear. In this study, we found that both were repressed by E2 in a long term treatment. Importantly, we also observed that ectopic expression of miR-145 repressed MCF-7 cell proliferation (data not shown). These observations are consistent with previous studies in other cancers, indicating that miR-145 is repressed in cancers compared to the normal control.

Analyzing the miR-342, miR-21 and miR-143~145 regulatory regions, we found AP-1 binding motifs in all of them, supporting the role of AP-1 as a basal activator (Figure 3) (Wagner 2001). In addition, there are both ER and p53 motifs in the miR-342 upstream region, therefore miR-342 may be a dual target of these two TFs and the expression of miR-342 perhaps depends on both the integrity of estrogen signalling pathway but also the status of p53. Estrogen-response elements were detected in both miR-342 and miR-143 promoter regions, indicating direct estrogen receptor binding. However, ERE is not present in miR-21 upstream regulatory region and it is possible that transcriptional activation of miR-21 may be mediated via estrogen receptor tethered to AP-1 motifs.

Except for miR-342, miR-21, miR-143, and miR-145, little is known for the other miRNAs regarding their roles in breast cancer development. Since our analysis has already implicated several oncogenic and tumour suppressor miRNAs to be regulated by ER, we believe that this strategy can provide promising miRNAs candidates for additional functional exploration.

#### **4. Discussion**

568 Bioinformatics – Trends and Methodologies

by E2 upon long term treatments. MiR-21 is thought to be an oncogenic miRNA (oncomiR) and several confirmed endogenous targets such as PDCD-4 and PTEN are important tumor suppressers (Asangani et al. 2008; Folini et al. 2010; Meng et al. 2007). Consistent with its proposed role as an oncomiR, our results showed that miR-21 expression progressively

increased from day 7 to day 12 in parallel with the late phase increase in cell number.

Fig. 3. ER,AP-1 and p53 binding motifs relative to miR-342,miR-21 and miR-143~145. The red, blue and pink boxes represent ER, AP-1 and p53 binding sites respectively. ER\_2920 and ER2921 located in miR-342 upstream region; ER\_3188 and ER\_3189 located in miR-21 regulatory region; ER\_1259 located in miR-21 upstream region (please referring to Figure 1). MiR-143 and miR-145 are clustered miRNAs with their expression levels co-ordinately down-regulated in multiple forms of cancer (Akao et al. 2007; Michael et al. 2003; Sevignani et al. 2007; Wang et al. 2008). They can function as important tumour suppressers by targeting multiple key genes in apoptosis, proliferation, and metastasis signalling pathways (Chen et al. 2009; Chiyomaru et al. 2010; Sevignani et al. 2007; Zaman et al. 2010). However, whether miR-143 and miR-145 are regulated by E2 in MCF-7 breast cell is unclear. In this study, we found that both were repressed by E2 in a long term treatment. Importantly, we also observed that ectopic expression of miR-145 repressed MCF-7 cell proliferation (data not shown). These observations are consistent with previous studies in other cancers,

indicating that miR-145 is repressed in cancers compared to the normal control.

be mediated via estrogen receptor tethered to AP-1 motifs.

Analyzing the miR-342, miR-21 and miR-143~145 regulatory regions, we found AP-1 binding motifs in all of them, supporting the role of AP-1 as a basal activator (Figure 3) (Wagner 2001). In addition, there are both ER and p53 motifs in the miR-342 upstream region, therefore miR-342 may be a dual target of these two TFs and the expression of miR-342 perhaps depends on both the integrity of estrogen signalling pathway but also the status of p53. Estrogen-response elements were detected in both miR-342 and miR-143 promoter regions, indicating direct estrogen receptor binding. However, ERE is not present in miR-21 upstream regulatory region and it is possible that transcriptional activation of miR-21 may Understanding gene regulation is crucial to elucidating the mechanisms of development, differentiation and signaling response. Over the past three decades, advances in technologies such as genomic sequencing and expression profiling by microarray have paved ways to more thorough investigations into gene regulatory networks. These advances also necessitated the development of bioinformatics. Namely, analytical tools and methods are continuingly being invented for processing the vast amount of information generated and mining the corresponding datasets; hence, new discoveries are observed and novel concepts are developed for hypothesis building and testing. Nonetheless, the accumulation of datasets sometimes outpaces the development of bioinformatics and a certain amount of valuable information is left un-mined. In this chapter, we present a case of utilizing developed bioinformatics tools to learn more about gene regulation network based on published transcriptional factor binding datasets.

We used a previous published ER ChIP-chip data to find a set of putative ER-regulated miRNAs. This concept and method can be extended to other aspects. Firstly, several ChIP based techniques, such as ChIP-PET (paired-end tag), ChIP-DSL (DNA selection and ligation), were developed to map TFs binding sites. The genome-wide TF binding sites generated from these variations of ChIP-chip techniques could also be used to map the miRNAs promoters. Secondly, our methods can be extended to other nuclear hormone receptors (NHRs) and TFs providing that corresponding genomic coordinates of TFs binding are available. Importantly, the specificity of TF binding sites could be investigated by comparing different but related TF binding data. For example, recently by comparison of ER and estrogen-related receptor (ERR) binding data in breast cancer cell line MCF-7, Deblois et al. showed that ERR and ER display strict binding site specificity while a small number of binding sites were shared by both transcriptional factors(Deblois et al. 2009).Another prominent feature of this versatile procedure lies in its easy application and low cost. In recent years, ChIP based techniques are popular assays to study direct targets of TFs genome-wide. For example , many NHR binding maps have been published (Deblois and Giguere 2008). Surprisingly, few miRNAs regulated by a specific NHR were mined from these valuable datasets. The directly targeted miRNAs by a specific NHR or TF can be readily discovered through our procedure if the genome-wide binding sites for this TF have been produced by others. Thus, it avoids redundant experiments and greatly facilitate rapid discovery.

MiRNAs microarray is a common practice to identify miRNA expression changes upon a specific treatment. However, there are some limitations inherited in microarray platform. For instance, microarray data is usually mixed with primary, secondary, and even tertiary gene expression changes, making it difficult to dissect which TFs are responsible for these different levels of regulation. Our procedure directly links the candidate TFs with putative target miRNAs through analyzing ChIP-chip and ChIP-seq binding data. Uniquely, our analysis also allows investigation into the relationships between mRNAs and miRNAs co-

Genome-Wide Identification of Estrogen Receptor Alpha

Acids Res 37 (14):4850-61.

cancer. Br J Cancer 102 (5):883-91.

breast tumors. Mol Cancer 9:317.

Cancer Res 69 (15):6149-57.

life sciences. Genome Biol 11 (8):R86.

(2):281-97.

(10):1385-92.

2011.

(3):561-70.

10 (9):605-16.

colorectal cancer. Oncogene 27 (15):2128-36.

Regulated miRNAs Using Transcription Factor Binding Data 571

Bartel, D. P. 2004. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116

Bhat-Nakshatri, P., G. Wang, N. R. Collins, M. J. Thomson, T. R. Geistlinger, J. S. Carroll, M.

Carroll, J. S., C. A. Meyer, J. Song, W. Li, T. R. Geistlinger, J. Eeckhoute, A. S. Brodsky, E. K.

Chiyomaru, T., H. Enokida, S. Tatarano, K. Kawahara, Y. Uchida, K. Nishiyama, L.

Cittelly, D. M., P. M. Das, N. S. Spoelstra, S. M. Edgerton, J. K. Richer, A. D. Thor, and F. E.

Deblois, G., and V. Giguere. 2008. Nuclear receptor location analyses in mammalian

Deblois, G., J. A. Hall, M. C. Perry, J. Laganiere, M. Ghahremani, M. Park, M. Hallett, and V.

Deroo, B. J., and K. S. Korach. 2006. Estrogen receptors and human disease. J Clin Invest 116

Farnham, P. J. 2009. Insights from genomic profiling of transcription factors. Nat Rev Genet

Folini, M., P. Gandellini, N. Longoni, V. Profumo, M. Callari, M. Pennati, M. Colecchia, R.

Grady, W. M., R. K. Parkin, P. S. Mitchell, J. H. Lee, Y. H. Kim, K. D. Tsuchiya, M. K.

2010. miR-21: an oncomir on strike in prostate cancer. Mol Cancer 9:12. Goecks, J., A. Nekrutenko, and J. Taylor. 2010. Galaxy: a comprehensive approach for

analysis of estrogen receptor binding sites. Nat Genet 38 (11):1289-97. Chen, X., X. Guo, H. Zhang, Y. Xiang, J. Chen, Y. Yin, X. Cai, K. Wang, G. Wang, Y. Ba, L.

suppressor Pdcd4 and stimulates invasion, intravasation and metastasis in

Brown, S. Hammond, E. F. Srour, Y. Liu, and H. Nakshatri. 2009. Estradiolregulated microRNAs control estradiol response in breast cancer cells. Nucleic

Keeton, K. C. Fertuck, G. F. Hall, Q. Wang, S. Bekiranov, V. Sementchenko, E. A. Fox, P. A. Silver, T. R. Gingeras, X. S. Liu, and M. Brown. 2006. Genome-wide

Zhu, J. Wang, R. Yang, Y. Zhang, Z. Ren, K. Zen, J. Zhang, and C. Y. Zhang. 2009. Role of miR-143 targeting KRAS in colorectal tumorigenesis. Oncogene 28

Fujimura, N. Kikkawa, N. Seki, and M. Nakagawa. 2010. miR-145 and miR-133a function as tumour suppressors and directly regulate FSCN1 expression in bladder

Jones. 2010. Downregulation of miR-342 is associated with tamoxifen resistant

genomes: from gene regulation to regulatory networks. Mol Endocrinol 22 (9):1999-

Giguere. 2009. Genome-wide identification of direct target genes implicates estrogen-related receptor alpha as a determinant of breast cancer heterogeneity.

Supino, S. Veneroni, R. Salvioni, R. Valdagni, M. G. Daidone, and N. Zaffaroni.

supporting accessible, reproducible, and transparent computational research in the

Washington, C. Paraskeva, J. K. Willson, A. M. Kaz, E. M. Kroh, A. Allen, B. R. Fritz, S. D. Markowitz, and M. Tewari. 2008. Epigenetic silencing of the intronic

ordinately regulated by a specific TF in a given cell type upon a particular treatment, providing an entirely new set of information not revealed by mircroarray analysis alone. However, it should be noted that not all regulatory regions are included in the original design of ChIP-chip platform. Thus, our analysis can only provide a partial picture that is dependent on the completeness of ChIP-chip design. As more comprehensive technology such as ChIP-seq analysis is used in investigation, the genomic coverage will be significantly improved. Besides, TF binding sites may be located outside of the 50 kb upstream regulatory region defined in our analysis. Therefore, it is best to complement ChIP data analysis with microarray studies to obtain comprehensive information on TFs and miRNAs regulation networks.

#### **5. Conclusion**

Understanding the relationships between transcriptional factors and their target mRNAs is greatly facilitated by genome-wide analysis based on the pairing of chromatin immunoprecipitation with DNA microarray. However, few miRNAs regulated by transcription factors have been mined from these data. Our bioinformatics procedure efficiently utilize genome-wide binding data to screen upstream regulatory regions of all human miRNAs and hunt for miRNA targets modulated by a specific transcription factor. As an example, we predicted 59 putative estrogen-responsive miRNAs based on a published genome-wide ER binding dataset. Several ER-regulated miRNAs were further confirmed in a breast cancer cell model. Among these, miR-342, miR-21, miR-422a, miR-124, and miR-181c were generally found to be up-regulated by estrogen treatment; whereas miR-143, miR-145, and miR-483 were down-regulated. This example demonstrated the power and efficiency of this novel analysis method. Furthermore, this example also indicated miRNA target of a specific TF can be equally detected from ChIP-chip based binding data, which are usually produced for identifying mRNA targets. Integrating our method with routine analysis procedure will gain a full picture of gene regulation network by simultaneously elucidating the miRNAs and mRNAs targets of a specific TF.

#### **6. Acknowledgment**

We are in debt to Ms. Xuemei Liao for assistance on graphic preparation. The research is supported by the science foundation of the education department of Henan province (Grant No. 2011A180009) and a start-up grant from Henan University of Technology (#2009BS040).

#### **7. References**


ordinately regulated by a specific TF in a given cell type upon a particular treatment, providing an entirely new set of information not revealed by mircroarray analysis alone. However, it should be noted that not all regulatory regions are included in the original design of ChIP-chip platform. Thus, our analysis can only provide a partial picture that is dependent on the completeness of ChIP-chip design. As more comprehensive technology such as ChIP-seq analysis is used in investigation, the genomic coverage will be significantly improved. Besides, TF binding sites may be located outside of the 50 kb upstream regulatory region defined in our analysis. Therefore, it is best to complement ChIP data analysis with microarray studies to obtain comprehensive information on TFs and miRNAs regulation

Understanding the relationships between transcriptional factors and their target mRNAs is greatly facilitated by genome-wide analysis based on the pairing of chromatin immunoprecipitation with DNA microarray. However, few miRNAs regulated by transcription factors have been mined from these data. Our bioinformatics procedure efficiently utilize genome-wide binding data to screen upstream regulatory regions of all human miRNAs and hunt for miRNA targets modulated by a specific transcription factor. As an example, we predicted 59 putative estrogen-responsive miRNAs based on a published genome-wide ER binding dataset. Several ER-regulated miRNAs were further confirmed in a breast cancer cell model. Among these, miR-342, miR-21, miR-422a, miR-124, and miR-181c were generally found to be up-regulated by estrogen treatment; whereas miR-143, miR-145, and miR-483 were down-regulated. This example demonstrated the power and efficiency of this novel analysis method. Furthermore, this example also indicated miRNA target of a specific TF can be equally detected from ChIP-chip based binding data, which are usually produced for identifying mRNA targets. Integrating our method with routine analysis procedure will gain a full picture of gene regulation network by simultaneously

We are in debt to Ms. Xuemei Liao for assistance on graphic preparation. The research is supported by the science foundation of the education department of Henan province (Grant No. 2011A180009) and a start-up grant from Henan University of Technology (#2009BS040).

Aerts, S., P. Van Loo, G. Thijs, H. Mayer, R. de Martin, Y. Moreau, and B. De Moor. 2005.

Akao, Y., Y. Nakagawa, Y. Kitade, T. Kinoshita, and T. Naoe. 2007. Downregulation of microRNAs-143 and -145 in B-cell malignancies. Cancer Sci 98 (12):1914-20. Ali, S., and R. C. Coombes. 2002. Endocrine-responsive breast cancer and strategies for

Asangani, I. A., S. A. Rasheed, D. A. Nikolova, J. H. Leupold, N. H. Colburn, S. Post, and H.

analysis. Nucleic Acids Res 33 (Web Server issue):W393-6.

combating resistance. Nat Rev Cancer 2 (2):101-12.

TOUCAN 2: the all-inclusive open source workbench for regulatory sequence

Allgayer. 2008. MicroRNA-21 (miR-21) post-transcriptionally downregulates tumor

elucidating the miRNAs and mRNAs targets of a specific TF.

networks.

**5. Conclusion** 

**6. Acknowledgment** 

**7. References** 

suppressor Pdcd4 and stimulates invasion, intravasation and metastasis in colorectal cancer. Oncogene 27 (15):2128-36.


Genome-Wide Identification of Estrogen Receptor Alpha

Res 1 (12):882-91.

Genomics 97 (1):1-6.

(Database issue):D82-5.

(7):2257-61.

95.

Regulated miRNAs Using Transcription Factor Binding Data 573

Meng, F., R. Henson, H. Wehbe-Janek, K. Ghoshal, S. T. Jacob, and T. Patel. 2007.

Michael, M. Z., O' Connor SM, N. G. van Holst Pellekaan, G. P. Young, and R. J. James. 2003.

Miller, T. E., K. Ghoshal, B. Ramaswamy, S. Roy, J. Datta, C. L. Shapiro, S. Jacob, and S.

Nigro, J. M., S. J. Baker, A. C. Preisinger, J. M. Jessup, R. Hostetter, K. Cleary, S. H. Bigner,

Rhead, B., D. Karolchik, R. M. Kuhn, A. S. Hinrichs, A. S. Zweig, P. A. Fujita, M. Diekhans,

Schmid, C. D., R. Perier, V. Praz, and P. Bucher. 2006. EPD in its twentieth year: towards

Sevignani, C., G. A. Calin, S. C. Nnadi, M. Shimizu, R. V. Davuluri, T. Hyslop, P. Demant, C.

mouse cancer susceptibility loci. Proc Natl Acad Sci U S A 104 (19):8017-22. Volinia, S., G. A. Calin, C. G. Liu, S. Ambs, A. Cimmino, F. Petrocca, R. Visone, M. Iorio, C.

Wang, X., S. Tang, S. Y. Le, R. Lu, J. S. Rader, C. Meyers, and Z. M. Zheng. 2008. Aberrant

Wickramasinghe, N. S., T. T. Manavalan, S. M. Dougherty, K. A. Riggs, Y. Li, and C. M.

Xu, J., X. Liao, N. Lu, W. Liu, and C. W. Wong. 2011. Chromatin-modifying drugs induce miRNA-153 expression to suppress Irs-2 in glioblastoma cell lines. Int J Cancer. Xu, J., X. Liao, and C. Wong. 2010. Downregulations of B-cell lymphoma 2 and myeloid cell

hepatocellular cancer. Gastroenterology 133 (2):647-58.

by targeting p27Kip1. J Biol Chem 283 (44):29897-903.

in diverse human tumour types. Nature 342 (6250):705-8.

Wagner, E. F. 2001. AP-1--Introductory remarks. Oncogene 20 (19):2334-5.

required for cancer cell growth. PLoS One 3 (7):e2557.

DBTRG-05MG. Int J Cancer 126 (4):1029-35.

MicroRNA-21 regulates expression of the PTEN tumor suppressor gene in human

Reduced accumulation of specific microRNAs in colorectal neoplasia. Mol Cancer

Majumder. 2008. MicroRNA-221/222 confers tamoxifen resistance in breast cancer

N. Davidson, S. Baylin, P. Devilee, and et al. 1989. Mutations in the p53 gene occur

K. E. Smith, K. R. Rosenbloom, B. J. Raney, A. Pohl, M. Pheasant, L. R. Meyer, K. Learned, F. Hsu, J. Hillman-Jackson, R. A. Harte, B. Giardine, T. R. Dreszer, H. Clawson, G. P. Barber, D. Haussler, and W. J. Kent. 2010. The UCSC Genome Browser database: update 2010. Nucleic Acids Res 38 (Database issue):D613-9. Schanen, B. C., and X. Li. 2010. Transcriptional regulation of mammalian miRNA genes.

complete promoter coverage of selected model organisms. Nucleic Acids Res 34

M. Croce, and L. D. Siracusa. 2007. MicroRNA genes are frequently located near

Roldo, M. Ferracin, R. L. Prueitt, N. Yanaihara, G. Lanza, A. Scarpa, A. Vecchione, M. Negrini, C. C. Harris, and C. M. Croce. 2006. A microRNA expression signature of human solid tumors defines cancer gene targets. Proc Natl Acad Sci U S A 103

expression of oncogenic and tumor-suppressive microRNAs in cervical cancer is

Klinge. 2009. Estradiol downregulates miR-21 expression and increases miR-21 target gene expression in MCF-7 breast cancer cells. Nucleic Acids Res 37 (8):2584-

leukemia sequence 1 by microRNA 153 induce apoptosis in a glioblastoma cell line

microRNA hsa-miR-342 and its host gene EVL in colorectal cancer. Oncogene 27 (27):3880-8.


Hannenhalli, S., and K. H. Kaestner. 2009. The evolution of Fox genes and their role in

Herrlich, P. 2001. Cross-talk between glucocorticoid receptor and AP-1. Oncogene 20

Houbaviy, H. B., L. Dennis, R. Jaenisch, and P. A. Sharp. 2005. Characterization of a highly

Jochum, W., E. Passegue, and E. F. Wagner. 2001. AP-1 in mouse development and

Konduri, S. D., R. Medisetty, W. Liu, B. A. Kaipparettu, P. Srivastava, H. Brauch, P. Fritz, W.

Kozomara, A., and S. Griffiths-Jones. 2011. miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 39 (Database issue):D152-7. Lee, Y., K. Jeon, J. T. Lee, S. Kim, and V. N. Kim. 2002. MicroRNA maturation: stepwise

Lee, Y., M. Kim, J. Han, K. H. Yeom, S. Lee, S. H. Baek, and V. N. Kim. 2004. MicroRNA genes are transcribed by RNA polymerase II. Embo J 23 (20):4051-60. Li, L., J. Xu, D. Yang, X. Tan, and H. Wang. 2010. Computational approaches for microRNA

Liu, W., M. M. Ip, M. B. Podgorsak, and G. M. Das. 2009. Disruption of estrogen receptor

Lowery, A. J., N. Miller, A. Devaney, R. E. McNeill, P. A. Davoren, C. Lemetre, V. Benes, S.

Mattie, M. D., C. C. Benz, J. Bowers, K. Sensinger, L. Wong, G. K. Scott, V. Fedele, D.

Matys, V., O. V. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D.

tumor effect of radiation therapy. Breast Cancer Res Treat 115 (1):43-50. Liu, W., S. D. Konduri, S. Bansal, B. K. Nayak, S. A. Rajasekaran, S. M. Karuppayil, A. K.

alpha-p53 interaction in breast tumors: a novel mechanism underlying the anti-

Rajasekaran, and G. M. Das. 2006. Estrogen receptor-alpha binds p53 tumor suppressor protein directly and represses its function. J Biol Chem 281 (15):9837-40.

Schmidt, J. Blake, G. Ball, and M. J. Kerin. 2009. MicroRNA signatures predict oestrogen receptor, progesterone receptor and HER2/neu receptor status in breast

Ginzinger, R. Getts, and C. Haqq. 2006. Optimized high-throughput microRNA expression profiling provides novel biomarker assessment of clinical prostate and

Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. E. Kel, and E. Wingender. 2006. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res

processing and subcellular localization. Embo J 21 (17):4663-70.

studies: a review. Mamm Genome 21 (1-2):1-12.

cancer. Breast Cancer Res 11 (3):R27.

breast cancer biopsies. Mol Cancer 5:24.

34 (Database issue):D108-10.

M. Swetzig, A. E. Gardner, S. A. Khan, and G. M. Das. 2010. Mechanisms of estrogen receptor antagonism toward p53 and its implications in breast cancer therapeutic response and stem cell regulation. Proc Natl Acad Sci U S A 107

development and disease. Nat Rev Genet 10 (4):233-40.

variable eutherian microRNA gene. Rna 11 (8):1245-57.

tumorigenesis. Oncogene 20 (19):2401-12.

(27):3880-8.

(19):2465-75.

(34):15081-6.

microRNA hsa-miR-342 and its host gene EVL in colorectal cancer. Oncogene 27


**Part 8** 

**Gene Expression and Systems Biology** 

Zaman, M. S., Y. Chen, G. Deng, V. Shahryari, S. O. Suh, S. Saini, S. Majid, J. Liu, G. Khatri, Y. Tanaka, and R. Dahiya. 2010. The functional significance of microRNA-145 in prostate cancer. Br J Cancer 103 (2):256-64.

## **Part 8**

## **Gene Expression and Systems Biology**

574 Bioinformatics – Trends and Methodologies

Zaman, M. S., Y. Chen, G. Deng, V. Shahryari, S. O. Suh, S. Saini, S. Majid, J. Liu, G. Khatri,

prostate cancer. Br J Cancer 103 (2):256-64.

Y. Tanaka, and R. Dahiya. 2010. The functional significance of microRNA-145 in

**26** 

*Iran* 

**Quantification of Gene Expression** 

*Department of Chemical Engineering, Ferdowsi University of Mashhad,* 

Gene expression is a common process in all forms of living cells including eukaryotes, prokaryotes and viruses to generate the macromolecular requirements for life. The study of gene expression provides a systemic comprehension of the cell function for addressing specific biological questions. This process comprises replication, transcription, RNA splicing, translation and post translational modification of a single protein. At first, DNA serves as a template to replicate itself and the production of RNA (transcription), a copy from the DNA, is mediated by RNA polymerase. In prokaryotes, transcription creates messenger RNA (mRNA) which doesn't need any additional processing for translation but this stage in eukaryotes produces a primary transcript of RNA, which needs further processing prior to becoming a mature mRNA. This step is referred to as RNA splicing that in the proper context, involves the removal of certain sequences called intervening sequences, or introns. Hence, the final mRNA contains the remaining sequences, called exons, which are spliced together (Knapp et al., 1978). In the next stage, so called translation, mRNA separates from DNA strand and serves as a template for protein production that such a process is assisted by ribosomes. Proteins are modified after translation in variety of processes i.e. they are altered at structural level to achieve the final 3D conformation. These modifications are essential for all aspects of biology and can be performed spontaneously or driven by enzyme mediation. Common post-translational modifications include phosphorylation, glycosilation, dimerization or tetramerizaion, etc. (Doyle & Mamula, 2001). Therefore, the transfer of genetic information, from DNA to RNA and to proteins, ending up with the expression of genes in all cells makes up the central dogma of molecular

Genomics information is delivered to the cells in three biochemical datasets including the complete set of mRNA species that result in generating proteins (transcriptomics), the complete collection of proteins (proteomics), and the complete series of metabolites produced

Transcriptomics provides a complete profile of RNAs that appear within the cells, tissues and biological fluids at a specific time. The mRNA levels do vary over time, among diverse cell types and within cells under different conditions while DNA is more or less unchanged over the life cycles. Thus, gene expression based on mRNA mediates cellular function and specifies genes that are turned on or off in different status of cells. As transcriptome

in the cell (metabolomics) (Figure 2)(Karakach et al., 2010; van der Werf et al., 2005).

**1. Introduction** 

biology (Figure 1) (Crick, 1970).

**Based on Microarray Experiment** 

Samane F. Farsani and Mahmood A. Mahdavi

*Azadi Square, Pardis Campus, Mashhad,* 

### **Quantification of Gene Expression Based on Microarray Experiment**

Samane F. Farsani and Mahmood A. Mahdavi *Department of Chemical Engineering, Ferdowsi University of Mashhad, Azadi Square, Pardis Campus, Mashhad, Iran* 

#### **1. Introduction**

Gene expression is a common process in all forms of living cells including eukaryotes, prokaryotes and viruses to generate the macromolecular requirements for life. The study of gene expression provides a systemic comprehension of the cell function for addressing specific biological questions. This process comprises replication, transcription, RNA splicing, translation and post translational modification of a single protein. At first, DNA serves as a template to replicate itself and the production of RNA (transcription), a copy from the DNA, is mediated by RNA polymerase. In prokaryotes, transcription creates messenger RNA (mRNA) which doesn't need any additional processing for translation but this stage in eukaryotes produces a primary transcript of RNA, which needs further processing prior to becoming a mature mRNA. This step is referred to as RNA splicing that in the proper context, involves the removal of certain sequences called intervening sequences, or introns. Hence, the final mRNA contains the remaining sequences, called exons, which are spliced together (Knapp et al., 1978). In the next stage, so called translation, mRNA separates from DNA strand and serves as a template for protein production that such a process is assisted by ribosomes. Proteins are modified after translation in variety of processes i.e. they are altered at structural level to achieve the final 3D conformation. These modifications are essential for all aspects of biology and can be performed spontaneously or driven by enzyme mediation. Common post-translational modifications include phosphorylation, glycosilation, dimerization or tetramerizaion, etc. (Doyle & Mamula, 2001). Therefore, the transfer of genetic information, from DNA to RNA and to proteins, ending up with the expression of genes in all cells makes up the central dogma of molecular biology (Figure 1) (Crick, 1970).

Genomics information is delivered to the cells in three biochemical datasets including the complete set of mRNA species that result in generating proteins (transcriptomics), the complete collection of proteins (proteomics), and the complete series of metabolites produced in the cell (metabolomics) (Figure 2)(Karakach et al., 2010; van der Werf et al., 2005).

Transcriptomics provides a complete profile of RNAs that appear within the cells, tissues and biological fluids at a specific time. The mRNA levels do vary over time, among diverse cell types and within cells under different conditions while DNA is more or less unchanged over the life cycles. Thus, gene expression based on mRNA mediates cellular function and specifies genes that are turned on or off in different status of cells. As transcriptome

Quantification of Gene Expression Based on Microarray Experiment 579

omics technologies have been advancing over the years, they still contain some drawbacks. Proteomics and metabolomics offer the holistic and complementary insights into cells because transcriptomics cannot always reflect corresponding protein or metabolite profiling. They are, however, limited in lack of standardized methodologies and poor reproducibility (Pinet, 2009). This is partly due to the heterogeneous characteristics of the compounds identified. In proteomic analysis, the wide range of proteins makes it difficult to design standard protocols for identification of compounds. Likewise, metabolomics suffers from the diverse collection of chemical properties of different metabolites (Karakach et al., 2010). Scientists are not well trained to cope with the large data and limited availability of commercial metabolites. Despite these limitations, going from one biochemical level to the next, information is acquired or lost by regulatory events such as post-transcriptional and post-translational modifications that occur between these levels. Metabolomics, however, is valuable as it is the closest to the function of a cell i.e. the phenotype (Tsiridis & Giannoudis,

Compared to proteomics and metabolomics, transcriptomics is a more robust, large-scale, moderate cost technology of simultaneously measuring thousands of mRNA levels, but most transcriptomic analysis platforms are not routinely set up to systematically detect changes in spliced species as nearly 50% of human genes may undergo alternative splicing (Hegde et al., 2003). Also in some cases mRNA levels are a reasonable proxy for protein abundance, allowing one to make a rational inference regarding the level of protein expression based on the levels of mRNA expression. But, sometimes some caution seems necessary where protein expression is controlled post-transcriptionally by other factors. Since mRNA molecules are relatively more homogeneous than metabolites and proteins, and capture methods based on complementary DNA have been developing, the field of transcriptomics has been more associated with gene expression studies using microarray

In conclusion, the study of omics sciences plays an important role in understanding different perspectives of cells to gain knowledge about cellular pathways, mechanisms and functions that eventually make up an expression cycle. The transcriptome is more crucial in expression measurements while the proteome and the metabolome together assist in determining the functionality of expressed genes (van der Werf et al., 2005). Thus, effective integration of omics datasets provides a broader view of systematic changes in expression levels. However, this integration still remains one of the challenges of systems biology and

Composition and differences of various transcriptomes is specified through mRNA level measurements. There are a number of methods to quantitatively determine this factor including northern blotting, reverse transcriptase polymerase chain reaction (RT-PCR) and

Northern blotting is a standard method for studying the expression profile of specific genes in mRNA level. It can detect alternatively spliced transcripts and transcript size. In Northern blot analysis, mRNAs are extracted from sample then separated based on size in gel electrophoresis (targets). Probes are a complementary sequence to all or a part of interested mRNAs. Afterwards, targets are transferred to a solid support from an agarose gel to hybridize with radio-labeled probes. If the probe has complemented sequence to an mRNA,

DNA microarray. These techniques are briefly discussed in the following.

2006; van der Werf et al., 2005; Zhang et al., 2010).

technology (Karakach et al., 2010).

**2. Methods to quantifying mRNA level** 

functional genomics.

Fig. 1. Central dogma of molecular biology.

represents small percentage of the genome and much more complexity, information carried in the transcriptome has no substantial direct relation to information from the genome (Frith et al., 2005; Tsiridis & Giannoudis, 2006). Proteomics assists to comprehensively characterize quantity, structure and activity of the entire complement of expressed proteins (proteome) in large scale within a cell or tissue at a particular time. In addition, this approach provides the studies of protein-protein interactions and detailed understanding of the complex responses of a living system to stimuli (Beranova-Giorgianni, 2003; Hirsch et al., 2004). However, genome is relatively static while the dynamic proteome changes constantly in response to environmental signals. This is due to many reasons, including different amino acid sequences, alternative splicing of mRNAs and post-translational protein modifications that often give rise to more than one protein per a single gene. Proteomics, therefore, produce large high dimensional datasets that require powerful tools to handle and analyze the data effectively (Hegde et al., 2003; Tsiridis & Giannoudis, 2006).

Fig. 2. Biochemical levels of information in gene expression study.

Metabolomics is the study of the entire set of metabolites, low-molecular-weight organic compounds, in the cell (metabolome) assisting the inference of biological functioning (Schaub et al., 2009; van der Werf et al., 2005). It involves the large-scale analysis of changes in metabolites in response to environmental or cellular changes. Metabolomics aims at quantifying every single metabolite and is one step further than metabolic profiling that only elucidates an inventory of the metabolites present in the cell (van der Werf et al., 2005). The transcriptome, proteome and metabolome can change considerably depending on various environmental conditions and directly represent the status of cellular physiology. Hence, these sources are so beneficial in understanding biological performance. Although

represents small percentage of the genome and much more complexity, information carried in the transcriptome has no substantial direct relation to information from the genome (Frith et al., 2005; Tsiridis & Giannoudis, 2006). Proteomics assists to comprehensively characterize quantity, structure and activity of the entire complement of expressed proteins (proteome) in large scale within a cell or tissue at a particular time. In addition, this approach provides the studies of protein-protein interactions and detailed understanding of the complex responses of a living system to stimuli (Beranova-Giorgianni, 2003; Hirsch et al., 2004). However, genome is relatively static while the dynamic proteome changes constantly in response to environmental signals. This is due to many reasons, including different amino acid sequences, alternative splicing of mRNAs and post-translational protein modifications that often give rise to more than one protein per a single gene. Proteomics, therefore, produce large high dimensional datasets that require powerful tools to handle and analyze

Metabolomics is the study of the entire set of metabolites, low-molecular-weight organic compounds, in the cell (metabolome) assisting the inference of biological functioning (Schaub et al., 2009; van der Werf et al., 2005). It involves the large-scale analysis of changes in metabolites in response to environmental or cellular changes. Metabolomics aims at quantifying every single metabolite and is one step further than metabolic profiling that only elucidates an inventory of the metabolites present in the cell (van der Werf et al., 2005). The transcriptome, proteome and metabolome can change considerably depending on various environmental conditions and directly represent the status of cellular physiology. Hence, these sources are so beneficial in understanding biological performance. Although

Genomics Transcriptomics Proteomics Metabolomics

the data effectively (Hegde et al., 2003; Tsiridis & Giannoudis, 2006).

Fig. 2. Biochemical levels of information in gene expression study.

Fig. 1. Central dogma of molecular biology.

omics technologies have been advancing over the years, they still contain some drawbacks. Proteomics and metabolomics offer the holistic and complementary insights into cells because transcriptomics cannot always reflect corresponding protein or metabolite profiling. They are, however, limited in lack of standardized methodologies and poor reproducibility (Pinet, 2009). This is partly due to the heterogeneous characteristics of the compounds identified. In proteomic analysis, the wide range of proteins makes it difficult to design standard protocols for identification of compounds. Likewise, metabolomics suffers from the diverse collection of chemical properties of different metabolites (Karakach et al., 2010). Scientists are not well trained to cope with the large data and limited availability of commercial metabolites. Despite these limitations, going from one biochemical level to the next, information is acquired or lost by regulatory events such as post-transcriptional and post-translational modifications that occur between these levels. Metabolomics, however, is valuable as it is the closest to the function of a cell i.e. the phenotype (Tsiridis & Giannoudis, 2006; van der Werf et al., 2005; Zhang et al., 2010).

Compared to proteomics and metabolomics, transcriptomics is a more robust, large-scale, moderate cost technology of simultaneously measuring thousands of mRNA levels, but most transcriptomic analysis platforms are not routinely set up to systematically detect changes in spliced species as nearly 50% of human genes may undergo alternative splicing (Hegde et al., 2003). Also in some cases mRNA levels are a reasonable proxy for protein abundance, allowing one to make a rational inference regarding the level of protein expression based on the levels of mRNA expression. But, sometimes some caution seems necessary where protein expression is controlled post-transcriptionally by other factors. Since mRNA molecules are relatively more homogeneous than metabolites and proteins, and capture methods based on complementary DNA have been developing, the field of transcriptomics has been more associated with gene expression studies using microarray technology (Karakach et al., 2010).

In conclusion, the study of omics sciences plays an important role in understanding different perspectives of cells to gain knowledge about cellular pathways, mechanisms and functions that eventually make up an expression cycle. The transcriptome is more crucial in expression measurements while the proteome and the metabolome together assist in determining the functionality of expressed genes (van der Werf et al., 2005). Thus, effective integration of omics datasets provides a broader view of systematic changes in expression levels. However, this integration still remains one of the challenges of systems biology and functional genomics.

#### **2. Methods to quantifying mRNA level**

Composition and differences of various transcriptomes is specified through mRNA level measurements. There are a number of methods to quantitatively determine this factor including northern blotting, reverse transcriptase polymerase chain reaction (RT-PCR) and DNA microarray. These techniques are briefly discussed in the following.

Northern blotting is a standard method for studying the expression profile of specific genes in mRNA level. It can detect alternatively spliced transcripts and transcript size. In Northern blot analysis, mRNAs are extracted from sample then separated based on size in gel electrophoresis (targets). Probes are a complementary sequence to all or a part of interested mRNAs. Afterwards, targets are transferred to a solid support from an agarose gel to hybridize with radio-labeled probes. If the probe has complemented sequence to an mRNA,

Quantification of Gene Expression Based on Microarray Experiment 581

parallel Northern blotting. DNA microarray gives a holistic picture of gene expression within the cell or the sample in different environmental conditions at a specific time (Tarca et al., 2006). Practically, such high throughput method utilizes an inert surface containing a certain number of spots. Each spot contains a single species of a nucleic acid representing the genes of interest (probe). Hybridization between labeled biological sample (target) and probes creates a signal that represents the level of expression of a gene in a biological sample. The microarrays have become important because they are easier to use and do not require large-scale DNA sequencing. However these studies are still limited by lack of universally accepted standards for data collection, analysis and validation (Bilban et al., 2002; Russo et al., 2003). Microarrays are quite user friendly and usually consistent with results produced from northern blotting and PCR; although, these approaches can measure small levels in gene expression that microarrays cannot. The main advantage of microarrays is visualizing thousands of genes at a time, while other methods are usually quantifying one

Some features of the above mentioned methods have been summarized in Table 1. Regarding the advantages and limitations of each technique, it is concluded that even though the all methods can measure mRNA levels, they differ on their special attributes.


results

PCR




transcript number

technology

• chip type



contamination may yield false-positive






• sample preparation • data analysis


or a small number of genes (Bilban et al., 2002; Trayhuru, 1996).

**Method Pros Cons** 



Table 1. Features of conventional techniques to quantifying mRNA level.

Therefore, the selection of methods is performed based on required characteristics in experimental design. It should be noted that although traditional techniques of gene





Microarray -The parallel quantication of thousands of genes from multiple samples

transcripts

RT-PCR -High sensitive -Rapid

PCR)


studies -Cost effective -Easy to use

sequencing

Northern blotting

then it will bind to the location of that mRNA on the gel (Trayhuru, 1996). Degree of radiation gives an indication of expression level in gene of interest. This method is a semiquantitative detection because the amount of radioactivity depends to some extent on the amount of the probe which in turn depends on the amount of mRNA in the sample (Perdew et al., 2006; Trayhuru, 1996). Northern blotting is an appropriate assay especially for laboratories which are limited with the lack of specialized equipments and expertise in molecular biology (Trayhuru, 1996). One of the pitfalls in northern blotting is often sample degradation through the action of RNases, which can be overcome by proper sterilization of glassware and reagents and the employment of RNase inhibitors. Also the used chemicals can be a risk to the researcher.

Polymerase chain reaction (PCR) is an enzymatic assay which produces large amount of a specific DNA sequence from even a small and complex mixture. Also reverse transcriptase (RT)-PCR is a rapid and flexible approach for mRNA examination and quantification. In this method, first the mRNA must be converted to a double-stranded molecule by using the enzyme reverse transcriptase (Perdew et al., 2006). Since small variations of amplification efficiencies between samples can result in significant differences in product yield, quantification of mRNA by RT-PCR is difficult, therefore modified methods have been developed such as quantitative competitive (QC)-PCR, relative RT-PCR and real time RT-PCR. The QC-PCR measures the absolute level of a particular mRNA sequence in a biological sample. It relies on using dilutions of a synthetic RNA called competitors. These competitors compete with the target cDNA for co-amplification. Since competitor molecule differs in size from the target one, the two PCR products can be separated by gel electrophoresis. Although this method provides an accurate result, the design and construction of competitor for each gene is technically complicated. Validation of the results of the technique is also labor intensive (Breljak et al., 2005). Relative or semi-quantitative RT-PCR measures mRNA level using a coamplified internal control with the gene of interest. Results are reported as ratios of the genespecific signal to the internal control signal. Although this method requires only common laboratory equipment, it suffers from poor dynamic range of the quantification and being time consuming as well as labor intensive (Lipshutz et al., 1999). A novel approach of PCR, realtime PCR, is the combination of the best features of both relative and competitive PCR. It is much faster, higher throughput and less labor-intensive assay than current quantitative PCR. Furthermore, it combines amplification and detection in one step. Unlike other quantitative PCR methods, real-time PCR does not need preventing carryover contamination of PCR products and PCR processing such as electrophoresis. This approach is carried out through dual labeled fluorogenic probes. The amount of fluorescence emitted is directly proportional to the amount of product produced in each PCR cycle (Breljak et al., 2005; Heid et al., 1996). In spite of outstanding advances performed in the area of real-time RT-PCR, competitive and semi-quantitative RT-PCR may still utilize for relative mRNA quantification especially for small number of samples (Breljak et al., 2005). RT-PCR is much more sensitive, rapid with a large dynamic range of quantification. It requires specialized expensive equipment and ingredient which may be restrictive to some researchers (Perdew et al., 2006; Trayhuru, 1996). Since undesirable primer–primer interactions may happen, RT-PCR is limited in the number of genes to be analyzed each time. Some sources of variation such as template concentration and amplification efficiency make difficult quantification based on RT-PCR (Trayhuru, 1996).

Microarray experiment is an emerging technique as such, based on determining expression levels of thousands of genes simultaneously. This approach can be considered as a massive

then it will bind to the location of that mRNA on the gel (Trayhuru, 1996). Degree of radiation gives an indication of expression level in gene of interest. This method is a semiquantitative detection because the amount of radioactivity depends to some extent on the amount of the probe which in turn depends on the amount of mRNA in the sample (Perdew et al., 2006; Trayhuru, 1996). Northern blotting is an appropriate assay especially for laboratories which are limited with the lack of specialized equipments and expertise in molecular biology (Trayhuru, 1996). One of the pitfalls in northern blotting is often sample degradation through the action of RNases, which can be overcome by proper sterilization of glassware and reagents and the employment of RNase inhibitors. Also the used chemicals

Polymerase chain reaction (PCR) is an enzymatic assay which produces large amount of a specific DNA sequence from even a small and complex mixture. Also reverse transcriptase (RT)-PCR is a rapid and flexible approach for mRNA examination and quantification. In this method, first the mRNA must be converted to a double-stranded molecule by using the enzyme reverse transcriptase (Perdew et al., 2006). Since small variations of amplification efficiencies between samples can result in significant differences in product yield, quantification of mRNA by RT-PCR is difficult, therefore modified methods have been developed such as quantitative competitive (QC)-PCR, relative RT-PCR and real time RT-PCR. The QC-PCR measures the absolute level of a particular mRNA sequence in a biological sample. It relies on using dilutions of a synthetic RNA called competitors. These competitors compete with the target cDNA for co-amplification. Since competitor molecule differs in size from the target one, the two PCR products can be separated by gel electrophoresis. Although this method provides an accurate result, the design and construction of competitor for each gene is technically complicated. Validation of the results of the technique is also labor intensive (Breljak et al., 2005). Relative or semi-quantitative RT-PCR measures mRNA level using a coamplified internal control with the gene of interest. Results are reported as ratios of the genespecific signal to the internal control signal. Although this method requires only common laboratory equipment, it suffers from poor dynamic range of the quantification and being time consuming as well as labor intensive (Lipshutz et al., 1999). A novel approach of PCR, realtime PCR, is the combination of the best features of both relative and competitive PCR. It is much faster, higher throughput and less labor-intensive assay than current quantitative PCR. Furthermore, it combines amplification and detection in one step. Unlike other quantitative PCR methods, real-time PCR does not need preventing carryover contamination of PCR products and PCR processing such as electrophoresis. This approach is carried out through dual labeled fluorogenic probes. The amount of fluorescence emitted is directly proportional to the amount of product produced in each PCR cycle (Breljak et al., 2005; Heid et al., 1996). In spite of outstanding advances performed in the area of real-time RT-PCR, competitive and semi-quantitative RT-PCR may still utilize for relative mRNA quantification especially for small number of samples (Breljak et al., 2005). RT-PCR is much more sensitive, rapid with a large dynamic range of quantification. It requires specialized expensive equipment and ingredient which may be restrictive to some researchers (Perdew et al., 2006; Trayhuru, 1996). Since undesirable primer–primer interactions may happen, RT-PCR is limited in the number of genes to be analyzed each time. Some sources of variation such as template concentration and amplification efficiency make difficult quantification based on RT-PCR (Trayhuru, 1996). Microarray experiment is an emerging technique as such, based on determining expression levels of thousands of genes simultaneously. This approach can be considered as a massive

can be a risk to the researcher.

parallel Northern blotting. DNA microarray gives a holistic picture of gene expression within the cell or the sample in different environmental conditions at a specific time (Tarca et al., 2006). Practically, such high throughput method utilizes an inert surface containing a certain number of spots. Each spot contains a single species of a nucleic acid representing the genes of interest (probe). Hybridization between labeled biological sample (target) and probes creates a signal that represents the level of expression of a gene in a biological sample. The microarrays have become important because they are easier to use and do not require large-scale DNA sequencing. However these studies are still limited by lack of universally accepted standards for data collection, analysis and validation (Bilban et al., 2002; Russo et al., 2003). Microarrays are quite user friendly and usually consistent with results produced from northern blotting and PCR; although, these approaches can measure small levels in gene expression that microarrays cannot. The main advantage of microarrays is visualizing thousands of genes at a time, while other methods are usually quantifying one or a small number of genes (Bilban et al., 2002; Trayhuru, 1996).

Some features of the above mentioned methods have been summarized in Table 1. Regarding the advantages and limitations of each technique, it is concluded that even though the all methods can measure mRNA levels, they differ on their special attributes.


Table 1. Features of conventional techniques to quantifying mRNA level.

Therefore, the selection of methods is performed based on required characteristics in experimental design. It should be noted that although traditional techniques of gene

Quantification of Gene Expression Based on Microarray Experiment 583

DNA microarray is the most popular type of microarray technology that uses nucleic acid– nucleic acid interactions. It allows measuring the amount of mRNA transcripts for thousands of genes in different combinations of sample derived from normal and diseased or treated and non-treated tissues, time courses of treated cells and stages of cell differentiation or development (Karakach et al., 2010). It has been proved that DNA microarrays are extremely valuable in studying of expression profiling, sequence

DNA microarrays are currently manufactured using two main techniques: in-situ synthesis and deposition of pre-synthesized probes (spotted arrays). There are various platforms or types of DNA microarrays that are commercially available. Figure 5 summarizes some of these platforms based on different fabrication methods. The two most commonly micarrays are the affymetrix oligonucleotide chips (Lockhart et al., 1996) and spotted cDNA arrays (Schena et al., 1995). Experimental steps and construction process of these arrays are

Fig. 4. **a.** Tissue arrayer instrument, **b.** Extraction of the donor core, **c.** Insertion into recipient

The in situ synthesis of oligonucleotides (Affymetrix Gene Chip) can be achieved using a photolithographic method (Fodor et al., 1991). This approach involves adding of adenine (A), cytosine (C), guanine (G) and thymine (T) nucleotides step by step through a set of designed masks. In fabrication process, solid substrate, usually quartz wafer, is washed to provide uniform hydroxylation of the surface and is then placed in a silane bath. Silane molecules are capable to directly react with the hydroxyl groups of the quartz. Therefore starting points are formed to synthesize new oligonucleotide strands. In the following steps,

identification and location of transcription factor binding sites (Hall et al., 2007).

**4. The DNA microarray experiment** 

discussed in this section.

block (Gulmann & O´

**4.1 Affymetrix Gene Chips** 

Grady, 2003).

**4.1.1 Fabrication of Affymetrix Gene Chip** 

expression analysis provide valuable biological insights into the living cells, they are probably limited in some ways such as scale, economy, and sensitivity. As a result, compared to the other commonly used techniques, quantification based on microarray is remarkable because of high throughput and cost effective features.

#### **3. Microarray technology**

Microarray technology has become one of the most commonly used high-throughput techniques to query a large variety of biological issues. It enables the simultaneous analysis of thousands of parameters within one single experiment. Such miniaturized binding technology is typically divided into DNA, protein, tissue, cellular and chemical compound microarrays (Templin et al., 2002). Some of the arrays such as protein array and tissue array will be described in detail with a special emphasis on DNA arrays.

Protein microarrays assist in characterizing of thousands of proteins in a parallel format. Proteome chips afford researchers a way to address true level of gene function by studying the pair-wise interactions such as protein-protein, protein-DNA, protein-lipid, protein-drug, protein-receptor and antigen-antibody (Hall et al., 2007). In this technique, probes such as aptamers, engineered antibody fragments, affibodies, full-length proteins or protein domains can be spotted on a microscope slide. The array is then probed with a target solution and binding detected using the analytical approaches. Antibody microarray is the most powerful type of protein microarray. Figure 3 shows the detailed view of the steps taken to carry out antibody microarray experiment (Angenendt, 2005). Tissue microarray (TMA) technology was developed in order to evaluate the difference of molecular targets (in the DNA, RNA or protein level) in several thousands of tissue samples at the same time (Kononen et al., 1998; Singh & Sau, 2010). TMA is constructed from paraffin embedded material, frozen tissue, paraffin embedded cell lines or cell blocks (Parsons & Grabsch, 2009). Totally, TMA is made of tissue core samples taken with a precision punching instrument from donor paraffin blocks. These cores of tissue are arrayed into an empty recipient block, TMA block (Figure 4). Afterwards, the TMA block is sectioned by using a device called microtome. The sections are placed on a microscope slide and then analyzed by any standard histological procedure. From a TMA block, approximately 200–300 5-μm sections can be cut and used at independent tests (Parsons & Grabsch, 2009).

Fig. 3. Schematic diagram of an antibody microarray technology.

DNA microarray is the most popular type of microarray technology that uses nucleic acid– nucleic acid interactions. It allows measuring the amount of mRNA transcripts for thousands of genes in different combinations of sample derived from normal and diseased or treated and non-treated tissues, time courses of treated cells and stages of cell differentiation or development (Karakach et al., 2010). It has been proved that DNA microarrays are extremely valuable in studying of expression profiling, sequence identification and location of transcription factor binding sites (Hall et al., 2007).

### **4. The DNA microarray experiment**

582 Bioinformatics – Trends and Methodologies

expression analysis provide valuable biological insights into the living cells, they are probably limited in some ways such as scale, economy, and sensitivity. As a result, compared to the other commonly used techniques, quantification based on microarray is

Microarray technology has become one of the most commonly used high-throughput techniques to query a large variety of biological issues. It enables the simultaneous analysis of thousands of parameters within one single experiment. Such miniaturized binding technology is typically divided into DNA, protein, tissue, cellular and chemical compound microarrays (Templin et al., 2002). Some of the arrays such as protein array and tissue array

Protein microarrays assist in characterizing of thousands of proteins in a parallel format. Proteome chips afford researchers a way to address true level of gene function by studying the pair-wise interactions such as protein-protein, protein-DNA, protein-lipid, protein-drug, protein-receptor and antigen-antibody (Hall et al., 2007). In this technique, probes such as aptamers, engineered antibody fragments, affibodies, full-length proteins or protein domains can be spotted on a microscope slide. The array is then probed with a target solution and binding detected using the analytical approaches. Antibody microarray is the most powerful type of protein microarray. Figure 3 shows the detailed view of the steps taken to carry out antibody microarray experiment (Angenendt, 2005). Tissue microarray (TMA) technology was developed in order to evaluate the difference of molecular targets (in the DNA, RNA or protein level) in several thousands of tissue samples at the same time (Kononen et al., 1998; Singh & Sau, 2010). TMA is constructed from paraffin embedded material, frozen tissue, paraffin embedded cell lines or cell blocks (Parsons & Grabsch, 2009). Totally, TMA is made of tissue core samples taken with a precision punching instrument from donor paraffin blocks. These cores of tissue are arrayed into an empty recipient block, TMA block (Figure 4). Afterwards, the TMA block is sectioned by using a device called microtome. The sections are placed on a microscope slide and then analyzed by any standard histological procedure. From a TMA block, approximately 200–300 5-μm sections

remarkable because of high throughput and cost effective features.

will be described in detail with a special emphasis on DNA arrays.

can be cut and used at independent tests (Parsons & Grabsch, 2009).

Fig. 3. Schematic diagram of an antibody microarray technology.

**3. Microarray technology** 

DNA microarrays are currently manufactured using two main techniques: in-situ synthesis and deposition of pre-synthesized probes (spotted arrays). There are various platforms or types of DNA microarrays that are commercially available. Figure 5 summarizes some of these platforms based on different fabrication methods. The two most commonly micarrays are the affymetrix oligonucleotide chips (Lockhart et al., 1996) and spotted cDNA arrays (Schena et al., 1995). Experimental steps and construction process of these arrays are discussed in this section.

Fig. 4. **a.** Tissue arrayer instrument, **b.** Extraction of the donor core, **c.** Insertion into recipient block (Gulmann & O´ Grady, 2003).

### **4.1 Affymetrix Gene Chips**

#### **4.1.1 Fabrication of Affymetrix Gene Chip**

The in situ synthesis of oligonucleotides (Affymetrix Gene Chip) can be achieved using a photolithographic method (Fodor et al., 1991). This approach involves adding of adenine (A), cytosine (C), guanine (G) and thymine (T) nucleotides step by step through a set of designed masks. In fabrication process, solid substrate, usually quartz wafer, is washed to provide uniform hydroxylation of the surface and is then placed in a silane bath. Silane molecules are capable to directly react with the hydroxyl groups of the quartz. Therefore starting points are formed to synthesize new oligonucleotide strands. In the following steps,

Quantification of Gene Expression Based on Microarray Experiment 585

Fig. 6. **a.** Schematic overview on photolithographic fabrication of Gene Chip. **b.** Drawing of

Microarrays use various approaches based on uniqueness and composition design rules to select the 25-nucleotide-length (25mer) probes (Lipshutz et al., 1999). They utilize the Perfect Match/Mismatch probe strategy (Figure 7). Each gene sequence (or expressed sequence tag (EST)) is represented by typically 12-20 different probe pairs. The collection of probes for each gene is referred to as a probeset. Each pair includes a perfect match (PM) oligonucleotide and a mismatch (MM) oligonucleotide. A PM probe is perfectly complementary to the gene sequence of interest (Barrett & Kawasaki, 2003; Lipshutz et al., 1999) while the MM probe has a one-base mismatch in the central base position (the 13th base). The MM probe is used as an internal control to estimate the signal of any non-specific hybridization or contaminating fluorescence within measurement (Lipshutz et al., 1999; Tarca et al., 2006). These probesets are made on array through in situ synthesis and the

The basic steps in this single-dye experiment are as follows. Total RNA (or mRNA) is extracted from the biological sample, called target. The total RNA is then reversed transcribed to generate double-stranded cDNA. Then, biotin-labeled cRNA is produced from cDNA using *in vitro* transcription. Next, biotin-labeled cRNA is fragmented into smaller segments and hybridized on the array. After a series of washing for removing nonhybridized material, the array is incubated with appropriate fluorescent dyes linked to the biotins on the cRNAs. The array is placed in a scanner and emission of uorescent staining agent is quantified (Schadt et al., 2001). Measurement of the florescent agent intensity

provides an estimate of the level of mRNA within each gene of interest on the chip.

the lamp, mask and chip (Lipshutz et al., 1999).

 **a.**

**b.**

**4.1.2 Experiment of Affymetrix Gene Chip** 

microarray will be ready to carry out the experiment.

synthetic linkers are attached to silanes and coated with a light-sensitive protecting group (Figure 6). The first mask is placed over the surface which then exposed to the light source. Masks selectively direct light toward specific areas on the substrate. Afterwards linker molecules are activated at the unprotected position. Next, the first of a series of nucleotides, linked to the light-sensitive agent, is incubated on the surface. Thus, the nucleotides are chemically coupled to the activated sites. Photo labile agents block further nucleotide binding to linkers until light subsequently activates them through a new mask. This chemical cycle is repeated until several hundred thousands of oligonucleotides (probes) with desired lengths and sequences are synthesized at each of sites on the surface of the chip (Lipshutz et al., 1999).

Fig. 5. Different microarray platforms and their fabrication methods.

synthetic linkers are attached to silanes and coated with a light-sensitive protecting group (Figure 6). The first mask is placed over the surface which then exposed to the light source. Masks selectively direct light toward specific areas on the substrate. Afterwards linker molecules are activated at the unprotected position. Next, the first of a series of nucleotides, linked to the light-sensitive agent, is incubated on the surface. Thus, the nucleotides are chemically coupled to the activated sites. Photo labile agents block further nucleotide binding to linkers until light subsequently activates them through a new mask. This chemical cycle is repeated until several hundred thousands of oligonucleotides (probes) with desired lengths and sequences are synthesized at each of sites on the surface of the chip

Fig. 5. Different microarray platforms and their fabrication methods.

(Lipshutz et al., 1999).

Fig. 6. **a.** Schematic overview on photolithographic fabrication of Gene Chip. **b.** Drawing of the lamp, mask and chip (Lipshutz et al., 1999).

#### **4.1.2 Experiment of Affymetrix Gene Chip**

Microarrays use various approaches based on uniqueness and composition design rules to select the 25-nucleotide-length (25mer) probes (Lipshutz et al., 1999). They utilize the Perfect Match/Mismatch probe strategy (Figure 7). Each gene sequence (or expressed sequence tag (EST)) is represented by typically 12-20 different probe pairs. The collection of probes for each gene is referred to as a probeset. Each pair includes a perfect match (PM) oligonucleotide and a mismatch (MM) oligonucleotide. A PM probe is perfectly complementary to the gene sequence of interest (Barrett & Kawasaki, 2003; Lipshutz et al., 1999) while the MM probe has a one-base mismatch in the central base position (the 13th base). The MM probe is used as an internal control to estimate the signal of any non-specific hybridization or contaminating fluorescence within measurement (Lipshutz et al., 1999; Tarca et al., 2006). These probesets are made on array through in situ synthesis and the microarray will be ready to carry out the experiment.

The basic steps in this single-dye experiment are as follows. Total RNA (or mRNA) is extracted from the biological sample, called target. The total RNA is then reversed transcribed to generate double-stranded cDNA. Then, biotin-labeled cRNA is produced from cDNA using *in vitro* transcription. Next, biotin-labeled cRNA is fragmented into smaller segments and hybridized on the array. After a series of washing for removing nonhybridized material, the array is incubated with appropriate fluorescent dyes linked to the biotins on the cRNAs. The array is placed in a scanner and emission of uorescent staining agent is quantified (Schadt et al., 2001). Measurement of the florescent agent intensity provides an estimate of the level of mRNA within each gene of interest on the chip.

Quantification of Gene Expression Based on Microarray Experiment 587

Fig. 8. **a.** Spotted cDNA microarray experiment consists of the following: preparation of target genes, labeling of the targets, hybridizing, scanning, **b.** Scanned image of a cDNA

laser scanner assists to quantify the emission from Cy3 and Cy5 dyes (Figure 8). A green spot indicates that the corresponding gene is more strongly expressed in the reference sample compared to the test sample, while a red spot shows the opposite. A yellow spot reveals a gene in both samples is expressed in the same levels while a black spot shows that the gene is not express in either sample. The fluorescent spot intensity directly gives an estimate of the amount of mRNA concentration at specific condition and cell type. Details of each experimental step have been reviewed elsewhere (Bilban et al., 2002; Karakach et al., 2010). Characteristics of the two discussed microarrays are summarized in Table 2. This table provides a comparative view of cDNA and affymetrix oligonucleotide microarrays. Selection of desired platform is based on biological question which determines aims of the experiment. In the remainder of the chapter, we will focus mainly on the spotted cDNA microarray because of limited space even though some of the discussions can be generalized

microarray (Karakach et al., 2010).

cDNA microarray

oligonucleotide chip

Affymetrix

to other platforms such as affymetrix oligonucleotide array.



variants


**Platforms Pros Cons** 



individual spots

necessary -Lack of flexibility






Table 2. A comparison between cDNA and oligonucleotides arrays.

Fig. 7. Design of Affymetrix Gene Chip technology.

#### **4.2 Spotted cDNA array**

In spotted technology, probe sequences are synthesized separated from the array. In this technique, the probes correspond to specific genes, expressed sequence tag i.e. a stable cDNA fragment, or cDNAs from libraries of interest (Bilban et al., 2002). If the quantity of available probes is limiting, PCR amplification is performed to make sufficient probes. The PCR products are then analyzed by gel electrophoresis, quantified and eventually spotted using a robotic printing on the microarray surface. Probes are immobilized or attached at fixed locations onto the slides electrostatically, through cross-linking by heat or ultraviolet irradiation and via amines or other active groups on modified slides (Barrett & Kawasaki, 2003). Therefore, the location of each spot on the array can also assist researchers to identify a desired gene sequence.

Since the cDNA probes are double stranded the array is then heated (or alkali treated) until the DNA is separated and hybridized to its complementary strand. In this two color approach there are two samples, a test sample and a reference sample. In order to prepare the targets, cDNAs are synthesized using reverse transcript of mRNAs in the samples. Targets are labeled through variety of labeling methods. The most common approach is labeling with a red and green fluorescent dye, called Cy5 and Cy3, respectively. The labeled targets are combined and deposited on the array. If a gene is present in one or both samples, it will bind to its complementary probe according to the complementary base pairing property of nucleic acids. After washing the array to remove the non-hybridized targets, a

In spotted technology, probe sequences are synthesized separated from the array. In this technique, the probes correspond to specific genes, expressed sequence tag i.e. a stable cDNA fragment, or cDNAs from libraries of interest (Bilban et al., 2002). If the quantity of available probes is limiting, PCR amplification is performed to make sufficient probes. The PCR products are then analyzed by gel electrophoresis, quantified and eventually spotted using a robotic printing on the microarray surface. Probes are immobilized or attached at fixed locations onto the slides electrostatically, through cross-linking by heat or ultraviolet irradiation and via amines or other active groups on modified slides (Barrett & Kawasaki, 2003). Therefore, the location of each spot on the array can also assist researchers to identify

Since the cDNA probes are double stranded the array is then heated (or alkali treated) until the DNA is separated and hybridized to its complementary strand. In this two color approach there are two samples, a test sample and a reference sample. In order to prepare the targets, cDNAs are synthesized using reverse transcript of mRNAs in the samples. Targets are labeled through variety of labeling methods. The most common approach is labeling with a red and green fluorescent dye, called Cy5 and Cy3, respectively. The labeled targets are combined and deposited on the array. If a gene is present in one or both samples, it will bind to its complementary probe according to the complementary base pairing property of nucleic acids. After washing the array to remove the non-hybridized targets, a

Fig. 7. Design of Affymetrix Gene Chip technology.

**4.2 Spotted cDNA array** 

a desired gene sequence.

Fig. 8. **a.** Spotted cDNA microarray experiment consists of the following: preparation of target genes, labeling of the targets, hybridizing, scanning, **b.** Scanned image of a cDNA microarray (Karakach et al., 2010).

laser scanner assists to quantify the emission from Cy3 and Cy5 dyes (Figure 8). A green spot indicates that the corresponding gene is more strongly expressed in the reference sample compared to the test sample, while a red spot shows the opposite. A yellow spot reveals a gene in both samples is expressed in the same levels while a black spot shows that the gene is not express in either sample. The fluorescent spot intensity directly gives an estimate of the amount of mRNA concentration at specific condition and cell type. Details of each experimental step have been reviewed elsewhere (Bilban et al., 2002; Karakach et al., 2010). Characteristics of the two discussed microarrays are summarized in Table 2. This table provides a comparative view of cDNA and affymetrix oligonucleotide microarrays. Selection of desired platform is based on biological question which determines aims of the experiment. In the remainder of the chapter, we will focus mainly on the spotted cDNA microarray because of limited space even though some of the discussions can be generalized to other platforms such as affymetrix oligonucleotide array.



Quantification of Gene Expression Based on Microarray Experiment 589

while pixels in the background area correspond to signals not due to hybridization of target molecules (noise or artifacts) (Yang et al., 2001; Yang et al., 2002a). The most common segmentation methods are classified based on whether they place restrictions on the spot geometry. Fixed circle and adaptive circle segmentation methods assume circular spot shapes, while the histogram and adaptive shape segmentation approaches apply no restrictions on the shapes of the spots in the estimation of the spot masks. Each segmentation method generates a spot mask which consists of a set of foreground pixels for each spot (Karakach et al., 2010; Yang et al., 2001). The simplest method is fixed circle that assigns a circle with constant diameter to all spots. It characterizes the pixels within the circle as true signal and the pixels out of the circle as background pixels. Adaptive circles segmentation estimates the circles' diameters separately for each spot (Yang et al., 2002a; Yang et al., 2001). Since this approach requires the user to adjust spot sizes, it can be timeconsuming for an array with thousands of spots. Furthermore it will be hard to distinguish a transition between the foreground and background if the signal strength is low (Yang et al., 2002a). Although most spot shapes are expected to be circular, in practice non-commercial arrayers rarely print the perfect circular shapes of spots resulting in poor estimates of uorescent intensities for hybridized targets. Thus, novel approaches known as "adaptive shape segmentation" methods has been developed which try to find the best shape of the spot (Yang et al., 2001; Yang et al., 2002a). These methods are commonly based on the watershed transform (Beucher & Meyer, 1993) and the seeded region growing algorithm (SRG) (Adams & Bischof, 1994) which successfully detect different sizes and shapes of

segmented spots (Karakach et al., 2010).

Fig. 10. Common structure of a cDNA microarray slide.

The most widely used method for segmenting spots, without restricting to particular shapes, is the histogram-based technique (Yang et al., 2001). It defines a target spot mask whose size is larger than any other spot. Foreground and background intensities of each spot are estimated from histogram of the pixels within this mask in various ways (Yang et al., 2002a). This technique directly quantifies values and needs no spot quantification stage. The discussed segmentation methods are implemented in most softwares to perform

primary level processing of microarray images (Table 3) (Yang et al., 2001).

#### **5. Image processing**

In the microarray experiment, as mentioned earlier, hybridized slides are inserted into a scanner to prepare fluorescent images arranged into a matrix of spots. The next step is processing these images to quantify level of gene expression based on the intensity of each spot and obtain background estimates and quality measures (Istepanian, 2003, Yang et al., 2002a). Accuracy of analysis in this phase has remarkable effect on downstream analyses such as clustering, classification or the identification of differentially expressed genes (Yang et al., 2001). Generally, laser scanning confocal microscopy acquires fluorescent signals emitted by fluorescently labelled targets on the array. Scanners detect and record the signals using photomultiplier tubes (PMT) or charge coupled device (CCD) cameras (Figure 9). These signals are stored in two 16-bit tiff (tagged image file format) images for further analysis (Karakach et al., 2010). Images contain information about each fluorescent dye, typically Cy3 and Cy5. Most of the softwares create a composite image by overlaying the two images corresponding to the individual channels for visualizing different status of genes.

Fig. 9. Photons emit from a fluorescent samples through excitation, enter into a PMT, resulting in the release of electrons. Analogue signals from PMT are converted into digital signals by an analog-to-digital (A/D) converter. (More details in Schena, 2003).

Image processing techniques can be divided into the following steps: gridding, segmentation, quantification and spot quality assessment (Istepanian, 2003; Yang et al., 2001). Over the last years, a number of commercial and free softwares have been developed which can perform each step of image processing in a particular approach. These steps are discussed in more details in following sections.

#### **5.1 Image gridding**

The basic layout of a microarray image is determined by the robotic printing devices (arrayers) as it is known in advance (Figure 10). The arrayer itself consists of a series of pins arranged as a print tip (also referred to as sub-arrays or grids). The pins pick up reagents and deposit them on the array. Hence, the spots on array are organized in several print tips that each one is composed of spots printed with one pin (Karakach et al., 2010). Gridding (addressing) is the process of finding location of the spots on images. This is carried out using a simple model based on layout of scanned image. In order to enhance the reliability, manual intervention is utilized in association with automatic procedures (semi-automatic) (Yang et al., 2001). However, this can probably make the process very time consuming and introduce user bias and loss of consistency. At first, the user manually specifies the positions of spots on the image. Then, a suitable grid pattern is automatically provided from the indicated positions (Gjerstad et al., 2009; Yang et al., 2002a).

#### **5.2 Image segmentation**

Grid spots are partitioned into foreground (within printed spot) and background regions through a process referred to as segmentation. Foreground pixels represent the true signal

In the microarray experiment, as mentioned earlier, hybridized slides are inserted into a scanner to prepare fluorescent images arranged into a matrix of spots. The next step is processing these images to quantify level of gene expression based on the intensity of each spot and obtain background estimates and quality measures (Istepanian, 2003, Yang et al., 2002a). Accuracy of analysis in this phase has remarkable effect on downstream analyses such as clustering, classification or the identification of differentially expressed genes (Yang et al., 2001). Generally, laser scanning confocal microscopy acquires fluorescent signals emitted by fluorescently labelled targets on the array. Scanners detect and record the signals using photomultiplier tubes (PMT) or charge coupled device (CCD) cameras (Figure 9). These signals are stored in two 16-bit tiff (tagged image file format) images for further analysis (Karakach et al., 2010). Images contain information about each fluorescent dye, typically Cy3 and Cy5. Most of the softwares create a composite image by overlaying the two images

corresponding to the individual channels for visualizing different status of genes.

Fig. 9. Photons emit from a fluorescent samples through excitation, enter into a PMT, resulting in the release of electrons. Analogue signals from PMT are converted into digital

Image processing techniques can be divided into the following steps: gridding, segmentation, quantification and spot quality assessment (Istepanian, 2003; Yang et al., 2001). Over the last years, a number of commercial and free softwares have been developed which can perform each step of image processing in a particular approach. These steps are

Signal

The basic layout of a microarray image is determined by the robotic printing devices (arrayers) as it is known in advance (Figure 10). The arrayer itself consists of a series of pins arranged as a print tip (also referred to as sub-arrays or grids). The pins pick up reagents and deposit them on the array. Hence, the spots on array are organized in several print tips that each one is composed of spots printed with one pin (Karakach et al., 2010). Gridding (addressing) is the process of finding location of the spots on images. This is carried out using a simple model based on layout of scanned image. In order to enhance the reliability, manual intervention is utilized in association with automatic procedures (semi-automatic) (Yang et al., 2001). However, this can probably make the process very time consuming and introduce user bias and loss of consistency. At first, the user manually specifies the positions of spots on the image. Then, a suitable grid pattern is automatically provided from the

Grid spots are partitioned into foreground (within printed spot) and background regions through a process referred to as segmentation. Foreground pixels represent the true signal

signals by an analog-to-digital (A/D) converter. (More details in Schena, 2003).

discussed in more details in following sections.

indicated positions (Gjerstad et al., 2009; Yang et al., 2002a).

**5.1 Image gridding** 

**5.2 Image segmentation** 

**5. Image processing** 

while pixels in the background area correspond to signals not due to hybridization of target molecules (noise or artifacts) (Yang et al., 2001; Yang et al., 2002a). The most common segmentation methods are classified based on whether they place restrictions on the spot geometry. Fixed circle and adaptive circle segmentation methods assume circular spot shapes, while the histogram and adaptive shape segmentation approaches apply no restrictions on the shapes of the spots in the estimation of the spot masks. Each segmentation method generates a spot mask which consists of a set of foreground pixels for each spot (Karakach et al., 2010; Yang et al., 2001). The simplest method is fixed circle that assigns a circle with constant diameter to all spots. It characterizes the pixels within the circle as true signal and the pixels out of the circle as background pixels. Adaptive circles segmentation estimates the circles' diameters separately for each spot (Yang et al., 2002a; Yang et al., 2001). Since this approach requires the user to adjust spot sizes, it can be timeconsuming for an array with thousands of spots. Furthermore it will be hard to distinguish a transition between the foreground and background if the signal strength is low (Yang et al., 2002a). Although most spot shapes are expected to be circular, in practice non-commercial arrayers rarely print the perfect circular shapes of spots resulting in poor estimates of uorescent intensities for hybridized targets. Thus, novel approaches known as "adaptive shape segmentation" methods has been developed which try to find the best shape of the spot (Yang et al., 2001; Yang et al., 2002a). These methods are commonly based on the watershed transform (Beucher & Meyer, 1993) and the seeded region growing algorithm (SRG) (Adams & Bischof, 1994) which successfully detect different sizes and shapes of

Fig. 10. Common structure of a cDNA microarray slide.

segmented spots (Karakach et al., 2010).

The most widely used method for segmenting spots, without restricting to particular shapes, is the histogram-based technique (Yang et al., 2001). It defines a target spot mask whose size is larger than any other spot. Foreground and background intensities of each spot are estimated from histogram of the pixels within this mask in various ways (Yang et al., 2002a). This technique directly quantifies values and needs no spot quantification stage. The discussed segmentation methods are implemented in most softwares to perform primary level processing of microarray images (Table 3) (Yang et al., 2001).

Quantification of Gene Expression Based on Microarray Experiment 591

After calculation of foreground and background intensities, quality measures are estimated to assess spot quality and reliability. These include variability of pixel values within each spot, spot area, a circularity measure, relative signal to background intensity (signal to noise ratio) and flag (Yang et al., 2001; Yang et al., 2002a). Each quality measure can be interpreted as follows. In most arrays the spots should be of the same size, thus very large or very small spots may be an indication of problems (Wang, 2007). Eliminating or marking of poorquality and low-intensity spots is called flagging. This is zero if the spot is good, but will take different values if the spot has problems. Different image processing software uses

different flag values for different problems, but the typical flagged spots are:

3. Negative spot: The signal of the spot is less than the background value.

2. Dark spot: The signal of the spot is very weak.

**6. Preprocessing of cDNA microarray data** 

software (Stekel, 2003).

microarray experiment.

**6.1 Background correction** 

1. Bad spot: The pixel standard deviation is considerably higher than the pixel mean.

4. Manually flagged spot: The user has flagged the spot using the image processing

Performing the four steps of image processing, quantitative parameters are generated in an output file of the software as shown in Table 4. These measures exhibit some of the location information, foreground and background quantifications and quality measures in a

Prior to identification of DEGs, the data collected from image processing step needs to be preprocessed. This important step in microarray data analysis removes non-biological variations, makes data more meaningful, transforms data into an appropriate scale for analysis and enhances the quality of subsequent analysis. There are a number of approaches for preprocessing such as background correction, logarithm transformation and normalization of microarray data. It should be mentioned that spot quality assessment

Background correction is a necessary step in preprocessing of cDNA microarray data since the quantified uorescence intensity of a spot contains background noise which does not reflect the true hybridization of the target to the probe. Background noise results from several sources such as non-specific hybridization of labeled target to the array surface, autouorescence from the array surface or detection instrument, spatial heterogeneity across the arrays (Ritchie et al., 2007; Tarca et al., 2006). For the purpose of background correction, it is conventionally assumed that the background signals are additive to the foreground signals (Ritchie et al., 2007). Also, the standard approach for correction is subtracting an estimate of the local background intensity from the foreground intensity. Despite spread implementation of this approach in different software packages, it may cause problems. It generates negative corrected intensities resulting in missing log ratios, if the background intensity is larger than the foreground intensity. Even when there is no missing, it results highly variable log-ratios for low intensity spots (Kooperberg et al., 2002). Also it may cause some difficulties in the identification of differentially expressed genes (Yang et al., 2001). To overcome aforementioned limitations, alternative approaches have been proposed such as subtractive correction using an estimate of the global instead of the local background and

(section 5.4) in image processing could also be considered as a preprocessing step.


Table 3. Different segmentation methods in different image processing softwares.

#### **5.3 Image quantification**

After detecting the location of spot and classifying pixels, it is necessary to compute red and green foreground intensities as well as red and green background fluorescent values for each spot on the array (Yang et al., 2002a).

#### **5.3.1 Foreground quantification**

In fact, the aim of the spot quantification is estimating a quantitative measure which is a combination of pixel intensity values (Yang et al., 2001). There are different statistics to compute this measure. Simple sum of pixel intensities is not a good statistic because it dependent on the size of the spot. Thus, values obtained from spots with different densities cannot be compared directly. Most microarray imaging softwares estimate the foreground intensity as the mean or median of pixel values within the segmented spot mask (Yang et al., 2002a). The median value is more robust to possible outlier pixels; hence it is preferable over the mean. Also interquartile range (IQR) (i.e., the difference between the 25th and 75th percentiles) of foreground may be computed for each channel as pixel variation estimation.

#### **5.3.2 Background quantification**

Background estimation is generally considered necessary for the aim of performing background correction (Yang et al., 2002a). Background estimation methods can be classified into four categories: local, morphological, constant and no adjustment background (Yang et al., 2001). In the first category, background intensities are computed by focusing on small regions around the spot mask. Different softwares utilize variety of shapes for these areas such as square, diamond-shape (referred to as the valley) and circles with different diameters. Usually, the background measure is estimated by the median of pixel values within these specific regions; however, it is possible to calculate mean, standard deviation, and interquartile range of pixels (Yang et al., 2001). Also, there are two types of morphological filters. The first one corresponds to a non-linear filter called morphological opening that is obtained by applying a form of local minimum filter (an erosion process) followed by a local maximum filter (a dilation process) with the same window for each image (for more details, see (Soli, 1999)). The second one corresponds to a combination of a closing followed by an opening that removes small dark regions as a better estimate (Wang, 2007; Yang et al., 2001). Constant background is a global method which estimates the mean or median intensity of the whole image background as a constant background for all spots. The fourth option is possibility of no background correction (Yang et al., 2001).

#### **5.4 Spot quality assessment**

The quality assessment step facilitates to diagnose possible quality problems or even mistakes that occurred during microarray fabrication and experiment. If this step does not report any serious irregularities, it will allow performing the following preprocessing steps.

Adaptive Circle QuantArray, GenePix, Dapple, Agilent Feature Extraction

Histogram ImaGene, QuantArray and DeArray

After detecting the location of spot and classifying pixels, it is necessary to compute red and green foreground intensities as well as red and green background fluorescent values for

In fact, the aim of the spot quantification is estimating a quantitative measure which is a combination of pixel intensity values (Yang et al., 2001). There are different statistics to compute this measure. Simple sum of pixel intensities is not a good statistic because it dependent on the size of the spot. Thus, values obtained from spots with different densities cannot be compared directly. Most microarray imaging softwares estimate the foreground intensity as the mean or median of pixel values within the segmented spot mask (Yang et al., 2002a). The median value is more robust to possible outlier pixels; hence it is preferable over the mean. Also interquartile range (IQR) (i.e., the difference between the 25th and 75th percentiles) of foreground may be computed for each channel as pixel variation estimation.

Background estimation is generally considered necessary for the aim of performing background correction (Yang et al., 2002a). Background estimation methods can be classified into four categories: local, morphological, constant and no adjustment background (Yang et al., 2001). In the first category, background intensities are computed by focusing on small regions around the spot mask. Different softwares utilize variety of shapes for these areas such as square, diamond-shape (referred to as the valley) and circles with different diameters. Usually, the background measure is estimated by the median of pixel values within these specific regions; however, it is possible to calculate mean, standard deviation, and interquartile range of pixels (Yang et al., 2001). Also, there are two types of morphological filters. The first one corresponds to a non-linear filter called morphological opening that is obtained by applying a form of local minimum filter (an erosion process) followed by a local maximum filter (a dilation process) with the same window for each image (for more details, see (Soli, 1999)). The second one corresponds to a combination of a closing followed by an opening that removes small dark regions as a better estimate (Wang, 2007; Yang et al., 2001). Constant background is a global method which estimates the mean or median intensity of the whole image background as a constant background for all spots.

The fourth option is possibility of no background correction (Yang et al., 2001).

The quality assessment step facilitates to diagnose possible quality problems or even mistakes that occurred during microarray fabrication and experiment. If this step does not report any serious irregularities, it will allow performing the following preprocessing steps.

**Segmentation method Software Implementing Method**  Fixed Circle ScanAnalyze, GenePix, QuantArray

Table 3. Different segmentation methods in different image processing softwares.

Adaptive shape Spot

**5.3 Image quantification** 

each spot on the array (Yang et al., 2002a).

**5.3.1 Foreground quantification** 

**5.3.2 Background quantification** 

**5.4 Spot quality assessment** 

After calculation of foreground and background intensities, quality measures are estimated to assess spot quality and reliability. These include variability of pixel values within each spot, spot area, a circularity measure, relative signal to background intensity (signal to noise ratio) and flag (Yang et al., 2001; Yang et al., 2002a). Each quality measure can be interpreted as follows. In most arrays the spots should be of the same size, thus very large or very small spots may be an indication of problems (Wang, 2007). Eliminating or marking of poorquality and low-intensity spots is called flagging. This is zero if the spot is good, but will take different values if the spot has problems. Different image processing software uses different flag values for different problems, but the typical flagged spots are:


Performing the four steps of image processing, quantitative parameters are generated in an output file of the software as shown in Table 4. These measures exhibit some of the location information, foreground and background quantifications and quality measures in a microarray experiment.

#### **6. Preprocessing of cDNA microarray data**

Prior to identification of DEGs, the data collected from image processing step needs to be preprocessed. This important step in microarray data analysis removes non-biological variations, makes data more meaningful, transforms data into an appropriate scale for analysis and enhances the quality of subsequent analysis. There are a number of approaches for preprocessing such as background correction, logarithm transformation and normalization of microarray data. It should be mentioned that spot quality assessment (section 5.4) in image processing could also be considered as a preprocessing step.

#### **6.1 Background correction**

Background correction is a necessary step in preprocessing of cDNA microarray data since the quantified uorescence intensity of a spot contains background noise which does not reflect the true hybridization of the target to the probe. Background noise results from several sources such as non-specific hybridization of labeled target to the array surface, autouorescence from the array surface or detection instrument, spatial heterogeneity across the arrays (Ritchie et al., 2007; Tarca et al., 2006). For the purpose of background correction, it is conventionally assumed that the background signals are additive to the foreground signals (Ritchie et al., 2007). Also, the standard approach for correction is subtracting an estimate of the local background intensity from the foreground intensity. Despite spread implementation of this approach in different software packages, it may cause problems. It generates negative corrected intensities resulting in missing log ratios, if the background intensity is larger than the foreground intensity. Even when there is no missing, it results highly variable log-ratios for low intensity spots (Kooperberg et al., 2002). Also it may cause some difficulties in the identification of differentially expressed genes (Yang et al., 2001). To overcome aforementioned limitations, alternative approaches have been proposed such as subtractive correction using an estimate of the global instead of the local background and

Quantification of Gene Expression Based on Microarray Experiment 593

background methods show greater variability around the low intensity spots rather than no

Before normalization, a logarithmic transformation is often performed on microarray data. This transformation is successful at reducing some of the variations, and makes the multiplicative noise of the data additive. Also data is transformed into a symmetrical and normal data distributed around zero through taking log transformation. This means that upand down-expressed genes are treated in identical way (Quackenbush, 2002). However, the log transformed ratios limit subsequent analyses and the amount of information gained from the data (Zhao et al., 2007). The ratios do not provide information about the absolute expression levels. Also, the use of the ratios remarkably dependents on the choice of the reference sample, which is uncharacterized and not accurately reproduced. This will make it difficult to compare between data sets that use different reference samples (Zhao et al., 2007).

There is variety of variations from the beginning of the experimental process through generation of raw data in microarray experiment. Two sources of variations are biological variations and procedural variations. Biological variations are the consequence of environmental changes or biological differences of the studied genes on the array. These are desired variations and represent the true changes in expression cycle. Procedural variations can be attributed to many sources such as microarray fabrication, mRNA preparation, reverse transcription, labelling, amplification, pin geometry, fluctuations in target volumes, target fixation, hybridization parameters, overshining, and image analysis. Detail description of each variation source is presented elsewhere (Schuchhardt et al., 2000; Yang et al., 2002b). Procedural variations can be removed (or minimized) using statistical approaches, so that biological variations are more accurately detected. The processes and transformations for the purpose of adjusting data are referred to as normalization. Hence, normalization is a crucial step in microarray data preprocessing, since data interpretation and identification of DEGs

Different biases arise from variations in the microarray data. The most common is dye bias i.e. imbalance between the two channels due to differences between physical properties of dyes and detection efficiencies between the fluorescent dyes. Other biases such as print tip bias and spatial bias may arise from variation between spatial positions on the array due to differences between the print-tips on the arrayer (Smyth & Speed, 2003). In order to remove biases, numerous normalization approaches have been proposed. These algorithms can be applied either globally to an entire data set or locally to a subset of the data. For cDNA spotted microarray, local normalization is often applied to each print tip (Quackenbush, 2002). Normalization methods can be divided into two main categories: within-array normalizations and between-array normalizations. Within-array normalization has to be performed to adjust procedural variations for each single microarray. Some of more

Global normalization is the simplest and most common within-array normalization method. It assumes the red and green intensities are related by a constant factor k, namely R=kG. The

depends on the choice of normalization method (Yang et al., 2002b).

common approaches are as follows.

**6.3.1 Global normalization** 

background adjustment (Yang et al., 2001).

**6.2 Logarithm transformation** 

**6.3 Normalization** 



Table 4. Partial output file from Spot software for green (Cy3) channel: Location information ( spot index, grid row, grid column, spot row and spot column), Area: the number of foreground pixels for each spot, Gmean: the average of foreground pixel values, Gmedian: the median of foreground pixel values, GIQR: the interquartile range of foreground pixel, bgGmean: the average of background pixel values, bgGmed: the median of background pixel values, bgGSD: the standard deviation of background pixel values , valleyG: the background intensity estimate from the local background valley method, morphG: background estimate using morphological opening, morphG.erode: green background estimate using morphological erosion, morphG.close.open: green background estimate using morphological closing-opening, Logratio: the log-ratio for each spot is calculated

as <sup>2</sup> 2 log ( ) log ( ) *Rmean bgmedR Gmean bgmedG* − <sup>−</sup> , Circularity: Shape of spot defined as 2 4 *Area primeter* × × π, Badspot: If the

spot has problem, it equals to 1, otherwise 0.

morphological opening filters which provide less variable log ratios (Ritchie et al., 2007; Yang et al., 2001). Some methods utilize statistical models, other than subtraction, to adjust the background estimate. A simpler background correction method was proposed to avoid negative corrected values. This model adjusts the foreground intensities by subtracting the background when the difference between the foreground and background is larger than a threshold value. However, when the difference is less than the threshold, subtraction is replaced by a smooth monotonic function. Kooperberg et al., 2002 proposed an empirical bayes model to correct background noise. A remarkable feature of this method is only the use of the mean, median and standard deviation statistics for each spot that are provided through the scanning software. In other methods, the models based on variance stabilizing transformations were proposed for incorporating additive components which prevent negative intensities. The Models use an arcsinh function instead of the logarithm transformation of the data. Also background correction and normalization are simultaneously performed on all the arrays together (Kooperberg et al., 2002; Ritchie et al., 2007). It is notable that no background correction has been recommended. Sometimes, local background methods show greater variability around the low intensity spots rather than no background adjustment (Yang et al., 2001).

#### **6.2 Logarithm transformation**

592 Bioinformatics – Trends and Methodologies

Index grid.r grid.c spot.r spot.c Area Gmean Gmedian GIQR bgGmean 1 1 1 1 1 95 22028.26 23219 0.564843 372.6964 2 1 1 1 2 85 25613.2 20827 0.672128 928.8974 3 1 1 1 3 77 22652.39 17498 0.939413 1371.86 4 1 1 1 4 21 8929.286 5270 1.975485 250.5417 5 1 1 1 5 21 8746.476 7396 2.518724 262.0417 6 1 1 1 6 112 37010.08 41539 0.943238 499.1722

307 0.252131 306 182 153 289 -0.17171 40 0.746128 0 299 0.390198 280 171 153 278 -0.16341 36 0.824183 0 339 0.820078 275 153 136 278 -0.15408 34 0.837033 0 244 0.270411 258 153 132 271 0.80675 16 1.030835 0 235 0.275412 244 153 132 216 -0.10662 16 1.030835 0 304 0.740031 244 139 120 224 -0.44679 44 0.72698 0 381 1.05041 243 138 120 224 -0.21073 34 0.739198 0 Table 4. Partial output file from Spot software for green (Cy3) channel: Location information

( spot index, grid row, grid column, spot row and spot column), Area: the number of foreground pixels for each spot, Gmean: the average of foreground pixel values, Gmedian: the median of foreground pixel values, GIQR: the interquartile range of foreground pixel, bgGmean: the average of background pixel values, bgGmed: the median of background pixel values, bgGSD: the standard deviation of background pixel values , valleyG: the background intensity estimate from the local background valley method, morphG: background estimate using morphological opening, morphG.erode: green background estimate using morphological erosion, morphG.close.open: green background estimate using morphological closing-opening, Logratio: the log-ratio for each spot is calculated

<sup>−</sup> , Circularity: Shape of spot defined as 2

morphological opening filters which provide less variable log ratios (Ritchie et al., 2007; Yang et al., 2001). Some methods utilize statistical models, other than subtraction, to adjust the background estimate. A simpler background correction method was proposed to avoid negative corrected values. This model adjusts the foreground intensities by subtracting the background when the difference between the foreground and background is larger than a threshold value. However, when the difference is less than the threshold, subtraction is replaced by a smooth monotonic function. Kooperberg et al., 2002 proposed an empirical bayes model to correct background noise. A remarkable feature of this method is only the use of the mean, median and standard deviation statistics for each spot that are provided through the scanning software. In other methods, the models based on variance stabilizing transformations were proposed for incorporating additive components which prevent negative intensities. The Models use an arcsinh function instead of the logarithm transformation of the data. Also background correction and normalization are simultaneously performed on all the arrays together (Kooperberg et al., 2002; Ritchie et al., 2007). It is notable that no background correction has been recommended. Sometimes, local

open

Logratio Perimeter Circularity Badspot

4 *Area primeter* × × π

, Badspot: If the

bgGmed bgGSD Valley morphG morphG.erode morphG.close.

as <sup>2</sup> 2

log ( ) log ( ) *Rmean bgmedR Gmean bgmedG* −

spot has problem, it equals to 1, otherwise 0.

Before normalization, a logarithmic transformation is often performed on microarray data. This transformation is successful at reducing some of the variations, and makes the multiplicative noise of the data additive. Also data is transformed into a symmetrical and normal data distributed around zero through taking log transformation. This means that upand down-expressed genes are treated in identical way (Quackenbush, 2002). However, the log transformed ratios limit subsequent analyses and the amount of information gained from the data (Zhao et al., 2007). The ratios do not provide information about the absolute expression levels. Also, the use of the ratios remarkably dependents on the choice of the reference sample, which is uncharacterized and not accurately reproduced. This will make it difficult to compare between data sets that use different reference samples (Zhao et al., 2007).

#### **6.3 Normalization**

There is variety of variations from the beginning of the experimental process through generation of raw data in microarray experiment. Two sources of variations are biological variations and procedural variations. Biological variations are the consequence of environmental changes or biological differences of the studied genes on the array. These are desired variations and represent the true changes in expression cycle. Procedural variations can be attributed to many sources such as microarray fabrication, mRNA preparation, reverse transcription, labelling, amplification, pin geometry, fluctuations in target volumes, target fixation, hybridization parameters, overshining, and image analysis. Detail description of each variation source is presented elsewhere (Schuchhardt et al., 2000; Yang et al., 2002b). Procedural variations can be removed (or minimized) using statistical approaches, so that biological variations are more accurately detected. The processes and transformations for the purpose of adjusting data are referred to as normalization. Hence, normalization is a crucial step in microarray data preprocessing, since data interpretation and identification of DEGs depends on the choice of normalization method (Yang et al., 2002b).

Different biases arise from variations in the microarray data. The most common is dye bias i.e. imbalance between the two channels due to differences between physical properties of dyes and detection efficiencies between the fluorescent dyes. Other biases such as print tip bias and spatial bias may arise from variation between spatial positions on the array due to differences between the print-tips on the arrayer (Smyth & Speed, 2003). In order to remove biases, numerous normalization approaches have been proposed. These algorithms can be applied either globally to an entire data set or locally to a subset of the data. For cDNA spotted microarray, local normalization is often applied to each print tip (Quackenbush, 2002). Normalization methods can be divided into two main categories: within-array normalizations and between-array normalizations. Within-array normalization has to be performed to adjust procedural variations for each single microarray. Some of more common approaches are as follows.

#### **6.3.1 Global normalization**

Global normalization is the simplest and most common within-array normalization method. It assumes the red and green intensities are related by a constant factor k, namely R=kG. The

Quantification of Gene Expression Based on Microarray Experiment 595

trend is assumed to be global rather than varying across the array as for print-tip loess

2 2 log ( ) log ( ) (,) *i i i i normalized R R loess r c G G*

Where loess(r,c) is a loess fit calculated based on the position of the spots. The three techniques remove different biases arose from the experiment. Gloess removes the dye bias dependent to spot intensity. PTloess removes spatial bias introduced from print tips and

The above normalization methods are applied to a single microarray. But in order to be able to facilitate comparison and integration of different microarrays, it is required to remove the variability caused by using multiple microarrays. It can be performed through the following approaches. Differences between arrays may arise from differences in print quality or from differences in ambient conditions when the plates are processed (Smyth & Speed, 2003).

This method is a simple scaling of the data on multiple arrays so that each array has identical median absolute deviation (MAD). It aims to remove scale differences in the data and assumes that the log ratios on the array follow a normal distribution with mean zero

> 1 *n <sup>n</sup> jj j j a MAD MAD* =

Finally, all log ratios are scaled through dividing by the same scale factor for each array. It is notable that scale normalization can also be applied to data within a microarray locally at

Quantile normalization was initially developed for the Affymetrix single channel chip, and then extended for two color cDNA microarrays. The goal of this method is to produce the same empirical distributions of expression levels on all arrays analyzed. It relies on the assumption that the probe intensities among arrays are always exactly the same, regardless of biology or study design. Clearly the situation where all samples have equal amounts of expressed genes is the exception, not the rule, making it the rare case where quantile normalization will normalize data without introducing errors. Quantile normalization is carried out through the following steps: Suppose that we have the (log base 2 transformed) probe level expression values from p genes and n arrays in a p × n matrix *X X* = { <sup>1</sup> *<sup>j</sup>*} with i = 1,2,…,p and j = 1,2,…,n. First, each column of *X* separately is ranked to generate a p × n

for array j with n denoting the total number of arrays (Yang et al., 2002b).

Where *MADj* denote the median absolute deviation for array i. Then

twoDloess removes the spatial bias on the overall slide.

= − (4)

is the variance of the true log ratios and *<sup>j</sup> a* is the scale factor

*MAD median M median M j jj* = − { ( )*<sup>j</sup>* } (6)

<sup>=</sup> ∏ (5)

normalization.

**6.3.4 Scale normalization** 

*j a* σ  where <sup>2</sup> σ

and variance 2 2

the print tip level.

**6.3.5 Quantile normalization** 

log-ratios are corrected by subtracting a constant c to get normalized values.(log *R G* , log ) are background corrected red and green intensities and then:

$$
\left[\log\_2(\bigwedge\_{\mathbf{G}\_i}^{\mathbf{R}\_i})\right]\_{\text{normalized}} = \log\_2(\bigwedge\_{\mathbf{G}\_i}^{\mathbf{R}\_i}) - c = \log\_2(\bigwedge\_{\mathbf{G}\_i}^{\mathbf{R}\_i}) - \log\_2(k) \tag{1}
$$

The global constant c is usually estimated from the mean or median log ratios over a subset of the genes assumed to be not differentially expressed, although variety of strategies have been proposed for estimating this global constant (Quackenbush, 2002; Smyth & Speed, 2003). Global method is limited in adjusting intensity-dependent dye bias and spatial bias.

#### **6.3.2 Intensity-dependent linear normalization**

In most cases, the dye bias appears to be dependent on spot intensity linearly or nonlinearly. Linear normalization assumes the relation between M and A is linear based on model *M A* = + β β 0 1 , where 0 1 (,) β β can be estimated by least squares estimation. The most common method to visualize behavior of two channels is MA plot which uses log intensity ratios (M) and log intensity averages (A) where M and A are usually defined for each gene as

$$M\_i = \log\_2\left(\bigwedge\_{\mathbf{G}\_i}^{R\_i}\right) \quad \text{and} \quad A\_i = \frac{1}{2} \times \log\_2\left(R\_i \times \mathbf{G}\_i\right)$$

#### **6.3.3 Intensity dependent nonlinear normalization**

The most efficient and widely used nonlinear normalization approach was proposed by Yang et al., 2002b. It considers the relation between M and A as a function of A i.e. M = c(A), instead of a linear relation. The estimation of c(A) is made by using a loess (locally weighted scatter plot smoother) function to operate a local scatter plot smoothing to the MA plot. The scatter plot smoother performs local linear fits in overlapping windows on the data and then combines the regressions to produce a smooth curve. This method can be divided into three categories based on the type of the treatment performed on the data. These categories include: global loess, print tip loess, and two-dimensional loess. Global Loess (Gloess) normalization method uses the loess function to perform a local A-dependent analysis:

$$\left[\log\_2(\bigwedge\_{\mathbf{G}\_i}^{R\_i})\right]\_{\text{normalized}} = \log\_2(\bigwedge\_{\mathbf{G}\_i}^{R\_i}) - \text{c(A)} = \log\_2(\bigwedge\_{\mathbf{G}\_i}^{R\_i}) - \log\_2(k(A)) \tag{2}$$

Where c(A) is the loess fit to the MA plot for all printed genes (Smyth & Speed, 2003). Print tip loess (PTloess) is performed within each of the print tip groups separately as follows:

$$\mathbb{E}\left[\log\_2\binom{R\_i}{\mathbb{G}\_i}\right]\_{\text{normalized}} = \log\_2\binom{R\_i}{\mathbb{G}\_i} - c\_p(A) = \log\_2\binom{R\_i}{\mathbb{G}\_i} - \log\_2(k\_p(A))\tag{3}$$

Where ( ) *<sup>p</sup> c A* is the loess fit as a function of A for the *th p* print tip. By fitting separate loess lines for each group and correcting the intensity by its corresponding loess lines, not only the dye bias will be removed, but it can also correct the print tip bias. Two dimensional loess (twoDloess) method fits a smooth two-dimensional surface to the data which is a function of overall row position r and column position c of the spot on the array. The intensity-based trend is assumed to be global rather than varying across the array as for print-tip loess normalization.

$$
\begin{bmatrix}
\log\_2(\prescript{R\_i}{\bigvee}\_{\mathbf{G}\_i})
\end{bmatrix}\_{\text{normalized}} = \log\_2(\prescript{R\_i}{\bigvee}\_{\mathbf{G}\_i}) - \text{loss}(r, c) \tag{4}
$$

Where loess(r,c) is a loess fit calculated based on the position of the spots. The three techniques remove different biases arose from the experiment. Gloess removes the dye bias dependent to spot intensity. PTloess removes spatial bias introduced from print tips and twoDloess removes the spatial bias on the overall slide.

The above normalization methods are applied to a single microarray. But in order to be able to facilitate comparison and integration of different microarrays, it is required to remove the variability caused by using multiple microarrays. It can be performed through the following approaches. Differences between arrays may arise from differences in print quality or from differences in ambient conditions when the plates are processed (Smyth & Speed, 2003).

#### **6.3.4 Scale normalization**

594 Bioinformatics – Trends and Methodologies

log-ratios are corrected by subtracting a constant c to get normalized values.(log *R G* , log )

<sup>2</sup> 2 22 log ( ) log ( ) log ( ) log ( ) *i ii i ii normalized R RR c k G GG*

The global constant c is usually estimated from the mean or median log ratios over a subset of the genes assumed to be not differentially expressed, although variety of strategies have been proposed for estimating this global constant (Quackenbush, 2002; Smyth & Speed, 2003). Global method is limited in adjusting intensity-dependent dye bias and spatial bias.

In most cases, the dye bias appears to be dependent on spot intensity linearly or nonlinearly. Linear normalization assumes the relation between M and A is linear based on model

method to visualize behavior of two channels is MA plot which uses log intensity ratios (M) and log intensity averages (A) where M and A are usually defined for each gene as

The most efficient and widely used nonlinear normalization approach was proposed by Yang et al., 2002b. It considers the relation between M and A as a function of A i.e. M = c(A), instead of a linear relation. The estimation of c(A) is made by using a loess (locally weighted scatter plot smoother) function to operate a local scatter plot smoothing to the MA plot. The scatter plot smoother performs local linear fits in overlapping windows on the data and then combines the regressions to produce a smooth curve. This method can be divided into three categories based on the type of the treatment performed on the data. These categories include: global loess, print tip loess, and two-dimensional loess. Global Loess (Gloess) normalization method uses the loess function to perform a local A-dependent analysis:

> <sup>2</sup> 2 22 log ( ) log ( ) ( ) log ( ) log ( ( )) *i ii i ii normalized R RR c A k A G GG*

Print tip loess (PTloess) is performed within each of the print tip groups separately as

<sup>2</sup> <sup>2</sup> 2 2 log ( ) log ( ) ( ) log ( ) log ( ( )) *iii p p iii normalized RRR c A k A GGG*

Where ( ) *<sup>p</sup> c A* is the loess fit as a function of A for the *th p* print tip. By fitting separate loess lines for each group and correcting the intensity by its corresponding loess lines, not only the dye bias will be removed, but it can also correct the print tip bias. Two dimensional loess (twoDloess) method fits a smooth two-dimensional surface to the data which is a function of overall row position r and column position c of the spot on the array. The intensity-based

Where c(A) is the loess fit to the MA plot for all printed genes (Smyth & Speed, 2003).

= −= − (2)

= −= − (3)

<sup>1</sup> log ( ) <sup>2</sup> *A RG <sup>i</sup>* =× ×*i i*

= −= − (1)

can be estimated by least squares estimation. The most common

are background corrected red and green intensities and then:

**6.3.2 Intensity-dependent linear normalization** 

**6.3.3 Intensity dependent nonlinear normalization** 

*M A* = + β β

follows:

<sup>2</sup> log ( ) *<sup>i</sup> <sup>i</sup> <sup>i</sup>*

0 1 , where 0 1 (,)

*<sup>R</sup> <sup>M</sup> <sup>G</sup>* <sup>=</sup> and 2

β β

> This method is a simple scaling of the data on multiple arrays so that each array has identical median absolute deviation (MAD). It aims to remove scale differences in the data and assumes that the log ratios on the array follow a normal distribution with mean zero and variance 2 2 *j a* σ where <sup>2</sup> σ is the variance of the true log ratios and *<sup>j</sup> a* is the scale factor for array j with n denoting the total number of arrays (Yang et al., 2002b).

$$a\_j = \text{MAD}\_j \Bigg/ \sqrt[n]{\prod\_{j=1}^n \overline{MAD}\_j} \tag{5}$$

Where *MADj* denote the median absolute deviation for array i. Then

$$\text{MAD}\_{\rangle} = \operatorname{median}\_{\rangle} \left\{ \left\| \mathbf{M}\_{\rangle} - \text{median}(\mathbf{M}\_{\rangle}) \right\| \right\} \tag{6}$$

Finally, all log ratios are scaled through dividing by the same scale factor for each array. It is notable that scale normalization can also be applied to data within a microarray locally at the print tip level.

#### **6.3.5 Quantile normalization**

Quantile normalization was initially developed for the Affymetrix single channel chip, and then extended for two color cDNA microarrays. The goal of this method is to produce the same empirical distributions of expression levels on all arrays analyzed. It relies on the assumption that the probe intensities among arrays are always exactly the same, regardless of biology or study design. Clearly the situation where all samples have equal amounts of expressed genes is the exception, not the rule, making it the rare case where quantile normalization will normalize data without introducing errors. Quantile normalization is carried out through the following steps: Suppose that we have the (log base 2 transformed) probe level expression values from p genes and n arrays in a p × n matrix *X X* = { <sup>1</sup> *<sup>j</sup>*} with i = 1,2,…,p and j = 1,2,…,n. First, each column of *X* separately is ranked to generate a p × n

Quantification of Gene Expression Based on Microarray Experiment 597

distribution of the gene expression data. Thus, may not properly perform when data exhibit a strong departure from the normal distribution. Also the performance of t-test will be poor when sample sizes are small, because variance estimation is more challenging (Yan et al.,

The ANOVA approach is a generalization of the t-test that can be used when more than two conditions are compared. The idea underlying ANOVA is to make a model that considers the variation sources that affect measurements. Then variance of each individual variable in the model is computed using expression data (Tarca et al., 2006). In order to improve the performance of the ordinary t-test and produce more stable results, modified t-statistics are alternatively proposed. The main difference between an ordinary t-statistic and these novel statistics is that the latter estimate variability regarding to information not only from the gene tested, but also from other genes displaying a similar magnitude of expression level (Smyth, 2004). Two commonly used approaches, i.e. the modified t-statistic methods

This empirical Bayes t-tset has been implemented in the limma R statistical package. In this approach, gene-wise linear models are separately made to represent the design of a microarray experiment. Next, the coefficients of each linear model are estimated through the expression data. After quantification of coefficients of model and standard errors, moderate t, F and B (log-odds) statistics of differential expression are computed using empirical Bayes approach. It is equivalent to reduction of the gene-wise sample variance towards a pooled estimate producing more stable result when the number of measurements is small in experiments. Finally, genes can be ranked based on one of the chosen statistics. A more

SAM is a statistical technique, proposed by Tusher et al., 2001. It utilizes a non-parametric statistics, since the expression data may not be normally distributed. Modified t-statistic used in this method is essentially similar to the moderated t-statistic used in limma but have no associated distributional theory. Also the empirical bayes method provides a more complex model of the gene variance. SAM assigns a score to each gene based on change in gene expression relative to the standard deviation over repeated measurements for that gene (Smyth, 2004). Genes with scores greater than a threshold are considered differentially expressed. The threshold significance is determined by the user based on the FDR. The proportion of such genes identified by chance (false positives) is the false discovery rate (FDR). To estimate the FDR, nonsense genes are specified using random permutations of the

Microarray experiments produce large and highly complex datasets. Access to an efficient statistical computing environment is a critical aspect of the analysis of these gene expression datasets. There are a lot of free and commercial software. In most cases, the microarray kits come with the software that adequately analyses microarray data. One of the best options for data analysis is the R statistical programming environment (www.rproject.org) where

(empirical Bayes and SAM), will be described in more detail as fallows.

**7.1 The empirical Bayes t-test LIMMA** 

detailed derivation can be found in (Smyth, 2004).

**7.2 Signicance analysis of microarrays (SAM)** 

repeated measurements (Tusher et al., 2001; Yan et al., 2005).

**8. R and bioconductor packages** 

2005).

matrix*Y Y* = { <sup>1</sup> *<sup>j</sup>*} . Next, the average of each row of *Y* is computed and generated *Xm* . *Xm* is assigned to each column of Y to get a matrix denoted as *Xsort* . Finally, the normalized genes for each array is provided by rearranging each column of *Xsort* to have the same ordering as the corresponding column of the matrix *X* so that empirical distributions of the normalized genes are the same across arrays. Because the algorithm consists of only sorting and averaging operations, it runs quickly, even with large data sets (Bolstad et al., 2003). (More details in (Stafford, 2008))

All above normalization methods utilize certain critical biological and statistical assumptions about data distribution which may not be valid in practice. The main assumption is that the most genes on the array are non-differentially expressed between the two samples and the number of up-regulated genes approximately equals the number of down-regulated genes. In such cases, above-mentioned normalization methods may yield unreliable results. Xiong et al., 2008 proposed a novel statistical method based on the Generalized Procrustes Analysis (GPA) algorithm free of assumption (Xiong et al., 2008).

#### **7. Differentially gene expression**

Once the data is normalized, further analysis is necessary to obtain biologically meaningful results. In fact, the main purpose of microarray experiment is to identifying genes that are significantly differentially expressed under different biological, and/or clinical conditions. A growing number of approaches have been presented to fulfill this purpose that can be divided in three categories: marginal filters, wrappers, and embedded methods. The wrapper and embedded methods are a type of search algorithms by which subsets of genes that are useful to define a good predictor are generated. Evaluation of a specific subset of genes is provided by running a specific classification model on the subset. The filter approaches are scalable and fast methods and independent of the classification algorithm including t-tests and nonparametric tests and analysis of variance (ANOVA) (Saeys et al., 2007). We will provide a brief overview on some of the popular statistical differential expression methods. It is notable that various methods usually identify different ordered list of significant genes since each approach is based on a specific set of assumptions, and takes certain features of dataset into account.

Fold Change (FC) cutoff is one of the early approaches for DEG identification that is still widely used to rank genes in microarray assays. In this method, when ratio of two color intensities from each gene exceeds a pre-set threshold is said to be differentially expressed. Usually a threshold of twofold up- or down-regulation is considered as cutoff value in most biological studies. This method ranks genes based on the ratio of average gene expression under two different groups or conditions. Simplicity is a main reason for popularity of fold change approach. Also a major drawback is that it does not consider variance of the expression values quantified. Hence, in order to cope with this problem, it will be used in combination with other statistical methods (Tarca et al., 2006). There also exists variety of statistical tests instead of using a fold change cutoff, for a correct selection of deferentially expressed genes. A simple but popular method is the t-test and its variants (Cui & Churchill, 2003). The t-test performs according to the simple estimation of the population variance for a gene through the sample variance of its expression levels. It typically compares the difference between the mean expression levels among the two groups, considering the variability of genes in their ranking (Tarca et al., 2006). T-test depends on the type of distribution of the gene expression data. Thus, may not properly perform when data exhibit a strong departure from the normal distribution. Also the performance of t-test will be poor when sample sizes are small, because variance estimation is more challenging (Yan et al., 2005).

The ANOVA approach is a generalization of the t-test that can be used when more than two conditions are compared. The idea underlying ANOVA is to make a model that considers the variation sources that affect measurements. Then variance of each individual variable in the model is computed using expression data (Tarca et al., 2006). In order to improve the performance of the ordinary t-test and produce more stable results, modified t-statistics are alternatively proposed. The main difference between an ordinary t-statistic and these novel statistics is that the latter estimate variability regarding to information not only from the gene tested, but also from other genes displaying a similar magnitude of expression level (Smyth, 2004). Two commonly used approaches, i.e. the modified t-statistic methods (empirical Bayes and SAM), will be described in more detail as fallows.

#### **7.1 The empirical Bayes t-test LIMMA**

596 Bioinformatics – Trends and Methodologies

matrix*Y Y* = { <sup>1</sup> *<sup>j</sup>*} . Next, the average of each row of *Y* is computed and generated *Xm* . *Xm* is assigned to each column of Y to get a matrix denoted as *Xsort* . Finally, the normalized genes for each array is provided by rearranging each column of *Xsort* to have the same ordering as the corresponding column of the matrix *X* so that empirical distributions of the normalized genes are the same across arrays. Because the algorithm consists of only sorting and averaging operations, it runs quickly, even with large data sets (Bolstad et al., 2003). (More

All above normalization methods utilize certain critical biological and statistical assumptions about data distribution which may not be valid in practice. The main assumption is that the most genes on the array are non-differentially expressed between the two samples and the number of up-regulated genes approximately equals the number of down-regulated genes. In such cases, above-mentioned normalization methods may yield unreliable results. Xiong et al., 2008 proposed a novel statistical method based on the Generalized Procrustes Analysis (GPA) algorithm free of assumption (Xiong et al., 2008).

Once the data is normalized, further analysis is necessary to obtain biologically meaningful results. In fact, the main purpose of microarray experiment is to identifying genes that are significantly differentially expressed under different biological, and/or clinical conditions. A growing number of approaches have been presented to fulfill this purpose that can be divided in three categories: marginal filters, wrappers, and embedded methods. The wrapper and embedded methods are a type of search algorithms by which subsets of genes that are useful to define a good predictor are generated. Evaluation of a specific subset of genes is provided by running a specific classification model on the subset. The filter approaches are scalable and fast methods and independent of the classification algorithm including t-tests and nonparametric tests and analysis of variance (ANOVA) (Saeys et al., 2007). We will provide a brief overview on some of the popular statistical differential expression methods. It is notable that various methods usually identify different ordered list of significant genes since each approach is based on a specific set of assumptions, and takes

Fold Change (FC) cutoff is one of the early approaches for DEG identification that is still widely used to rank genes in microarray assays. In this method, when ratio of two color intensities from each gene exceeds a pre-set threshold is said to be differentially expressed. Usually a threshold of twofold up- or down-regulation is considered as cutoff value in most biological studies. This method ranks genes based on the ratio of average gene expression under two different groups or conditions. Simplicity is a main reason for popularity of fold change approach. Also a major drawback is that it does not consider variance of the expression values quantified. Hence, in order to cope with this problem, it will be used in combination with other statistical methods (Tarca et al., 2006). There also exists variety of statistical tests instead of using a fold change cutoff, for a correct selection of deferentially expressed genes. A simple but popular method is the t-test and its variants (Cui & Churchill, 2003). The t-test performs according to the simple estimation of the population variance for a gene through the sample variance of its expression levels. It typically compares the difference between the mean expression levels among the two groups, considering the variability of genes in their ranking (Tarca et al., 2006). T-test depends on the type of

details in (Stafford, 2008))

**7. Differentially gene expression** 

certain features of dataset into account.

This empirical Bayes t-tset has been implemented in the limma R statistical package. In this approach, gene-wise linear models are separately made to represent the design of a microarray experiment. Next, the coefficients of each linear model are estimated through the expression data. After quantification of coefficients of model and standard errors, moderate t, F and B (log-odds) statistics of differential expression are computed using empirical Bayes approach. It is equivalent to reduction of the gene-wise sample variance towards a pooled estimate producing more stable result when the number of measurements is small in experiments. Finally, genes can be ranked based on one of the chosen statistics. A more detailed derivation can be found in (Smyth, 2004).

#### **7.2 Signicance analysis of microarrays (SAM)**

SAM is a statistical technique, proposed by Tusher et al., 2001. It utilizes a non-parametric statistics, since the expression data may not be normally distributed. Modified t-statistic used in this method is essentially similar to the moderated t-statistic used in limma but have no associated distributional theory. Also the empirical bayes method provides a more complex model of the gene variance. SAM assigns a score to each gene based on change in gene expression relative to the standard deviation over repeated measurements for that gene (Smyth, 2004). Genes with scores greater than a threshold are considered differentially expressed. The threshold significance is determined by the user based on the FDR. The proportion of such genes identified by chance (false positives) is the false discovery rate (FDR). To estimate the FDR, nonsense genes are specified using random permutations of the repeated measurements (Tusher et al., 2001; Yan et al., 2005).

#### **8. R and bioconductor packages**

Microarray experiments produce large and highly complex datasets. Access to an efficient statistical computing environment is a critical aspect of the analysis of these gene expression datasets. There are a lot of free and commercial software. In most cases, the microarray kits come with the software that adequately analyses microarray data. One of the best options for data analysis is the R statistical programming environment (www.rproject.org) where

Quantification of Gene Expression Based on Microarray Experiment 599

In order to analyze the microarray data, a directory of all the image processing output files should be created (.spot les). This directory includes a file containing experiment description (SwirlSamples.txt file) and a file describing information on probe sequences, such as gene names, spot ID (fish.gal). Then R is started in the desired working directory. The following command will load Limma and marray packages for preprocessing swirl

Information about the hybridizations and the raw uorescent intensities data are provided

Qualitative assessment of arrays can be performed using different plots and graphs in microarray experiments. Therefore, serious quality problems and sources of artifacts will be identified in the data. In this step, the background signal, different biases such as dye bias and spatial bias are evaluated using visualization techniques. According to the results of the quality assessment, the need for each preprocessing method is clearly revealed. Firstly, the background signal distribution is evaluated to identify whether there is any region with

>imageplot (log2(RG\$Rb[,1]), Layout, low="white", high="red") >imageplot (log2(RG\$Gb[,1]), Layout, low="white", high="green")

Fig. 11. Image of green and red channel background intensities for slide 1.

Figure 11 shows that the background signals in both red and green channels are unreliably high in some region of array. It can be concluded as that there is spatial non-uniformity.

data.

>library (limma) >library (marray)

through the following commands

>Genes <- readGAL("fish.gal")

>Layout <- getLayout (Genes)

non-uniformity distribution.

>targets <- readTargets ("SwirlSamples.txt")

and the layout information of slides uses this command,

>RG <- read.maimages (targets\$FileName, source="spot")

In order to identify gene names the following command may be used,

the open-source Bioconductor R packages (www.bioconductor.org) are resourceful and effective in dealing with these microarray data.

There are plenty of packages such as limma, marray and arrayQuality for two-color spotted arrays or affy, affyPLM, affyPara and gcrma for Affymetrix array and Agi4x44PreProcess and AgiMicroRna for Agilent chips. The complete documentation of Bioconductor packages can be found on the Bioconductor project web site at: http://www.bioconductor.org/help/ bioc-views/release/bioc/. Bioconductor packages remove noise from measurements of microarray experiments through preprocessing of data. Also they specialize in various related tasks in handling microarray data. Some packages are dedicated to facilitation and automation of array data input and applied to detection of spatial and dye effects on arrays via a variety of diagnostic plots and graphs. In addition to the primary fluorescence intensity data, these packages also extract textual information on probe sequences and target samples, such as gene annotations, layout array, target sample descriptions and hybridization conditions, etc.

Limma package implements tools for data quality assesment, background correction, normalization and identification of DEGs in microarray experiments. Marray package also provides alternative functions for reading microarray data into R, normalization data and diagnostic plots of different measurements. Limma and marray packages share some features (Smyth & Speed, 2003; Yang et al., 2002b).

In the following, in a case study, we will demonstrate a microarray analysis flow using Bioconductor R packages in experimental design, data preprocessing, and differential expression detection. This analysis is performed using Bioconductor Release 2.7 based on R Version 2.9.

#### **8.1 Step by step microarray analysis**

The publicly available dataset from Swirl zebra fish two-color spotted microarray experiment was used as a typical example in this analysis. Swirl is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. In this experiment, two sets of dyeswap were prepared. On the first array, the wild type and mutated samples were labelled with Cy5 and Cy3 dyes, respectively. On the second array, the Cy5 and Cy3 dyes were swapped for the samples. The next two arrays were replicates of the first two arrays, respectively (Table 5). Thus, four arrays were prepared. Each array consisted of 16 print tips (4 by 4) and each print tip comprised 22 by 24 spots. Therefore, each array accounted for 8448 spots. Once the experimental steps were carried out, each array was scanned and then analyzed by SPOT software (Buckley, 2000). The main purpose of the Swirl experiment is to find genes with altered expression in the swirl mutant compared to wild type zebra sh.


Table 5. All experiments for the study of Swirl mutant.

In order to analyze the microarray data, a directory of all the image processing output files should be created (.spot les). This directory includes a file containing experiment description (SwirlSamples.txt file) and a file describing information on probe sequences, such as gene names, spot ID (fish.gal). Then R is started in the desired working directory. The following command will load Limma and marray packages for preprocessing swirl data.

```
>library (limma) 
>library (marray)
```
598 Bioinformatics – Trends and Methodologies

the open-source Bioconductor R packages (www.bioconductor.org) are resourceful and

There are plenty of packages such as limma, marray and arrayQuality for two-color spotted arrays or affy, affyPLM, affyPara and gcrma for Affymetrix array and Agi4x44PreProcess and AgiMicroRna for Agilent chips. The complete documentation of Bioconductor packages can be found on the Bioconductor project web site at: http://www.bioconductor.org/help/ bioc-views/release/bioc/. Bioconductor packages remove noise from measurements of microarray experiments through preprocessing of data. Also they specialize in various related tasks in handling microarray data. Some packages are dedicated to facilitation and automation of array data input and applied to detection of spatial and dye effects on arrays via a variety of diagnostic plots and graphs. In addition to the primary fluorescence intensity data, these packages also extract textual information on probe sequences and target samples, such as gene annotations, layout array, target sample descriptions and

Limma package implements tools for data quality assesment, background correction, normalization and identification of DEGs in microarray experiments. Marray package also provides alternative functions for reading microarray data into R, normalization data and diagnostic plots of different measurements. Limma and marray packages share some

In the following, in a case study, we will demonstrate a microarray analysis flow using Bioconductor R packages in experimental design, data preprocessing, and differential expression detection. This analysis is performed using Bioconductor Release 2.7 based on R

The publicly available dataset from Swirl zebra fish two-color spotted microarray experiment was used as a typical example in this analysis. Swirl is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. In this experiment, two sets of dyeswap were prepared. On the first array, the wild type and mutated samples were labelled with Cy5 and Cy3 dyes, respectively. On the second array, the Cy5 and Cy3 dyes were swapped for the samples. The next two arrays were replicates of the first two arrays, respectively (Table 5). Thus, four arrays were prepared. Each array consisted of 16 print tips (4 by 4) and each print tip comprised 22 by 24 spots. Therefore, each array accounted for 8448 spots. Once the experimental steps were carried out, each array was scanned and then analyzed by SPOT software (Buckley, 2000). The main purpose of the Swirl experiment is to find genes with altered expression in the swirl mutant compared to wild type zebra sh.

**Date FileName Slide number Conditions**  2001/9/20 Swirl.1.spot 81 Swirl(Cy3), wild type (Cy5) 2001/9/20 Swirl.2.spot 82 Swirl(Cy5), wild type (Cy3) 2001/11/8 Swirl.3.spot 93 Swirl(Cy3), wild type (Cy5) 2001/11/8 Swirl.4.spot 94 Swirl(Cy5), wild type (Cy3)

effective in dealing with these microarray data.

features (Smyth & Speed, 2003; Yang et al., 2002b).

Table 5. All experiments for the study of Swirl mutant.

**8.1 Step by step microarray analysis** 

hybridization conditions, etc.

Version 2.9.

Information about the hybridizations and the raw uorescent intensities data are provided through the following commands

```
>targets <- readTargets ("SwirlSamples.txt") 
>RG <- read.maimages (targets$FileName, source="spot")
```
In order to identify gene names the following command may be used,

>Genes <- readGAL("fish.gal")

and the layout information of slides uses this command,

>Layout <- getLayout (Genes)

Qualitative assessment of arrays can be performed using different plots and graphs in microarray experiments. Therefore, serious quality problems and sources of artifacts will be identified in the data. In this step, the background signal, different biases such as dye bias and spatial bias are evaluated using visualization techniques. According to the results of the quality assessment, the need for each preprocessing method is clearly revealed. Firstly, the background signal distribution is evaluated to identify whether there is any region with non-uniformity distribution.

```
>imageplot (log2(RG$Rb[,1]), Layout, low="white", high="red") 
>imageplot (log2(RG$Gb[,1]), Layout, low="white", high="green")
```
Figure 11 shows that the background signals in both red and green channels are unreliably high in some region of array. It can be concluded as that there is spatial non-uniformity.

Fig. 11. Image of green and red channel background intensities for slide 1.

Quantification of Gene Expression Based on Microarray Experiment 601

A boxplot shows graphically 5-number summary of data, the median, the upper and lower quartiles, the range, and individual extreme values. The central box in the plot represents the interquartile range (IQR), which is specified as the difference between the 75th percentile and 25th percentile. The width of a box represents the variability of the data and solid line in the middle of the box represents the median. Extreme values, greater than 1.5×IQR above the 75th percentile and less than 1.5×IQR below the 25th percentile, are plotted individually

Fig. 13. Boxplots in different slides and print tips of slide 1.

(54).

Therefore background correction is performed on swirl data. Since data have no negative value, subtractive methods can be carried out. Background corrected M and A-values are generated through subtraction method as follows

>MA <- normalizeWithinArrays (RG, method="none")

Secondly, we visualize the intensity range of M-values for each individual microarray using MA-plots. The signals include both background signals and foreground signals. These plots are generated using the following commands:

```
> plotMA (MA[,1], main="slide 1", ylim=c(-3,3)) 
> plotMA (MA[,2], main="slide 2", ylim=c(-3,3))
```
Figure 12 shows MA-plots of raw data of two slides of swirl experiment. Swirl experiment satisfies major assumption in microarray experiment i.e. a small percentage of genes are expected to be differentially expressed. Therefore, the majority of the points on the y axis (M-value) would be located at 0, since log (1) is 0. The shape of the curve on slide 1 shows more non–linear dependence on the overall spot intensity than slide 2. Therefore, normalization will attempt to remove the curvature of the spread and centralize the data around zero axis.

Fig. 12. MA plots for slide 1 and slide 2 in swirl experiment.

Another diagnostic plot is boxplot which can be useful for comparing M-values and homogeneity between print tip group and slides. Boxplot in different print tip is plotted using marray package after generation background corrected data by maNorm function.

```
>swirl.norm <- maNorm (swirl, norm="none") 
>boxplot (swirl.norm[,1], xvar="maPrintTip", yvar="maM", main="slide 
1")
```
The following command also produces a boxplot of the pre–normalization M-values for all four arrays in the swirl experiment.

```
>boxplot (MA$M~col (MA$M), xlab="slides", ylab="M")
```
Therefore background correction is performed on swirl data. Since data have no negative value, subtractive methods can be carried out. Background corrected M and A-values are

Secondly, we visualize the intensity range of M-values for each individual microarray using MA-plots. The signals include both background signals and foreground signals. These plots

Figure 12 shows MA-plots of raw data of two slides of swirl experiment. Swirl experiment satisfies major assumption in microarray experiment i.e. a small percentage of genes are expected to be differentially expressed. Therefore, the majority of the points on the y axis (M-value) would be located at 0, since log (1) is 0. The shape of the curve on slide 1 shows more non–linear dependence on the overall spot intensity than slide 2. Therefore, normalization will attempt to remove the curvature of the spread and centralize the data

Another diagnostic plot is boxplot which can be useful for comparing M-values and homogeneity between print tip group and slides. Boxplot in different print tip is plotted using marray package after generation background corrected data by maNorm function.

>boxplot (swirl.norm[,1], xvar="maPrintTip", yvar="maM", main="slide

The following command also produces a boxplot of the pre–normalization M-values for all

generated through subtraction method as follows

are generated using the following commands:

around zero axis.

1")

>MA <- normalizeWithinArrays (RG, method="none")

> plotMA (MA[,1], main="slide 1", ylim=c(-3,3)) > plotMA (MA[,2], main="slide 2", ylim=c(-3,3))

Fig. 12. MA plots for slide 1 and slide 2 in swirl experiment.

>swirl.norm <- maNorm (swirl, norm="none")

>boxplot (MA\$M~col (MA\$M), xlab="slides", ylab="M")

four arrays in the swirl experiment.

A boxplot shows graphically 5-number summary of data, the median, the upper and lower quartiles, the range, and individual extreme values. The central box in the plot represents the interquartile range (IQR), which is specified as the difference between the 75th percentile and 25th percentile. The width of a box represents the variability of the data and solid line in the middle of the box represents the median. Extreme values, greater than 1.5×IQR above the 75th percentile and less than 1.5×IQR below the 25th percentile, are plotted individually (54).

Fig. 13. Boxplots in different slides and print tips of slide 1.

Quantification of Gene Expression Based on Microarray Experiment 603

The closer the solid line to zero line in Figure 14 the more centrality of the data after normalization. In boxplot plotted between arrays when the width of the rectangles are approximately the same the distribution of the spots on replicate arrays are the most similar

In order to detect genes with differential expression between wild type and mutant samples, linear model and empirical bayes methods in limma package are used (Smyth, 2004). Dye swap samples are specified using the design matrix, which allows calculating of the average

Moderated t-statistics and log–odds (B-statistics) of expression data are calculated using

A summary table of some statistics for the top genes will be obtained using the following

ID Name M-value Moderated-t B Adj.P.val Control BMP2 -2.205288 -21.06952 7.960750 0.0003572816 Control BMP2 -2.296045 -20.28697 7.778330 0.0003572816 Control Dlx3 -2.184900 -20.01066 7.710959 0.0003572816 Control Dlx3 -2.180471 -19.63599 7.710959 0.0003572816 fb94h06 20-L12 1.271119 14.08467 7.617005 0.0020666932 fb40h07 7-D14 1.347207 13.52924 5.535983 0.0020666932 fc22a09 27-E17 1.266129 13.41339 5.483567 0.0020666932 fb85f09 18-G18 1.275686 13.39543 5.475386 0.0020666932 fc10h09 24-H18 1.195126 13.23722 5.402676 0.0020666932 fb85a01 18-E1 -1.287128 -13.07059 5.324819 0.0020666932

The moderated t-statistic with adjusted p-values can identify differentially expressed genes. As seen in Table 6, it can sort both copies of the gene BMP2 knocked out and both copies of

that means between-array normalization method has been selected appropriately.

M-values are estimated between these two samples using the lmFit function.

>topTable(fit, number=10, adjust="fdr", sort.by="t")

>MA <- normalizeWithinArray(RG)

M- values across multiple arrays.

>design <- c(-1,1,-1,1)

>fit <- lmFit(MA,design)

empirical Bayes methods

command.

> fit <- eBayes (fit)

Table 6. Top 10 genes from the Swirl data.

Dlx3, which is a known target of BMP

>MA <- normalizeBetweenArrays (MA, method="scale")

In the next step, we can normalize data through different within and between array normalization using both marray and limma packages. The pre–normalization MA–plot and boxplot for slide 1 in Figures 12 and 13 illustrate the non–linear dependence of the M-value on the overall spot intensity A and the existence of spatial biases. We thus perform PTloess normalization on this data. In the following scale normalization will be performed on the swirl data because four slides have different spread of M-values (Figure 14).

Fig. 14. Swirl data after PTloess normalization and PTloess fallowing Scale normalization.

```
>MA <- normalizeWithinArray(RG) 
>MA <- normalizeBetweenArrays (MA, method="scale")
```
The closer the solid line to zero line in Figure 14 the more centrality of the data after normalization. In boxplot plotted between arrays when the width of the rectangles are approximately the same the distribution of the spots on replicate arrays are the most similar that means between-array normalization method has been selected appropriately.

In order to detect genes with differential expression between wild type and mutant samples, linear model and empirical bayes methods in limma package are used (Smyth, 2004). Dye swap samples are specified using the design matrix, which allows calculating of the average M- values across multiple arrays.

>design <- c(-1,1,-1,1)

602 Bioinformatics – Trends and Methodologies

In the next step, we can normalize data through different within and between array normalization using both marray and limma packages. The pre–normalization MA–plot and boxplot for slide 1 in Figures 12 and 13 illustrate the non–linear dependence of the M-value on the overall spot intensity A and the existence of spatial biases. We thus perform PTloess normalization on this data. In the following scale normalization will be performed on the

Fig. 14. Swirl data after PTloess normalization and PTloess fallowing Scale normalization.

swirl data because four slides have different spread of M-values (Figure 14).

M-values are estimated between these two samples using the lmFit function.

```
>fit <- lmFit(MA,design)
```
Moderated t-statistics and log–odds (B-statistics) of expression data are calculated using empirical Bayes methods

> fit <- eBayes (fit)

A summary table of some statistics for the top genes will be obtained using the following command.


```
>topTable(fit, number=10, adjust="fdr", sort.by="t")
```
Table 6. Top 10 genes from the Swirl data.

The moderated t-statistic with adjusted p-values can identify differentially expressed genes. As seen in Table 6, it can sort both copies of the gene BMP2 knocked out and both copies of Dlx3, which is a known target of BMP

Quantification of Gene Expression Based on Microarray Experiment 605

Beranova-Giorgianni, S. (2003). Proteome analysis by two-dimensional gel electrophoresis

Beucher, S., & Meyer, F. (1993). The morphological approach to segmentation: the watershed

Bilban, M., Buehler, L.K., Head, S., Desoye, G., & Quaranta, V. (2002). Normalizing DNA

Bolstad, B.M., Irizarry, R.A., Astrand, M., & Speed, T.P. (2003). A comparison of

Breljak, D., Ambriović-Ristov, A., Kapitanović, S., Čačev, T., & Gabrilovac, J. (2005).

Cui, X., & Churchill, G.A. (2003). Statistical tests for differential expression in cDNA

Doyle, H.A., & Mamula, M.J. (2001). Post-translational protein modifications in antigen recognition and autoimmunity. *TRENDS in Immunology,* Vol. 22, No.8, pp. 443-449 Fodor, S.P.A., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A.T., & Solas, D. (1991) Light-directed, spatially addressable parallel chemical synthesis. *Science*, Vol.251, pp. 767–773 Frith, M.C., Pheasant, M., & Mattick, J.S. (2005). The amazing complexity of the human transcriptome. *European Journal of Human Genetics,* Vol.13, pp. 894–897 Gjerstad, Ø., Aakra, Å., Snipen, L., & Indahl, U. (2009). Probabilistically assisted spot

Gulmann, C., & O´Grady, A. (2003). Tissue microarray: an overview. *Current Diagnostic* 

Hall, D.A., Ptacek, J., & Snyder, M. (2007). Protein Microarray Technology. *Mech Ageing Dev,*

Hegde, P.S., White, I.R., & Debouck, C. (2003). Interplay of transcriptomics and proteomics.

Heid, C.A., Stevens, J., Livak, K.J., & Williams, P.M. (1996). Real Time Quantitative PCR.

Hirsch, J., Hansen, K.C., Burlingame, A.L., & Matthay, M.A. (2004). Proteomics: current

Istepanian, R.S.H., (2003). Microarray Image Processing: Current Status and Future

techniques and potential applications to lung disease. *Am J Physiol Lung Cell Mol* 

*Current Opinion in Biotechnology,* Vol.14, No.6, pp. 647–651

*Genome Research,* Vol.6, pp. 986-994, ISSN 1054-9803/96

Directions. *IEEE Trans. Nanobioscience,* Vol.2, No.4, pp. 173-175

segmentation-with application to DNA microarray images. *Chemometrics and* 

Buckley, M.J. (2000). Spot User's Guide, CSIRO Mathematical and Information Sciences,

Microarray Data. *Curr. Issues Mol. Biol.,* Vol.4, pp. 57-64

variance and bias. *Bioinformatics,* Vo.19, No.2, pp. 185-193

mRNA. *Food Technol. Biotechnol,* Vol.43, No.4, pp. 379–388

<http://www.cmis.csiro.au/iap/Spot/spotmanual.htm> Crick, F. (1970). Central dogma of molecular biology. *Nature,* Vol.227, pp. 561–563

microarray experiments. *Genome Biol,* Vol.4, pp.210–219

*Intelligent Laboratory Systems,* Vol.98, pp. 1–9

*Pathology,* Vol.9, pp. 149 -154

Vol.128, No.1, pp. 161–167

*Physiol,* Vol. 287, No.1, pp. L1–L23

Sydney, Australia, Available from:

Vol.22, No.5, pp. 273-281

481

and mass spectrometry: strengths and limitations. *Trends in Analytical Chemistry,*

transformation. *Processing of mathematical morphology in image,* NewYork, pp. 433–

normalization methods for high density oligonucleotide array data based on

Comparison of Three RT-PCR Based Methods for Relative Quantification of

#### **9. Conclusion**

Gene expression is a common process in all forms of living cells to generate the macromolecules which are necessary for life. Systemic comprehension of the cell function is provided using study of gene expression. Investigation of molecular dynamics of the cell can be performed in three biochemical levels, transcriptomics, proteomics, metabolomics. Compared to others, transcriptomics is a more robust, large-scale, moderate cost technology of simultaneously measuring thousands of mRNA level. There are various techniques for quantifying gene expression based on mRNA. However gene expression traditional techniques provide valuable biological information, they are limited in some ways such as scale, economy and sensitivity. Therefore, compared to the other commonly used techniques, quantification based on microarray is really remarkable because of being high throughput and cost effective. It enables the simultaneous analysis of thousands of genes within one single experiment. Such miniaturized binding technology is typically divided into DNA, protein, tissue, cellular and chemical compound microarrays. DNA microarrays are the most popular type of this technology which currently manufactured through two main approaches: in situ synthesis and deposition of pre-synthesized probes (spotted arrays). We focused mainly on the spotted cDNA microarray. After microarray experiment, slides are inserted into scanner. The output data are fluorescent images arranged into a matrix of spots. Then, images are processed to quantify level of gene expression based on the intensity of each spot and obtain background estimates and quality measures. It is performed in gridding, segmentation, quantification and spot quality assessment stages. The output data from image processing stage needs to be preprocessed to eliminate nonbiological variations, transform data into a suitable scale and improve the quality of downstream analysis. These are performed using background correction, logarithm transformation and normalization of microarray data. Finally, identification of genes that are significantly differentially expressed under different conditions can be carried out using marginal filters, wrappers, and embedded methods. We pointed some of the filter approaches such as t-test and its variants such as moderated t-test and SAM approach and analysis of variance (ANOVA). In summary, in order to analyze microarray data, the R statistical programming environment is chosen where the Bioconductor R packages such as limma and marray are effective in processing these microarray data. These packages address data input, production of diagnostic plots to detection of different biases, the statistical methods of removing experimental noises and errors on the spots within and between arrays. Finally, limma package is also used as a powerful tool to identification of differentially expressed genes.

#### **10. References**


Gene expression is a common process in all forms of living cells to generate the macromolecules which are necessary for life. Systemic comprehension of the cell function is provided using study of gene expression. Investigation of molecular dynamics of the cell can be performed in three biochemical levels, transcriptomics, proteomics, metabolomics. Compared to others, transcriptomics is a more robust, large-scale, moderate cost technology of simultaneously measuring thousands of mRNA level. There are various techniques for quantifying gene expression based on mRNA. However gene expression traditional techniques provide valuable biological information, they are limited in some ways such as scale, economy and sensitivity. Therefore, compared to the other commonly used techniques, quantification based on microarray is really remarkable because of being high throughput and cost effective. It enables the simultaneous analysis of thousands of genes within one single experiment. Such miniaturized binding technology is typically divided into DNA, protein, tissue, cellular and chemical compound microarrays. DNA microarrays are the most popular type of this technology which currently manufactured through two main approaches: in situ synthesis and deposition of pre-synthesized probes (spotted arrays). We focused mainly on the spotted cDNA microarray. After microarray experiment, slides are inserted into scanner. The output data are fluorescent images arranged into a matrix of spots. Then, images are processed to quantify level of gene expression based on the intensity of each spot and obtain background estimates and quality measures. It is performed in gridding, segmentation, quantification and spot quality assessment stages. The output data from image processing stage needs to be preprocessed to eliminate nonbiological variations, transform data into a suitable scale and improve the quality of downstream analysis. These are performed using background correction, logarithm transformation and normalization of microarray data. Finally, identification of genes that are significantly differentially expressed under different conditions can be carried out using marginal filters, wrappers, and embedded methods. We pointed some of the filter approaches such as t-test and its variants such as moderated t-test and SAM approach and analysis of variance (ANOVA). In summary, in order to analyze microarray data, the R statistical programming environment is chosen where the Bioconductor R packages such as limma and marray are effective in processing these microarray data. These packages address data input, production of diagnostic plots to detection of different biases, the statistical methods of removing experimental noises and errors on the spots within and between arrays. Finally, limma package is also used as a powerful tool to identification of

Adams, R., & Bischof, L. (1994). Seeded region growing. *IEEE transactions on pattern analysis* 

Angenendt, P. (2005). Progress in protein and antibody microarray technology. *DDT.,*

Barrett, J.C., & Kawasaki, E.S. (2003). Microarrays: the use of oligonucleotides and cDNA for

the analysis of gene expression. *DDT.*, Vol.8, No.3, pp. 134-141

*and machine intelligence,* Vol.16, No.6, pp. 641–647

**9. Conclusion** 

differentially expressed genes.

Vol.10, No.7, pp. 503-511

**10. References** 


Quantification of Gene Expression Based on Microarray Experiment 607

Singh, A., & Sau, A. K. (2010). Tissue Microarray: A powerful and rapidly evolving tool for high-throughput analysis of clinical specimens. *IJCRI,* Vol.1, No.1, pp. 1-6

Smyth, G.K., & Speed, T. (2003). Normalization of cDNA microarray data. *Methods,* Vol.31,

Smyth, G.K. (2004). Linear Models and Empirical Bayes Methods for Assessing Differential

Soli, P. (1999). *Morphological image Analysis: Principles and Applications.* Springer-Verlag

Stafford, P. (2008). *Methods in microarray normalization*. Taylor and Francis CRC Press, 978-1-

Tarca, A.L., Romero, R., & Draghici, S. (2006). Analysis of microarray experiments of gene

Templin, M.F., Stoll, D., Schrenk, M., Traub, P.C., Vöhringer, C.F., & Joos, T.O. (2002).

Trayhuru, P., (1996). Northern blotting. *Proceedings of the Nutrition Society,* Vol.55, pp. 583-

Tsiridis, E., & Giannoudis, P.V. (2006). Transcriptomics and proteomics: Advancing the

Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance analysis of microarrays applied

van der Werf, M.J., Jellema, R.H., & Hankemeier, T. (2005). Microbial metabolomics:

Wang, D. (2007). Spot: cDNA Microarray Image Analysis Users Guide. CSIRO Mathematical

Xiong, H., Zhang, D., Martyniuk, C.J., Trudeau, V.L., & Xia, X. (2008). Using Generalized

Yan, X., Deng, M., Fung, W.K., & Qian, M. (2005). Detecting differentially expressed genes by relative entropy. *Journal of Theoretical Biology,* Vol.234, pp. 395–402 Yang, Y.H., Buckley, M.J., & Speed, T.P. (2001). Analysis of cDNA microarray images.

Yang, Y.H., Buckley, M.J., Dudoit, S., & Speed, T.P. (2002a). Comparison of methods for

Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., & Speed, T.P. (2002b).

Expression in Microarray Experiments. *Statistical Applications in Genetics and* 

expression proling. *American Journal of Obstetrics and Gynecology,* Vol.195, pp. 373–

Protein microarray technology. *TRENDS in Biotechnology,* Vol.20, No.4, pp. 160-166

understanding of genetic basis of fracture healing. *Injury, Int. J. Care Injured,* Vol.

to the ionizing radiation response. *Proceedings of the National Academy of Sciences,*

replacing trial-and-error by the unbiased selection and ranking of targets. *J Ind* 

and Information Sciences, Australia, Available from: <

Procrustes Analysis (GPA) for normalization of cDNA microarray data. *BMC* 

image analysis on cDNA microarray data. *J. Comput. Graph. Statist.,* Vol.11, pp. 108–

Normalization for cDNA microarray data: a robust composite method addressing

Stekel, D. (2003). *Microarray Bioinformatics.* Cambridge University Press, Cambridge

pp. 265-273

388

589

136

37S, pp. S13—S19

Vol.98, No.9, pp. 5116–5121

*Bioinformatics,* Vol.9, No.25

*Microbiol Biotechnol,* Vol. 32, pp. 234–252

http://spot.cmis.csiro.au/spot/doc/Spot.pdf>

*Briefings in bioinformatics,* Vol.2, No.4, pp. 341-349

*Molecular Biology,* Vol.3, No.1, Article 3

4200-5278-7, Boca Raton, London, NewYork

Berlin, Heidelberg, New York


Karakach, T.K., Flight, R.M., & Douglas, S. (2010). An introduction to DNA microarrays for

Knapp, G., Beckwith, J.S., Johnson, P.F., Fuhrman, S.A., & Abelson, J. (1978). Transcription and processing of intervening sequences in yeast tRNA genes. *Cell 14,* pp. 221–236 Kononen, J., Bubendorf, L., Kallioniemi. A., Barlund, M., Schraml, P., Leighton. S., Torhorst.

Kooperberg, C., Fazzio, T.G., Delrow, J.J., & Tsukiyama, T. (2002). Improved Background

Lipshutz, R.J., Fodor, S.P.A., Gingeras, T.R., & Lockhart, D.J. (1999). High density synthetic

Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M.,

Parsons, M., & Grabsch, H. (2009). How to make tissue microarrays. *Diagnostic* 

Perdew, G.H., Vanden Heuvel, J.P., & Peters, J.M. (2006). Regulation of Gene Expression:

Pinet, F. **(**2009). Identifying patients at risk of progressive left ventricular dysfunction. *Heart* 

Quackenbush, J. (2002). Microarray data normalization and transformation. *Nature genetics* 

Ritchie, M.E., Silver, J., Oshlack, A., Holmes, M., Diyagama, D., Holloway, A., & Smyth, G.K.

Russo, G., Zegar, C., & Giordano, A. (2003). Advantages and limitations of microarray

Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in

Schadt, E.E., Li, C., Su, C., & Wong, W.H. (2001). Analyzing High-Density Oligonucleotide Gene Expression Array Data. *Journal of Cellular Biochemistry,* Vol.80, pp. 192–202 Schaub, M.C., Lucchinetti, E., & Zaugg, M. (2009). Genomics, transcriptomics, and

Schena, M., Shalon, D., Davis, R., & Brown, P.O. (1995). Quantitative monitoring of gene

Schuchhardt, J., Beule, D., Wolski, E., Eichhoff, H., Leharch, H., & Herzel, H. (2000).

expression patterns with a complementary DNA microarray. *Science,* Vol.270, pp.

Normalization strategies for cDNA microarrays. *Nucleic Acids Research,* Vol.28,

(2007). A comparison of background correction methods for two-colour

oligonucleotide arrays. *Nature genetics supplement,* Vol.21, pp. 20-24

No.1, pp. 28–52

No.1, pp. 55-66

*Biotechnology*, Vol.14, pp. 1675–1680

*Metab,* Vol. 42**,** pp. 10–14

467-470

No.10, pp. e47

*supplement*, Vol.32, pp. 496–501

*histopathology,* Vol.15, No.3, pp. 142-150

Molecular Mechanisms. *Humana Press,* pp. 11-30

microarrays. *Bioinformatics,* Vol.23, No.20, pp. 2700–2707

technology in human cancer. *Oncogene,* Vol.22, pp. 6497–6507

bioinformatics. *Bioinformatics,* Vol.23, No.19, pp. 2507–2517

proteomics of the ischemic heart. *Heart Metab,* Vol.42, pp. 4–9

Schena, M. (2003). *Microarray analysis*. John Wiley & Sons, New Jersey

847

gene expression analysis. *Chemometrics and Intelligent Laboratory Systems,* Vol.104,

*J.*, Mihatsch, M.J., Sauter, G., & Kallioniemi, O.P. (1998). Tissue microarrays for high-throughput molecular profiling of tumor specimens. *Nat Med*, Vol.4, pp. 844–

Correction for Spotted DNA Microarrays. *Journal of computational biology,* Vol.9,

Wang, C., Kobayashi, M., Norton, H., & Brown, E.L. (1996). DNA expression monitoring by hybridization of high density oligo-nucleotide arrays. *Nature* 


**27** 

*Belgium* 

**On-Chip Living-Cell** 

Ronnie Willaert and Hichem Sahli

*Vrije Universiteit Brussel,* 

**Microarrays for Network Biology** 

The recently developed field of systems biology creates a new framework for understanding the molecular basis of physiological or pathophysiological states of cells. Screening modalities that can be used on single cells are needed to study cellular systems biology. The recent development of cellular microarrays has provided a method for the complex molecular analysis of living, single cells (Chen & Davies, 2006). Unlike other highthroughput systems, such as gene expression profiling microarrays or protein microarrays, cellular microarrays use a printed pattern of geographically distinct spots to probe living cells, rather than cell lysates, or other non-viable samples. Among the most powerful tools to assay gene function on a genome-wide scale in the physiological context of intact living cells are fluorescence microscopy and related imaging techniques (Pepperkok & Ellenberg, 2006). To enable these techniques to be applied to functional genomics experiments, fluorescence microscopy is making the transition to a quantitative and high-throughput technology. The combination of time-lapse microscopy, quantitative image analysis and fluorescent protein reporters has enabled observation of multiple cellular components over time in individual cells (Locke & Elowitz, 2009). In conjunction with mathematical modelling, these techniques are now providing powerful insights into genetic and proteomic behaviour in diverse microbial systems. Recently, a quantitative system-wide analysis of mRNA and protein expression in individual cells with single-molecule sensitivity using a yellow fluorescent protein fusion library for *E. coli* has been realised (Taniguchi *et al.*,

A cell assay is defined here as a measurement and analysis of the cellular response, at a given level, to a chemical and/or physical stimulus (Barbulovic-Nad & Wheeler, 2008). Cellular responses are diverse, e.g. alterations of intracellular and extracellular biochemistry, cell morphology, motility and (de)adhesion, survival and apoptosis, and proliferation properties. These responses characterise single aspects of cell phenotype, and are typically monitored in a culture dish or a multiwell plate, while more recently microfluidic devices have been employed. While culture dishes require millilitre volumes of media and reagents, multiwell plates contain microliter volumes and enable simultaneous

**1. Introduction** 

2010).

**2. Microfluidic chips for cell microarrays** 

**2.1 Cell assays and cell microarrays** 

single and multiple slide systematic variation. *Nucleic Acids Research,* Vol.30, No.4, pp. e15


### **On-Chip Living-Cell Microarrays for Network Biology**

Ronnie Willaert and Hichem Sahli *Vrije Universiteit Brussel, Belgium* 

#### **1. Introduction**

608 Bioinformatics – Trends and Methodologies

Zhang, L., Zhang, X., Ma, Q., Ma, F., & Zhou, H. (2010). Transcriptomics and Proteomics in

Zhao, H., Engelen, K., De Moor, B., & Marchal, K. (2007). CALIB: a BioConductor package

*Bioinformatics*, Vol.23, No.13, pp. 1700-1701

pp. e15

single and multiple slide systematic variation. *Nucleic Acids Research,* Vol.30, No.4,

the Study of H1N1 2009. *Genomics Proteomics Bioinformatics,* Vol.8, No.3, pp. 139-144

for estimating absolute expression levels from two-color microarray data.

The recently developed field of systems biology creates a new framework for understanding the molecular basis of physiological or pathophysiological states of cells. Screening modalities that can be used on single cells are needed to study cellular systems biology. The recent development of cellular microarrays has provided a method for the complex molecular analysis of living, single cells (Chen & Davies, 2006). Unlike other highthroughput systems, such as gene expression profiling microarrays or protein microarrays, cellular microarrays use a printed pattern of geographically distinct spots to probe living cells, rather than cell lysates, or other non-viable samples. Among the most powerful tools to assay gene function on a genome-wide scale in the physiological context of intact living cells are fluorescence microscopy and related imaging techniques (Pepperkok & Ellenberg, 2006). To enable these techniques to be applied to functional genomics experiments, fluorescence microscopy is making the transition to a quantitative and high-throughput technology.

The combination of time-lapse microscopy, quantitative image analysis and fluorescent protein reporters has enabled observation of multiple cellular components over time in individual cells (Locke & Elowitz, 2009). In conjunction with mathematical modelling, these techniques are now providing powerful insights into genetic and proteomic behaviour in diverse microbial systems. Recently, a quantitative system-wide analysis of mRNA and protein expression in individual cells with single-molecule sensitivity using a yellow fluorescent protein fusion library for *E. coli* has been realised (Taniguchi *et al.*, 2010).

#### **2. Microfluidic chips for cell microarrays**

#### **2.1 Cell assays and cell microarrays**

A cell assay is defined here as a measurement and analysis of the cellular response, at a given level, to a chemical and/or physical stimulus (Barbulovic-Nad & Wheeler, 2008). Cellular responses are diverse, e.g. alterations of intracellular and extracellular biochemistry, cell morphology, motility and (de)adhesion, survival and apoptosis, and proliferation properties. These responses characterise single aspects of cell phenotype, and are typically monitored in a culture dish or a multiwell plate, while more recently microfluidic devices have been employed. While culture dishes require millilitre volumes of media and reagents, multiwell plates contain microliter volumes and enable simultaneous

On-Chip Living-Cell Microarrays for Network Biology 611

Fibroblast 3T3 Microfluidics 16 Perfusion culture for 3 days Kim *et al*., 2006

(wells, spots) **Characteristics References** 

fluorescet probes in single cells

Response of single cells to different concentration of signalling molecule (TNF)-α

signalling networks

addressable rows/10x10 array

8x5 array: row with 5 wells is individually addressed; GFPbased gene expression

expression

cells

Transient stimulation schedules on proliferation, differentiation and motility

spots

Dimensions of the wells are tunable, diameter: 20 to >500µm, height: 10-500 µm

dependent, 4 days culture Chin *et al.*, 2004

gel Kim *et al*., 2007

analysis Sui *et al*., 2007

Roach *et al.*, 2009

Tay *et al*., 2010

Cheong *et al*., 2009

Hung *et al*., 2005; Lee *et al*., 2006

Thompson *et al*., 2004

> Wieder *et al*., 2005

Anderson *et al*., 2004

Gómez-Sjöberg *et al.*, 2007

2005

Fernandes *et al*., 2010

Chin *et al.*, 2004

**Array size** 

Carcinoma cells Microfluidics 1 Cells immobilised with peptide

Hepatocytes Batch microarray 512 Dynamic monitoring of

Fibroblast Microfluidics 4 Ligand labeling and cell binding

Fibroblast 3T3 Microfluidics 32 Quantitative interrogation of

H 35 cells Microfluidics 64/100 8x8 array with individually

Hela-NF Microfluidics 256 16x16 array; GFP-based gene

mESC Microfluidics 16 Proliferation is flow rate

Table 1. Examples of mammalian and stem cell microarrays/wells.

microenvironment (Breslauer *et al*., 2006; Charvin *et al*., 2009).

mESC Batch microarray 280 Cells immobilised in alginate gel

cells Batch microarray 1700 Interaction of biomaterials with

SC Microfluidics 1 Growth and differentiation Chung *et al*.,

on a chip" technologies have been used to track gene expression changes in individual cells, enabling large populations of cells to be monitored, and allowing precise control of the cell

Conventional methods of fabricating microfluidic devices have centered on etching in glass and silicon (Pisani & Tadigadapa, 2010). Polymers have assumed the leading role as substrate materials for microfluidic devices in recent years (Becker & Gärtner, 2008). They offer a broad range of material parameters as well as material and surface chemical properties, which

**Cell type Batch/ micro-**

**fluidics** 

Fibroblast 3T3 Microfluidics 96

Hela-NF Microfluidics 40

cells Microfluidics 96

Rat stem cells Batch microarray 10000

Human stem

Human stem

Human neural

analysis of multiple cell types or stimuli. Experiments with multiwell plates are typically integrated in a robotic analysis platform. Two major drawbacks of robotic platforms are the expense of the instrumentation, and the cost of experimental consumables.

The use of microarrays was first reported in 1989 (Ekins *et al*., 1989). The variety and diversity of microarrays has become impressive. Three main types of microarrays have been developed: DNA microarrays, protein microarrays, and cell microarrays (Barbulovic-Nad *et al.*, 2006). Several different approaches to cell microarrays have been explored to investigate gene expression, cell-surface interactions, extracellular matrix composition, cell migration and proliferation, the effects of drugs on cellular activity and many other areas (Angres, 2005). There are two fundamental methods to produce cell microarrays: the indirect and the direct method. The indirect method – i.e. the "reverse transfection" method – was developed by Ziauddin and Sabbatini (2001). In the direct method, the cells are printed onto a substrate. In few cases contact-based microarrayers are used, but more often non-contactbased devices are used.

Miniaturisation of cellular assays via cell microarrays increases assay throughput while reducing reagent consumption and the number of cells required, making these systems attractive for a wide range of assays in drug discovery, toxicology, and stem cell research (Fernandes *et al*., 2009). Cell microarrays have been developed for highly parallel, highthroughput analyses of cell phenotypes (Narayanaswamy *et al*., 2006), assessing cell proliferation and morphology (Bochner *et al*., 2001; Xu, 2002), protein expression levels (Schwenk *et al*., 2002), and imaging of tissues (Kononen *et al.*, 1998; Radhakrishnan *et al.,* 2008) and single cells (Biran *et al*., 2003). In these initially developed living-cell microarrays, microbial cells were printed on an agar growth medium and could grow as microcolonies, or cells were grown in multiwell plates and printed on a glass slide for imaging, or only short time analyses on living cells were performed. High-throughput experiments on a library of cells require on-chip cell culture. Microchip 2- or 3-D cell cultivation techniques can provide many advantages for cell culture systems because the scale of the cultivated environment inside the microchip is fitted to the size of the cells. Table 1 gives some examples of developed mammalian cell microarrays/wells.

#### **2.2 Cell assays in microfluidics**

Microfrabrication technology originated from the electronics industry, where 3D microfeatures for electronic devices were manufactured in the sub-centimeter to sub-micrometer range using lithography techniques (Franssila, 2010). Microfluidics emerged as an extension of MEMS (Micro Electro Mechanical Systems) technology at the beginning of the 1980s. Microfluidics is a technology that is characterised by devices containing networks of micrometer-dimension channels (Whitesides, 2006). It involves the manipulation of very small fluid volumes, enabling the creation and control of µl to nl volume reactors. Microsystems create new opportunities for the spatial and temporal control of cell proliferation and stimuli by combining surfaces that mimic complex biochemistries and geometries of extracellular matrix with microfluidic channels that regulate transport of fluids and soluble factors (West *et al*., 2008). Further integration with bioanalytic microsystems results in multifunctional platforms for basic biological insights into cells and tissues, as well as for cell-based sensors with biochemical, biomedical and environmental functions. Highly integrated microdevices show great promise for basic biomedical and pharmaceutical research, and for drug discovery (Dittrich & Manz, 2006). Microfluidic "lab

analysis of multiple cell types or stimuli. Experiments with multiwell plates are typically integrated in a robotic analysis platform. Two major drawbacks of robotic platforms are the

The use of microarrays was first reported in 1989 (Ekins *et al*., 1989). The variety and diversity of microarrays has become impressive. Three main types of microarrays have been developed: DNA microarrays, protein microarrays, and cell microarrays (Barbulovic-Nad *et al.*, 2006). Several different approaches to cell microarrays have been explored to investigate gene expression, cell-surface interactions, extracellular matrix composition, cell migration and proliferation, the effects of drugs on cellular activity and many other areas (Angres, 2005). There are two fundamental methods to produce cell microarrays: the indirect and the direct method. The indirect method – i.e. the "reverse transfection" method – was developed by Ziauddin and Sabbatini (2001). In the direct method, the cells are printed onto a substrate. In few cases contact-based microarrayers are used, but more often non-contact-

Miniaturisation of cellular assays via cell microarrays increases assay throughput while reducing reagent consumption and the number of cells required, making these systems attractive for a wide range of assays in drug discovery, toxicology, and stem cell research (Fernandes *et al*., 2009). Cell microarrays have been developed for highly parallel, highthroughput analyses of cell phenotypes (Narayanaswamy *et al*., 2006), assessing cell proliferation and morphology (Bochner *et al*., 2001; Xu, 2002), protein expression levels (Schwenk *et al*., 2002), and imaging of tissues (Kononen *et al.*, 1998; Radhakrishnan *et al.,* 2008) and single cells (Biran *et al*., 2003). In these initially developed living-cell microarrays, microbial cells were printed on an agar growth medium and could grow as microcolonies, or cells were grown in multiwell plates and printed on a glass slide for imaging, or only short time analyses on living cells were performed. High-throughput experiments on a library of cells require on-chip cell culture. Microchip 2- or 3-D cell cultivation techniques can provide many advantages for cell culture systems because the scale of the cultivated environment inside the microchip is fitted to the size of the cells. Table 1 gives some

Microfrabrication technology originated from the electronics industry, where 3D microfeatures for electronic devices were manufactured in the sub-centimeter to sub-micrometer range using lithography techniques (Franssila, 2010). Microfluidics emerged as an extension of MEMS (Micro Electro Mechanical Systems) technology at the beginning of the 1980s. Microfluidics is a technology that is characterised by devices containing networks of micrometer-dimension channels (Whitesides, 2006). It involves the manipulation of very small fluid volumes, enabling the creation and control of µl to nl volume reactors. Microsystems create new opportunities for the spatial and temporal control of cell proliferation and stimuli by combining surfaces that mimic complex biochemistries and geometries of extracellular matrix with microfluidic channels that regulate transport of fluids and soluble factors (West *et al*., 2008). Further integration with bioanalytic microsystems results in multifunctional platforms for basic biological insights into cells and tissues, as well as for cell-based sensors with biochemical, biomedical and environmental functions. Highly integrated microdevices show great promise for basic biomedical and pharmaceutical research, and for drug discovery (Dittrich & Manz, 2006). Microfluidic "lab

expense of the instrumentation, and the cost of experimental consumables.

examples of developed mammalian cell microarrays/wells.

**2.2 Cell assays in microfluidics** 

based devices are used.



on a chip" technologies have been used to track gene expression changes in individual cells, enabling large populations of cells to be monitored, and allowing precise control of the cell microenvironment (Breslauer *et al*., 2006; Charvin *et al*., 2009).

Conventional methods of fabricating microfluidic devices have centered on etching in glass and silicon (Pisani & Tadigadapa, 2010). Polymers have assumed the leading role as substrate materials for microfluidic devices in recent years (Becker & Gärtner, 2008). They offer a broad range of material parameters as well as material and surface chemical properties, which

On-Chip Living-Cell Microarrays for Network Biology 613

cells must be detached from culture flasks and seeded or spotted into a microfluidic device while sufficient time has to be allowed to achieve proper cell attachment and reduction of stress induced by the transfer. Mobile cells in suspension are easier to handle and require

A fundamental goal of cell biology is identifying how cell behaviour arises from the dynamic collection of environmental stimuli to which the cell is exposed (Lee & Di Carlo, 2009). From a biosystems science and engineering perspective, there is great interest in how the cell behaves as a system that processes time-dependent input signals into output behaviour(s). Ideally, with knowledge of the history of the ensemble of environmental stimuli, one would be able to predict the precise behaviour that a particular cell would exhibit under a given stimulus. Unfortunately, cells under seemingly identical environmental conditions often display a distribution of heterogeneous behaviour(s) (Lidstrom & Meldrum, 2003). This appears to be partly due to probabilistic behaviour in the "decision" processes that connect input and output (Raser & O'Shea, 2005; Mettetal *et al*., 2006). Underlying the links between inputs and outputs are systems of interconnected molecular interactions (signalling pathways). Signalling within one pathway as well as cross-signalling between pathways, localisation of reactions and the sometimes small molecule numbers involved in signalling contribute to stochastic behaviour in these systems (Raser & O'Shea, 2005; Kholodenko *et al*., 2010), which in the case of stem cells may very well be an essential and necessary feature of their biology and enables them to transit from one state to another. Because of the meanwhile well-documented heterogeneity within such cell population, increased emphasis has been put on analysing a large number of single cells and determining distributions of responses (Cai *et al*., 2006; Mettetal *et al*., 2006; Yu *et al.*, 2006). New tools, based on microfabrication and microfluidic technologies, are now allowing improved dynamic control of environmental variables for high-throughput single-cell analysis. These experimental technologies combined with systems analysis of signalling pathways are expected to lead to an improved

less time to adapt to the new environment.

**2.3 Single-cell analysis/monitoring in microfluidic devices** 

quantitative description of single-cell function (Lee & Di Carlo, 2009).

observe the response of the cells, e.g. to drug candidate molecules.

under various growth conditions or in response to environmental insults.

Several single-cell analysis techniques have been developed, which may be classified in terms of information content (number of elements capable of being studied simultaneously) and throughput (number of cells studied in a give time). The simplest and most widely used forms of single-cell analysis are fluorescence microscopy and flow cytometry. Automated microscopy techniques, often termed high-content screening (HCS) or "cellomics", recently provided also quantitative insight into cellular behaviour and in most cases are applied to

The utility of single-cell measurements with high temporal resolution has been demonstrated by bacterial studies, which used optical microscopy to observe *Escherichia coli* over long time periods and reveal interesting temporal fluctuations and cell-to-cell variability that would otherwise be masked by population-wide measurements (Pedraza & van Oudenaarden, 2005). A microfluidic microchemostat has been constructed and used to acquire single-cell fluorescence data from *Saccharomyces cerevisiae* over many cellular generations (Charvin *et al*., 2009; Rowat *et al.*, 2009). One way in which cells can rapidly respond to environmental stimuli is to alter the localisation and abundance of proteins (Charvin *et al*., 2009). In a microfluidic device, these aspects can be studied on the same cells

enable microscopic design features that cannot be realised by any other class of materials. Today, the most preferred material for biocompatible microfluidic devices is poly(dimethylsiloxane) (PDMS) (Velve-Casquillas *et al*., 2010), which was introduced as soft lithography by Whitesides (Anderson *et al*., 2000). PDMS is soft, transparant, permeable to gasses, for most, impermeable to liquids, biocompatible, nontoxic, and has a low electrical conductivity, making it a very suitable material for biological applications in microfluidic devices. Fabrication of microfluidic devices in PDMS by soft lithography provides faster, less expensive routes than these conventional methods to devices that handle aqueous solutions. Soft lithography refers to a collection of techniques for creating microstructures and nanostructures based on printing, moulding and embossing (Weibel *et al*., 2007). It is based on rapid prototyping and replica molding. In rapid prototyping, a computer-aided design program is used to create a design for channels, which are printed at high resolution onto transparency film. The transparency film then serves as the photomask. The master molds are generated by using the photomask in contact lithography to produce a positive relief of photoresist. In replica molding, PDMS is poured over the master and heat cured to generate a negative replica of the master. The PDMS is then removed from the mold and sealed against a glass coverslip to form the device features and channels. Flows in microfluidic devices are mainly pressure-driven by using syringe pump, rotary pump, or electro-osmotic flow.

Microfluidic devices are advantageous for cell assays for various reasons. The most obvious one is the similarity in dimensions of cells and microchannels (10-100 µm widths and depths). Another important advantage is flow: fluid flow in these small channels is laminar. Consequently, convection only exists in the direction of the applied flow, whereas in the direction perpendicular to the applied flow, diffusion contributes to mass transport. Although diffusion-based transport is slow across long distances, in microchannels diffusion enables rapid reagent delivery. In addition, the combination of laminar flow and diffusion makes the formation of highly resolved chemical gradients across small distances. This feature is particular useful for cell assays as such gradients are common in living systems (but difficult to implement in macroscale setups). Another advantage is the increased surface-to-volume ratio, which facilitates favourable scaling of heat and mass transfer, as well as favourable scaling of electrical and magnetic fields that are used in electromagnetic cell analysis. Another consequence of the size regime lies in the concentration of analytes: as cells in microchannels are confined in sub-microliter volumes, relevant analytes do not become too dilute and can thus be more readily detected. A limitation of the high surface-tovolume ratio of microchannels is the adsorption of molecules onto channel walls that are generally hydrophobic. However, surfaces can be chemically treated to prevent adsorption of biomolecules (Velve-Casquillas *et al*., 2010). Automated high-throughput experiments may be performed in a large number of repeating functional microstructures fabricated on a single chip. These microsystems can also monitor the time course of the release, which is difficult to measure by conventional batch cell culture methods. Microfluidic devices can be made transparant and the cells monitored in real time by imaging, using fluorescence markers to probe cell functions and cell fate.

In a microfluidic device for cell-based assays, adequate culture conditions must be maintained for the duration of the experiment, which can span several days. While being cultured, cells must be continuously perfused with nutrients and oxygen; in addition, constant temperature and pH must be maintained. In contrast to traditional batch cultures, miniaturised perfusion systems provide precise control of medium composition, long-term unattented cultures and tissue-like structuring of cultures (Heiskanen *et al*., 2010). Adherent

enable microscopic design features that cannot be realised by any other class of materials. Today, the most preferred material for biocompatible microfluidic devices is poly(dimethylsiloxane) (PDMS) (Velve-Casquillas *et al*., 2010), which was introduced as soft lithography by Whitesides (Anderson *et al*., 2000). PDMS is soft, transparant, permeable to gasses, for most, impermeable to liquids, biocompatible, nontoxic, and has a low electrical conductivity, making it a very suitable material for biological applications in microfluidic devices. Fabrication of microfluidic devices in PDMS by soft lithography provides faster, less expensive routes than these conventional methods to devices that handle aqueous solutions. Soft lithography refers to a collection of techniques for creating microstructures and nanostructures based on printing, moulding and embossing (Weibel *et al*., 2007). It is based on rapid prototyping and replica molding. In rapid prototyping, a computer-aided design program is used to create a design for channels, which are printed at high resolution onto transparency film. The transparency film then serves as the photomask. The master molds are generated by using the photomask in contact lithography to produce a positive relief of photoresist. In replica molding, PDMS is poured over the master and heat cured to generate a negative replica of the master. The PDMS is then removed from the mold and sealed against a glass coverslip to form the device features and channels. Flows in microfluidic devices are mainly pressure-driven by using syringe pump, rotary pump, or electro-osmotic flow.

Microfluidic devices are advantageous for cell assays for various reasons. The most obvious one is the similarity in dimensions of cells and microchannels (10-100 µm widths and depths). Another important advantage is flow: fluid flow in these small channels is laminar. Consequently, convection only exists in the direction of the applied flow, whereas in the direction perpendicular to the applied flow, diffusion contributes to mass transport. Although diffusion-based transport is slow across long distances, in microchannels diffusion enables rapid reagent delivery. In addition, the combination of laminar flow and diffusion makes the formation of highly resolved chemical gradients across small distances. This feature is particular useful for cell assays as such gradients are common in living systems (but difficult to implement in macroscale setups). Another advantage is the increased surface-to-volume ratio, which facilitates favourable scaling of heat and mass transfer, as well as favourable scaling of electrical and magnetic fields that are used in electromagnetic cell analysis. Another consequence of the size regime lies in the concentration of analytes: as cells in microchannels are confined in sub-microliter volumes, relevant analytes do not become too dilute and can thus be more readily detected. A limitation of the high surface-tovolume ratio of microchannels is the adsorption of molecules onto channel walls that are generally hydrophobic. However, surfaces can be chemically treated to prevent adsorption of biomolecules (Velve-Casquillas *et al*., 2010). Automated high-throughput experiments may be performed in a large number of repeating functional microstructures fabricated on a single chip. These microsystems can also monitor the time course of the release, which is difficult to measure by conventional batch cell culture methods. Microfluidic devices can be made transparant and the cells monitored in real time by imaging, using fluorescence

In a microfluidic device for cell-based assays, adequate culture conditions must be maintained for the duration of the experiment, which can span several days. While being cultured, cells must be continuously perfused with nutrients and oxygen; in addition, constant temperature and pH must be maintained. In contrast to traditional batch cultures, miniaturised perfusion systems provide precise control of medium composition, long-term unattented cultures and tissue-like structuring of cultures (Heiskanen *et al*., 2010). Adherent

markers to probe cell functions and cell fate.

cells must be detached from culture flasks and seeded or spotted into a microfluidic device while sufficient time has to be allowed to achieve proper cell attachment and reduction of stress induced by the transfer. Mobile cells in suspension are easier to handle and require less time to adapt to the new environment.

#### **2.3 Single-cell analysis/monitoring in microfluidic devices**

A fundamental goal of cell biology is identifying how cell behaviour arises from the dynamic collection of environmental stimuli to which the cell is exposed (Lee & Di Carlo, 2009). From a biosystems science and engineering perspective, there is great interest in how the cell behaves as a system that processes time-dependent input signals into output behaviour(s). Ideally, with knowledge of the history of the ensemble of environmental stimuli, one would be able to predict the precise behaviour that a particular cell would exhibit under a given stimulus. Unfortunately, cells under seemingly identical environmental conditions often display a distribution of heterogeneous behaviour(s) (Lidstrom & Meldrum, 2003). This appears to be partly due to probabilistic behaviour in the "decision" processes that connect input and output (Raser & O'Shea, 2005; Mettetal *et al*., 2006). Underlying the links between inputs and outputs are systems of interconnected molecular interactions (signalling pathways). Signalling within one pathway as well as cross-signalling between pathways, localisation of reactions and the sometimes small molecule numbers involved in signalling contribute to stochastic behaviour in these systems (Raser & O'Shea, 2005; Kholodenko *et al*., 2010), which in the case of stem cells may very well be an essential and necessary feature of their biology and enables them to transit from one state to another. Because of the meanwhile well-documented heterogeneity within such cell population, increased emphasis has been put on analysing a large number of single cells and determining distributions of responses (Cai *et al*., 2006; Mettetal *et al*., 2006; Yu *et al.*, 2006). New tools, based on microfabrication and microfluidic technologies, are now allowing improved dynamic control of environmental variables for high-throughput single-cell analysis. These experimental technologies combined with systems analysis of signalling pathways are expected to lead to an improved quantitative description of single-cell function (Lee & Di Carlo, 2009).

Several single-cell analysis techniques have been developed, which may be classified in terms of information content (number of elements capable of being studied simultaneously) and throughput (number of cells studied in a give time). The simplest and most widely used forms of single-cell analysis are fluorescence microscopy and flow cytometry. Automated microscopy techniques, often termed high-content screening (HCS) or "cellomics", recently provided also quantitative insight into cellular behaviour and in most cases are applied to observe the response of the cells, e.g. to drug candidate molecules.

The utility of single-cell measurements with high temporal resolution has been demonstrated by bacterial studies, which used optical microscopy to observe *Escherichia coli* over long time periods and reveal interesting temporal fluctuations and cell-to-cell variability that would otherwise be masked by population-wide measurements (Pedraza & van Oudenaarden, 2005). A microfluidic microchemostat has been constructed and used to acquire single-cell fluorescence data from *Saccharomyces cerevisiae* over many cellular generations (Charvin *et al*., 2009; Rowat *et al.*, 2009). One way in which cells can rapidly respond to environmental stimuli is to alter the localisation and abundance of proteins (Charvin *et al*., 2009). In a microfluidic device, these aspects can be studied on the same cells under various growth conditions or in response to environmental insults.

On-Chip Living-Cell Microarrays for Network Biology 615

well as quantitative information (Peng, 2008; Zhou & Wong, 2008; Hamilton, 2009; Swedlow

Automatic analysis of multidimensional live cell microscopy images requires different computational methods. A general workflow for quantitative analysis of live cell microscopy images is composed of the following steps: (i) preprocessing, (ii) segmentation,

The goal of image preprocessing is to improve the quality of raw images prior to image segmentation and feature extraction. Applications include denoising for reducing the image noise, elimination of artifacts, intensity normalisation, contrast enhancement, and deconvolution for reducing the image blur introduced by the imaging process. Denoising methods use either linear or nonlinear filters to reduce noise in images and improve the signal-to noise ratio. For denoising images, a Gaussian filter is often applied (Rohr *et al*., 2010). Filters that are not based on convolution are called nonlinear filters. A nonlinear filter that is often used to remove the pepper-noise generated by CCD detectors in optical fluorescent microscopy is the median filter (Zhou & Wong, 2008). This median filter can preserve high frequency information describing cell edges in high content microscopy

Deconvolution methods to reduce the image blur are relevant for both wide-field and confocal light microscopes (Cannell *et al*., 2006). It is often assumed that the blurring of an image is caused by a linear process and thus can be presented by convolution with a point spread function (PSF). The aim of deconvolution is to reconstruct the original (true) image by reversing the effect of convolution and thus improving the resolution and contrast of the image (Rohr *et al*., 2010). Examples of such approaches are the inverse filter, the Wiener

Image segmentation is one of the most basic processing steps in many bioimage informatics applications. The goal is to segment out meaningful objects of interest in the respective image. In the case of microscopy images, one main task is to identify cells and to distinguish them from the background. Another task is to detect and localize particles in the image. Because particles are much smaller than cells and corresponded to spot-like image structures, different segmentation methods are required for cells and particles (Rohr *et al*., 2010). Segmentation is a prerequisite for quantifying geometric properties of objects as well as for quantifying the corresponding signal intensities. Additionally, segmentation is often

Cell segmentation can be categorised into two classes, i.e. nucleic segmentation and cytoplasm (or whole cell) segmentation. In recent years, there has been significant effort towards the development of automated methods for cell nuclei image and 3D cell segmentation have been developed (Ortiz de Solorzano *et al*., 1999; Sarti *et al*., 2000; De Solorzano *et al*., 2001; Umesh Adiga & Chaudhuri, 2001; Malpica *et al*., 1997; Belien *et al*., 2002; Wählby *et al*., 2004; Lin *et al*., 2005; Lindblad *et al*., 2004; Dufour *et al*., 2005; Li *et al*., 2007, 2008; Dorn *et al*., 2008; Ko *et al*., 2009). The main methods for cell segmentation can be

(iii) registration, (iv) tracking and (v) classification (Rohr *et al*., 2010).

*et al*., 2009; Rohr *et al*., 2010).

filter, and the constrained least-squares filter.

the basis for subsequent image analysis steps, i.e. for tracking.

**3.1 Preprocessing** 

images.

**3.2 Segmentation** 

**3.2.1 Cell segmentation** 

#### **2.4 Localisomics**

Localisomics seeks to identify the subcellular location of all proteins in the cell, which can provide key insights into the cellular function of the individual proteins a well as their probable interacting partners (Joyce & Palsson, 2006). Protein localisation has to be described in intracellular compartments, e.g. the nucleus or cytoplasm, and also in organelles, as specialisation of cellular organelles defines the functional roles of proteins (Souchennytskyi, 2005). The most informative is data about protein localisation and its dynamics in a single, living cell.

Mostly fluorescence microscopy techniques have been used to monitor green fluorescent protein (GFP)-tagged- or yellow fluorescence protein (YFP)-tagged proteins in *E. coli* (Taniguchi *et al.*, 2010), *S. cerevisiae* (Huh *et al*., 2003) and human cells (Shariff *et al*., 2010). Visual interpretation of the fluorescent images, and more recently, automated image analysis, have been used to extract dynamic protein localisation data (Schubert *et al*., 2006; Conrad *et al*., 2011). Images from many studies are publicly available (Table 2).


Table 2. Publicly available microscopy images concerning protein localisation in cells (adapted from Newberg *et al*., 2009).

### **3. Computational methods for quantitative image analysis**

Quantitative information from live cell microscopy can be obtained. To reach this goal, image analysis methods have to be used. These methods can provide quantified geometric, intensity, and motion properties, and these quantitative parameters can be used as input parameters for predictive systems biology models (Pepperkok & Ellenberg, 2006; Megason & Fraser, 2007; Bakal *et al*., 2007; Verveer & Bastiaens, 2008).

Advances in imaging technology provide a huge amount of digital image data. A manual analysis is hardly possible. Additionally, 3D images over time are difficult to interpret manually and the result suffers from subjectivity. Therefore, computer-based image analysis is required to cope with the enormous amount of image data and to extract reproducible as well as quantitative information (Peng, 2008; Zhou & Wong, 2008; Hamilton, 2009; Swedlow *et al*., 2009; Rohr *et al*., 2010).

Automatic analysis of multidimensional live cell microscopy images requires different computational methods. A general workflow for quantitative analysis of live cell microscopy images is composed of the following steps: (i) preprocessing, (ii) segmentation, (iii) registration, (iv) tracking and (v) classification (Rohr *et al*., 2010).

#### **3.1 Preprocessing**

614 Bioinformatics – Trends and Methodologies

Localisomics seeks to identify the subcellular location of all proteins in the cell, which can provide key insights into the cellular function of the individual proteins a well as their probable interacting partners (Joyce & Palsson, 2006). Protein localisation has to be described in intracellular compartments, e.g. the nucleus or cytoplasm, and also in organelles, as specialisation of cellular organelles defines the functional roles of proteins (Souchennytskyi, 2005). The most informative is data about protein localisation and its

Mostly fluorescence microscopy techniques have been used to monitor green fluorescent protein (GFP)-tagged- or yellow fluorescence protein (YFP)-tagged proteins in *E. coli* (Taniguchi *et al.*, 2010), *S. cerevisiae* (Huh *et al*., 2003) and human cells (Shariff *et al*., 2010). Visual interpretation of the fluorescent images, and more recently, automated image analysis, have been used to extract dynamic protein localisation data (Schubert *et al*., 2006;

Mouse (3T3) >100 Internal GFP fusion cdtag.bio.cmu.edu Jarvik *et al*., 2002

Various Various Various ccdb.ucsd.edu Martone *et al*., 2008

Quantitative information from live cell microscopy can be obtained. To reach this goal, image analysis methods have to be used. These methods can provide quantified geometric, intensity, and motion properties, and these quantitative parameters can be used as input parameters for predictive systems biology models (Pepperkok & Ellenberg, 2006; Megason

Advances in imaging technology provide a huge amount of digital image data. A manual analysis is hardly possible. Additionally, 3D images over time are difficult to interpret manually and the result suffers from subjectivity. Therefore, computer-based image analysis is required to cope with the enormous amount of image data and to extract reproducible as

Table 2. Publicly available microscopy images concerning protein localisation in cells

**proteins Tagging method Website Reference** 

immunochemistry www.proteinatlas.org Berglund *et al*., 2008

.net

GFP fusion yeastgfp.yeastgenome.org Huh *et al*., 2003

fusion gfp-cdna.embl.de Liebel *et al.*, 2004

murphylab.web.cmu.edu Huang *et al*., 2002

Frenkel-Morgenstern *et al*., 2010

Conrad *et al*., 2011). Images from many studies are publicly available (Table 2).

Immunofluorescence and genomic internal GFP fusion

carcinoma) > 2000 YFP CD tagging www.dynamicproteomics

**3. Computational methods for quantitative image analysis** 

& Fraser, 2007; Bakal *et al*., 2007; Verveer & Bastiaens, 2008).

**2.4 Localisomics** 

dynamics in a single, living cell.

**Species (cell type) Number of** 

Mouse (3T3) >100

(adapted from Newberg *et al*., 2009).

A-431, U-251 MG) > 6000 Immuno-fluorescence,

Monkey (Vero) >1000 cDNA terminal GFP

Yeast > 4000 cDNA C-terminal

Human (U-2 OS,

Human (HeLa)

Human (HeLa)

Human (H1299

Human (brain)

The goal of image preprocessing is to improve the quality of raw images prior to image segmentation and feature extraction. Applications include denoising for reducing the image noise, elimination of artifacts, intensity normalisation, contrast enhancement, and deconvolution for reducing the image blur introduced by the imaging process. Denoising methods use either linear or nonlinear filters to reduce noise in images and improve the signal-to noise ratio. For denoising images, a Gaussian filter is often applied (Rohr *et al*., 2010). Filters that are not based on convolution are called nonlinear filters. A nonlinear filter that is often used to remove the pepper-noise generated by CCD detectors in optical fluorescent microscopy is the median filter (Zhou & Wong, 2008). This median filter can preserve high frequency information describing cell edges in high content microscopy images.

Deconvolution methods to reduce the image blur are relevant for both wide-field and confocal light microscopes (Cannell *et al*., 2006). It is often assumed that the blurring of an image is caused by a linear process and thus can be presented by convolution with a point spread function (PSF). The aim of deconvolution is to reconstruct the original (true) image by reversing the effect of convolution and thus improving the resolution and contrast of the image (Rohr *et al*., 2010). Examples of such approaches are the inverse filter, the Wiener filter, and the constrained least-squares filter.

#### **3.2 Segmentation**

Image segmentation is one of the most basic processing steps in many bioimage informatics applications. The goal is to segment out meaningful objects of interest in the respective image. In the case of microscopy images, one main task is to identify cells and to distinguish them from the background. Another task is to detect and localize particles in the image. Because particles are much smaller than cells and corresponded to spot-like image structures, different segmentation methods are required for cells and particles (Rohr *et al*., 2010). Segmentation is a prerequisite for quantifying geometric properties of objects as well as for quantifying the corresponding signal intensities. Additionally, segmentation is often the basis for subsequent image analysis steps, i.e. for tracking.

#### **3.2.1 Cell segmentation**

Cell segmentation can be categorised into two classes, i.e. nucleic segmentation and cytoplasm (or whole cell) segmentation. In recent years, there has been significant effort towards the development of automated methods for cell nuclei image and 3D cell segmentation have been developed (Ortiz de Solorzano *et al*., 1999; Sarti *et al*., 2000; De Solorzano *et al*., 2001; Umesh Adiga & Chaudhuri, 2001; Malpica *et al*., 1997; Belien *et al*., 2002; Wählby *et al*., 2004; Lin *et al*., 2005; Lindblad *et al*., 2004; Dufour *et al*., 2005; Li *et al*., 2007, 2008; Dorn *et al*., 2008; Ko *et al*., 2009). The main methods for cell segmentation can be

On-Chip Living-Cell Microarrays for Network Biology 617

offers a wealth of information on the dynamic organisation of proteins and subcellular structures that is unavailable in static 2D and 3D imaging. With the addition of time, organelle dynamics as proteins are recruited, transported and expelled can be viewed in detail and the passage through a cell of proteins and the structures that they interact with can be readily observed (Hamilton, 2009). Additionally, the addition of temporal parameters such as the change of size and size of nuclei and the duration between the different stages are important indicators of the cell division cycle (Zhou and Wong, 2006). There is also extensive work on analysing the behaviour of specific labelled proteins by tracking

Tracking denotes the repeated localisation of objects in successive images. The aim is to establish temporal correspondences between objects to analyse object motion (Rohr *et al*., 2010). Although finding correspondences is largely simplified when there is only one object in the images, this task is generally quite challenging when there are several or a large number of moving objects. Therefore, sophisticated multiple target tracking methods are required. Object tracking from fluorescent video microscopy present many challenges (Hamilton, 2009). Objects viewed may join, split, disappear, change direction or substantially change their morphology, and there are technical challenges such as photobleaching and compromises between spatial and temporal resolution. Tracking algorithms developed in other research areas and adapted to fluorescent video microscopy tend to perform poorly and considerable research has gone into designing algorithms specific to fluorescent

A last step in image analysis is to distinguish objects into different classes. Automatic classification methods can be divided into supervised and unsupervised learning methods (Glory & Murphy, 2007). Supervised learning methods allow classification into predefined classes and require training of the classifier with a set of annotated examples. In unsupervised learning methods, the classes do not need to be known in advance. Supervised learning methods are used for cell microscopy since the classes are known in advance. Common used classifiers are artificial neural networks (Boland & Murphy, 2001), *k*-nearest-neighbour classifiers (Chen *et al*., 2006), and support vector machines (Conrad *et* 

Network modelling is a key step for processing dynamic proteomics data, because a network model provides: (i) a means of understanding how detected proteins are associated with underlying network operations, and (ii) a platform into which other useful

A cell is an enormous complex entity made up by myriad interacting molecular components that perform the biochemical reactions that maintain life. A cell can be described through the set of interconnections between its component molecules according to the network hypothesis (De Los Rios & Vendruscolo, 2010). The central dogma in molecular biology describes the way in which a cell processes the information required to produce the molecules necessary to maintain life and reproduce (Crick, 1970). In order to obtain a more complete description of the functioning of a cell, a deeper understanding of the manner in

information, (such as protein abundances and localisation) can be integrated.

individual objects in time series images (Meijering *et al*., 2006).

imaging (reviewed in Kalaidzidis, 2009).

*al*., 2004; Huang & Murphy, 2004; Harder *et al*., 2008).

**4. Biological network analysis** 

**4.1 Network modelling** 

**3.5 Classification** 

classified as: threshold-based segmentation, edge-based segmentation, region-based segmentation, and deformable models (reviewed in Rohr *et al*., 2010).

#### **3.2.2 Particle localisation**

Often it is assumed that the intensities representing a fluorescently labelled particle resemble a 2D Gaussian function in which the peak intensity value of the particle differs significantly from that of the local background. A bottom-up or a top-down strategy is used to address the problem of particle localisation.

Bottom-up localisation schemes for fluorescent particles typically comprise three consecutive steps: image preprocessing, particle detection, and particle localisation (Rohr *et al*., 2010). A common technique is to apply a threshold on the intensities of a (preprocessed) image to determine image regions that correspond to particles (Ponti *et al*., 2003; Sbalzarini & Koumoutakos, 2005). Automatic schemes for determining an optimal threshold is required since manual determination is often impractical and can give inconsistent results.

Top-down approaches use model-driven strategies in which hypotheses regarding the possible configuration of the models are tested against the information found in the images. A 2D Gaussian function is typically used as a model for the shape and appearance of fluorescently labelled particles (Godinez *et al*., 2007; Cortes & Amit, 2008).

#### **3.3 Registration**

The task of finding an optimal geometric transformation between corresponding image data is known as registration. Bioimage registration is essential in many applications that need to compare multiple image subjects of different conditions. Registration approaches can be classified based on the type of transformation model and image information used (Rohr *et al.*, 2010). The transformation model defines the degrees of freedom for geometrically matching two images, and a main distinction is made between rigid, affine, and nonrigid schemes.

Many of the 2D and 3D image registration methods proposed for medical image analysis, such as the mutual information registration (Viola & Wells, 1997), spline-based elastic registration (Rohr *et al*., 2003), invariant moment feature-based registration (Shen & Davatzikos, 2002), congealing registration (Miller, 2006; Zollei *et al*., 2005), etc., can be extended to align the molecular and cellular images (Peng, 2008). Nonrigid or elastic registration approaches are required to cope with the shape changes of live cells (Rohr *et al.*, 2010). An intensity-based nonrigid registration approach for cell microscopy images, which relies on an optic flow scheme and uses segmented images, has been developed recently (Yang *et al*., 2008). An intensity-based approach has been used to register segmented 2D static images of different cell nuclei (Rohde *et al*., 2008), and a biomechanical model has been used to register 3D segmented images of cell nuclei (Gladilin *et al*., 2008). An intensity-based nonrigid registration approach that directly analyses the intensity information without requiring a segmentation step has been developed (Kim *et al*., 2007). This approach relies on optic flow estimation and has been applied to register 2D and 3D time-lapse images of live cells for accurate analysis of protein particle movement.

#### **3.4 Tracking and motion analysis**

Dynamic cell population studies are becoming more and more important in understanding pathways and networks (Glory and Murphy, 2007). Live cell fluorescent video microscopy offers a wealth of information on the dynamic organisation of proteins and subcellular structures that is unavailable in static 2D and 3D imaging. With the addition of time, organelle dynamics as proteins are recruited, transported and expelled can be viewed in detail and the passage through a cell of proteins and the structures that they interact with can be readily observed (Hamilton, 2009). Additionally, the addition of temporal parameters such as the change of size and size of nuclei and the duration between the different stages are important indicators of the cell division cycle (Zhou and Wong, 2006). There is also extensive work on analysing the behaviour of specific labelled proteins by tracking individual objects in time series images (Meijering *et al*., 2006).

Tracking denotes the repeated localisation of objects in successive images. The aim is to establish temporal correspondences between objects to analyse object motion (Rohr *et al*., 2010). Although finding correspondences is largely simplified when there is only one object in the images, this task is generally quite challenging when there are several or a large number of moving objects. Therefore, sophisticated multiple target tracking methods are required.

Object tracking from fluorescent video microscopy present many challenges (Hamilton, 2009). Objects viewed may join, split, disappear, change direction or substantially change their morphology, and there are technical challenges such as photobleaching and compromises between spatial and temporal resolution. Tracking algorithms developed in other research areas and adapted to fluorescent video microscopy tend to perform poorly and considerable research has gone into designing algorithms specific to fluorescent imaging (reviewed in Kalaidzidis, 2009).

#### **3.5 Classification**

616 Bioinformatics – Trends and Methodologies

classified as: threshold-based segmentation, edge-based segmentation, region-based

Often it is assumed that the intensities representing a fluorescently labelled particle resemble a 2D Gaussian function in which the peak intensity value of the particle differs significantly from that of the local background. A bottom-up or a top-down strategy is used

Bottom-up localisation schemes for fluorescent particles typically comprise three consecutive steps: image preprocessing, particle detection, and particle localisation (Rohr *et al*., 2010). A common technique is to apply a threshold on the intensities of a (preprocessed) image to determine image regions that correspond to particles (Ponti *et al*., 2003; Sbalzarini & Koumoutakos, 2005). Automatic schemes for determining an optimal threshold is required since manual determination is often impractical and can give inconsistent results. Top-down approaches use model-driven strategies in which hypotheses regarding the possible configuration of the models are tested against the information found in the images. A 2D Gaussian function is typically used as a model for the shape and appearance of

The task of finding an optimal geometric transformation between corresponding image data is known as registration. Bioimage registration is essential in many applications that need to compare multiple image subjects of different conditions. Registration approaches can be classified based on the type of transformation model and image information used (Rohr *et al.*, 2010). The transformation model defines the degrees of freedom for geometrically matching two images, and a main distinction is made between rigid, affine, and nonrigid

Many of the 2D and 3D image registration methods proposed for medical image analysis, such as the mutual information registration (Viola & Wells, 1997), spline-based elastic registration (Rohr *et al*., 2003), invariant moment feature-based registration (Shen & Davatzikos, 2002), congealing registration (Miller, 2006; Zollei *et al*., 2005), etc., can be extended to align the molecular and cellular images (Peng, 2008). Nonrigid or elastic registration approaches are required to cope with the shape changes of live cells (Rohr *et al.*, 2010). An intensity-based nonrigid registration approach for cell microscopy images, which relies on an optic flow scheme and uses segmented images, has been developed recently (Yang *et al*., 2008). An intensity-based approach has been used to register segmented 2D static images of different cell nuclei (Rohde *et al*., 2008), and a biomechanical model has been used to register 3D segmented images of cell nuclei (Gladilin *et al*., 2008). An intensity-based nonrigid registration approach that directly analyses the intensity information without requiring a segmentation step has been developed (Kim *et al*., 2007). This approach relies on optic flow estimation and has been applied to register 2D and 3D time-lapse images of live

Dynamic cell population studies are becoming more and more important in understanding pathways and networks (Glory and Murphy, 2007). Live cell fluorescent video microscopy

segmentation, and deformable models (reviewed in Rohr *et al*., 2010).

fluorescently labelled particles (Godinez *et al*., 2007; Cortes & Amit, 2008).

cells for accurate analysis of protein particle movement.

**3.4 Tracking and motion analysis** 

**3.2.2 Particle localisation** 

**3.3 Registration** 

schemes.

to address the problem of particle localisation.

A last step in image analysis is to distinguish objects into different classes. Automatic classification methods can be divided into supervised and unsupervised learning methods (Glory & Murphy, 2007). Supervised learning methods allow classification into predefined classes and require training of the classifier with a set of annotated examples. In unsupervised learning methods, the classes do not need to be known in advance. Supervised learning methods are used for cell microscopy since the classes are known in advance. Common used classifiers are artificial neural networks (Boland & Murphy, 2001), *k*-nearest-neighbour classifiers (Chen *et al*., 2006), and support vector machines (Conrad *et al*., 2004; Huang & Murphy, 2004; Harder *et al*., 2008).

#### **4. Biological network analysis**

#### **4.1 Network modelling**

Network modelling is a key step for processing dynamic proteomics data, because a network model provides: (i) a means of understanding how detected proteins are associated with underlying network operations, and (ii) a platform into which other useful information, (such as protein abundances and localisation) can be integrated.

A cell is an enormous complex entity made up by myriad interacting molecular components that perform the biochemical reactions that maintain life. A cell can be described through the set of interconnections between its component molecules according to the network hypothesis (De Los Rios & Vendruscolo, 2010). The central dogma in molecular biology describes the way in which a cell processes the information required to produce the molecules necessary to maintain life and reproduce (Crick, 1970). In order to obtain a more complete description of the functioning of a cell, a deeper understanding of the manner in

On-Chip Living-Cell Microarrays for Network Biology 619

2003), CellDesigner (Kitano *et al*., 2005), BioLayout (Goldovsky *et al*., 2005), GenMAPP (Dahlquist *et al*., 2002), PIANA (Aragues *et al.*, 2006), ProViz (Iragne *et al*., 2005), and Patika (Demir *et al*., 2002). These systems play a key role in the development of integrative biology, systems biology and integrative bioinformatics. The trend in the development of these tools is to go beyond static representations of cellular states, towards a more dynamic model of cellular processes through the incorporation of gene expression data, subcellular localisation

Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework (Shannon *et al*., 2003). In Cytoscape, nodes representing biological entities, such as proteins or genes, are connected with edges representing pairwise interactions, such as experimentally determined protein–protein interactions. Nodes and edges can have associated data attributes describing properties of the protein or interaction. A key feature of Cytoscape is its ability to set visual aspects of nodes and edges, such as shape, color and size, based on attribute values. This data-to-visual attribute mapping allows biologists to synoptically view multiple types of data in a network context. Additionally, Cytoscape allows users to extend its functionality by creating or downloading

VisANT is a web-based software framework for visualising and analysing many types of networks of biological interactions and associations (Hu *et al*., 2005). Given user-defined sets of interactions or groupings between genes or proteins, VisANT provides: (i) a visual interface for combining and annotating network data, (ii) supporting function and annotation data for different genomes from the Gene Ontology and KEGG databases, and (iii) the statistical and analytical tools needed for extracting topological properties of the user-defined networks. The new VisANT (v3.5) functions can be classified into three categories (Hu *et al*., 2009). (i) Visualisation: a new tree-based browser allows visualisation of GO hierarchies. GO terms can be easily dropped into the network to group genes annotated under the term, thereby integrating the hierarchical ontology with the network. This facilitates multi-scale visualisation and analysis. (ii) Flexible annotation schema: in addition to conventional methods for annotating network nodes with the most specific functional descriptions available; VisANT also provides functions to annotate genes at any customized level of abstraction. (iii) Finding over-represented GO terms and expressionenriched GO modules: two new algorithms have been implemented as VisANT plugins. One detects over-represented GO annotations in any given sub-network and the other finds the GO categories that are enriched in a specified phenotype or perturbed dataset. Both algorithms take account of network topology (i.e. correlations between genes based on

Osprey is a Java-based network visualisation and analysis tool for protein-protein and genetic interaction networks (Breitkreutz *et al*., 2003). Osprey builds data-rich graphical representations that are color-coded for gene function and experimental interaction data. Mouse-over functions allow rapid elaboration and organisation of network diagrams in a spoke model format. User-defined large-scale datasets can be readily combined with Osprey

GenMAPP is a free computer application designed to visualise gene expression and other genomic data on maps representing biological pathways and groupings of genes (Dahlquist *et al*., 2002). Integrated with GenMAPP are programs to perform a global analysis of gene expression or genomic data in the context of hundreds of pathway MAPPs and thousands of

information and time-dependent behaviour (Suderman & Hallett, 2007).

additional software modules known as "plugins".

various sources of evidence).

for comparison of different methods.

which the sets of interconnections between these molecules are defining the identity of the cell itself is needed (De Los Rios & Vendruscolo, 2010). Therefore, it is important to investigate whether the genetic makeup of an organism does not only specify the rules for generating proteins, but also the way in which these proteins interact among themselves and with the other molecules in a cell. Networks provide a way to organise and regulate efficiently complex systems. In an effective network different parts are linked by reducing at a minimum the number of interconnections. A network is also a powerful method to represent the data in one object and to enable the quantitative assessment of the fragility or robustness of the system. The biological molecules in a cellular system are individual molecules, which affect each other by pairwise interactions (Chen *et al*., 2009). A cascade of those pairwise interactions forms a local structure (i.e. a linear pathway or a subnetwork), which transforms local perturbations into a functional response. All linear pathways or subnetworks are assembled into a global biomolecular network, which eventually generates global behaviours and holds responsability for complicated life in a living organism.

Gene products, such as mRNA and proteins, are produced through the transcription and translation processes. Gene, mRNA, and protein are known as biological molecules or basic components (Chen *et al*., 2009). The complicated relations and interactions between these components are responsible for diverse cellular functions. Transcription factors (TFs) are DNA-binding proteins that can activate or inhibit the transcription of genes to synthesise mRNAs by regulating the activities of genes. Since these TFs themselves are products of genes, the final effect is that genes regulate each other's expression as part of a transcription (or translational) regulatory network (TRN) or gene regulatory network (GRN). At the proteome level, proteins participate in diverse posttranslational modifications of other proteins or form protein complexes and pathways together with other proteins. Such local associations between proteins molecules are called protein-protein interactions (PPIs), which form a protein interaction network. The biochemical reactions in cellular metabolism can likewise be integrated into a metabolic network whose fluxes are regulated by enzymes that catalyse the reactions. In many cases, these interactions at different levels are integrated into a signaling network.

Multiple proteins in a cell are in dynamic interaction with each other, and these interactions provide functioning and behaviour of living cells (Terentiev *et al*., 2009). Reversible proteinprotein interactions are among other dynamic processes that proceed in a cell and contribute to cell functioning. The whole set of protein-protein interactions of a given organism is referred to as the "interactome". Structural organisation of interactomes and the total number of interactions in them are important factors that determine complexity of biological systems. The number of copies of a certain protein per cell can vary from several tens to millions (Ghaemmaghami *et al*., 2003). Interactomes even of simple organisms can be formed by a rather large number of interactions. The determination of physically interacting protein pairs makes it possible to design interactome maps as graphs consisting of nodes, in which a particular protein is located, and of links between them that indicate paired interactions. The interactome maps are considered as keys to obtain knowledge on protein functioning (Rual *et al*., 2005).

#### **4.2 Integration of biological networks**

#### **4.2.1 Network visualisation and analysis**

Many tools exist for visually exploring networks and network analysis, including examples such as Cytoscape (Shannon *et al*., 2003), VisANT (Hu *et al*., 2009), Osprey (Breitkreutz *et al*.,

which the sets of interconnections between these molecules are defining the identity of the cell itself is needed (De Los Rios & Vendruscolo, 2010). Therefore, it is important to investigate whether the genetic makeup of an organism does not only specify the rules for generating proteins, but also the way in which these proteins interact among themselves and with the other molecules in a cell. Networks provide a way to organise and regulate efficiently complex systems. In an effective network different parts are linked by reducing at a minimum the number of interconnections. A network is also a powerful method to represent the data in one object and to enable the quantitative assessment of the fragility or robustness of the system. The biological molecules in a cellular system are individual molecules, which affect each other by pairwise interactions (Chen *et al*., 2009). A cascade of those pairwise interactions forms a local structure (i.e. a linear pathway or a subnetwork), which transforms local perturbations into a functional response. All linear pathways or subnetworks are assembled into a global biomolecular network, which eventually generates

global behaviours and holds responsability for complicated life in a living organism.

a signaling network.

*et al*., 2005).

**4.2 Integration of biological networks 4.2.1 Network visualisation and analysis** 

Gene products, such as mRNA and proteins, are produced through the transcription and translation processes. Gene, mRNA, and protein are known as biological molecules or basic components (Chen *et al*., 2009). The complicated relations and interactions between these components are responsible for diverse cellular functions. Transcription factors (TFs) are DNA-binding proteins that can activate or inhibit the transcription of genes to synthesise mRNAs by regulating the activities of genes. Since these TFs themselves are products of genes, the final effect is that genes regulate each other's expression as part of a transcription (or translational) regulatory network (TRN) or gene regulatory network (GRN). At the proteome level, proteins participate in diverse posttranslational modifications of other proteins or form protein complexes and pathways together with other proteins. Such local associations between proteins molecules are called protein-protein interactions (PPIs), which form a protein interaction network. The biochemical reactions in cellular metabolism can likewise be integrated into a metabolic network whose fluxes are regulated by enzymes that catalyse the reactions. In many cases, these interactions at different levels are integrated into

Multiple proteins in a cell are in dynamic interaction with each other, and these interactions provide functioning and behaviour of living cells (Terentiev *et al*., 2009). Reversible proteinprotein interactions are among other dynamic processes that proceed in a cell and contribute to cell functioning. The whole set of protein-protein interactions of a given organism is referred to as the "interactome". Structural organisation of interactomes and the total number of interactions in them are important factors that determine complexity of biological systems. The number of copies of a certain protein per cell can vary from several tens to millions (Ghaemmaghami *et al*., 2003). Interactomes even of simple organisms can be formed by a rather large number of interactions. The determination of physically interacting protein pairs makes it possible to design interactome maps as graphs consisting of nodes, in which a particular protein is located, and of links between them that indicate paired interactions. The interactome maps are considered as keys to obtain knowledge on protein functioning (Rual

Many tools exist for visually exploring networks and network analysis, including examples such as Cytoscape (Shannon *et al*., 2003), VisANT (Hu *et al*., 2009), Osprey (Breitkreutz *et al*., 2003), CellDesigner (Kitano *et al*., 2005), BioLayout (Goldovsky *et al*., 2005), GenMAPP (Dahlquist *et al*., 2002), PIANA (Aragues *et al.*, 2006), ProViz (Iragne *et al*., 2005), and Patika (Demir *et al*., 2002). These systems play a key role in the development of integrative biology, systems biology and integrative bioinformatics. The trend in the development of these tools is to go beyond static representations of cellular states, towards a more dynamic model of cellular processes through the incorporation of gene expression data, subcellular localisation information and time-dependent behaviour (Suderman & Hallett, 2007).

Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework (Shannon *et al*., 2003). In Cytoscape, nodes representing biological entities, such as proteins or genes, are connected with edges representing pairwise interactions, such as experimentally determined protein–protein interactions. Nodes and edges can have associated data attributes describing properties of the protein or interaction. A key feature of Cytoscape is its ability to set visual aspects of nodes and edges, such as shape, color and size, based on attribute values. This data-to-visual attribute mapping allows biologists to synoptically view multiple types of data in a network context. Additionally, Cytoscape allows users to extend its functionality by creating or downloading additional software modules known as "plugins".

VisANT is a web-based software framework for visualising and analysing many types of networks of biological interactions and associations (Hu *et al*., 2005). Given user-defined sets of interactions or groupings between genes or proteins, VisANT provides: (i) a visual interface for combining and annotating network data, (ii) supporting function and annotation data for different genomes from the Gene Ontology and KEGG databases, and (iii) the statistical and analytical tools needed for extracting topological properties of the user-defined networks. The new VisANT (v3.5) functions can be classified into three categories (Hu *et al*., 2009). (i) Visualisation: a new tree-based browser allows visualisation of GO hierarchies. GO terms can be easily dropped into the network to group genes annotated under the term, thereby integrating the hierarchical ontology with the network. This facilitates multi-scale visualisation and analysis. (ii) Flexible annotation schema: in addition to conventional methods for annotating network nodes with the most specific functional descriptions available; VisANT also provides functions to annotate genes at any customized level of abstraction. (iii) Finding over-represented GO terms and expressionenriched GO modules: two new algorithms have been implemented as VisANT plugins. One detects over-represented GO annotations in any given sub-network and the other finds the GO categories that are enriched in a specified phenotype or perturbed dataset. Both algorithms take account of network topology (i.e. correlations between genes based on various sources of evidence).

Osprey is a Java-based network visualisation and analysis tool for protein-protein and genetic interaction networks (Breitkreutz *et al*., 2003). Osprey builds data-rich graphical representations that are color-coded for gene function and experimental interaction data. Mouse-over functions allow rapid elaboration and organisation of network diagrams in a spoke model format. User-defined large-scale datasets can be readily combined with Osprey for comparison of different methods.

GenMAPP is a free computer application designed to visualise gene expression and other genomic data on maps representing biological pathways and groupings of genes (Dahlquist *et al*., 2002). Integrated with GenMAPP are programs to perform a global analysis of gene expression or genomic data in the context of hundreds of pathway MAPPs and thousands of

On-Chip Living-Cell Microarrays for Network Biology 621

into layers according to their subcellular localisation. Potential products or outcomes of the pathway can be shown at the bottom of the view, clustered according to any molecular attribute data-protein function, for example. Celebral scales well to networks containing

Patika partitions the drawing space into regions corresponding to the subcellular localisations and then search for layouts where nodes are forcibly constrained to their respective locations (Demir *et al*., 2002). It makes use of a modified force-directed algorithm

Cell Illustrator is a software platform for systems biology that uses the concept of Petri net for modeling and simulating biopathways (Nagasaki *et al*., 2010). It is intended for biological scientists working at bench. The recent version of Cell Illustrator 4.0 uses Java Web Start technology and is enhanced with new capabilities, including: automatic graph grid layout algorithms using ontology information; tools using Cell System Markup Language (CSML) 3.0 and Cell System Ontology 3.0; parameter search module; high-performance simulation module; CSML database management system; conversion from CSML model to programming languages (FORTRAN, C, C++, Java, Python and Perl); import from SBML, CellML, and BioPAX; and, export to SVG and HTML. Cell Illustrator employs an extension of hybrid Petri net in an object-oriented style so that biopathway models can include objects such as DNA sequence, molecular density, 3D localisation information, transcription with

Data from cellular microarray experiments include a list of differentially expressed proteins, i.e. changed fluorescence intensity (protein abundance), as a function of time and localisation in the cell. Integration of these data with other available biological network data for a specific organism can be performed using the above listed software platforms (see

The cellular microarray data can be mapped to the protein interactome. Network data related to these proteins can be imported into Cytoscape using three options: querying interaction databases using cPath (Cerami *et al*., 2006), building an association network through text mining using Agilent Literature Search plugin (Vailaya *et al*., 2005), and loading own network data from a text file. Additionally, pathways from repositories, such as KEGG (Wixon & Kell, 2000), Reactome (Joshi-Tope *et al*., 2005) via the PSI-MI, BioPAX, or

Networks can be analysed further using topologic information, and using combined information of various types, such as GO annotations and known pathways. Network modules enriched by GO terms and pathways (functional enrichment) can be identified. Therefore, the Cytoscape plugins BiNGO (Maere *et al*., 2005) and DAVID (Dennis *et al*., 2003; Huang da *et al*., 2009) can be employed. GO Biologic Process (GOBP) trees with nodes corresponding to GOBP terms is generated using the BiNGO plugin. GOBP terms that differ in terms of their degrees of enrichment can be identified, as can sets of network nodal proteins belonging to such GOBP terms. Pathways enrichment analysis can be performed

Network structures and active subnetworks can be explored using the Cytoscape plugins MCODE (Bader *et al*., 2003) and jActiveModules (Ideker *et al*., 2002). The MCODE-plugin can be used to generate network clusters within which proteins are densely connected, whereas proteins across different network clusters loosely interact. Both the core network modules and their dynamic relationships can be identified by integrating time-dependent protein

frame-shift, translation with codon table, as well as biochemical reactions.

**4.2.3 Network integration for cellular microarray data using Cytoscape** 

4.2.1), e.g. by using Cytoscape supplemented with the available plugins.

SBML data exchange formats (Strömbäck *et al*., 2006), can be imported.

for network nodal proteins using DAVID.

thousands of nodes.

to achieve this.

Gene Ontology Terms (MAPPFinder), import lists of genes/proteins to build new MAPPs (MAPPBuilder), and export archives of MAPPs and expression/genomic data to the web. The main features underlying GenMAPP are: (i) draw pathways with easy to use graphics tools, (ii) color genes on MAPP files based on user-imported genomic data, (iii) query data against MAPPs and the GeneOntology.

CellDesigner is a structured diagram editor for drawing gene-regulatory and biochemical networks (Kitano *et al*., 2005). Networks are drawn based on the process diagram, with graphical notation system proposed by Kitano, and are stored using the Systems Biology Markup Language (SBML), a standard for representing models of biochemical and generegulatory networks. Networks are able to link with simulation and other analysis packages through Systems Biology Workbench (SBW). CellDesigner supports simulation and parameter scan by an integration with SBML ODE Solver and Copasi. By using CellDesigner, you can browse and modify existing SBML models with references to existing databases, simulate and view the dynamics through an intuitive graphical interface.

BioLayout uses a general approach for the representation and analysis of networks of variable type, size and complexity (Goldovsky *et al*., 2005). The application is based on the original BioLayout program (C-language implementation of the Fruchterman-Rheingold layout algorithm), entirely re-written in Java to guarantee portability across platforms. BioLayout(Java) provides broader functionality, various analysis techniques, extensions for better visualisation and a new user interface.

PIANA (Protein Interactions And Network Analysis) facilitates working with protein interaction networks by (i) integrating data from multiple sources, (ii) providing a library that handles graph-related tasks and (iii) automating the analysis of protein-protein interaction networks (Aragues *et al.*, 2006). PIANA can also be used as a stand-alone application to create protein interaction networks and perform tasks such as predicting protein interactions and helping to identify spots in a 2D electrophoresis gel.

ProViz is a tool for the visualisation of protein-protein interaction networks, developed by the IntAct European project (Iragne *et al*., 2005). It provides facilities for navigating in large graphs and exploring biologically relevant features, and adopts emerging standards such as GO and PSI-MI.

Patika (Pathway Analysis Tool for Integration and Knowledge Acquisition) is based on an ontology for a comprehensive representation of cellular events (Demir *et al*., 2002). The ontology enables integration of fragmented or incomplete pathway information and supports manipulation and incorporation of the stored data, as well as multiple levels of abstraction. Patika is composed of a server-side, scalable, object-oriented database and client-side editors to provide an integrated, multi-user environment for visualising and manipulating network of cellular events. This tool features automated pathway layout, functional computation support, advanced querying and a user-friendly graphical interface.

#### **4.2.2 Subcellular localisation**

Interesting tools that take into account the subcellular localisation are (Suderman & Hallett, 2007): the Cytoscape plugin Cerebral (Barsky *et al*., 2007), Patika (Demir *et al*., 2002; see 4.2.1), and Cell Illustrator (Nagasaki *et al*., 2010). Cerebral (Cell Region-Based Rendering and Layout) is an open-source Java plugin for the Cytoscape biomolecular interaction viewer. Given an interaction network and subcellular annotation, Cerebral automatically generates a view of the network in the style of traditional pathway diagrams, providing an intuitive interface for the exploration of a biological pathways or system. The molecules are separated

Gene Ontology Terms (MAPPFinder), import lists of genes/proteins to build new MAPPs (MAPPBuilder), and export archives of MAPPs and expression/genomic data to the web. The main features underlying GenMAPP are: (i) draw pathways with easy to use graphics tools, (ii) color genes on MAPP files based on user-imported genomic data, (iii) query data

CellDesigner is a structured diagram editor for drawing gene-regulatory and biochemical networks (Kitano *et al*., 2005). Networks are drawn based on the process diagram, with graphical notation system proposed by Kitano, and are stored using the Systems Biology Markup Language (SBML), a standard for representing models of biochemical and generegulatory networks. Networks are able to link with simulation and other analysis packages through Systems Biology Workbench (SBW). CellDesigner supports simulation and parameter scan by an integration with SBML ODE Solver and Copasi. By using CellDesigner, you can browse and modify existing SBML models with references to existing

databases, simulate and view the dynamics through an intuitive graphical interface.

protein interactions and helping to identify spots in a 2D electrophoresis gel.

BioLayout uses a general approach for the representation and analysis of networks of variable type, size and complexity (Goldovsky *et al*., 2005). The application is based on the original BioLayout program (C-language implementation of the Fruchterman-Rheingold layout algorithm), entirely re-written in Java to guarantee portability across platforms. BioLayout(Java) provides broader functionality, various analysis techniques, extensions for

PIANA (Protein Interactions And Network Analysis) facilitates working with protein interaction networks by (i) integrating data from multiple sources, (ii) providing a library that handles graph-related tasks and (iii) automating the analysis of protein-protein interaction networks (Aragues *et al.*, 2006). PIANA can also be used as a stand-alone application to create protein interaction networks and perform tasks such as predicting

ProViz is a tool for the visualisation of protein-protein interaction networks, developed by the IntAct European project (Iragne *et al*., 2005). It provides facilities for navigating in large graphs and exploring biologically relevant features, and adopts emerging standards such as

Patika (Pathway Analysis Tool for Integration and Knowledge Acquisition) is based on an ontology for a comprehensive representation of cellular events (Demir *et al*., 2002). The ontology enables integration of fragmented or incomplete pathway information and supports manipulation and incorporation of the stored data, as well as multiple levels of abstraction. Patika is composed of a server-side, scalable, object-oriented database and client-side editors to provide an integrated, multi-user environment for visualising and manipulating network of cellular events. This tool features automated pathway layout, functional computation support, advanced querying and a user-friendly graphical interface.

Interesting tools that take into account the subcellular localisation are (Suderman & Hallett, 2007): the Cytoscape plugin Cerebral (Barsky *et al*., 2007), Patika (Demir *et al*., 2002; see 4.2.1), and Cell Illustrator (Nagasaki *et al*., 2010). Cerebral (Cell Region-Based Rendering and Layout) is an open-source Java plugin for the Cytoscape biomolecular interaction viewer. Given an interaction network and subcellular annotation, Cerebral automatically generates a view of the network in the style of traditional pathway diagrams, providing an intuitive interface for the exploration of a biological pathways or system. The molecules are separated

against MAPPs and the GeneOntology.

better visualisation and a new user interface.

GO and PSI-MI.

**4.2.2 Subcellular localisation** 

into layers according to their subcellular localisation. Potential products or outcomes of the pathway can be shown at the bottom of the view, clustered according to any molecular attribute data-protein function, for example. Celebral scales well to networks containing thousands of nodes.

Patika partitions the drawing space into regions corresponding to the subcellular localisations and then search for layouts where nodes are forcibly constrained to their respective locations (Demir *et al*., 2002). It makes use of a modified force-directed algorithm to achieve this.

Cell Illustrator is a software platform for systems biology that uses the concept of Petri net for modeling and simulating biopathways (Nagasaki *et al*., 2010). It is intended for biological scientists working at bench. The recent version of Cell Illustrator 4.0 uses Java Web Start technology and is enhanced with new capabilities, including: automatic graph grid layout algorithms using ontology information; tools using Cell System Markup Language (CSML) 3.0 and Cell System Ontology 3.0; parameter search module; high-performance simulation module; CSML database management system; conversion from CSML model to programming languages (FORTRAN, C, C++, Java, Python and Perl); import from SBML, CellML, and BioPAX; and, export to SVG and HTML. Cell Illustrator employs an extension of hybrid Petri net in an object-oriented style so that biopathway models can include objects such as DNA sequence, molecular density, 3D localisation information, transcription with frame-shift, translation with codon table, as well as biochemical reactions.

#### **4.2.3 Network integration for cellular microarray data using Cytoscape**

Data from cellular microarray experiments include a list of differentially expressed proteins, i.e. changed fluorescence intensity (protein abundance), as a function of time and localisation in the cell. Integration of these data with other available biological network data for a specific organism can be performed using the above listed software platforms (see 4.2.1), e.g. by using Cytoscape supplemented with the available plugins.

The cellular microarray data can be mapped to the protein interactome. Network data related to these proteins can be imported into Cytoscape using three options: querying interaction databases using cPath (Cerami *et al*., 2006), building an association network through text mining using Agilent Literature Search plugin (Vailaya *et al*., 2005), and loading own network data from a text file. Additionally, pathways from repositories, such as KEGG (Wixon & Kell, 2000), Reactome (Joshi-Tope *et al*., 2005) via the PSI-MI, BioPAX, or SBML data exchange formats (Strömbäck *et al*., 2006), can be imported.

Networks can be analysed further using topologic information, and using combined information of various types, such as GO annotations and known pathways. Network modules enriched by GO terms and pathways (functional enrichment) can be identified. Therefore, the Cytoscape plugins BiNGO (Maere *et al*., 2005) and DAVID (Dennis *et al*., 2003; Huang da *et al*., 2009) can be employed. GO Biologic Process (GOBP) trees with nodes corresponding to GOBP terms is generated using the BiNGO plugin. GOBP terms that differ in terms of their degrees of enrichment can be identified, as can sets of network nodal proteins belonging to such GOBP terms. Pathways enrichment analysis can be performed for network nodal proteins using DAVID.

Network structures and active subnetworks can be explored using the Cytoscape plugins MCODE (Bader *et al*., 2003) and jActiveModules (Ideker *et al*., 2002). The MCODE-plugin can be used to generate network clusters within which proteins are densely connected, whereas proteins across different network clusters loosely interact. Both the core network modules and their dynamic relationships can be identified by integrating time-dependent protein

On-Chip Living-Cell Microarrays for Network Biology 623

Anderson, J.R., Chiu, D.T., Jackman, R.J., Cherniavskaya, O., McDonald, J.C., Wu, H.,

Anderson, D.G., Levenberg, S., & Langer, R. (2004) Nanoliter-scale synthesis of arrayed

Aragues, R., Jaeggi, D., & Oliva, B. (2006) PIANA: protein interactions and network analysis.

Bader, G.D., & Hogue, C.W. (2003) An automated method for finding molecular complexes in large protein interaction networks. *BMC Bioinformatics*, Vol. 4, pp. 2. Bakal, C., Aach, J., Church, G., & Perrimon, N. (2007) Quantitative morphological signatures

Barbulovic-Nad, I., Lucente, M., Sun, Y., Zhang, M., Wheeler, A.R., & Bussmann, M. (2006)

Barbulovic-Nad, I., & Wheeler, A.R. (2008) Cell assays in microfluidics. In: *Encyclopedia of microfluidics and nanofluidics*. D. Li (Dd.), 209-216, Springer, New York, USA. Barsky, A., Gardy, J.L., Hancock, R.E., & Munzner, T. (2007) Cerebral: a Cytoscape plugin

Becker, H., & Gärtner, C. (2008) Polymer microfabrication technologies for microfluidic

Belien, J.A.M., Ginkel, H.A.H.M., Tekola, P., Ploeger, L.S., Poulin, N.M., Baak, J.P.A., & Diest,

Biran, I., Rissin, D.M., Ron, E.Z., & Walt, D.R. (2003) Optical imaging fiber-based live

Bochner, B.R., Gadzinski, P., & Panomitros, E. (2001) Phenotype microarrays for high-

Breitkreutz, B.J., Stark, C., & Tyers, M. (2003) Osprey: a network visualization system.

Breslauer, D.N., Lee, P.J., & Lee, L.P. (2006) Microfluidics-based systems biology. *Mol.* 

Cai, L., Friedman, N., & Xie, S. (2006) Stochastic protein expression in individual cells at the

bacterial cell array biosensor. *Anal. Biochem*., Vol. 315, pp. 106-113.

Angres, B. (2005) Cell microarrays. *Expert. Rev. Mol. Diagn*., Vol. 5, pp. 769-779.

Whitesides, S.H., & Whitesides, G.M. (2000) Fabrication of topologically complex three-dimensional microfluidic systems in PDMS by rapid prototyping. *Anal.* 

biomaterials and application to human embryonic stem cells. *Nature Biotechnol*.,

define local signaling networks regulating cell morphology. *Science*, Vol. 316, pp.

Bio-microarray fabrication techniques - a review. *Crit. Rev. Biotechnol*., Vol. 26, pp.

for layout of and interaction with biological networks using subcellular localization

P.J. (2002) Confocal DNA Cytometry: A Contour-Based Segmentation Algorithm for Automated Three-Dimensional Image Segmentation. *Cytometry*, Vol. 49, pp. 12-21. Berglund, L., Björling, E., Oksvold, P., Fagerberg, L., Asplund, A., Al-Khalili Szigyarto, C.,

Persson, A., Ottosson, J., Wernérus, H., Nilsson, P., Lundberg, E., Sivertsson, A., Navani S., Wester K., Kampf C., Hober S., Pontén F., & Uhlén M. (2008) A genecentric human protein atlas for expression profiles based on antibodies. *Mol. Cell* 

throughput phenotypic testing and assay of gene function. *Genome Res*., Vol. 11, pp.

**7. References** 

*Chem*., Vol. 72, pp. 3158-3164.

*Bioinformatics*, Vol. 22, pp. 1015-1017.

annotation. *Bioinformatics*, Vol. 23, pp. 1040-1042.

systems. *Anal. Bioanal. Chem*., Vol. 390, pp. 89-111.

single molecule level. *Nature*, Vol. 440, pp. 358-362.

*Proteomics*, Vol. 7, pp. 2019-2027.

*Genome Biol*., Vol. 4, pp. R22.

*BioSyst*., Vol. 2, pp. 97-112.

Vol. 22, pp. 863-866.

1753-1756.

237-259.

1246-1255.

abundance information. In addition, active networks can be identified among network modules using jActiveModules, which select networks with high collective abundances.

Fig. 1. Work scheme for on-chip cellular microarray screening and biological network analysis.

#### **5. Summary**

In this chapter, on-chip living-cell microarrays to study network biology is reviewed. A general work scheme is shown in Figure 1. Microfluidic technology holds great promise for the creation of advanced cell culture models. It can be used — in combination with timelapse fluorescent microscopy, and image analysis and data mining — to observe multiple cellular components over time in individual cells, i.e. dynamics of a FP-tagged protein. Integration of dynamic localisomics data with other avialable biological network data allows performing a quantitative system-wide analysis for a particular cell.

Cell assays in microfluidic chips that have been used for cellular microarrays are discussed in detail in this chapter. Next, image analysis algorithms to extract dynamic proteomics data from cellular microarray experiments are reviewed. In the last part, the integration of the cellular microarray data into a network model, as well as network analysis options are discussed.

#### **6. Acknowledgment**

R. Willaert is supported by the Belgian Federal Science Policy Office and European Space Agency (ESA) PRODEX program, the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT) and the Research Council of the VUB.

#### **7. References**

622 Bioinformatics – Trends and Methodologies

abundance information. In addition, active networks can be identified among network modules using jActiveModules, which select networks with high collective abundances.

Fig. 1. Work scheme for on-chip cellular microarray screening and biological network

performing a quantitative system-wide analysis for a particular cell.

and Technology in Flanders (IWT) and the Research Council of the VUB.

In this chapter, on-chip living-cell microarrays to study network biology is reviewed. A general work scheme is shown in Figure 1. Microfluidic technology holds great promise for the creation of advanced cell culture models. It can be used — in combination with timelapse fluorescent microscopy, and image analysis and data mining — to observe multiple cellular components over time in individual cells, i.e. dynamics of a FP-tagged protein. Integration of dynamic localisomics data with other avialable biological network data allows

Cell assays in microfluidic chips that have been used for cellular microarrays are discussed in detail in this chapter. Next, image analysis algorithms to extract dynamic proteomics data from cellular microarray experiments are reviewed. In the last part, the integration of the cellular microarray data into a network model, as well as network analysis options are

R. Willaert is supported by the Belgian Federal Science Policy Office and European Space Agency (ESA) PRODEX program, the Institute for the Promotion of Innovation by Science

analysis.

**5. Summary** 

discussed.

**6. Acknowledgment** 


On-Chip Living-Cell Microarrays for Network Biology 625

De Solorzano, C.O., Malladi, R., Lelievre, S.A., & Lockett, S.J. (2001) Segmentation of nuclei

Dittrich, P.S., & Manz, A. (2006) Lab-on-a-chip: microfluidics in drug discovery. *Nat. Rev.* 

Dorn, J.F., Danuser, G., & Yang, G. (2008) Computational processing and analysis of dynamic fluorescence image data. *Methods Cell Biol*., Vol. 85, pp. 497-538. Dufour, A., Shinin, V., Tajbakhsh, S., Guillen-Aghion, N., Olivo-Marin, J.C., & Zimmer, C.

Fernandes, T.G., Diogo, M.M., Clark, D.S., Dordick, J.S., & Cabral, J.M. (2009) High-

toxicology and stem cell research. *Trends Biotechnol*., Vol. 27, pp. 342-349. Fernandes, T.G., Kwon, S.J., Bale, S.S., Lee, M.Y., Diogo, M.M., Clark, D.S., Cabral, J.M., &

Franssila, S. (2010) Introduction to microfabrication. Second edition. John Wiley & Sons,

Frenkel-Morgenstern, M., Cohen, A.A., Geva-Zatorsky, N., Eden, E., Prilusky, J., Issaeva, I.,

Ghaemmaghami, S., Huh, W.K., Bower, K., Howson, R.W., Belle, A., Dephoure, N., O'Shea,

Gladilin, E., Goetze, S., Mateos-Langerak, J., Van Driel, R., Eils, R., & Rohr, K. (2008) Shape

Glory, E., & Murphy, R.F. (2007) Automated subcellular location determination and high-

Godinez, W.J., Lampe, M., Worz, S., Muller, B., Eils, R., & Rohr, K. (2007) Tracking of virus

Goldovsky, L., Cases, I., Enright, A.J., & Ouzounis, C.A. (2005) BioLayout(Java): versatile

Gómez-Sjöberg, R., Leyrat, A.A., Pirone, D.M., Chen, C.S., & Quake, S.R. (2007) Versatile,

Hamilton, N. (2009) Quantification and its applications in fluorescent microscopy imaging.

Harder, N., Eils, R., & Rohr, K. (2008) Automated classification of mitotic phenotypes of human cells using fluorescent proteins. *Methods Cell Biol*., Vol. 85, pp. 539-554.

studies of stem cell fate. *Biotechnol. Bioeng*., Vol. 106, pp. 106-118.

*Drug Discov*., Vol. 5, pp. 210-218.

pp. 73-96.

Chichester, UK.

Vol. 425, pp. 737-741.

pp. 105-114.

8563.

Vol. 38(Database issue), pp. D508-512.

*Symp. Biomed. Imaging*, pp. 272-299.

*Bioinformatics*, Vol. 4, pp. 71-74.

*Traffic*., Vol. 10, pp. 951-961.

throughput microscopy. *Dev. Cell*., Vol. 12, pp. 7-16.

and cells using membrane related protein markers. *J. Microsc*., Vol. 201, pp. 404-415.

(2005) Segmentation and Tracking Fluorescent Cells in Dynamic 3-D Microscopy with Coupled Active Surfaces. *IEEE Trans Image Processing*, Vol. 14, pp. 1396-1410. Ekins, R., Chu, F., & Biggart, E. (1989) Development of microspot multi-analyte ratiometric

immunoassay using dual fluorescent-labelled antibodies. *Anal. Chim. Acta*, Vol. 227,

throughput cellular microarray platforms: applications in drug discovery,

Dordick, J.S. (2010) Three-dimensional cell culture microarray for high-throughput

Sigal, A., Cohen-Saidon, C., Liron, Y., Cohen, L., Danon, T., Perzov, N., & Alon, U. (2010) Dynamic Proteomics: a database for dynamics and localizations of endogenous fluorescently-tagged proteins in living human cells. *Nucleic Acids Res*.,

E.K., & Weissman, J.S. (2003) Global analysis of protein expression in yeast. *Nature*

normalization of 3D cell nuclei using elastic spherical mapping. *J. Microsc*., Vol. 231,

particles in time-lapse fluorescence microscopy image sequences. *Proc. IEEE Int.* 

network visualisation of structural and functional relationships. *Appl.* 

fully automated, microfluidic cell culture system. *Anal. Chem*., Vol. 79, pp. 8557-


Cannell, M.B., McMorland, & A. Soeller, C. (2006) Emage enhancement by deconvolution.

Speinger Science+Business Media, LCC, New York, ISBN 987-0387-25921-5. Cerami, E.G., Bader, G.D., Gross, B.E., & Sander, C. (2006) cPath: open source software for

Charvin, G., Cross, F.R., & Siggia, E.D. (2009) Forced periodic expression of G1 cyclins

Chen, D.S., & Davis, M.M. (2006) Molecular and functional analysis using live cell

Chen, L., Wang R.-S., & Zhang, X.-S. (2009) Biomolecular networks: methods and

Chen, X., Zhou, X., & Wong, S.T. (2006) Automated segmentation, classification, and

Cheong, R., Wang, C.J., & Levchenko, A. (2009) Using a microfluidic device for high-content

Chin, V.I., Taupin, P., Sanga, S., Scheel, J., Gage, F.H., & Bhatia, S.N. (2004) Microfabricated platform for studying stem cell fates. *Biotechnol. Bioeng*., Vol. 88, pp. 399-415. Chung, B.G., Flanagan, L.A., Rhee, S.W., Schwartz, P.H., Lee, A.P., Monuki, E.S., & Jeon,

Conrad, C., Erfle, H., Warnat, P., Daigle, N., Lörch, T., Ellenberg, J., Pepperkok, R., & Eils, R.

Conrad, C., Wünsche, A., Tan, T.H., Bulkescher, J., Sieckmann, F., Verissimo, F., Edelstein,

Cortés, L, & Amit, Y. (2008) Efficient annotation of vesicle dynamics video microscopy. *IEEE* 

Crick, F. (1958) On protein synthesis. *The Symposia of the Society for Experimental Biology, Vol.*

Dahlquist, K.D., Salomonis, N., Vranizan, K., Lawlor, S.C., & Conklin, B.R. (2002)

De Los Rios, P. & Vendruscolo, M. (2010) Network views of the cell. In: *Networks in Systems* 

Demir, E., Babur, O., Dogrusoz, U., Gursoy, A., Nisanci, G., Cetin-Atalay, R., & Ozturk, M.

and analysis of cellular pathways. *Bioinformatics*, Vol. 18, pp. 996-1003. Dennis, G. Jr, Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., & Lempicki, R.A.

GenMAPP, a new tool for viewing and analyzing microarray data on biological

*Biology*, M. Buchanan, G. Caldarelli, P. De Los Rios, F. Rao & M. Vendruscolo,

(2002) PATIKA: an integrated visual environment for collaborative construction

(2003) DAVID: Database for Annotation, Visualization, and Integrated Discovery.

microarrays. *Curr. Opin. Chem. Biol*., Vol. 10, pp. 28-34.

analysis of cell signaling. *Sci Signal*., Vol. 2, pp. pl2.

*Genome Res.*, Vol. 14, pp. 1130-1136.

pathways. *Nat. Genet*., Vol. 31, pp. 19-20.

*Methods*, Vol. 8, pp. 246-249.

*Genome Biol*., Vol. 4, pp. P3.

12, pp. 138-163.

generating microfluidic device. *Lab Chip*, Vol. 5, pp. 401-406.

*Trans. Pattern Anal. Mach. Intell*., Vol. 30, pp. 1998-2010.

Crick, F. (1970) Central dagma of molecular biology. *Nature*, Vol. 227, pp. 561-563.

(Eds.), 4-13, Cambridge University Press, ISBN 978-0-521-88273-6.

pp. 497.

6632-6637.

Hoboken, New Jersey.

Vol. 53, pp. 762-766.

In: Handbook of biological confocal microscopy. J.B. Pawley (Ed.), 488-500,

collecting, storing, and querying biological pathways. BMC Bioinformatics, Vol. 7,

phase-locks the budding yeast cell cycle. *Proc. Natl. Acad. Sci. USA*, Vol. 106, pp.

applications in systens biology. John Wiley & Sons, ISBN 978-0-470-24373-2,

tracking of cancer cell nuclei in time-lapse microscopy. *IEEE Trans Biomed. Eng.*,

N.L. (2005) Human neural stem cell growth and differentiation in a gradient-

(2004) Automatic identification of subcellular phenotypes on human cell arrays.

A., Walter, T., Liebel, U., Pepperkok, R., & Ellenberg, J. (2011) Micropilot: automation of fluorescence microscopy-based imaging for systems biology. *Nature* 


On-Chip Living-Cell Microarrays for Network Biology 627

Kitano, H., Funahashi, A., Matsuoka, Y., & Oda, K. (2005) Using process diagrams for the

Ko, B., Seo, M., & Nam, J.Y. (2009) Microscopic cell nuclei segmentation based on adaptive

Kononen, J., Bubendorf, L., Kallioniemi, A., Bärlund, M., Schraml, P., Leighton, S., Torhorst,

Lee, P.J., & Di Carlo, D. (2009) In: *Single cell analysis: technologies and applications*. D. Anselmetti (ed.), 135-160, Wiley-VCH Verlag GmbH & Co., Weinheim. Lee, P.J., Hung, P.J., Rao, V.M., & Lee, L.P. (2006) Nanoliter scale microbioreactor array for

Li, G., Liu, T., Nie, J., Guo, L., Chen, J., Zhu, J., Xia, W., Mara, A., Holley, S., & Wong, S.T.

Lidstrom, M.E., & Meldrum, D.R. (2003) Life-on-a-chip. *Nat. Rev. Microbiol*., Vol. 1, pp. 158-

Liebel, U., Starkuviene, V., Erfle, H., Simpson, J.C., Poustka, A., Wiemann, S., & Pepperkok,

Lin, G., Chawla, M.K., Olson, K., Guzowski, J.F., Barnes C.A., & Roysam B. (2005)

Locke, J.C., & Elowitz, M.B. (2009) Using movies to analyse gene circuit dynamics in single

Maere, S., Heymans, K., & Kuiper, M. (2005) BiNGO: a Cytoscape plugin to assess

Malpica, N., de Solórzano, C.O., Vaquero, J.J., Santos, A., Vallcorba, I., García-Sagredo, J.M.,

Martone, M.E., Tran, J., Wong, W.W., Sargis, J., Fong, L., Larson, S., Lamont, S.P., Gupta, A.,

Megason, S.G., & Fraser, S.E. (2007) Imaging in systems biology. *Cell*, Vol. 130, pp. 784-795. Meijering, E., Smal, I., & Danuser, G. (2006) Tracking in biomolecuar imaging. *IEEE Signal* 

Mettetal, J.T., Muzzey, D., Pedraza, J.M., Ozbudak, E.M., & van Oudenaarden, A. (2006)

Miller, E. (2006) Data driven image models through continuous joint alignment. *IEEE Trans.* 

dimensional segmentation of nuclei. *Cytometry*, Vol. 63A, pp. 20-33. Lindblad, J., Wählby, C., Bengtsson, E., & Zaltsman, A. (2004) Image analysis for automatic

attention window. *J. Digit. Imaging*. Vol. 22, pp. 259-274.

quantitative cell biology. *Biotechnol. Bioeng*., Vol. 94, pp. 5-14.

analysis in intact cells. *FEBS Letters*, Vol. 554, pp. 394-398.

cells. *Nat. Rev. Microbiol*., Vol. 7, pp. 383-392.

clustered nuclei. *Cytometry*. Vol. 28, pp. 289-297.

*Pattern Anal. Mach. Intell*., Vol. 28, pp. 236-250.

*Bioinformatics*, Vol. 21, pp. 3448-3449.

*Struct. Biol.,* Vol. 161, pp. 220-231.

*Process. Mag*., Vol. 23, pp. 46-53.

*USA*, Vol. 103, pp. 7304-7309.

966.

847.

164.

Vol. 231, pp. 47-58.

57, pp. 22-33.

graphical representation of biological networks. *Nat. Biotechnol*., Vol. 23, pp. 961-

J., Mihatsch, M.J., Sauter, G., & Kallioniemi, O.P. (1998) Tissue microarrays for high-throughput molecular profiling of tumor specimens. *Nat. Med*., Vol. 4, pp. 844-

(2008) Segmentation of touching cell nuclei using gradient flow tracking. *J. Microsc.*,

R. (2003) A microscope-based screening platform for large-scale functional protein

Hierarchical, model-based merging of multiple fragments for improvoed three-

segmentation of cytoplasms and classification of Rac1 activation. *Cytometry A*. Vol.

overrepresentation of Gene Ontology categories in biological networks.

& del Pozo, F. (1997) Applying watershed algorithms to the segmentation of

& Ellisman, M.H. (2008) The Cell Centered Database project: An update on building community resources for managing and sharing 3D imaging data. *J.* 

Predicting stochastic gene expression dynamics in single cells. *Proc. Natl. Acad. Sci.* 


Heiskanen, A., Emnéus, J., & Dufva, M. (2010) In: *Microfluidic based Microsystems:* 

gene ontology. *Nucleic Acids Res*., Vol. 37(Web Server issue), pp. W115-121. Huang, K., Lin, J., Gajnak, J.A., & Murphy, R.F. (2002) Image Content-based Retrieval and

Huang da, W., Sherman, B.T., & Lempicki, R.A. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. *Nat. Protoc*., Vol. 4, pp. 44-57. Huang, K., & Murphy, R.F. (2004) Boosting accuracy of automated classification of

Huh, W.K., Falvo, J.V., Gerke, L.C., Carroll, A.S., Howson, R.W., Weissman, J.S., & O'Shea,

Hung, P.J., Lee, P.J., Sabounchi, P., Lin, R., & Lee, L.P. (2005) Continuous perfusion

Ideker, T., Ozier, O., Schwikowski, B., & Siegel, A.F. (2002) Discovering regulatory and

Iragne, F., Nikolski, M., Mathieu, B., Auber, D., & Sherman, D. (2005) ProViz: protein interaction visualization and exploration. *Bioinformatics*, Vol. 21, pp. 272-274. Jarvik, J.W., Fisher G.W., Shi C., Hennen L., Hauser C., Adler S., & Berget P.B. (2002) *In vivo*

Joshi-Tope, G., Gillespie, M., Vastrik, I., D'Eustachio, P., Schmidt, E., de Bono, B., Jassal, B.,

Joyce, A.R. & Palsson, B.Ø. (2006). The model organism as a system: integrating 'omics' data

Kalaidzidis, Y. (2009) Multiple objects tracking in fluorescence microscopy. *J. Math. Biol*.,

Kholodenko, B.N., Hancock, J.F., & Kolch, W. (2010) Signalling ballet in space and time. *Nat.* 

Kim, L., Toh, Y.C., Voldman, J., & Yu, H. (2007) A practical guide to microfluidic perfusion

Kim, I., Yang, S., Le Baccon, P., Heard, E., Chen, Y.-C., Spector, D., Kappel, C., Eils, R., &

Kim, L., Vahey, M.D., Lee, H.Y., & Voldman, J. (2006) Microfluidic arrays for logarithmically

Rohr, K. (2007) Non-rigid temporal registration of 2D and 3D multi-channel microscopy image sequences of human cells. *Proc. IEEE Int. Symp. Biomed. Imaging*,

sets. *Nature Reviews Molecular Cell Biology*, Vol.7, No.4, pp. 198-210.

culture of adherent mammalian cells. *Lab Chip*, Vol. 7, pp. 681-694.

perfused embryonic stem cell culture. *Lab Chip*, Vol. 6, pp. 394-406.

2002), pp. 325-328.

425, pp. 686-691.

pp. S233-240.

Vol. 58, pp. 57-80.

pp. 1328-1331.

*Bioeng*., Vol. 89, pp. 1-8.

*BioTechniques*, Vol. 33, pp. 852-866

33(Database issue), pp. D428-432.

*Rev. Mol Cell. Biol*., Vol. 11, pp. 414-426.

pp. 78.

*fundamentals and applications*. S. Kakac, B. Kosey, D. Li, Pramuanjaroenkij (Eds.), 427-452, Spinger Science + Business Media B.V., Dordrecht, The Netherlands. Hu, Z., Hung, J.H., Wang, Y., Chang, Y.C., Huang, C.L., Huyck, M., & DeLisi, C. (2009)

VisANT 3.5: multi-scale network visualization, analysis and inference based on the

Automated Interpretation of Fluorescence Microscope Images via the Protein Subcellular Location Image Database. *Proc 2002 IEEE Intl Symp Biomed Imaging* (ISBI

fluorescence microscope images for location proteo- mics. *BMC Bioinformatics*, Vol. 5,

E.K. (2003) Global analysis of protein localization in budding yeast. *Nature*, Vol.

microfluidic cell culture array for high-throughput cell-based assays. *Biotechnol.* 

signalling circuits in molecular interaction networks. *Bioinformatics*, Vol. 18 Suppl 1,

functional proteomics: Mammalian genome annotation using CD-tagging.

Gopinath, G.R., Wu, G.R., Matthews, L., Lewis, S., Birney, E., & Stein, L. (2005) Reactome: a knowledgebase of biological pathways. *Nucleic Acids Res.*, Vol.


On-Chip Living-Cell Microarrays for Network Biology 629

Sarti, A., de Solorzano, C.O., Locket, S., & Malladi, R. (2000) A Geometric Model for 3-D

Sbalzarini, I.F., & Koumoutsakos, P. (2005) Feature point tracking and trajectory analysis for

Schubert, W., Bonnekoh, B., Pommer, A.J., Philipsen, L., Böckelmann, R., Malykh, Y.,

Shannon, P., Markiel, A., Ozier, O., Baliga, NS., Wang, J.T., Ramage, D., Amin, N.,

Shariff, A., Kangas, J., Coelho, L.P., Quinn, S., & Murphy, R.F. (2010) Automated image

Shen, D., & Davatzikos, C. (2002) HAMMER: heirarchical attribute matching mechanism for elastic registration. *IEEE Trans. Med. Imaging*, Vol. 21, pp. 1421-1439. Souchelnytskyi, S. (2005) Bridging proteomics and systems biology: what are the roads to be

Strömbäck, L., Jakoniene, V., Tan, H., & Lambrix, P. (2006) Representing, storing and

Suderman, M, & Hallett, M. (2007) Tools for visually exploring biological networks.

Sui, G., Lee, C., Kamei, K., Li, H., Wang, J-Y., Wang, J., Herschman, H.R., & Tseng, H. (2007)

Swedlow, J.R., Goldberg, I.G., & Eliceiri, K.W. (2009) OME Consortium. Bioimage informatics for experimental biology. *Annu. Rev. Biophys*. Vol. 38, pp. 327-346. Taniguchi, Y., Choi, P.J., Li, G.W., Chen, H., Babu, M., Hearn, J., Emili, A., & Xie, X.S. (2010)

Tay, S., Hughey, J.J., Lee, T.K., Lipniacki, T., Quake, S.R., & Covert, M.W. (2010) Single-cell

Terentiev, A.A., Moldogazieva, N.T., & Shaitan, K.V. (2009) Dynamic proteomics in

Thompson, D.M., King, K.R., Wieder, K.J., Toner, M., Yarmush, M.L., & Jayaraman, A.

Schwenk, J.M., Stoll, D., Templin, M.F., & Joos, T.O. (2002) Cell microarrays: an emerging

Taniguchi, Y., Choi, P.J., Li, G, Chen, H., Babu, M., Hearn, J., Emili, A., & Xie, X.S. (2010)

video imaging in cell biology. *J. Struct. Biol*., Vol. 151, pp. 182-195.

microscopy. *Nature Biotechnology*, Vol. 24, pp. 1270-1278.

traveled? *Proteomics*, Vol. 5, pp. 4123-4137.

*Bioinformatics*, Vol. 23, pp. 2651-2659.

Biomed. *Microdevices*, Vol. 9, pp. 301-305.

single cells. *Science*, Vol. 329, pp. 533-538.

processing. *Nature*, Vol. 466, pp. 267-271.

*Anal. Chem*., Vol. 76, pp. 4098-4103.

single cells. *Science*, Vol. 329, pp. 533-538.

pp. 2498-2504.

Vol. 7, pp. 331-338.

pp. 1586-607.

61.

734.

Confocal Image Analysis. *IEEE Trans Biomedical Engineering*, Vol. 47, pp. 1600-1609.

Gollnick, H., Friedenberger, M., Bode, M., & Dress, A.W. (2006) Analyzing proteome topology and function by automated multidimensional fluorescence

Schwikowski, B., & Ideker, T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. *Genome Res*., Vol. 13(11),

analysis for high-content screening and analysis. *J. Biomol. Screen*., Vol. 15, pp. 726-

accessing molecular interaction data: a review of models and tools. *Brief Bioinform.*,

A microfluidic platform for sequential ligand labeling and cell binding analysis.

Quantifying *E. coli* proteome and transcriptome with single-molecule sensitivity in

NF-kappaB dynamics reveal digital activation and analogue information

modeling of the living cell. Protein-protein interactions. *Biochemistry (Mosc)*, Vol. 74,

(2004) Dynamic gene expression profiling using a microfabricated living cell array.

technology for the characterization of antibodies. *Biotechniques*, Dec, Suppl, pp. 54-

Quantifying *E. coli* proteome and trancriptome with single-molecule sensitivity in


Nagasaki, M., Saito, A., Jeong, E., Li, C., Kojima, K., Ikeda, E., & Miyano, S. (2010) Cell

Narayanaswamy, R., Niu, W., Scouras, A.D., Hart, G.T., Davies, J., Ellington, A.D., Iyer,

Newberg, J., Hua, J., & Murphy, R.F. (2009) Location proteomics: systematic determination

Pedraza, J.M., & van Oudenaarden, A. (2005) Noise propagation in gene networks. *Science*,

Peng, H. (2008) Bioimage informatics: a new area of engineering biology. *Bioinformatics* Vol.

Pepperkok, R., & Ellenberg, J. (2006) High-throughput fluorescence microscopy for systems

Pisani, M.B., & Tadigadapa, S.A. (2010) Microfabrication techniques for microfluidic devices.

Ponti, A., Vallotton, P., Salmon, W.C., Waterman-Storer, C.M., & Danuser, G. (2003)

Radhakrishnan, R., Solomon, M., Satyamoorthy, K., Martin, L.E., & Lingen, M.W. (2008)

Raser, J.M., & O'Shea, E.K. (2005) Review: Noise in gene expression: origins, consequences,

Roach, K.L., King, K.R., Uygun, B.E., Kohane, I.S., Yarmush, M.L., & Toner, M. (2009) High throughput single cell bioinformatics. *Biotechnol. Prog*., Vol. 25, pp. 1772-1779. Rohde, G.K., Ribeiro, A.J., Dahl, K.N., & Murphy, R.F. (2008) Deformation-based nuclear

Rohr, K., Godinez, W.J., Harder, N., Wörz, S., Mattes, J., Tvarusko, W., & Eils, R. (2010)

Rohr, K., Fornefett, M., & Stiehl, H.S. (2003) Spline-based elastic image registration,

Rowat, A.C., Bird, J.C., Agresti, J.J., Rando, O.J., & Weitz, D.A. (2009) Tracking lineages of

Rual, J.F., Venkatesan, K., Hao T., *et al.* (2005) Towards a proteome-scale map of the human protein-protein interaction network. *Nature*, Vol. 437, pp. 1173-1178.

fluorescent speckle microscopy. *Biophys J*., Vol. 84, 3336-3352.

In: *Methods in Bioengineering: Biomicrofabrication & biomicrofluidics*. J.D. Zahn (Ed), 1-

Computational analysis of F-actin turnover in cortical actin meshworks using

Tissue microarray - a high-throughput molecular analysis in head and neck cancer.

morphometry: capturing nuclear shape variation in HeLa cells. *Cytometry A*., Vol.

Tracking and quantitative analysis of dynamic movement of cells and particles. In: *Live cell imaging: a laboratory manual*. R.D. Goldman, Swedlow, J.R., Spector, D.L. (eds.), 239-256, Cold Spring Harbor Laboratory Press, New York, ISBN 978-0-87969-

integration of landmark errors and orientation attributes. *Comput*. *Vis. Image* 

single cells in lines using a microfluidic device. *Proc. Natl. Acad. Sci. USA*, Vol. 106,

thick tissue sections. *J. Microsc*., Vol. 193, pp. 212-226.

biology. *Nat. Rev. Mol. Biol*., Vol. 7, pp. 690-696.

pp. 0002.

Vol. 7, pp. R6-9.

Vol. 307, pp. 1965-1969.

57, Artech House, Boston, USA.

*J. Oral Pathol. Med*., Vol. 37, pp. 166-176.

and control. *Science*, 309, pp. 2010-2013.

24, pp. 1827-1836.

73, pp. 341-350.

pp. 18149-18154.

*Underst.*, Vol. 90, pp. 153-168.

893-5.

Illustrator 4.0: A computational platform for systems biology. *In Silico Biol*., Vol. 10,

V.R., & Marcotte, E.M. (2006) Systematic profiling of cellular phenotypes with spotted cell microarrays reveals mating-pheromone response genes. *Genome Biol*.,

of protein subcellular location. In: *Methods in Molecular Biology, Systems Biology*, I.V. Maly, (Ed.), 313-332, Humana Press, ISBN 987-1-934115-64-0, New York, NY, USA. Ortiz de Solorzano, C., Garcia Rodriguez, E., Jones, A., Pinkel, D., Gray, J.W., Sudar, D., &

Lockett, S.J. (1999) Segmentation of confocal microscope images of cell nuclei in


**28** 

**Novel Machine** 

*Cairo University,* 

*Egypt* 

**Learning Techniques for** 

*Faculty of Computers and Information,* 

**Micro-Array Data Classification** 

Neamat El Gayar, Eman Ahmed and Iman El Azab

Machine learning, data mining and pattern recognition have been quite often used in various contexts of medical and bioinformatics applications. Currently computational methods and tools available for that purpose are quite abundant. The main aim of this chapter is to outline to the practitioners the basic concepts of the fields focusing on essential machine learning tools and highlighting their best practices to be successfully used in the medical domain. We present a case study for DNA microarray classification using ensemble

The background section will begin by introducing the reader to the fields of pattern recognition, machine learning and data mining. It will then focus on some of the most

In particular in section 2 we review the most popular machine learning models for classification used in the context of the medical domain. We then describe one of the most powerful and widely used classifiers for high dimensional feature spaces; the support vector machines (SVM). We cover the area of classifier evaluation and comparison to provide practitioners with essential understanding of how to test, validate and select the appropriate models for their applications. Finally, we summarize the main advances in the field of

Section 3 presents a review of using machine learning in various fields of bioinformatics. In section 4, a recent case study on DNA microarray data that uses an ensemble of SVMs coupled with feature subset selection methods is presented. We show how the proposed model can alleviate the curse of dimensionality associated with expression-based

Section 5 describes the data used and the experiments conducted, while section 6 presents

Finally in section 7 we summarize the main contributions of this chapter and review the main guidelines to effectively use machine learning tools. We end this section by highlighting a set of challenges that need to be addressed and propose some future research

ensemble learning, feature subset selection and feature subset ensembles.

classification of DNA data in order to achieve stable and reliable results.

results and a comparative analysis for the proposed models.

**1. Introduction** 

directions in the field.

methods and feature subset selection techniques.

important concepts related to machine learning.


### **Novel Machine Learning Techniques for Micro-Array Data Classification**

Neamat El Gayar, Eman Ahmed and Iman El Azab *Faculty of Computers and Information, Cairo University, Egypt* 

#### **1. Introduction**

630 Bioinformatics – Trends and Methodologies

Umesh Adiga, P.S., & Chaudhuri, B.B. (2001) An efficient method based on watershed and

Vailaya, A., Bluvas, P., Kincaid, R., Kuchinsky, A., Creech, M., & Adler, A. (2005) An

Velve-Casquillas, G., Le Berre, M., Piel, M., & Tran, P.T. (2010) Microfluidic tools for cell

Verveer, P.J., & Bastiaens, P.I. (2008) Quantitative microscopy and systems biology: seeing

Viola, P., & Wells, W.M. (1997) Alignment by maximization of mutual information. *Int. J.* 

Wählby, C., Sintorn, I.M., Erlandsson, F., Borgefors, G., & Bengtsson, E. (2004) Combining

Weibel, D.B., DiLuzio, W.R., & Whitesides, G.M. (2007) Microfabrication meets

West, J., Becker, M., Tombrink, S., & Manz, A. (2008) Micro total analysis systems: latest

Whitesides, G.M., Ostuni, E., Takayama, S., Jiang, X., & Ingber, D.E. (2001) Soft lithography in biology and biochemistry. *Annu. Rev. Biomed. Eng*., Vol. 3, pp. 335-373. Whitesides, G.M. (2006) The origins and the future of microfluidics. *Nature*, Vol. 442, pp.

Wieder, K.J., King, K.R., Thompson, D.M., Zia, C., Yarmush, M.L., & Jayaraman, A. (2005)

Wixon, J., & Kell, D. (2000) The Kyoto encyclopedia of genes and genomes--KEGG. *Yeast*,

Xu, C.W. (2002) High-density cell microarrays for parallel functional determinations.

Yang, S., Kohler, D., Teller, K., Cremer, T., Le Baccon, P., Heard, E., Eils, R., & Rohr, K.

Yu J., Xiao J., Ren X., Lao K., & Xie S. (2006) Probing gene expression in live cells, one

Zhou, X., & Wong, S.T.C. (2006) High content cellular imaging for drug development. *IEEE* 

Zhou, X., & Wong, S.T.C. (2008) A primer on image informatrics of high content screening.

Ziauddin, J., & Sabatini, D.M. (2001) Microarrays of cells expressingdefined cDNAs. *Nature*,

Zollei, L., Learned-Miller, E., Grimson, E., Wells, W. (2005) Efficient population registration

protein molecule at a time. *Science*, Vol. 311, pp. 1600-1603.

Optimization of reporter cells for expression profiling in a microfluidic device.

(2008) Nonrigid registration of 3-d multichannel microscopy images of cell nuclei.

In: *High Content Screening*. S. Haney (Ed.), 43-84, John Wiley & Sons, Hoboken,

of 3D data. 2005. In *ICCV Workshop on Computer Vision for Biomedical Image* 

intensity, edge and shape information for 2D and 3D segmentation of cell nuclei in

*Recognition*, Vol. 34, pp. 1449-1458.

Bioinformatics, Vol. 21, pp. 430-438.

*Comput. Vis.*, Vol. 24, pp. 137-154.

368-373.

Vol. 17, pp. 48-55.

biological research. *Nano Today*, Vol. 5, pp. 28-47.

tissue sections. *J. Microsc*., Vol. 215, pp. 67-76.

microbiology. *Nat. Rev. Microbiol*., Vol. 5, pp. 209-218.

achievements. *Anal. Chem*., Vol. 80, pp. 4403-4419.

*Biomed. Microdevices*, Vol. 7, pp. 213-222.

*IEEE Trans Image Process*, Vol. 17, 493-499.

*Signal Process. Mag*., Vol. 23, pp. 170-174.

*Applications: Current Techniques and Future Trends.*

ISBN 978-0-470-03999-1.

Vol. 411, pp. 107-110.

*Genome Res*., Vol. 12, pp. 482-486.

the whole picture. *Histochem. Cell Biol.,* Vol. 130, pp. 833-843.

rule-based merging for segmentation of 3-D histo-pathological images. *Pattern* 

architecture for biological information extraction and representation.

Machine learning, data mining and pattern recognition have been quite often used in various contexts of medical and bioinformatics applications. Currently computational methods and tools available for that purpose are quite abundant. The main aim of this chapter is to outline to the practitioners the basic concepts of the fields focusing on essential machine learning tools and highlighting their best practices to be successfully used in the medical domain. We present a case study for DNA microarray classification using ensemble methods and feature subset selection techniques.

The background section will begin by introducing the reader to the fields of pattern recognition, machine learning and data mining. It will then focus on some of the most important concepts related to machine learning.

In particular in section 2 we review the most popular machine learning models for classification used in the context of the medical domain. We then describe one of the most powerful and widely used classifiers for high dimensional feature spaces; the support vector machines (SVM). We cover the area of classifier evaluation and comparison to provide practitioners with essential understanding of how to test, validate and select the appropriate models for their applications. Finally, we summarize the main advances in the field of ensemble learning, feature subset selection and feature subset ensembles.

Section 3 presents a review of using machine learning in various fields of bioinformatics.

In section 4, a recent case study on DNA microarray data that uses an ensemble of SVMs coupled with feature subset selection methods is presented. We show how the proposed model can alleviate the curse of dimensionality associated with expression-based classification of DNA data in order to achieve stable and reliable results.

Section 5 describes the data used and the experiments conducted, while section 6 presents results and a comparative analysis for the proposed models.

Finally in section 7 we summarize the main contributions of this chapter and review the main guidelines to effectively use machine learning tools. We end this section by highlighting a set of challenges that need to be addressed and propose some future research directions in the field.

Novel Machine Learning Techniques for Micro-Array Data Classification 633

neural networks, support vector machines, the naïve Bayesian classifier and Bayesian

It is worth mentioning at this point that particularly in medical applications sometimes models are preferred that are more interpretable. Such models posses some characteristics like being able to make knowledge discovered from data explicit and communicable to domain experts, the provision of an explanation when deploying and using the knowledge with new cases, in addition to the ability to encode and use the domain knowledge in the data analysis process (Bellazi & Zupan 2008). Decision trees and Bayesian networks are among the models that are easily explainable. Decision Trees are sometimes preferred over more accurate classifier because of their descriptive power; i.e. the ability interpret classification rule produced by the model. This is particularly important for 'safety critically' medical applications where results are required to be

Moreover, the fact that medical data can often be imperfect is complemented in practice by exploiting domain knowledge. Building classification models using background knowledge is very useful in order to take into account information which is already known and should not be rediscovered from data. Background knowledge can be expressed using different

From another perspective, in the bioinformatics applications and in particular for the DNA microarray data classification; more powerful tools are needed to deal with the challenges posed by the low sample size, high dimensionality, noise and large biological variability

We therefore devote the next subsections for reviewing Support Vector Machines (SVMs), ensembles methods and feature subset selection techniques. These techniques are known to

This section is devoted to review one of the most powerful and widely used classifiers for

SVMs are binary classifiers that aim to produce an optimal classifier that lies in midway

In case of linearly separable problem, SVMs discriminate between two classes by fitting an optimal separating hyper-plane in the midway between the closest training samples of the opposite classes in a multi-dimensional feature space. This is done by maximizing the

Where *w* is the weight vector and *b* is the bias that maximizes the margin under the constraint of correct classification. It was found that minimizing *w* maximizes the margin.

> <sup>w</sup><sup>2</sup> <sup>N</sup> min C <sup>θ</sup><sup>i</sup> <sup>2</sup> i 1 <sup>+</sup> =

f(x) wx b = + (1)

(2)

margin which is the distance between the closest training samples and the classifier. Given *Z* a training dataset with *N* samples in *d-*dimensional feature space *Rd*.

be robust tools for classification in noisy, high-dimensional and complex domains.

high dimensional feature spaces; the support vector machines (SVM).

between the nearest data points of the 2 classes of the problem at hand.

The objective is to find the linear hyper-plane represented by:

This forms the following optimization problem:

networks.

understood by domain experts.

**2.3 Support vector machines** 

Each *xi* has class *yi* = ± 1.

present in the data.

models like Bayesian models, decision rules and fuzzy rules.

### **2. Background**

#### **2.1 Pattern recognition, machine learning and data mining**

Pattern recognition can be defined as the categorization of the input data into identifiable classes via the extraction of significant features or attributes of the data from a background of irrelevant detail (Duda et al, 2000). The task of pattern recognition is also viewed as the transformation from the measurement space to the feature space and finally to a decision space.

Machine learning techniques aim at producing a system that can learn and adapt from the environment and hence exhibits a kind of intelligence essential for applications that lack known solutions (Alpydin, 2004). Machine learning models very often attempt to optimize a criterion function through exploiting information from training examples.

Data mining, on the other hand, can be thought of as a collection of statistical, machine learning, pattern recognition and artificial intelligence tools that help uncover and extract 'hidden' knowledge from data. Particularly in the medical domain data mining refers often to techniques and methods that analyze large amounts of data. These techniques include among many others classification, clustering, association rule mining and regression or prediction.

*Cluster analysis* usually addresses segmentation problems. The objective of this analysis is to separate data with similar characteristics from the dissimilar ones. Cluster analysis is frequently the first required task of the mining process. Cluster analysis can also be used for outlier detection to identify samples with peculiar behavior. Among the most simple and efficient clustering techniques are K-means, fuzzy K-means, Self Organizing maps; in addition to more advanced clustering methods like evolving clustering techniques and distributed clustering.

The purpose of *association rule mining,* on the other hand*,* is to search for the most significant relationship across large number of variables or attributes. Sometimes, association is viewed as one type of dependencies where affinities of data items are described (e.g., describing data items or events that frequently occur together or in sequence). Some techniques for association analysis are nonlinear regression, rule induction, Apriori algorithm and Bayesian networks.

*Time Series prediction* is also an important aspect in data mining whereby the temporal structure and ordering of the data is utilized to estimate some future value based on current and past data samples. Time-series prediction encompasses a wide variety of applications.

As mentioned earlier, the purpose of this chapter is to provide a broad introduction to the fundamentals of machine learning suitable for bioinformatics. The rest of the chapter will mainly focus on the classification problem.

#### **2.2 Machine learning models for classification**

Classification is usually referred to as the process of devising models that can predict categorical (discrete, unordered) class labels. Often machine learning models are used for these purposes that learn the class functions using a set of given training examples.

Popular machine learning classification models are decision tree classifiers, Bayesian classifiers, Bayesian belief networks, rule based classifiers and Backpropagation- Multi layer neural network (Hand et. al, 2001). More recent approaches to classification include support vector machines and ensemble methods. In addition, other approaches are frequently encountered in the literature like *k*-nearest-neighbor classifiers, case-based reasoning, genetic algorithms, rough sets and fuzzy logic techniques.

According to a recent ranking (KDnuggets : Polls, 2006) common classification models used in the data mining community are decision trees, decision rules, logistic regression, artificial neural networks, support vector machines, the naïve Bayesian classifier and Bayesian networks.

It is worth mentioning at this point that particularly in medical applications sometimes models are preferred that are more interpretable. Such models posses some characteristics like being able to make knowledge discovered from data explicit and communicable to domain experts, the provision of an explanation when deploying and using the knowledge with new cases, in addition to the ability to encode and use the domain knowledge in the data analysis process (Bellazi & Zupan 2008). Decision trees and Bayesian networks are among the models that are easily explainable. Decision Trees are sometimes preferred over more accurate classifier because of their descriptive power; i.e. the ability interpret classification rule produced by the model. This is particularly important for 'safety critically' medical applications where results are required to be understood by domain experts.

Moreover, the fact that medical data can often be imperfect is complemented in practice by exploiting domain knowledge. Building classification models using background knowledge is very useful in order to take into account information which is already known and should not be rediscovered from data. Background knowledge can be expressed using different models like Bayesian models, decision rules and fuzzy rules.

From another perspective, in the bioinformatics applications and in particular for the DNA microarray data classification; more powerful tools are needed to deal with the challenges posed by the low sample size, high dimensionality, noise and large biological variability present in the data.

We therefore devote the next subsections for reviewing Support Vector Machines (SVMs), ensembles methods and feature subset selection techniques. These techniques are known to be robust tools for classification in noisy, high-dimensional and complex domains.

#### **2.3 Support vector machines**

632 Bioinformatics – Trends and Methodologies

Pattern recognition can be defined as the categorization of the input data into identifiable classes via the extraction of significant features or attributes of the data from a background of irrelevant detail (Duda et al, 2000). The task of pattern recognition is also viewed as the transformation from the measurement space to the feature space and finally to a decision

Machine learning techniques aim at producing a system that can learn and adapt from the environment and hence exhibits a kind of intelligence essential for applications that lack known solutions (Alpydin, 2004). Machine learning models very often attempt to optimize a

Data mining, on the other hand, can be thought of as a collection of statistical, machine learning, pattern recognition and artificial intelligence tools that help uncover and extract 'hidden' knowledge from data. Particularly in the medical domain data mining refers often to techniques and methods that analyze large amounts of data. These techniques include among many others classification, clustering, association rule mining and regression or prediction. *Cluster analysis* usually addresses segmentation problems. The objective of this analysis is to separate data with similar characteristics from the dissimilar ones. Cluster analysis is frequently the first required task of the mining process. Cluster analysis can also be used for outlier detection to identify samples with peculiar behavior. Among the most simple and efficient clustering techniques are K-means, fuzzy K-means, Self Organizing maps; in addition to more advanced clustering methods like evolving clustering techniques and

The purpose of *association rule mining,* on the other hand*,* is to search for the most significant relationship across large number of variables or attributes. Sometimes, association is viewed as one type of dependencies where affinities of data items are described (e.g., describing data items or events that frequently occur together or in sequence). Some techniques for association analysis are nonlinear regression, rule induction, Apriori algorithm and Bayesian networks. *Time Series prediction* is also an important aspect in data mining whereby the temporal structure and ordering of the data is utilized to estimate some future value based on current and past data samples. Time-series prediction encompasses a wide variety of applications. As mentioned earlier, the purpose of this chapter is to provide a broad introduction to the fundamentals of machine learning suitable for bioinformatics. The rest of the chapter will

Classification is usually referred to as the process of devising models that can predict categorical (discrete, unordered) class labels. Often machine learning models are used for

Popular machine learning classification models are decision tree classifiers, Bayesian classifiers, Bayesian belief networks, rule based classifiers and Backpropagation- Multi layer neural network (Hand et. al, 2001). More recent approaches to classification include support vector machines and ensemble methods. In addition, other approaches are frequently encountered in the literature like *k*-nearest-neighbor classifiers, case-based reasoning,

According to a recent ranking (KDnuggets : Polls, 2006) common classification models used in the data mining community are decision trees, decision rules, logistic regression, artificial

these purposes that learn the class functions using a set of given training examples.

**2.1 Pattern recognition, machine learning and data mining** 

criterion function through exploiting information from training examples.

**2. Background** 

distributed clustering.

mainly focus on the classification problem.

**2.2 Machine learning models for classification** 

genetic algorithms, rough sets and fuzzy logic techniques.

space.

This section is devoted to review one of the most powerful and widely used classifiers for high dimensional feature spaces; the support vector machines (SVM).

SVMs are binary classifiers that aim to produce an optimal classifier that lies in midway between the nearest data points of the 2 classes of the problem at hand.

In case of linearly separable problem, SVMs discriminate between two classes by fitting an optimal separating hyper-plane in the midway between the closest training samples of the opposite classes in a multi-dimensional feature space. This is done by maximizing the margin which is the distance between the closest training samples and the classifier.

Given *Z* a training dataset with *N* samples in *d-*dimensional feature space *Rd*.

Each *xi* has class *yi* = ± 1.

The objective is to find the linear hyper-plane represented by:

$$\mathbf{f}(\mathbf{x}) = \mathbf{w}\mathbf{x} + \mathbf{b} \tag{1}$$

Where *w* is the weight vector and *b* is the bias that maximizes the margin under the constraint of correct classification. It was found that minimizing *w* maximizes the margin. This forms the following optimization problem:

$$\min \left( \frac{\mathbf{w}^2}{2} + \mathbf{C} \sum\_{\mathbf{i}=1}^{N} \theta\_{\mathbf{i}} \right) \tag{2}$$

Novel Machine Learning Techniques for Micro-Array Data Classification 635

In our model described in section 4, we present a combiner based on a SVM trainable

A common way to build base classifiers for further combination is by randomly selecting different subsets of features and training classifiers on those subsets. Feature subset selection should enforce diversity among classifiers created and hence lead to more robust ensembles. In applications that are characterized by having a huge number of features, feature subset

The way this method works is by sub-sampling the features such that the base classifiers in the ensemble can be built on different subsets of features, either disjoint or overlapping. So instead of over-whelming a single classifier with all the features, individual classifiers can be

The choice of features for each subset depends on the problem at hand. The features may be naturally grouped forming the feature subsets. They can also be selected by any available feature selection method. The random subspace methods (Ho, 1998) work well when there is

Also, various heuristic search techniques such as genetic algorithms, tabu search and simulated annealing are used for feature subset selection. The feature subsets can be selected one at a time or all at the same time in one run of the algorithm by optimizing some

Random selection is the intuitive way for selecting samples and is the simplest method

There are two types of random selection: *Random selection without replacement* in which the samples are randomly selected then removed so that they cannot be chosen again and *Random selection with replacement* where the samples are randomly selected then placed back

In our case study presented later, we use *Random selection without replacement*. We also propose a feature subset selection method based on the *K*-means clustering algorithm. Kmeans is a typical partition-based clustering method. Given a pre-specified number *K*, the algorithm iteratively partitions the data set till it gets *K* disjoint subsets. In these iterations, *K-*means tries to minimize the sum of the squared distances of the samples from their cluster centres. It is a simple and fast algorithm. In our proposed approach the genes are the objects of interest to be clustered and they are characterized by their expression values among the

As follows we present main concepts for classifier evaluation and comparison. We start by reviewing cross validation and then discuss main performance measures that can be used to

A classifier usually learns from the available data. The problem is that the resulting classifier

Cross validation is a technique for assessing the generalization performance of a given classifier. It can be used for estimating the performance of a given classifier as well as for

built on groups of feature then their decisions are combined to get the final decision.

redundant information dispersed across all the features (Kuncheva, 2004).

classifier that works on measurement level outputs of the base classifiers.

**2.5 Feature subset selection and feature subset ensembles** 

ensembles can be used in order to make use of all the features.

ensemble performance criterion function (Kuncheva, 2004).

so that they can be chosen again.

samples in the microarray dataset.

evaluate classification results.

tuning the model parameters.

**2.6.1 Cross validation** 

**2.6 Classifier testing and evaluation** 

available. It assumes a uniform distribution for all the samples.

may fit on the training data, but might fail to predict unseen data.

With *C* as regularization parameter and *θ* as slack variables.

In case of non-linearly separable classes, the input samples are mapped to a high dimensional feature space using a kernel function. Thanks to kernel trick, it is possible to work within the newly transformed feature space without having to map every sample explicitly.

The training of the SVM requires getting optimal parameter values for the regularization parameter *C*.

The final SVM function for non-linearly separable case is represented by:

$$\mathbf{f}(\mathbf{x}) = \sum\_{\mathbf{i}=1}^{N} \mathbf{a}\_{\mathbf{i}} \mathbf{y}\_{\mathbf{i}} \mathbf{k} \left< \mathbf{x}, \mathbf{x}\_{\mathbf{i}} \right> + \mathbf{b} \tag{3}$$

Where α*i* are Lagrange multipliers.

Further detailed explanation can be found in (Abe, 2005).

#### **2.4 Ensemble learning**

Ensemble classifiers - also called Multiple Classifier Systems (MCS) - are based on the design of several classifiers separately then joining the final classification decision. MCS are a preferred solution to recognition problems because it allows simultaneous use of different feature descriptors of many types, corresponding measures of similarity and many classification procedures. Examples of these techniques include bagging, boosting, and mixtures of experts and others. Refer to (Roli & Giacinto 2002) (Kuncheva, 2004) (MCS series) for a good review on methods and research in that area.

Perhaps the most obvious motivation for classifier ensembles is the possibility to boost the classification accuracy by combining classifiers that make different errors or by combining local experts. The fact that the best individual classifier for the classification task at hand is very difficult to identify unless deep prior knowledge is available is also a motivation for using multiple classifiers. Another reason is when the features of a sample may be presented in very diverse forms, making it impossible to use them as input for one single classifier. Another rationale is the desire to boost efficiency by using simple and cheap classifiers that operate only on a small set of features. These are all cases that can be found in medical and bioinformatics data.

Classifier combination can fall under one of the following taxonomies according to the type of outputs produced by the classifiers (Kittler et al. 1998). Classifier outputs can be crisp outputs (also called abstract level), ranked list of data classes or measurement level outputs. For abstract level, a classifier outputs a unique label for every pattern to be classified. The combination of such classifiers is usually done by voting strategies, such as majority vote, weighted majority vote or by trained fusion rules such as Behavioural Knowledge Space (Kuncheva, 2004).

For rank level classifiers, the output is a ranked list of labels for every pattern. Borda Count is the common technique to combine these rankings. The rankings from all classifiers are combined by ranking functions assigning votes to the classes based on their positions in the classifiers' rankings. The final decision is taken as the minimum of the sum of these rankings. Finally, at the measurement level, the classifier output represents the degree of belongingness in each class. For this type of output various combination rules can be applied like product, sum, mean, etc. These combination rules are derived mainly from Bayesian decision rule. Non-Bayesian combinations can also be applied such that a weighted linear combination of classifiers is learnt using optimization techniques.

In case of non-linearly separable classes, the input samples are mapped to a high dimensional feature space using a kernel function. Thanks to kernel trick, it is possible to work within the

The training of the SVM requires getting optimal parameter values for the regularization

f(x) <sup>α</sup> <sup>y</sup> k x,x b ii i i 1 = +

Ensemble classifiers - also called Multiple Classifier Systems (MCS) - are based on the design of several classifiers separately then joining the final classification decision. MCS are a preferred solution to recognition problems because it allows simultaneous use of different feature descriptors of many types, corresponding measures of similarity and many classification procedures. Examples of these techniques include bagging, boosting, and mixtures of experts and others. Refer to (Roli & Giacinto 2002) (Kuncheva, 2004) (MCS

Perhaps the most obvious motivation for classifier ensembles is the possibility to boost the classification accuracy by combining classifiers that make different errors or by combining local experts. The fact that the best individual classifier for the classification task at hand is very difficult to identify unless deep prior knowledge is available is also a motivation for using multiple classifiers. Another reason is when the features of a sample may be presented in very diverse forms, making it impossible to use them as input for one single classifier. Another rationale is the desire to boost efficiency by using simple and cheap classifiers that operate only on a small set of features. These are all cases that can be found in medical and

Classifier combination can fall under one of the following taxonomies according to the type of outputs produced by the classifiers (Kittler et al. 1998). Classifier outputs can be crisp outputs (also called abstract level), ranked list of data classes or measurement level outputs. For abstract level, a classifier outputs a unique label for every pattern to be classified. The combination of such classifiers is usually done by voting strategies, such as majority vote, weighted majority vote or by trained fusion rules such as Behavioural Knowledge Space

For rank level classifiers, the output is a ranked list of labels for every pattern. Borda Count is the common technique to combine these rankings. The rankings from all classifiers are combined by ranking functions assigning votes to the classes based on their positions in the classifiers' rankings. The final decision is taken as the minimum of the sum of these rankings. Finally, at the measurement level, the classifier output represents the degree of belongingness in each class. For this type of output various combination rules can be applied like product, sum, mean, etc. These combination rules are derived mainly from Bayesian decision rule. Non-Bayesian combinations can also be applied such that a

weighted linear combination of classifiers is learnt using optimization techniques.

= (3)

newly transformed feature space without having to map every sample explicitly.

N

The final SVM function for non-linearly separable case is represented by:

With *C* as regularization parameter and *θ* as slack variables.

Further detailed explanation can be found in (Abe, 2005).

series) for a good review on methods and research in that area.

parameter *C*.

Where α*i* are Lagrange multipliers.

**2.4 Ensemble learning** 

bioinformatics data.

(Kuncheva, 2004).

In our model described in section 4, we present a combiner based on a SVM trainable classifier that works on measurement level outputs of the base classifiers.

#### **2.5 Feature subset selection and feature subset ensembles**

A common way to build base classifiers for further combination is by randomly selecting different subsets of features and training classifiers on those subsets. Feature subset selection should enforce diversity among classifiers created and hence lead to more robust ensembles.

In applications that are characterized by having a huge number of features, feature subset ensembles can be used in order to make use of all the features.

The way this method works is by sub-sampling the features such that the base classifiers in the ensemble can be built on different subsets of features, either disjoint or overlapping. So instead of over-whelming a single classifier with all the features, individual classifiers can be built on groups of feature then their decisions are combined to get the final decision.

The choice of features for each subset depends on the problem at hand. The features may be naturally grouped forming the feature subsets. They can also be selected by any available feature selection method. The random subspace methods (Ho, 1998) work well when there is redundant information dispersed across all the features (Kuncheva, 2004).

Also, various heuristic search techniques such as genetic algorithms, tabu search and simulated annealing are used for feature subset selection. The feature subsets can be selected one at a time or all at the same time in one run of the algorithm by optimizing some ensemble performance criterion function (Kuncheva, 2004).

Random selection is the intuitive way for selecting samples and is the simplest method available. It assumes a uniform distribution for all the samples.

There are two types of random selection: *Random selection without replacement* in which the samples are randomly selected then removed so that they cannot be chosen again and *Random selection with replacement* where the samples are randomly selected then placed back so that they can be chosen again.

In our case study presented later, we use *Random selection without replacement*. We also propose a feature subset selection method based on the *K*-means clustering algorithm. Kmeans is a typical partition-based clustering method. Given a pre-specified number *K*, the algorithm iteratively partitions the data set till it gets *K* disjoint subsets. In these iterations, *K-*means tries to minimize the sum of the squared distances of the samples from their cluster centres. It is a simple and fast algorithm. In our proposed approach the genes are the objects of interest to be clustered and they are characterized by their expression values among the samples in the microarray dataset.

#### **2.6 Classifier testing and evaluation**

As follows we present main concepts for classifier evaluation and comparison. We start by reviewing cross validation and then discuss main performance measures that can be used to evaluate classification results.

#### **2.6.1 Cross validation**

A classifier usually learns from the available data. The problem is that the resulting classifier may fit on the training data, but might fail to predict unseen data.

Cross validation is a technique for assessing the generalization performance of a given classifier. It can be used for estimating the performance of a given classifier as well as for tuning the model parameters.

Novel Machine Learning Techniques for Micro-Array Data Classification 637

quite often referred to as the *true positive (TP)*. For medical application this would indicate the number of patterns found to be sick; while they are really sick. '*B'* on the other hand represents the *false negatives (FN)*; i.e the number of sick people (positive samples) who have been falsely classified to be healthy (or negative). Similarly *'C'* is referred o as the *false* 

**Accuracy** is the percentage of correctly classified samples. It can also be formulated as in

TP TN Accuracy 100% TP FP TN FN + = × ++ +

**Sensitivity** measures the ability of a classifier to recognize the positive class (in our application to detect sick people). It is also known as *True Positive Rate (TPR)* or *recall*.

> TP Sensitivity 100% TP FN = × +

On the other hand **Specificity** measures the ability of a classifier in detecting the negative

TN Specificity 100% TN FP = × +

The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using a receiver operating characteristic (*ROC)*  Curve. It is a graphical plot of the *sensitivity (TPR)* versus *false positive rate (FPR)* which is (1

The ROC space is defined by two axes which are *FPR* and *TPR* representing the *x*-axis and the *y*-axis, respectively. This depicts relative trade-offs between true positive representing benefits and false positive representing costs. A point in the ROC space represents a

A prefect classification would yield a point in the upper left corner or coordinate (0, 1) of the ROC space where there is 100% sensitivity (no false negatives) and 100% specificity (no false positives). The point (0, 0) represents a classifier that predicts all cases to be negative, while the point (1, 1) corresponds to a classifier that predicts every case to be positive. Point (1, 0) is the classifier that is incorrect for all classifications as it means 100% false negatives and 100% false positives. The diagonal divides the ROC space. Generally, the points above the diagonal represent good classification results while the points below the line poor results.

class (i.e healthy samples). This is also known as *True Negative Rate (TNR)*.

− *specificity*), for a binary classifier as its discrimination threshold is varied.

Actual Target Classes

*positive (FP)* while 'D' indicates the *true negative (TN)*.

equation 4 using TP, TN, FP and FN.

prediction of the classifier.

Fig. 1. Confusion matrix.

Confusion Matrix Predicted Classes

+ve Class A B


+ve -ve

(4)

(5)

(6)

Methods of cross validation include *Re-substitution Validation*, *Hold-Out Validation*, *K-Fold Cross Validation*, and *Leave-One-Out Cross Validation* as will be described next.

In *Re-substitution Validation* all the available dataset is used for training the classifier. Then, it is tested on the same dataset. This makes it liable to overfitting. Thus, the classifier might perform well on the available data yet poorly on future unseen test data. However in the *Hold-Out Validation* the available dataset is split into 2 sets: one for training and the other for testing the model, such that the model can be tested on unseen data. For this approach the results are highly dependent on the choice for the training / test split. The instances in the test set may be too easy or too difficult to classify and this can skew the results. On the other hand, the instances in the test set may be valuable for training and when they are held out, the prediction performance may suffer leading to skewed results.

To overcome this drawback in *K-Fold Cross Validation*, the available data is divided into *k* equally sized folds. Subsequently, *k* iterations of training and validation are performed such that, within each iteration a different fold of the data is held-out for validation while the remaining (*k*-1) folds are used for training the classification model. Data is usually stratified prior to being split into *k* folds i.e. data is rearranged to ensure that each fold contains instances of all the classes in the problem at hand.

*Leave-One-Out Cross Validation (LOOCV)* is a special case of *k*-fold cross validation where *k* equals the number of instances in the data. In other words, in each iteration, all the data except for a single instance are used for training the model and the model is tested on that single instance. An accuracy estimate obtained using LOOCV is known to be almost unbiased but it has high variance.

To obtain more reliable performance estimates, multiple runs of *k*-fold cross validation can be applied. The data is reshuffled and re-stratified before each round. This is referred to as *Repeated K-Fold Cross Validation.* 

#### **2.6.2 Performance measures**

In this section we review some of the most important measures to calculate classifier performance. In particular we discuss the accuracy, sensitivity, specificity and precision measures.

A classifier is tested by applying it to unseen test data with known classes and comparing the predicted classes resulted from the classifier with the target classes.

The confusion matrix summarizes the correct and incorrect classifications resulted from a given classifier. It displays both the actual target classes and the predicted classes. The matrix dimension is *M x M*, where *M* is the number of classes of the problem at hand. The entry *mij* of such a matrix denotes the number of samples whose actual class is *wi*, and which are assigned by the classifier to class *wj*.

Usually, in medical diagnosis, there are two classes, the positive class that indicates infection/sickness and the negative class that indicates being healthy. For assessing the performance of a given classifier, there are other important measures that need to be considered in addition to accuracy. Among them are the sensitivity and the specificity. Sensitivity is the proportion of correctly classified samples for being positive of all the samples that are actually positive, while specificity is the proportion of correctly classified samples for being negative of all the samples that are actually negative.

In the confusion matrix, in figure 1, '*A'* represents the number of samples that actually belong to the positive class and are predicted to belong to the positive class. This is also


Fig. 1. Confusion matrix.

636 Bioinformatics – Trends and Methodologies

Methods of cross validation include *Re-substitution Validation*, *Hold-Out Validation*, *K-Fold* 

In *Re-substitution Validation* all the available dataset is used for training the classifier. Then, it is tested on the same dataset. This makes it liable to overfitting. Thus, the classifier might perform well on the available data yet poorly on future unseen test data. However in the *Hold-Out Validation* the available dataset is split into 2 sets: one for training and the other for testing the model, such that the model can be tested on unseen data. For this approach the results are highly dependent on the choice for the training / test split. The instances in the test set may be too easy or too difficult to classify and this can skew the results. On the other hand, the instances in the test set may be valuable for training and when they are held out,

To overcome this drawback in *K-Fold Cross Validation*, the available data is divided into *k* equally sized folds. Subsequently, *k* iterations of training and validation are performed such that, within each iteration a different fold of the data is held-out for validation while the remaining (*k*-1) folds are used for training the classification model. Data is usually stratified prior to being split into *k* folds i.e. data is rearranged to ensure that each fold contains

*Leave-One-Out Cross Validation (LOOCV)* is a special case of *k*-fold cross validation where *k* equals the number of instances in the data. In other words, in each iteration, all the data except for a single instance are used for training the model and the model is tested on that single instance. An accuracy estimate obtained using LOOCV is known to be almost

To obtain more reliable performance estimates, multiple runs of *k*-fold cross validation can be applied. The data is reshuffled and re-stratified before each round. This is referred to as

In this section we review some of the most important measures to calculate classifier performance. In particular we discuss the accuracy, sensitivity, specificity and precision

A classifier is tested by applying it to unseen test data with known classes and comparing

The confusion matrix summarizes the correct and incorrect classifications resulted from a given classifier. It displays both the actual target classes and the predicted classes. The matrix dimension is *M x M*, where *M* is the number of classes of the problem at hand. The entry *mij* of such a matrix denotes the number of samples whose actual class is *wi*, and which

Usually, in medical diagnosis, there are two classes, the positive class that indicates infection/sickness and the negative class that indicates being healthy. For assessing the performance of a given classifier, there are other important measures that need to be considered in addition to accuracy. Among them are the sensitivity and the specificity. Sensitivity is the proportion of correctly classified samples for being positive of all the samples that are actually positive, while specificity is the proportion of correctly classified

In the confusion matrix, in figure 1, '*A'* represents the number of samples that actually belong to the positive class and are predicted to belong to the positive class. This is also

the predicted classes resulted from the classifier with the target classes.

samples for being negative of all the samples that are actually negative.

*Cross Validation*, and *Leave-One-Out Cross Validation* as will be described next.

the prediction performance may suffer leading to skewed results.

instances of all the classes in the problem at hand.

unbiased but it has high variance.

*Repeated K-Fold Cross Validation.* 

**2.6.2 Performance measures** 

are assigned by the classifier to class *wj*.

measures.

quite often referred to as the *true positive (TP)*. For medical application this would indicate the number of patterns found to be sick; while they are really sick. '*B'* on the other hand represents the *false negatives (FN)*; i.e the number of sick people (positive samples) who have been falsely classified to be healthy (or negative). Similarly *'C'* is referred o as the *false positive (FP)* while 'D' indicates the *true negative (TN)*.

**Accuracy** is the percentage of correctly classified samples. It can also be formulated as in equation 4 using TP, TN, FP and FN.

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}} \times 100\% \tag{4}$$

**Sensitivity** measures the ability of a classifier to recognize the positive class (in our application to detect sick people). It is also known as *True Positive Rate (TPR)* or *recall*.

$$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} \times 100\% \tag{5}$$

On the other hand **Specificity** measures the ability of a classifier in detecting the negative class (i.e healthy samples). This is also known as *True Negative Rate (TNR)*.

$$\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} \times 100\% \tag{6}$$

The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using a receiver operating characteristic (*ROC)*  Curve. It is a graphical plot of the *sensitivity (TPR)* versus *false positive rate (FPR)* which is (1 − *specificity*), for a binary classifier as its discrimination threshold is varied.

The ROC space is defined by two axes which are *FPR* and *TPR* representing the *x*-axis and the *y*-axis, respectively. This depicts relative trade-offs between true positive representing benefits and false positive representing costs. A point in the ROC space represents a prediction of the classifier.

A prefect classification would yield a point in the upper left corner or coordinate (0, 1) of the ROC space where there is 100% sensitivity (no false negatives) and 100% specificity (no false positives). The point (0, 0) represents a classifier that predicts all cases to be negative, while the point (1, 1) corresponds to a classifier that predicts every case to be positive. Point (1, 0) is the classifier that is incorrect for all classifications as it means 100% false negatives and 100% false positives. The diagonal divides the ROC space. Generally, the points above the diagonal represent good classification results while the points below the line poor results.

Novel Machine Learning Techniques for Micro-Array Data Classification 639

single classifier as in (Re & Valentini, 2010), where several data sources are integrated then input to SVM base classifiers and combined using weighted average and decision templates. The ensembles outperform the single SVM classifier. Sequence information is also used for gene function and RNA structure prediction (Freyhult, 2007) as well as many other relevant

*Gene expression data analysis* is a well-established bioinformatics domain where Machine Learning methods for classification and clustering have been widely applied*. DNA gene expression microarrays* allow biologists to study genome-wide patterns of gene expression in any given cell type, at any given time, and under any given set of conditions (Baldi & Brunak, 2001). Gene expression data is arranged into a matrix where, columns represent genes and rows represent the samples. Each element in the matrix represents the expression

The use of these arrays produces large amounts of data, potentially capable of providing fundamental insights into biological processes ranging from gene function to development, cancer, aging and pharmacology (Baldi & Brunak, 2001). However the data needs to be preprocessed first, i.e. modified to be suitably used by machine learning algorithms. Then the

Clustering techniques such as *k*-means, hierarchical clustering (Eisen et al., 1998) and selforganizing maps (SOMs) (Tamayo et al., 1999) have been applied to identify genes according to their function similarities. These methods assume that related genes have similar expression patterns across all samples and hence divide the set of genes into disjoint groups. Accordingly, identifying local patterns with subset of genes that are similarly expressed over a subset of samples is difficult using traditional clustering techniques. (AboHamad et al., 2010) propose a bi-clustering technique which is based on clustering similarly expressed genes set over a subset of samples simultaneously. On the other hand, many classification techniques are used. The majority of papers published in the area of machine learning for genomic medicine deal with analyzing gene expression data coming from DNA microarrays, consisting of thousands of genes for each patient, with the aim to diagnose (sub) types of diseases and to obtain a prognosis which may lead to individualized therapeutic decisions (Bellazi & Zupan, 2008). The published papers are mainly related to oncology, where there is a strong need for defining individualized therapeutic strategies (Mischel & Cloughesy, 2006). A seminal paper from this area is that of (Golub et al., 1999) and focuses on the problem of the early differential diagnosis of acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Several classification techniques have been applied on different benchmark datasets, among these are decision trees, naïve bayes classifier, multilayer perceptron and SVMs which have proved to be very effective in such applications. The mentioned classification approaches are usually coupled with feature (gene) selection methods to improve the performance. To avoid removing some features, the use of ensembles has emerged with different ways of distributing features among subsets as

*Proteomics* is the field that studies proteins. Proteins transform the genetic information into actions performed in life. The prediction of the secondary and tertiary structure of proteins represents one of the main challenges for Machine Learning methods in bioinformatics. Neural networks have been applied to predict protein secondary structure (Baldi & Brunak, 2001). This is due to the fact that proteins are very complex macromolecules with thousands of atoms and bounds so there are huge number of possible structures. This makes protein structure prediction a very complicated combinatorial problem where optimization

level of a gene under a specific condition and it is represented by a real number.

data is analyzed to look for useful information.

genomics problems.

explained in section 2.

**Precision** is the proportion of the true positives against all the positive results.

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \times 100\% \tag{7}$$

**F1 score** (also **F-score** or **F-measure**) is a measure that considers both the precision and the recall of the test to compute the score.

$$\text{F} = 2. \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \tag{8}$$

This is also known as the *F*1 measure, because recall and precision are evenly weighted. It is a special case of the general *Fβ* measure (for non-negative real values of β).

$$F\_{\widehat{\beta}} = \left(1 + \beta^2\right) \frac{\text{Precision} \times \text{Recall}}{\beta^2 \text{.(Precision} + \text{Recall)}} \tag{9}$$

Two other commonly used *F* measures are the *F*2 measure, which weights recall higher than precision, and the *F*0.5 measure, which puts more emphasis on precision than recall. Usually the measure chosen for performance evaluation is application dependent. In medical applications usually precision and recall – in addition to accuracy- are very important. In our case study presented in section 4 we use accuracy, sensitivity, specificity and precision to evaluate the proposed model for DNA microarray data classification.

#### **3. Machine learning techniques in bioinformatics**

Due to the availability of huge amounts of data delivered by high-throughput biotechnologies, data management procedures are required to provide the ability to store and retrieve biological information efficiently (Valentini, 2008)( Goble & Stevens,2008); this is in addition to the need of methods to extract and model biological knowledge from the data (Baldi & Brunak, 2001).

Machine learning techniques deal with a wide range of bioinformatics problems in genomics, proteomics, gene expression analysis, biological evolution, systems biology, and other relevant bioinformatics domains (Valentini, 2008) (Larranaga et.al., 2005). As follows we briefly review the use of machine learning in each of the previously mentioned fields of bioinformatics. However, the rest of the chapter will focus on micro-array data classification. The state of a cell consists of all those variables-both internal and external-which determine its behaviour. According to the Central Dogma of molecular biology, the activity of a cell is determined by which of its genes are expressed i.e., which genes are ``turned on", resulting in the active production of the respective proteins. When a particular gene is expressed, its DNA is first transcribed into the complementary messenger RNA (mRNA), which is then translated into the specific protein this gene codes for. We can measure the level of expression of each gene (i.e. how much each gene is ``turned on") by measuring how many mRNA copies are present in the cell (Lander, 1996).

*Genomics* is one of the most important domains in bioinformatics. It studies biological sequences at genome level such as DNA and RNA. (Mathe´ et al., 2002) provide a review on some important applications which are locating the genes in a genome and identifying its function. Ensemble methods have been applied to predict gene function in comparison with

TP Precision 100% TP FP = × +

**F1 score** (also **F-score** or **F-measure**) is a measure that considers both the precision and the

Precision Recall F 2. Precision Recall

This is also known as the *F*1 measure, because recall and precision are evenly weighted. It is

( ) <sup>2</sup> Precision Recall F 1 <sup>β</sup> . <sup>β</sup> <sup>2</sup><sup>β</sup> .(Precision Recall)

Two other commonly used *F* measures are the *F*2 measure, which weights recall higher than precision, and the *F*0.5 measure, which puts more emphasis on precision than recall. Usually the measure chosen for performance evaluation is application dependent. In medical applications usually precision and recall – in addition to accuracy- are very important. In our case study presented in section 4 we use accuracy, sensitivity, specificity and precision

Due to the availability of huge amounts of data delivered by high-throughput biotechnologies, data management procedures are required to provide the ability to store and retrieve biological information efficiently (Valentini, 2008)( Goble & Stevens,2008); this is in addition to the need of methods to extract and model biological knowledge from the

Machine learning techniques deal with a wide range of bioinformatics problems in genomics, proteomics, gene expression analysis, biological evolution, systems biology, and other relevant bioinformatics domains (Valentini, 2008) (Larranaga et.al., 2005). As follows we briefly review the use of machine learning in each of the previously mentioned fields of bioinformatics. However, the rest of the chapter will focus on micro-array data classification. The state of a cell consists of all those variables-both internal and external-which determine its behaviour. According to the Central Dogma of molecular biology, the activity of a cell is determined by which of its genes are expressed i.e., which genes are ``turned on", resulting in the active production of the respective proteins. When a particular gene is expressed, its DNA is first transcribed into the complementary messenger RNA (mRNA), which is then translated into the specific protein this gene codes for. We can measure the level of expression of each gene (i.e. how much each gene is ``turned on") by measuring how many

*Genomics* is one of the most important domains in bioinformatics. It studies biological sequences at genome level such as DNA and RNA. (Mathe´ et al., 2002) provide a review on some important applications which are locating the genes in a genome and identifying its function. Ensemble methods have been applied to predict gene function in comparison with

×

+

<sup>×</sup> <sup>=</sup> <sup>+</sup>

a special case of the general *Fβ* measure (for non-negative real values of β).

= +

to evaluate the proposed model for DNA microarray data classification.

**3. Machine learning techniques in bioinformatics** 

mRNA copies are present in the cell (Lander, 1996).

(7)

(8)

(9)

**Precision** is the proportion of the true positives against all the positive results.

recall of the test to compute the score.

data (Baldi & Brunak, 2001).

single classifier as in (Re & Valentini, 2010), where several data sources are integrated then input to SVM base classifiers and combined using weighted average and decision templates. The ensembles outperform the single SVM classifier. Sequence information is also used for gene function and RNA structure prediction (Freyhult, 2007) as well as many other relevant genomics problems.

*Gene expression data analysis* is a well-established bioinformatics domain where Machine Learning methods for classification and clustering have been widely applied*. DNA gene expression microarrays* allow biologists to study genome-wide patterns of gene expression in any given cell type, at any given time, and under any given set of conditions (Baldi & Brunak, 2001). Gene expression data is arranged into a matrix where, columns represent genes and rows represent the samples. Each element in the matrix represents the expression level of a gene under a specific condition and it is represented by a real number.

The use of these arrays produces large amounts of data, potentially capable of providing fundamental insights into biological processes ranging from gene function to development, cancer, aging and pharmacology (Baldi & Brunak, 2001). However the data needs to be preprocessed first, i.e. modified to be suitably used by machine learning algorithms. Then the data is analyzed to look for useful information.

Clustering techniques such as *k*-means, hierarchical clustering (Eisen et al., 1998) and selforganizing maps (SOMs) (Tamayo et al., 1999) have been applied to identify genes according to their function similarities. These methods assume that related genes have similar expression patterns across all samples and hence divide the set of genes into disjoint groups. Accordingly, identifying local patterns with subset of genes that are similarly expressed over a subset of samples is difficult using traditional clustering techniques. (AboHamad et al., 2010) propose a bi-clustering technique which is based on clustering similarly expressed genes set over a subset of samples simultaneously. On the other hand, many classification techniques are used. The majority of papers published in the area of machine learning for genomic medicine deal with analyzing gene expression data coming from DNA microarrays, consisting of thousands of genes for each patient, with the aim to diagnose (sub) types of diseases and to obtain a prognosis which may lead to individualized therapeutic decisions (Bellazi & Zupan, 2008). The published papers are mainly related to oncology, where there is a strong need for defining individualized therapeutic strategies (Mischel & Cloughesy, 2006). A seminal paper from this area is that of (Golub et al., 1999) and focuses on the problem of the early differential diagnosis of acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Several classification techniques have been applied on different benchmark datasets, among these are decision trees, naïve bayes classifier, multilayer perceptron and SVMs which have proved to be very effective in such applications. The mentioned classification approaches are usually coupled with feature (gene) selection methods to improve the performance. To avoid removing some features, the use of ensembles has emerged with different ways of distributing features among subsets as explained in section 2.

*Proteomics* is the field that studies proteins. Proteins transform the genetic information into actions performed in life. The prediction of the secondary and tertiary structure of proteins represents one of the main challenges for Machine Learning methods in bioinformatics. Neural networks have been applied to predict protein secondary structure (Baldi & Brunak, 2001). This is due to the fact that proteins are very complex macromolecules with thousands of atoms and bounds so there are huge number of possible structures. This makes protein structure prediction a very complicated combinatorial problem where optimization

Novel Machine Learning Techniques for Micro-Array Data Classification 641

Train SVM2 using the feature subset FS2 of the training set

Partition the features into *k* disjoint feature subsets chosen using *k*-means clustering applied to the training set FS1, FS2… FSk

Dataset

Start

Test the whole ensemble on

End

Train the SVM Combiner 1

**…** 

the test dataset 2

Train SVMk using the feature subset FSk of the training set

Fig. 2. Ensembles using SVM fusion.

Train SVM1 using the feature subset FS1 of the training set

partitions of Z for training and validation.

Cross validation is used to evaluate the proposed model using different

techniques are required. Machine learning is also applied for protein function prediction, fold recognition as well as other relevant proteomics problems (Valentini, 2008).

*Systems biology* is an emerging bioinformatics area where Machine Learning techniques play a central role (Kitano, 2002). It is concerned with modelling biological processes inside the cell. Mathematical models and learning methods are required to model the biological networks ranging from genetic networks to signal transduction networks to metabolic pathways (Bower et al., 2004).

*Phylogenetic trees* are schematic representations of organisms' evolution. Machine Learning is applied for phylogenetic tree construction by comparisons made by multiple sequence alignment where many optimization techniques are used (Larranaga et al., 2005).

As follows we present a novel machine learning model for micro-array data classification. Experiments and comparative results demonstrate the efficiency of these models to deal with high-dimensional DNA data.

#### **4. A case study of an SVM ensemble using feature subset selection for DNA classification**

In this section, we present a case study of a SVM ensemble that uses SVM base classifiers and another SVM classifier for combining the results of the base classifiers to get the final classification. The proposed ensemble uses *k*-means clustering for grouping the features into subsets. The ensemble is referred to as *k*-means-SVM fusion throughout the rest of this chapter.

The flow charts in figures 2, 3 and 4 illustrate the main phases used for building the ensemble. A dataset consisting of a set of labelled examples is initially given. Each example is characterized by a set of features and a label indicating its class. The dataset is divided into a training set, a validation set and a test set. For clustering, the features are grouped into *k* feature subsets and *k*-means clustering is applied to the training set. Then, each of the SVM base classifiers is trained using the training set characterized by features of a single feature subset. The SVM classifier responsible for fusion is trained using the validation set and then the ensemble is ready to be tested on the test set.

Figure 5 presents the steps of the algorithm for building the ensembles that use an SVM classifier for fusion in more details. Initially, the available data set Z containing all features is divided into training *ZTrain*, validation *ZValid* and testing set *ZTest*. The training set *ZTrain* is used for building the base SVM classifiers and determining their parameters through cross validation by further splitting the training set into training part and validation part, applying grid search using range of values for the parameters and selecting the parameter values that resulted in the best accuracy among the validation set. The validation set *ZValid* is used to train the combiner SVM, while the test set *ZTest* is used to evaluate the overall ensemble.

Before any of the data parts are applied, we first perform a feature subset selection procedure to choose *k* subsets to be used as an input for each base classifier. The input to any base SVM classifier *i* is hence the training samples with features in the cluster *i*. After the base classifiers are trained using the training portion *ZTrain*, the combiner is trained using the validation set *ZValid* as follows:

The outputs of the *k* SVMs are collected to form a new feature-sample training matrix for the SVM combiner where each sample is characterized by the base SVM outputs as features and its label is the same as its original labels.

In the test phase, the overall accuracy of the ensemble is tested using the remaining samples of *ZTest* reserved.

techniques are required. Machine learning is also applied for protein function prediction,

*Systems biology* is an emerging bioinformatics area where Machine Learning techniques play a central role (Kitano, 2002). It is concerned with modelling biological processes inside the cell. Mathematical models and learning methods are required to model the biological networks ranging from genetic networks to signal transduction networks to metabolic

*Phylogenetic trees* are schematic representations of organisms' evolution. Machine Learning is applied for phylogenetic tree construction by comparisons made by multiple sequence

As follows we present a novel machine learning model for micro-array data classification. Experiments and comparative results demonstrate the efficiency of these models to deal

**4. A case study of an SVM ensemble using feature subset selection for DNA** 

In this section, we present a case study of a SVM ensemble that uses SVM base classifiers and another SVM classifier for combining the results of the base classifiers to get the final classification. The proposed ensemble uses *k*-means clustering for grouping the features into subsets. The ensemble is referred to as *k*-means-SVM fusion throughout the rest of this chapter. The flow charts in figures 2, 3 and 4 illustrate the main phases used for building the ensemble. A dataset consisting of a set of labelled examples is initially given. Each example is characterized by a set of features and a label indicating its class. The dataset is divided into a training set, a validation set and a test set. For clustering, the features are grouped into *k* feature subsets and *k*-means clustering is applied to the training set. Then, each of the SVM base classifiers is trained using the training set characterized by features of a single feature subset. The SVM classifier responsible for fusion is trained using the validation set and then

Figure 5 presents the steps of the algorithm for building the ensembles that use an SVM classifier for fusion in more details. Initially, the available data set Z containing all features is divided into training *ZTrain*, validation *ZValid* and testing set *ZTest*. The training set *ZTrain* is used for building the base SVM classifiers and determining their parameters through cross validation by further splitting the training set into training part and validation part, applying grid search using range of values for the parameters and selecting the parameter values that resulted in the best accuracy among the validation set. The validation set *ZValid* is used to train

Before any of the data parts are applied, we first perform a feature subset selection procedure to choose *k* subsets to be used as an input for each base classifier. The input to any base SVM classifier *i* is hence the training samples with features in the cluster *i*. After the base classifiers are trained using the training portion *ZTrain*, the combiner is trained using

The outputs of the *k* SVMs are collected to form a new feature-sample training matrix for the SVM combiner where each sample is characterized by the base SVM outputs as features and

In the test phase, the overall accuracy of the ensemble is tested using the remaining samples

the combiner SVM, while the test set *ZTest* is used to evaluate the overall ensemble.

fold recognition as well as other relevant proteomics problems (Valentini, 2008).

alignment where many optimization techniques are used (Larranaga et al., 2005).

pathways (Bower et al., 2004).

with high-dimensional DNA data.

the ensemble is ready to be tested on the test set.

the validation set *ZValid* as follows:

of *ZTest* reserved.

its label is the same as its original labels.

**classification** 

Fig. 2. Ensembles using SVM fusion.

Cross validation is used to evaluate the proposed model using different partitions of Z for training and validation.

Novel Machine Learning Techniques for Micro-Array Data Classification 643

2

Test SVM2 using the feature subset FS2 of the test set

Get the *k* feature subsets (FS1, FS2… FSk)

Dataset (Test set)

Test SVMk using the feature subset FSk of the test set

In this section, we describe the experiments conducted to test and evaluate the proposed the *k*-means-SVM fusion ensemble on the Leukaemia data set. Refer to (Ahmed et al., 2010) for

Test the SVM Combiner using the outputs of the base classifiers on the test data.

**…** 

End

The Leukemia dataset is a benchmark micro-array dataset which consists of 72 samples, 7129 features and 2 classes (AML and ALL). The 72 samples consist of 47 samples of Acute Lymphoblastic Leukemia (ALL) and 25 samples of Acute Myeloblastic Leukemia (AML). The training and test samples in (Chang & Lin, 2001), (Golub et al., 1999) are merged then

The proposed *k*-means-SVM fusion ensemble is compared to a single SVM classifier as well as three different SVM ensembles: the *random-majority vote* ensemble, the *k-means-majority vote* ensemble and the *random-SVM fusion* ensemble. The *random-majority vote* ensemble uses the random subspace method to select feature subsets and distribute them among base classifier in the ensembles. A fixed fusing rule –majority vote- is used to combine the output of the base classifiers. In contrast, the *k-means-majority vote* ensemble uses feature subsets resulting from k-means clustering as described in section 4 but still uses majority voting for

Fig. 4. Testing the ensemble with SVM fusion.

normalized as indicated in (Shevade & Keerthi, 2003).

**5. Data and experimental setup** 

Test SVM1 using the feature subset FS1 of the test set

more experiments on other data sets.

Fig. 3. Training the SVM Combiner in the Ensemble.

Train the SVM Combiner using the outputs of the base classifiers on the validation dataset

End

**…** 

Test SVMk using the feature subset FSk of the validation set

Get the *k* feature subsets (FS1, FS2… FSk)

Dataset (Validation set)

1

Test SVM2 using the feature subset FS2 of the validation set

Fig. 3. Training the SVM Combiner in the Ensemble.

Test SVM1 using the feature subset FS1 of the validation set

Fig. 4. Testing the ensemble with SVM fusion.

#### **5. Data and experimental setup**

In this section, we describe the experiments conducted to test and evaluate the proposed the *k*-means-SVM fusion ensemble on the Leukaemia data set. Refer to (Ahmed et al., 2010) for more experiments on other data sets.

The Leukemia dataset is a benchmark micro-array dataset which consists of 72 samples, 7129 features and 2 classes (AML and ALL). The 72 samples consist of 47 samples of Acute Lymphoblastic Leukemia (ALL) and 25 samples of Acute Myeloblastic Leukemia (AML). The training and test samples in (Chang & Lin, 2001), (Golub et al., 1999) are merged then normalized as indicated in (Shevade & Keerthi, 2003).

The proposed *k*-means-SVM fusion ensemble is compared to a single SVM classifier as well as three different SVM ensembles: the *random-majority vote* ensemble, the *k-means-majority vote* ensemble and the *random-SVM fusion* ensemble. The *random-majority vote* ensemble uses the random subspace method to select feature subsets and distribute them among base classifier in the ensembles. A fixed fusing rule –majority vote- is used to combine the output of the base classifiers. In contrast, the *k-means-majority vote* ensemble uses feature subsets resulting from k-means clustering as described in section 4 but still uses majority voting for

Novel Machine Learning Techniques for Micro-Array Data Classification 645

base classifiers are then used in the training of the SVM combiner. The test set is then used

All experiments are performed using LibSVM (Chang & Lin, 2001) using 5 fold cross validation for training the base classifiers and the combiner. *K*-means is applied with values for *k* ranging from 3 to 63 with increment of 2. Linear kernels are chosen for the SVM base classifiers as well as for the SVM combiner as it was found in literature that they are suitable for the high dimensional microarray datasets (Chang & Lin, 2001), (Bertoni et al., 2005). For each SVM with linear kernel, parameter *C* requires to be optimized. This is done using grid

We experimented with exponentially growing sequences of C in the rage of -15 to 15 to

Figure 6 compares the accuracy of the *k-means-SVM fusion* ensemble to the *random-majority vote* ensemble, the *k-means-majority vote* ensemble and the *random-SVM fusion* ensemble for different number of feature subsets (i.e. different number of base classifier or ensemble

Fig. 6. Test accuracies of the four ensembles with respect to the number of feature subsets. The accuracy of the *K*-Means-SVM fusion ensemble seems to outperform the other models with growing number of feature subsets (i.e. with increased number of base classifiers). Also, the accuracy of random-SVM fusion increases with higher number of feature subsets.

However, it still has the lowest accuracy among the four ensembles.

to test the whole ensemble.

search by 5 fold cross validation.

**6. Results and discussions** 

sizes).

identify a good value for the parameter.

classifier fusion. Alternatively, the *random-SVM fusion* ensemble uses Random subspace method for feature subset selection and a trainable SVM classifier as a combination rule.

#### **Given:**


#### **Step 1: Choose feature subsets using** *k***-means clustering**

	- Initialize *k*, the number of clusters and the number of the base classifiers.
	- Cluster analysis on features using *ZTrain* using *k*-means algorithm.
	- Get \$k\$ disjoint feature subsets *FS1*, *FS2*, … , *FSK*.

#### **Step 2: Train the Base Classifiers**

• Train the base classifiers using *ZTrain* such that each base classifier *SVMi* is trained with feature subset *FSi*.

#### **Step 3: Train the Combiner**


#### **Step 4: Test the ensemble**

	- *ZTest* is passed through the *SVM (1 k)* base classifiers.
	- *ZTest* with the outputs of the SVM base classifiers as features is given to the SVM(combiner).
	- SVM(combiner) classifies the samples of *ZTest*.

Fig. 5. Ensembles using SVM fusion.

The suggested ensembles are tested for different number of feature subsets, i.e different number of base classifiers. Experiments are repeated for number of feature subsets *k* = 2*n*-1 where *n* = 2, 3, 4, 5, 6. Results are compared using different measures of performance including accuracy, sensitivity, specificity and precision. For the sake of brevity we only present results based on accuracy and sensitivity.

Cross validation is used to obtain different training, test and validation sets. Since DNA microarrays are characterized by having a very small number of samples, the training and validation sets are overlapped with 1/3 of the samples.

The usage of cross validation differs according to the ensemble model. For the ensembles that use majority vote combiner, the dataset is divided into a training set and a test set. The training set is used to train the base classifiers while the test set is used to test the base classifiers then the majority vote is applied to their outputs. For tuning the parameters of the base classifiers, the training set is further split into a training set and a validation set on which the classifier is validated. Grid search using range of values for the parameters is applied and the parameter values that get the best performance on the validation set are chosen for the base classifiers.

For the ensembles that use a SVM for fusion, the dataset is divided into a training set, validation set and a test set using *k*-fold cross validation. The training set is used to train the base classifiers then the validation set is used to test the base classifiers. The outputs of the base classifiers are then used in the training of the SVM combiner. The test set is then used to test the whole ensemble.

All experiments are performed using LibSVM (Chang & Lin, 2001) using 5 fold cross validation for training the base classifiers and the combiner. *K*-means is applied with values for *k* ranging from 3 to 63 with increment of 2. Linear kernels are chosen for the SVM base classifiers as well as for the SVM combiner as it was found in literature that they are suitable for the high dimensional microarray datasets (Chang & Lin, 2001), (Bertoni et al., 2005). For each SVM with linear kernel, parameter *C* requires to be optimized. This is done using grid search by 5 fold cross validation.

We experimented with exponentially growing sequences of C in the rage of -15 to 15 to identify a good value for the parameter.

#### **6. Results and discussions**

644 Bioinformatics – Trends and Methodologies

classifier fusion. Alternatively, the *random-SVM fusion* ensemble uses Random subspace method for feature subset selection and a trainable SVM classifier as a combination rule.

• Initialize *k*, the number of clusters and the number of the base classifiers.

• Train the base classifiers using *ZTrain* such that each base classifier *SVMi* is trained

The suggested ensembles are tested for different number of feature subsets, i.e different number of base classifiers. Experiments are repeated for number of feature subsets *k* = 2*n*-1 where *n* = 2, 3, 4, 5, 6. Results are compared using different measures of performance including accuracy, sensitivity, specificity and precision. For the sake of brevity we only

Cross validation is used to obtain different training, test and validation sets. Since DNA microarrays are characterized by having a very small number of samples, the training and

The usage of cross validation differs according to the ensemble model. For the ensembles that use majority vote combiner, the dataset is divided into a training set and a test set. The training set is used to train the base classifiers while the test set is used to test the base classifiers then the majority vote is applied to their outputs. For tuning the parameters of the base classifiers, the training set is further split into a training set and a validation set on which the classifier is validated. Grid search using range of values for the parameters is applied and the parameter values that get the best performance on the validation set are

For the ensembles that use a SVM for fusion, the dataset is divided into a training set, validation set and a test set using *k*-fold cross validation. The training set is used to train the base classifiers then the validation set is used to test the base classifiers. The outputs of the

• *ZTest* with the outputs of the SVM base classifiers as features is given to the

 *k)* base classifiers using *ZValid* with the corresponding feature

 *k)* base classifiers.

 *k)* on the validation data

• Cluster analysis on features using *ZTrain* using *k*-means algorithm.

**Given:** 

• *FS*: *(1* 

• *k*-means on dataset *ZTrain*

**Step 2: Train the Base Classifiers** 

with feature subset *FSi*. **Step 3: Train the Combiner**  • Test every *SVM (1* 

> *k)*.

SVM(combiner).

Fig. 5. Ensembles using SVM fusion.

chosen for the base classifiers.

subset *FS(1* 

**Step 4: Test the ensemble** 

set *ZValid*.

• *Z*, a set of *N* crisp labelled samples *x.*

 *n)*, a set containing all *n* features. **Step 1: Choose feature subsets using** *k***-means clustering** 

• Get \$k\$ disjoint feature subsets *FS1*, *FS2*, … , *FSK*.

• Train the SVM(combiner) using the outputs of the *SVM(1* 

• Test the ensemble using the Test data set *ZTest* as follows:

• SVM(combiner) classifies the samples of *ZTest*.

• *ZTest* is passed through the *SVM (1* 

present results based on accuracy and sensitivity.

validation sets are overlapped with 1/3 of the samples.

Figure 6 compares the accuracy of the *k-means-SVM fusion* ensemble to the *random-majority vote* ensemble, the *k-means-majority vote* ensemble and the *random-SVM fusion* ensemble for different number of feature subsets (i.e. different number of base classifier or ensemble sizes).

Fig. 6. Test accuracies of the four ensembles with respect to the number of feature subsets.

The accuracy of the *K*-Means-SVM fusion ensemble seems to outperform the other models with growing number of feature subsets (i.e. with increased number of base classifiers). Also, the accuracy of random-SVM fusion increases with higher number of feature subsets. However, it still has the lowest accuracy among the four ensembles.

Novel Machine Learning Techniques for Micro-Array Data Classification 647

work well on using a small number of feature subsets but as the number of feature subsets increase, it improves the performance and become better than the average performance of

the base classifiers.

Fig. 7. Results of the k-means-majority vote ensemble.

Fig. 8. Results of the k-means-SVM fusion ensemble.

On the other hand, for the *k-Means-majority vote* ensemble, increasing the number of feature subsets results in a drop of the accuracy; while for *random-majority vote* ensemble results are not affected by the change of the number of feature subsets.

Table 1 summarizes the best results obtained for each ensemble across the different number of feature subsets in addition to those obtained using a single SVM classifier. The number of feature subsets at which the best results are obtained is mentioned for each ensemble. It can be noticed that the ensembles that use majority vote combiner work well only with small number of feature subsets while those that use an SVM classifier for fusion need a large number of feature subsets.

Results reveal that the *k-means-SVM fusion* ensemble outperforms *k-means-majority vote* ensemble as well as *random-SVM fusion* and *random-majority vote* ensembles. *K-means-SVM fusion ensemble* also shows to have a high sensitivity with respect to the other ensembles in the comparison.

Since for the leukemia dataset, both classes are patients, there is no *positive* or *negative* class. Accordingly, the sensitivity is calculated twice; at first, considering AML as the *positive* class and then considering ALL as the *positive* class. The average of both is then calculated.


Table 1. Best classification accuracy and sensitivity measures obtained by applying the ensembles and the single SVM classifier to the leukemia dataset. The number of feature subsets at which the best results are obtained are mentioned between brackets.

Figures 7-10 illustrate for each ensemble the improvement of the combined model over the average performance of the base classifiers. The figures show the average accuracies of the base classifiers of each ensemble compared to the ensemble accuracies. In addition the ratio of the ensemble accuracies to those of the base classifiers are depicted. It is obvious that the *k-means-SVM fusion* has the best ratio among the four ensembles.

Figure 7 shows the results for the *k-means-majority vote* ensemble. It can be noticed that the ensemble improves the performance of the base classifiers but its accuracy drops with higher number of the feature subsets. So, it works better with small number of feature subsets. Figure 8 demonstrates the performance of the *k-means-SVM fusion ensemble*. Clearly the ensemble enhances the performance of the base classifiers except when using 3 feature subsets. Unlike the *k*-means-majority vote ensemble, its performance does not drop with increased number of feature subsets. *K-means-SVM fusion* ensemble achieves the best accuracy among the four ensembles when using 63 feature subsets. Figure 9 summarizes the performance of the *random-majority vote* ensemble. It is noticed that it has a slight improvement over the average performance of the base classifiers resulting in a nearly constant behaviour. Figure 10 demonstrates that *random-SVM fusion* ensemble does not

On the other hand, for the *k-Means-majority vote* ensemble, increasing the number of feature subsets results in a drop of the accuracy; while for *random-majority vote* ensemble results are

Table 1 summarizes the best results obtained for each ensemble across the different number of feature subsets in addition to those obtained using a single SVM classifier. The number of feature subsets at which the best results are obtained is mentioned for each ensemble. It can be noticed that the ensembles that use majority vote combiner work well only with small number of feature subsets while those that use an SVM classifier for fusion need a large

Results reveal that the *k-means-SVM fusion* ensemble outperforms *k-means-majority vote* ensemble as well as *random-SVM fusion* and *random-majority vote* ensembles. *K-means-SVM fusion ensemble* also shows to have a high sensitivity with respect to the other ensembles in

Since for the leukemia dataset, both classes are patients, there is no *positive* or *negative* class. Accordingly, the sensitivity is calculated twice; at first, considering AML as the *positive* class and then considering ALL as the *positive* class. The average of both is then

Classifier/Ensemble Accuracy Sensitivity (Average)

Single SVM classifier 92.86 ± 15.97 81.68 Random-Majority Vote (3) 92.86 ± 15.97 90.00 Random-SVM Fusion (41) 87.14 ± 19.17 84.67 *K-*Means-Majority Vote (3) 92.86 ± 15.97 90.00 *K***-Means-SVM Fusion (63) 97.14 ± 3.92 96.89**  Table 1. Best classification accuracy and sensitivity measures obtained by applying the ensembles and the single SVM classifier to the leukemia dataset. The number of feature

Figures 7-10 illustrate for each ensemble the improvement of the combined model over the average performance of the base classifiers. The figures show the average accuracies of the base classifiers of each ensemble compared to the ensemble accuracies. In addition the ratio of the ensemble accuracies to those of the base classifiers are depicted. It is obvious that the

Figure 7 shows the results for the *k-means-majority vote* ensemble. It can be noticed that the ensemble improves the performance of the base classifiers but its accuracy drops with higher number of the feature subsets. So, it works better with small number of feature subsets. Figure 8 demonstrates the performance of the *k-means-SVM fusion ensemble*. Clearly the ensemble enhances the performance of the base classifiers except when using 3 feature subsets. Unlike the *k*-means-majority vote ensemble, its performance does not drop with increased number of feature subsets. *K-means-SVM fusion* ensemble achieves the best accuracy among the four ensembles when using 63 feature subsets. Figure 9 summarizes the performance of the *random-majority vote* ensemble. It is noticed that it has a slight improvement over the average performance of the base classifiers resulting in a nearly constant behaviour. Figure 10 demonstrates that *random-SVM fusion* ensemble does not

subsets at which the best results are obtained are mentioned between brackets.

*k-means-SVM fusion* has the best ratio among the four ensembles.

not affected by the change of the number of feature subsets.

number of feature subsets.

the comparison.

calculated.

work well on using a small number of feature subsets but as the number of feature subsets increase, it improves the performance and become better than the average performance of the base classifiers.

Fig. 7. Results of the k-means-majority vote ensemble.

Fig. 8. Results of the k-means-SVM fusion ensemble.

Novel Machine Learning Techniques for Micro-Array Data Classification 649

According to the study on the leukemia dataset, *k*-means-SVM fusion ensemble performs the best among the four ensembles with regards to both accuracy and sensitivity. More

This chapter presents a broad introduction to machine learning and focuses on the classification problem in bioinformatics. In particular we cover main terminologies from the pattern recognition, machine learning and data mining fields. We try to review main models used for classification and to elaborate on classifier testing and evaluation techniques. We devote a special attention to SVM, ensemble techniques and feature subset ensembles as they are the base of our proposed DNA micro-array data classification model. The proposed classification model exploits the use of powerful machine learning models such as SVMs and ensemble methods coupled with feature subset selection. The proposed approach proves to be able to deal with data challenges that are imposed by this application which is

Results are shown on the leukemia dataset and compared to four different models. The study concludes that the use of ensembles is very fruitful in such applications. The way of distributing the features among subsets affects the performance of the ensemble. *K*-means is a systematic way that proved to be suitable for clustering the features into subsets especially when used with a SVM classifier for combination. For the leukemia dataset, *k*-means-SVM fusion ensemble performed the best with respect to accuracy and sensitivity. The study confirms the importance of ensembles in bioinformatics applications and highlights that the coupling between the method of distributing the features among subsets and the

Different method can be investigated for distributing the features among subsets. Higher numbers of base classifiers / numbers of feature subsets can be experimented with. Time complexity of the proposed models need to be calculated and accessed. The use of other combiners especially classifiers are worth investigating. In addition, the use of the proposed models can be extended to other data sets and other domains in the bioinformatics field.

This work was supported by DFG (German Research Society) grants SCHW 623/3-2 and

Abe,S. (2005). Support Vector Machines for Pattern Classification, *Springer*, ISBN 1-85233-

Abohamad, W., Korayem, M. & Moustafa,K. (2010), Biclustering of DNA Microarray Data

*Systems Design and Applications (ISDA),* pp. 1223-1228, Cairo, Egypt. Ahmed, E., El-Gayar, N. & El-Azab, I.A. (2010). Support Vector Machine Ensembles Using

Using Artificial Immune System, *Proceedings of International Conference on Intelligent* 

Features Distribution among Subsets for Enhancing Microarray Data Classification, *Proceedings of International Conference of Systems and Design (ISDA)*, Cairo, Egypt,

results to confirm this conclusion are reported in (Ahmed et al., 2010).

mainly the huge number of features and the small samples size.

combination method is crucial for obtaining good results.

**8. Acknowledgment** 

929-9.

December, 2010.

SCHW 623/4-2.

**9. References** 

**7. Conclusions and future directions** 

Fig. 9. Results of the random-majority vote ensemble.

Fig. 10. Results of random-SVM fusion ensemble.

As a general conclusion of the previous experiments we can state that the ensembles with SVM classifier as base classifiers generally improve the classification accuracy over single classifiers. Ensembles that use an SVM classifier for fusion outperform those that use majority vote as a combiner when using a reasonably large number of feature subsets and base classifiers.

According to the study on the leukemia dataset, *k*-means-SVM fusion ensemble performs the best among the four ensembles with regards to both accuracy and sensitivity. More results to confirm this conclusion are reported in (Ahmed et al., 2010).

#### **7. Conclusions and future directions**

648 Bioinformatics – Trends and Methodologies

Fig. 9. Results of the random-majority vote ensemble.

Fig. 10. Results of random-SVM fusion ensemble.

base classifiers.

As a general conclusion of the previous experiments we can state that the ensembles with SVM classifier as base classifiers generally improve the classification accuracy over single classifiers. Ensembles that use an SVM classifier for fusion outperform those that use majority vote as a combiner when using a reasonably large number of feature subsets and This chapter presents a broad introduction to machine learning and focuses on the classification problem in bioinformatics. In particular we cover main terminologies from the pattern recognition, machine learning and data mining fields. We try to review main models used for classification and to elaborate on classifier testing and evaluation techniques. We devote a special attention to SVM, ensemble techniques and feature subset ensembles as they are the base of our proposed DNA micro-array data classification model. The proposed classification model exploits the use of powerful machine learning models such as SVMs and ensemble methods coupled with feature subset selection. The proposed approach proves to be able to deal with data challenges that are imposed by this application which is mainly the huge number of features and the small samples size.

Results are shown on the leukemia dataset and compared to four different models. The study concludes that the use of ensembles is very fruitful in such applications. The way of distributing the features among subsets affects the performance of the ensemble. *K*-means is a systematic way that proved to be suitable for clustering the features into subsets especially when used with a SVM classifier for combination. For the leukemia dataset, *k*-means-SVM fusion ensemble performed the best with respect to accuracy and sensitivity. The study confirms the importance of ensembles in bioinformatics applications and highlights that the coupling between the method of distributing the features among subsets and the combination method is crucial for obtaining good results.

Different method can be investigated for distributing the features among subsets. Higher numbers of base classifiers / numbers of feature subsets can be experimented with. Time complexity of the proposed models need to be calculated and accessed. The use of other combiners especially classifiers are worth investigating. In addition, the use of the proposed models can be extended to other data sets and other domains in the bioinformatics field.

#### **8. Acknowledgment**

This work was supported by DFG (German Research Society) grants SCHW 623/3-2 and SCHW 623/4-2.

#### **9. References**


Novel Machine Learning Techniques for Micro-Array Data Classification 651

Kittler, J., Hatef, M., Duin, R.P.W. & Matas, J. (1998). On Combining Classifiers. *IEEE Trans. on Pattern Analysis and Machine Intelligence*, Vol. 20, No. 3, pp. (226-239). Krallinger, M., Erhardt, R.A. & Valencia, A. (2005). Text-mining approaches in molecular biology and biomedicine. *Drug Discovery Today*, Vol. 10, No.6, pp. (439–45). Kuncheva, L. I. (2004). *Combining Pattern Classifiers: Methods and Algorithms*. John Wiley Sons,

Lander, E.S. (1996). The new genomics global views of biology. *Science*, Vol. 274, No. (5287),

Larranaga,P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J. A.,

Lopez-Bigas, N. & Ouzounis, C. (2004). Genome-wide identification of genes likely to

Mathe´, C., Sagot, M-F and Schlex, T. (2002). Current methods of gene prediction, their

(MCS series) Multiple Classifier Systems. Lecture Notes in Computer Science, *Springer* 

Mischel, P.S., Cloughesy, T. (2006). Using molecular information to guide brain tumor

Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E.,

Ratsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Mller, K.-R., Sommer, R.-J. & Scholkopf,

Re, M. & Valentini, G. (2010). Prediction of Gene Function Using Ensembles of SVMs and

Roli, F. & Giacinto, G. (2002). *Design of Multiple Classifier Systems, HYBRID METHODS IN PATTERN RECOGNITION* , H Bunke and A Kandel (Eds.) , World scientific. Shevade, S. K. & Keerthi, S. S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression, *Bioinformatics*, Vol. 19, No.17, pp. (2246 – 2253). Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C., Gaasenbeek, M.,

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E. , Lander, E. S. &

bioinformatics, *Briefings in bioinformatics*, Vol. 7, No. 1, pp. (86-112).

(2005), 4472 (2007), 5519 (2009), 5997 (2010), 6713 (2011).

therapy, *Nat. Clin. Pract. Neurol.* Vol.2, pp. (232–233).

learning, *PLoS Computational Biology* , Vol. 3, No. 2.

*Bioinformatics,* Vol. 4, No. 28.

Armananzas, R., Santafe, G., Perez, A. & Robles, V. (2005). Machine learning in

beinvolved in human genetic diseases, *Nucleic Acid Research*, Vol. 32, No. 10, pp.

strengths and weaknesses. *Nucleic Acids Research*. Vol. 30, No. 19, spp. (4103–4117).

*Verlag*, Vols. 1857 (2000), 2096 (2001), 2364 (2002), 2709 (2003), 3077 (2004), 3541

Kim, J.Y., Goumnerova, L.C., & et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression, *Nature*, Vol. 415, pp. (436 -

B. (2007). Improving the Caenorhabditis elegans genome annotation using machine

Heterogeneous Data Sources. Applications of supervised and unsupervised ensemble methods, *Computational Intelligence Series*, Springer, Vol.245, pp. (79-91). Ritchie, M., White, B.C., Parker, J.S., Hahn, L.W. & Moore, J.H. (2003). Optimization of

neural network architecture using genetic programming improves detection and modelling of gene-gene interactions in studies of human diseases, *BMC* 

Angelo, M. & et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning, *Nat. Med.* Vol. 8, pp.

Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing

Inc, ISBN 0-471-21078-1.

(3108 – 3114).

442).

(68–74).

pp.(536 – 539), (October 1996).

Alpydin , E. (2004). *Introduction to Machine Learning*. The MIT Press, ISBN 0-262-01211-1.


http://www.kdnuggets.com/polls/2006/data\_mining\_methods.htm

Kitano, H. (2002). Systems biology: A brief overview, *Science*. Vol. 295, No. 5560, pp.(1662 – 1664).

Bellazi, R. & Zupan, B. (2008). Predictive data mining in clinical medicine: Current issues

Bernal, A., Crammer, K., Hatzigeorgiou, A. & Pereira, F., (2007). Global discriminative

Bertoni, A., Folgieri, R. & Valentini, G. (2005). Bio-molecular cancer prediction with random

Bower, J. & Bolouri, H. (2004). Computational Modeling of Genetic and Biochemical

Brent, M. & Guigo, R. (2004). Recent advances in gene structure prediction, *Current Opinion* 

Chang, C.-C. & Lin, C.-J. (2001). *LIBSVM: a library for support vector machines*. Available from

Duda, R. O., Hart, P. E. & Stork, D. G. (2nd ed). (2000). *Pattern Classification*, John Wiley &

Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D.(1998).Cluster analysis and display

Freyhult, E. (2007). *A Study in RNA Bioinformatics, Identication, Prediction and Analysis.* PhD

Goble, C. & Stevens, R. (5 August 2008). State of the nation in data integration for

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C. Gaasenbeek, M., Mesirov, JP., Coller, H.,

Handl, J., Kell, D. & Knowles, J. (2007). Multiobjective optimization in bioinformatics and

Ho, T. K. (1998).The random space method for constructing decision forests. *IEEE* 

Holloway, D., Kon, M. & DeLisi, C. (2007). Machine learning for regulatory analysis and

Kitano, H. (2002). Systems biology: A brief overview, *Science*. Vol. 295, No. 5560, pp.(1662 –

of genome-wide expression patterns, *in Proc. Natl. Acad. Sci, National Acad Sciences*,

bioinformatics. *Journal of Biomedical Informatics (in press)*, available on line at

Loh, M.L., Downing, J.R.,Caligiuri, M.A., Bloomfield, C.D. & Lander, E. (1999). Molecular classification of cancer: class discovery and class prediction by gene

computational biology, *IEEE/ACM Trans. Comput. Biol. Bioinformatics,* Vol. 4, No. 2,

*Transactions on Pattern Analysis and Machine Intelligence*, Vol. 20, No. 8, pp. (832–

transcription factor target prediction in yeast, *Systems and Synthetic Biology*, Vol. 1,

subspace ensembles of support vector machines, *Neurocomputing*.

*in Structural Biology.* Vol. 14, No. 3, pp.(264–272).

http://www.csie.ntu.edu.tw/\_cjlin/libsvm.

Vol. 95, No. 25., pp. (14 863–14 868), USA.

http://www.sciencedirect.com

pp. (279–292).

No. 1, pp. (25–46).

844).

1664).

thesis, ACTA Universitatis Upsaliensis Uppsala.

expression monitoring, *Science*, pp. (531 – 537).

KDnuggets , Polls, *Data Mining Methods* (Apr 2006) Available from:

http://www.kdnuggets.com/polls/2006/data\_mining\_methods.htm

Hand, D., Mannila, H. & Smyth, P. (2001). *Principles of Data Mining*, MIT Press.

and guidelines, *INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS*, Vol.

learning for higher-accuracy computational gene prediction, *PLoS Computational* 

Alpydin , E. (2004). *Introduction to Machine Learning*. The MIT Press, ISBN 0-262-01211-1. Baldi,P.& Brunak,S. (2 ed ). (2001). *Bioinformatics The Machine Learning Approach*. MIT Press,

ISBN 0 – 262 – 02506 – X.

7, pp. (81 – 97).

*Biology*, Vol. 3, No. 3.

Networks, *MIT Press*.

Sons.


**Part 9** 

**Next Generation Sequencing** 

maps: Methods and application to hematopoietic differentiation, *in Proc. Natl. Acad. Sci. USA, National Acad Sciences*, Vol. 96, No. 6, pp. (2907–2912).


## **Part 9**

**Next Generation Sequencing** 

652 Bioinformatics – Trends and Methodologies

Valentini, G. (August 2008). Guest editorial computational intelligence and machine

Van't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L. &

Won K. -J, Pru¨gel-Bennet A. & Krogh A. (2004). Training HMM structure with genetic

*Sci. USA, National Acad Sciences*, Vol. 96, No. 6, pp. (2907–2912).

learning in bioinformatics, *Preprint submitted to Elsevier*.

of breast cancer, Nature, Vol. 415, pp. (530–536).

3619).

maps: Methods and application to hematopoietic differentiation, *in Proc. Natl. Acad.* 

van der Kooy, K., et al., (2002). Gene expression profiling predicts clinical outcome

algorithm for biological sequence analysis. *Bioinformatics*;Vol. 20, No. 18, pp.(3613–

**29** 

*Israel* 

**Deep Sequencing Data Analysis:** 

Ultra high throughput sequencing, also known as deep sequencing or Next Generation Sequencing (NGS), is revolutionizing the study of human genetics and has immense clinical implications. It has reduced the cost and increased the throughput of genomic sequencing by more than three orders of magnitude in just a few years, a trend which is guaranteed to rapidly accelerate in the near future (Metzker, 2010). Using deep sequencing, for example, it is now possible to discover novel disease causing mutations (Ley et al., 2008) and detect traces of pathogenic microorganisms (Isakov et al., 2011). For the first time, research fields such as personalized medicine for patient treatment are becoming tangible at genomic levels

The amount of data produced by a single ultra high throughput sequencing run is often tremendous and can reach hundreds of millions of reads in various lengths per experiment (Mardis, 2008). The storage, processing, querying, parsing, analyzing and interpreting of such an incredible amount of data is a significant task that holds many obstacles and challenges (Koboldt et al., 2010). In this chapter we will address some of the possibilities, potentials and questions raised during ultra high throughput sequencing data analysis. We will mainly focus on common pre-analysis concepts and crucial advanced considerations for alignment, assembly and variation detection. Currently, the deep sequencing user is faced with an abundance of deep sequencing data analysis tools, both publicly and commercially available. For each of the aforementioned analysis types, we will point out the various aspects to be considered when choosing a tool, and emphasize the relevant challenges and possible limitations in order to assist the user in picking the most suitable one. Since deep sequencing data analysis is a rapidly evolving field, our focus will be on fundamental concepts of the analysis process and the its challenges, allowing this read to be relevant

Our first part will encompass a brief overview of current leading deep sequencing technologies with special attention to their features, strengths and possible drawbacks in regards to the different preliminary questions that one might ask when using ultra high throughput sequencing. The second part of the chapter introduces pre-analysis processes. These are common quality control and assurance methods that alleviate deep sequencing derived biases and improve the overall results of any down-stream analysis. In the third part of the chapter, we will go over the different aspects of the post-sequencing analysis,

**1. Introduction** 

given advances in deep sequencing data integration.

amid additional published software.

**Challenges and Solutions** 

Ofer Isakov and Noam Shomron

*Sackler Faculty of Medicine,* 

*Tel Aviv University,* 

### **Deep Sequencing Data Analysis: Challenges and Solutions**

Ofer Isakov and Noam Shomron *Sackler Faculty of Medicine,* 

*Tel Aviv University, Israel* 

#### **1. Introduction**

Ultra high throughput sequencing, also known as deep sequencing or Next Generation Sequencing (NGS), is revolutionizing the study of human genetics and has immense clinical implications. It has reduced the cost and increased the throughput of genomic sequencing by more than three orders of magnitude in just a few years, a trend which is guaranteed to rapidly accelerate in the near future (Metzker, 2010). Using deep sequencing, for example, it is now possible to discover novel disease causing mutations (Ley et al., 2008) and detect traces of pathogenic microorganisms (Isakov et al., 2011). For the first time, research fields such as personalized medicine for patient treatment are becoming tangible at genomic levels given advances in deep sequencing data integration.

The amount of data produced by a single ultra high throughput sequencing run is often tremendous and can reach hundreds of millions of reads in various lengths per experiment (Mardis, 2008). The storage, processing, querying, parsing, analyzing and interpreting of such an incredible amount of data is a significant task that holds many obstacles and challenges (Koboldt et al., 2010). In this chapter we will address some of the possibilities, potentials and questions raised during ultra high throughput sequencing data analysis. We will mainly focus on common pre-analysis concepts and crucial advanced considerations for alignment, assembly and variation detection. Currently, the deep sequencing user is faced with an abundance of deep sequencing data analysis tools, both publicly and commercially available. For each of the aforementioned analysis types, we will point out the various aspects to be considered when choosing a tool, and emphasize the relevant challenges and possible limitations in order to assist the user in picking the most suitable one. Since deep sequencing data analysis is a rapidly evolving field, our focus will be on fundamental concepts of the analysis process and the its challenges, allowing this read to be relevant amid additional published software.

Our first part will encompass a brief overview of current leading deep sequencing technologies with special attention to their features, strengths and possible drawbacks in regards to the different preliminary questions that one might ask when using ultra high throughput sequencing. The second part of the chapter introduces pre-analysis processes. These are common quality control and assurance methods that alleviate deep sequencing derived biases and improve the overall results of any down-stream analysis. In the third part of the chapter, we will go over the different aspects of the post-sequencing analysis,

Deep Sequencing Data Analysis: Challenges and Solutions 657

susceptibility to insertion and deletion errors and higher rate of homo-polymer (e.g contiguous run of the same base pair) sequencing errors should be considered when performing a variation oriented research (Chan, 2009). Current reverse termination (Illumina's Genome Analyzer or HiSeq 2000) and sequencing by ligation (Life's SOLiD) technologies produce shorter reads (<200nts) but at a much higher throughput and are considered optimal for small scale variants detection (e.g SNPs and indels) due to very high detection resolution owed to massive read overlap and high coverage. However, short reads' inherent problems of ambiguous mapping and complicated assembly can result in higher false positive rates in variant discovery (Nothnagel et al., 2011), that could be alleviated by higher throughput and employment of paired-end sequencing (e.g sequencing both ends of a fragment template) (Medvedev et al., 2009; Metzker, 2010). Sequencing by ligation technology, employed by Life's SOLiD, reads the colors of fluorescently marked ligated primers and converts them into the template sequence. SOLiD is less susceptible to phasing errors and the unique conversion of color to sequence results in an inherent error correction and thus a more accurate SNP detection process. However, this reduced error rate requires utilization of a reference genome in the color conversion process (Kircher & Kelso, 2010). We also note that error rate increases across all platforms towards the end of a sequenced read (Dohm et al., 2008), due to reduced enzyme efficiency, loss of enzymes, increased phasing effect or incomplete dye removal. The different attributes of the variety of platforms can result in significantly different output data and performance, and it was demonstrated that the combination of more than one platform is potentially more cost effective and could yield higher fidelity and accuracy (Dalloul et al., 2010; Nothnagel et al., 2011). We will now discuss how to alleviate some of these inherent and general difficulties

In this section we will discuss the processing performed on deep sequencing output data prior to the specific experimental analysis. Mentioned above are examples of the vast cross platform differences that could affect the downstream analysis and thus the biological conclusions derived. These differences accompany the inherent bias in deep sequencing experiments (Dohm et al., 2008; Schwartz et al., 2011). In order to reduce these possibly confounding effects, platform manufacturers and developers provide the end-user with a sequencing quality scale for both automated and recommended manual quality based data filtration and refinement (Bentley et al., 2008; Harris et al., 2008; Margulies et al., 2005a; K. J. McKernan et al., 2009). We will suggest quality confirmation methods for the text based output end-users face after a sequencing run, and discuss common necessary pre-analysis processing steps that ensure data validity and proper utilization. For each platform's inherent quality assurance and control measures, one should address the specific platform's

The most common initial form of output format is either a sequence FASTA file accompanied by a numerical quality QUAL file, describing the per-base probability of incorrect sequencing based on the PHRED quality score (Ewing and Green, 1998; Ewing et al., 1998), or the FASTQ format (Cock et al., 2010), containing sequences coupled with their quality stored as ASCII characters. Currently, Sanger FASTQ files use ASCII 33–126 to encode PHRED qualities from 0 to 93 (i.e. PHRED scores with an ASCII offset of 33), marking an error probability between 100 and 10-93. Up until the Genome Analyzer v1.3,

using pre-analysis processing.

**3. Pre-analysis processing** 

technical support and annotation.

specifically deep sequencing data alignment, assembly and variant detection. For each section we will cover leading methods and tools, quality evaluation and filtration and address the requirements, capabilities and limitations of these tools. The section on variation detection will cover both common variant detection considerations, variant specific challenges and currently available solutions.

#### **2. Sequencing technologies**

Sequencing technologies are evolving rapidly, with an overwhelming increase in efficiency and throughput (Mardis, 2008). This expeditious rate of change and improvement is accompanied by a variety of different sequencing platforms, with both great similarities and differences alike. Without going into the technology underlying each sequencing platform in detail, we will specify advantages and limitations both general and specific, that are relevant for deep sequencing experiment design. For this purpose, we will refer to the leading commercially available platforms produced by Roche/454 (Margulies et al., 2005a), Illumina/Solexa (Bentley et al., 2008), Life/APG (SOLiD) (McKernan et al., 2009) and Pacific Biosciences (Eid et al., 2009).

The initial step in the sequencing process is random fragmentation of the nucleotide sequence of interest, in order to increase throughput by simultaneously sequencing millions of fragments. These template fragments can then either undergo clonal amplification, in which they are ligated with adapters and amplified using common PCR (Polymerase Chain Reaction) primers (Roche; Illumina; Life), or they can be used as the sequencing templates themselves (single molecule templates; Pacific Biosciences). Clonal amplified template preparation requires a higher amount of initial DNA material. Since this technique relies on PCR amplification, errors might be introduced to the target before the sequencing process begins. The amount of introduced errors is related to the fidelity of the polymerase utilized in the reaction (Chan, 2009). These potential background errors could be considered actual sequence variants in the down stream analysis. PCR utilization might also result in amplification bias, misrepresenting high GC content areas. Such is the case in a recent study in which PCR introduced expression biases for GC rich chromosomes required additional assessment, hampering the uniformity of the results (Chiu et al., 2010). Simultaneously sequencing clonal amplified templates is further complicated by potential different extension rates that cause asynchronous sequencing (phasing), resulting in a higher background noise. Single molecule template sequencing (Schadt, Turner, & Kasarskis, 2010) does not require PCR amplification thus circumventing its derived amplification and clonal sequencing biases making it an appropriate tool to be used in quantification experiments (e.g RNA-seq, Chip-seq etc.) and in cases where the initial sample DNA content is scarce. Because sequencing is performed on a single molecule and sequences are inferred from extremely weak signals, the correcting effect of simultaneous same-sequence template sequencing is lost resulting in a higher error rate (Schadt et al., 2010). Therefore a higher sequencing fidelity is required (Metzker, 2010).

In addition to the aforementioned general template preparation and sequencing method associated biases, one needs to consider the inherent benefits and shortcomings of each sequencing technology. Pyrosequencing, for example, employed by the Roch 454's GS FLX platform, generates long reads (~400nts) and presents relatively unbiased coverage, enhancing de-novo genome assembly and improving alignment capabilities, thus making it an appropriate tool for SNP and structural variations discovery, demonstrating low false positive rates (Margulies et al., 2005a; Nothnagel et al., 2011). However, the technology's

specifically deep sequencing data alignment, assembly and variant detection. For each section we will cover leading methods and tools, quality evaluation and filtration and address the requirements, capabilities and limitations of these tools. The section on variation detection will cover both common variant detection considerations, variant specific

Sequencing technologies are evolving rapidly, with an overwhelming increase in efficiency and throughput (Mardis, 2008). This expeditious rate of change and improvement is accompanied by a variety of different sequencing platforms, with both great similarities and differences alike. Without going into the technology underlying each sequencing platform in detail, we will specify advantages and limitations both general and specific, that are relevant for deep sequencing experiment design. For this purpose, we will refer to the leading commercially available platforms produced by Roche/454 (Margulies et al., 2005a), Illumina/Solexa (Bentley et al., 2008), Life/APG (SOLiD) (McKernan et al., 2009) and Pacific

The initial step in the sequencing process is random fragmentation of the nucleotide sequence of interest, in order to increase throughput by simultaneously sequencing millions of fragments. These template fragments can then either undergo clonal amplification, in which they are ligated with adapters and amplified using common PCR (Polymerase Chain Reaction) primers (Roche; Illumina; Life), or they can be used as the sequencing templates themselves (single molecule templates; Pacific Biosciences). Clonal amplified template preparation requires a higher amount of initial DNA material. Since this technique relies on PCR amplification, errors might be introduced to the target before the sequencing process begins. The amount of introduced errors is related to the fidelity of the polymerase utilized in the reaction (Chan, 2009). These potential background errors could be considered actual sequence variants in the down stream analysis. PCR utilization might also result in amplification bias, misrepresenting high GC content areas. Such is the case in a recent study in which PCR introduced expression biases for GC rich chromosomes required additional assessment, hampering the uniformity of the results (Chiu et al., 2010). Simultaneously sequencing clonal amplified templates is further complicated by potential different extension rates that cause asynchronous sequencing (phasing), resulting in a higher background noise. Single molecule template sequencing (Schadt, Turner, & Kasarskis, 2010) does not require PCR amplification thus circumventing its derived amplification and clonal sequencing biases making it an appropriate tool to be used in quantification experiments (e.g RNA-seq, Chip-seq etc.) and in cases where the initial sample DNA content is scarce. Because sequencing is performed on a single molecule and sequences are inferred from extremely weak signals, the correcting effect of simultaneous same-sequence template sequencing is lost resulting in a higher error rate

(Schadt et al., 2010). Therefore a higher sequencing fidelity is required (Metzker, 2010).

In addition to the aforementioned general template preparation and sequencing method associated biases, one needs to consider the inherent benefits and shortcomings of each sequencing technology. Pyrosequencing, for example, employed by the Roch 454's GS FLX platform, generates long reads (~400nts) and presents relatively unbiased coverage, enhancing de-novo genome assembly and improving alignment capabilities, thus making it an appropriate tool for SNP and structural variations discovery, demonstrating low false positive rates (Margulies et al., 2005a; Nothnagel et al., 2011). However, the technology's

challenges and currently available solutions.

**2. Sequencing technologies** 

Biosciences (Eid et al., 2009).

susceptibility to insertion and deletion errors and higher rate of homo-polymer (e.g contiguous run of the same base pair) sequencing errors should be considered when performing a variation oriented research (Chan, 2009). Current reverse termination (Illumina's Genome Analyzer or HiSeq 2000) and sequencing by ligation (Life's SOLiD) technologies produce shorter reads (<200nts) but at a much higher throughput and are considered optimal for small scale variants detection (e.g SNPs and indels) due to very high detection resolution owed to massive read overlap and high coverage. However, short reads' inherent problems of ambiguous mapping and complicated assembly can result in higher false positive rates in variant discovery (Nothnagel et al., 2011), that could be alleviated by higher throughput and employment of paired-end sequencing (e.g sequencing both ends of a fragment template) (Medvedev et al., 2009; Metzker, 2010). Sequencing by ligation technology, employed by Life's SOLiD, reads the colors of fluorescently marked ligated primers and converts them into the template sequence. SOLiD is less susceptible to phasing errors and the unique conversion of color to sequence results in an inherent error correction and thus a more accurate SNP detection process. However, this reduced error rate requires utilization of a reference genome in the color conversion process (Kircher & Kelso, 2010). We also note that error rate increases across all platforms towards the end of a sequenced read (Dohm et al., 2008), due to reduced enzyme efficiency, loss of enzymes, increased phasing effect or incomplete dye removal. The different attributes of the variety of platforms can result in significantly different output data and performance, and it was demonstrated that the combination of more than one platform is potentially more cost effective and could yield higher fidelity and accuracy (Dalloul et al., 2010; Nothnagel et al., 2011). We will now discuss how to alleviate some of these inherent and general difficulties using pre-analysis processing.

#### **3. Pre-analysis processing**

In this section we will discuss the processing performed on deep sequencing output data prior to the specific experimental analysis. Mentioned above are examples of the vast cross platform differences that could affect the downstream analysis and thus the biological conclusions derived. These differences accompany the inherent bias in deep sequencing experiments (Dohm et al., 2008; Schwartz et al., 2011). In order to reduce these possibly confounding effects, platform manufacturers and developers provide the end-user with a sequencing quality scale for both automated and recommended manual quality based data filtration and refinement (Bentley et al., 2008; Harris et al., 2008; Margulies et al., 2005a; K. J. McKernan et al., 2009). We will suggest quality confirmation methods for the text based output end-users face after a sequencing run, and discuss common necessary pre-analysis processing steps that ensure data validity and proper utilization. For each platform's inherent quality assurance and control measures, one should address the specific platform's technical support and annotation.

The most common initial form of output format is either a sequence FASTA file accompanied by a numerical quality QUAL file, describing the per-base probability of incorrect sequencing based on the PHRED quality score (Ewing and Green, 1998; Ewing et al., 1998), or the FASTQ format (Cock et al., 2010), containing sequences coupled with their quality stored as ASCII characters. Currently, Sanger FASTQ files use ASCII 33–126 to encode PHRED qualities from 0 to 93 (i.e. PHRED scores with an ASCII offset of 33), marking an error probability between 100 and 10-93. Up until the Genome Analyzer v1.3,

Deep Sequencing Data Analysis: Challenges and Solutions 659

subsequent analysis steps such as alignment and assembly. Following both clipping and trimming, the researcher may review the sequence data for size distribution, and verify concordance with the experimental context. For example, when performing microRNA sequencing experiments, one would expect the sequence size composition to be approximately 20-24 nts in length. If the majority of the data deviates from this range, a more careful examination of the information is in order and library preparation bias should

We urge the user to consider the sequencing data in the appropriate experimental context and utilize the aforementioned quality control and assurance methods prior to the downstream analysis to increase the experimental validity and accuracy and to ensure

In the previous sections, we covered common deep sequencing data considerations and refinement, crucial and beneficial for all types of down stream analysis. In this section, we will go over the common data analysis pathways and possibilities, covering their appropriate utilization, the benefits and limitations of each pathway, and familiarizing the

Most of the analysis pathways specified below involve an initial step of mapping the deep sequencing reads against a reference genome of either the sequenced species, or a related organism with sufficient genetic resemblance. This step presents a computational challenge due to the sheer amount of short reads produced in deep sequencing experiments. It is further complicated by nucleotide and structural variance, sequencing errors, RNA editing and epigenetic modifications. When deep sequencing was initially introduced, established early-generation sequence alignment tools (Altschul et al., 1990; Kent, 2002) more suited for the query of a limited number of sequences were less appropriate for high throughput sequencing's millions of short sequence fragments mapping (Trapnell and Salzberg, 2009), requiring novel alignment algorithms and tools to be specifically designed. Current short

(Langmead et al., 2009; Li and Durbin, 2009; Li et al., 2008; Li et al.,2009; Lin et al., 2008; *Novoalign*), utilize various heuristic techniques for alignment of millions of short sequences within an acceptable time requirement (Flicek and Birney, 2009). This section will not cover the underlying algorithms for each tool (Li and Homer, 2010). Instead, we will address a few

When choosing an alignment tool, one needs to consider the memory and time requirements and limitations and the appropriateness of the tool to the exploratory question at hand.

**Quality utilization and control** - As we mentioned before, sequencing quality provides the user with initial assessment of the data. Some alignment tools, utilize these quality scores (Langmead et al., 2009; Li et al., 2008; *Novoalign*) and it was shown that such employment greatly improves the mapping performance (Frith et al., 2010; Li and Homer, 2010). Most common alignment software generate the alignment output in the Sequence Alignment Map (SAM) format (Li et al., 2009), with a multitude of supporting downstream analysis tools. This common format provides users with a simple and flexible common ground to evaluate

imperative features to be considered when initiating data analysis and alignment.

be considered.

**4.1 Alignment** 

read alignment tools.

better, more reliable results.

**4. Data analysis pathways** 

user with some of the common available analysis tools.

Some important features to be considered include:

Illumina utilized a different scoring scale in their sequencing output, described in (Cock et al., 2010). Currently Illumina encodes PHRED scores with an ASCII offset of 64, and so can hold PHRED scores from 0 to 62 (ASCII 64–126). Life's SOLiD produce a color based FASTQ file (CSFASTQ) that utilizes the digits 0-3 to mark the sequenced color, the processing of which we will not cover in this section. Though these different scoring methods potentially contribute to misinterpretation and confusion, they can be easily converted and conformed (Cock et al., 2009; Goto et al., 2010; Holland et al., 2008; Stajich et al., 2002). Most current analysis tools are able to handle both scoring methods, though some require specific parameters to be set for dealing with each. When employing these analysis tools, one should mind the appropriate quality score is used.

Quality control of deep sequencing data refers to an overview on the base and quality distribution between lanes, tiles and cycles, and correlating the initial sequence data with expected length, GC content, ambiguous bases, sequence complexity and alignment ensuing location distributions which can hold information regarding possible sequencing bias, contamination or artifacts. Platform specific quality control tools.

(Cox et al., 2010; Dolan and Denver, 2008; Martinez-Alcantara et al., 2009) and more general quality assessment software (Dai et al., 2010; Schmieder and Edwards, 2011) can help circumvent such biases, by both raising awareness to implicating irregularities with textual and graphical data representation and by removing such low quality or aberrant sequences prior to the downstream analysis. The need for careful quality control is exemplified by deep sequencing data with a tile specific A base bias, leading to over-expression of the base in the sequences derived from that tile. When searching for rare sequence variants, such base over-expression should be considered when sequences supporting an A variant are derived from the aforementioned tile. A more common example is sequence duplication (Gomez-Alvarez et al., 2009),usually an artifact of PCR amplification and other library preparation processes, that cause over-representation of certain sequences. This creates a skewed coverage distribution that may subsequently bias the error model and thus substantially increase the number of false-positive SNP discoveries and tilt expression and metagenomic analysis results. Available quality control software allow the user to completely remove these duplicates (*FASTX - toolkit*; Li et al., 2009) or mark them for downstream analysis consideration (*PICARD*). Recently various algorithms utilizing suffix tree data structures were developed for sequencing error correction (Kelley et al., 2010; Zhao et al., 2010).

A common procedure in the pre-analysis process, following initial quality control, and prior to sequence duplication removal, is the compulsory tag / adapter removal (Lassmann et al., 2009; Schmieder et al., 2010) and optional quality trimming. Tags are used during the library preparation phase for amplification or differentiation processes (e.g multiplexing; Galan et al., 2010). If they are sequenced, they can profoundly affect the downstream analysis unless removed (e.g clipping). The clipping process, removes any tag remnants from the sequence reads, ridding the data from reads composed mainly or even solely of the tags. The user must set the minimal read length to be retained (according to the sequenced sample and experimental question) and consider possible sequence similarities between the sample and the adapters. Trimming, refers to the sequence removal from either the 5' or the 3' ends of a read where either the sequence complexity or quality does not pass user settings. It is often used for poly-A or poly-T removal, or removal of bases with significantly lower, bias introducing quality scores. Unlike clipping, which is mandatory for valid downstream analysis, trimming is only recommended to improve accuracy and performance in subsequent analysis steps such as alignment and assembly. Following both clipping and trimming, the researcher may review the sequence data for size distribution, and verify concordance with the experimental context. For example, when performing microRNA sequencing experiments, one would expect the sequence size composition to be approximately 20-24 nts in length. If the majority of the data deviates from this range, a more careful examination of the information is in order and library preparation bias should be considered.

We urge the user to consider the sequencing data in the appropriate experimental context and utilize the aforementioned quality control and assurance methods prior to the downstream analysis to increase the experimental validity and accuracy and to ensure better, more reliable results.

#### **4. Data analysis pathways**

In the previous sections, we covered common deep sequencing data considerations and refinement, crucial and beneficial for all types of down stream analysis. In this section, we will go over the common data analysis pathways and possibilities, covering their appropriate utilization, the benefits and limitations of each pathway, and familiarizing the user with some of the common available analysis tools.

#### **4.1 Alignment**

658 Bioinformatics – Trends and Methodologies

Illumina utilized a different scoring scale in their sequencing output, described in (Cock et al., 2010). Currently Illumina encodes PHRED scores with an ASCII offset of 64, and so can hold PHRED scores from 0 to 62 (ASCII 64–126). Life's SOLiD produce a color based FASTQ file (CSFASTQ) that utilizes the digits 0-3 to mark the sequenced color, the processing of which we will not cover in this section. Though these different scoring methods potentially contribute to misinterpretation and confusion, they can be easily converted and conformed (Cock et al., 2009; Goto et al., 2010; Holland et al., 2008; Stajich et al., 2002). Most current analysis tools are able to handle both scoring methods, though some require specific parameters to be set for dealing with each. When employing these analysis tools, one should

Quality control of deep sequencing data refers to an overview on the base and quality distribution between lanes, tiles and cycles, and correlating the initial sequence data with expected length, GC content, ambiguous bases, sequence complexity and alignment ensuing location distributions which can hold information regarding possible sequencing bias,

(Cox et al., 2010; Dolan and Denver, 2008; Martinez-Alcantara et al., 2009) and more general quality assessment software (Dai et al., 2010; Schmieder and Edwards, 2011) can help circumvent such biases, by both raising awareness to implicating irregularities with textual and graphical data representation and by removing such low quality or aberrant sequences prior to the downstream analysis. The need for careful quality control is exemplified by deep sequencing data with a tile specific A base bias, leading to over-expression of the base in the sequences derived from that tile. When searching for rare sequence variants, such base over-expression should be considered when sequences supporting an A variant are derived from the aforementioned tile. A more common example is sequence duplication (Gomez-Alvarez et al., 2009),usually an artifact of PCR amplification and other library preparation processes, that cause over-representation of certain sequences. This creates a skewed coverage distribution that may subsequently bias the error model and thus substantially increase the number of false-positive SNP discoveries and tilt expression and metagenomic analysis results. Available quality control software allow the user to completely remove these duplicates (*FASTX - toolkit*; Li et al., 2009) or mark them for downstream analysis consideration (*PICARD*). Recently various algorithms utilizing suffix tree data structures were developed for sequencing error correction (Kelley et al., 2010; Zhao

A common procedure in the pre-analysis process, following initial quality control, and prior to sequence duplication removal, is the compulsory tag / adapter removal (Lassmann et al., 2009; Schmieder et al., 2010) and optional quality trimming. Tags are used during the library preparation phase for amplification or differentiation processes (e.g multiplexing; Galan et al., 2010). If they are sequenced, they can profoundly affect the downstream analysis unless removed (e.g clipping). The clipping process, removes any tag remnants from the sequence reads, ridding the data from reads composed mainly or even solely of the tags. The user must set the minimal read length to be retained (according to the sequenced sample and experimental question) and consider possible sequence similarities between the sample and the adapters. Trimming, refers to the sequence removal from either the 5' or the 3' ends of a read where either the sequence complexity or quality does not pass user settings. It is often used for poly-A or poly-T removal, or removal of bases with significantly lower, bias introducing quality scores. Unlike clipping, which is mandatory for valid downstream analysis, trimming is only recommended to improve accuracy and performance in

mind the appropriate quality score is used.

et al., 2010).

contamination or artifacts. Platform specific quality control tools.

Most of the analysis pathways specified below involve an initial step of mapping the deep sequencing reads against a reference genome of either the sequenced species, or a related organism with sufficient genetic resemblance. This step presents a computational challenge due to the sheer amount of short reads produced in deep sequencing experiments. It is further complicated by nucleotide and structural variance, sequencing errors, RNA editing and epigenetic modifications. When deep sequencing was initially introduced, established early-generation sequence alignment tools (Altschul et al., 1990; Kent, 2002) more suited for the query of a limited number of sequences were less appropriate for high throughput sequencing's millions of short sequence fragments mapping (Trapnell and Salzberg, 2009), requiring novel alignment algorithms and tools to be specifically designed. Current short read alignment tools.

(Langmead et al., 2009; Li and Durbin, 2009; Li et al., 2008; Li et al.,2009; Lin et al., 2008; *Novoalign*), utilize various heuristic techniques for alignment of millions of short sequences within an acceptable time requirement (Flicek and Birney, 2009). This section will not cover the underlying algorithms for each tool (Li and Homer, 2010). Instead, we will address a few imperative features to be considered when initiating data analysis and alignment.

When choosing an alignment tool, one needs to consider the memory and time requirements and limitations and the appropriateness of the tool to the exploratory question at hand. Some important features to be considered include:

**Quality utilization and control** - As we mentioned before, sequencing quality provides the user with initial assessment of the data. Some alignment tools, utilize these quality scores (Langmead et al., 2009; Li et al., 2008; *Novoalign*) and it was shown that such employment greatly improves the mapping performance (Frith et al., 2010; Li and Homer, 2010). Most common alignment software generate the alignment output in the Sequence Alignment Map (SAM) format (Li et al., 2009), with a multitude of supporting downstream analysis tools. This common format provides users with a simple and flexible common ground to evaluate

Deep Sequencing Data Analysis: Challenges and Solutions 661

all its mapped loci, adding a small equal portion to each. This could have the opposite effect of under-estimating expression and coverage, especially for low complexity loci. Several methods utilize heuristics for dividing these reads amongst their mapped loci according to the uniquely mapped reads in those regions (Hashimoto et al., 2009; Mortazavi et al., 2008). A fairly novel approach utilizes probabilistic models such as maximum likelihood to compute the most likely origin of each read greatly improving the results of quantitative

Since each parameter can greatly affect various performance attributes, considering the aforementioned features is crucial when initiating deep sequencing data alignment. The user should always mind the alignment tool's inherent limitations and implement parameters settings according to the experiment at hand and the expected possible downstream analysis, picking an appropriate tool and tuning necessary features for optimal alignment

Assembly refers to the process of piecing together short DNA/RNA sequences into longer ones (e.g contigs) which are then grouped to form scaffolds for computationally reconstructing a sample's genetic component. When the assembly process is performed with the assistance of a reference genome, it is referred to as mapping assembly, if no reference is available it is called *de novo* assembly. Original computational assembly tools were designed to use capillary-based sequencing's 800 base pairs long sequences in order to deduce the original full sequence through examination of overlapping segments. Deep sequencing data presents a more compound assembly problem due to higher amounts of sequences that are significantly shorter. Though it adds complexity to the process, this significant increase in throughput enables the successful realization of whole mammalian genome *de novo* assembly as shown in (Li, Fan, et al., 2010; Li, Zhu, et al., 2010). Sequencing errors, uneven genome coverage and reads too short to be informative in repeated regions required a new breed of assembly tools designed specifically for short reads (Butler et al., 2008; Chaisson and Pevzner, 2008; Dohm et al., 2007; Jeck et al., 2007; Li, Zhu, et al., 2010; Margulies et al., 2005b; Simpson et al., 2009; Treangen et al., 2011; Zerbino and Birney, 2008). These tools mainly rely on two algorithms, and differ mostly in the way they deal with sequencing errors and inconsistencies and sequence repeats. Since tools utilized today could be either deprecated or significantly changed in the near future, we will not address the underlying advantages and disadvantages for each specific tool. We will, however, cover some of the more general inherent challenges of deep sequencing data assembly and recommended

**Assembly Algorithms** - Currently there are two main models for deep sequencing data assembly, Overlap-Layout-Consensus (OLC) (Myers, 1995) which calculates overlaps by (computationally expensive) pairwise alignments, and de Bruijn graph-based (DBG) which creates a shared k-mer dictionary for the assembly process. K is often set by the user and it is recommended that it be set large enough so that most overlaps are true and do not occur by chance, and short enough so as to allow overlap between related sequences. Since comprehensive reviews are available on these algorithms (Miller et al., 2010), we will focus more on specific algorithm related considerations for tool selection. A recent overview comparing the performance of a variety of tools for assembly under different conditions (Zhang et al., 2011), recommended the use of OLC based assemblers (Hernandez et al., 2008; Margulies et al., 2005b) for small scale (e.g microorganisms) genome assemblies While

deep sequencing experiments and differential expression (Paşaniuc et al., 2011).

results.

**4.2 Assembly** 

optimization methods.

alignment results and easily extract and utilize data for further analysis. As for the sequencing output, so does the alignment output contain a PHRED based quality score for each of the aligned reads, describing the probability of per-base false alignment. Combination of this quality score together with other alignment parameters such as mismatches could and should be further assessed using specialized tools (Lassmann et al., 2011) in order to characterize mapped and unmapped reads for potential alignment improvement. These alignment quality scores can be re-assessed using currently available tools (McKenna et al., 2010; *Novoalign*), so that they better denote the probability of a mismatch between the aligned base and the reference sequence. This quality recalibration takes into account the given base and its quality score, the position within the read and the adjacent nucleotides to account for sequencing chemistry biases (Li, Li et al. 2009), and was shown to reduce the effect of sequencing technology derived biases and improve overall variant detection fidelity (DePristo et al., 2011).

**Gapped alignment** – An important feature one should be mindful of when choosing an alignment tool is whether the tool utilizes the gapped alignment algorithm. Since gapped alignment only mildly increases alignment sensitivity, it is not crucial to pick a supporting tool for many general purposes. However it is especially crucial for variant calling, specifically insertions and deletions (indels) detection (Krawitz et al., 2010) and it is highly recommended to choose a tool that implements gapped alignment (Li and Durbin, 2009; Li et al., 2008; *Novoalign*), when venturing on variant detection experiments, or when targeting known indel abundant areas.

**Mismatches and Gap penalties** – Most alignment tools allow the user to set the number of allowed mismatches between the read and a reference location and the scoring scale for gap opening and extension. Allowing more mismatches results in a higher portion of mapped reads but at the cost of increased ambiguity and reduced confidence of these alignments. Mismatch allowance should be set while considering the specific experiment at hand. For example, when undergoing microRNA expression profiling, one will want an accurate estimate of the abundance of each microRNA , and should not allow a high mismatch rate if any. On variant calling experiments however, the user should consider the possible expected size range of the variants before setting the allowed mismatch and gap penalty parameters (e.g, if one aims to find a >5nt long deletion, the mismatch limitation should allow it).

**Multiple mapping -** In theory, unique alignment, mapping a read to a single unique loci on the reference genome is expected by most reads longer than 30 nts when aligning against a large human scale reference. Usually, a portion of the reads will remain unmapped due to contaminant origin or sequencing errors, or more commonly, they will ambiguously map to several different locations (multiple mapping) due to sequence homology and repetitiveness. Different alignment tools flag these multiply mapped reads, and provide the user with the option to either randomly assign them to one loci (Li and Durbin, 2009) or just output all of them (*Novoalign*). Researchers may choose to incorporate only uniquely mapped reads into their downstream analysis, or set a maximal number of different mapping locations for incorporated reads. Discarding multiply mapped reads results in loss of a substantial portion of the data, with potential crucial effects on the following analysis. Currently, there are several approaches for allocation of these multiply mapped reads. One method is to count each read as if originating from each of the mapped loci, potentially over-estimating the expression or coverage of some, since the same read could not have originated from more than one loci. Another method is to divide each read count between all its mapped loci, adding a small equal portion to each. This could have the opposite effect of under-estimating expression and coverage, especially for low complexity loci. Several methods utilize heuristics for dividing these reads amongst their mapped loci according to the uniquely mapped reads in those regions (Hashimoto et al., 2009; Mortazavi et al., 2008). A fairly novel approach utilizes probabilistic models such as maximum likelihood to compute the most likely origin of each read greatly improving the results of quantitative deep sequencing experiments and differential expression (Paşaniuc et al., 2011).

Since each parameter can greatly affect various performance attributes, considering the aforementioned features is crucial when initiating deep sequencing data alignment. The user should always mind the alignment tool's inherent limitations and implement parameters settings according to the experiment at hand and the expected possible downstream analysis, picking an appropriate tool and tuning necessary features for optimal alignment results.

#### **4.2 Assembly**

660 Bioinformatics – Trends and Methodologies

alignment results and easily extract and utilize data for further analysis. As for the sequencing output, so does the alignment output contain a PHRED based quality score for each of the aligned reads, describing the probability of per-base false alignment. Combination of this quality score together with other alignment parameters such as mismatches could and should be further assessed using specialized tools (Lassmann et al., 2011) in order to characterize mapped and unmapped reads for potential alignment improvement. These alignment quality scores can be re-assessed using currently available tools (McKenna et al., 2010; *Novoalign*), so that they better denote the probability of a mismatch between the aligned base and the reference sequence. This quality recalibration takes into account the given base and its quality score, the position within the read and the adjacent nucleotides to account for sequencing chemistry biases (Li, Li et al. 2009), and was shown to reduce the effect of sequencing technology derived biases and improve overall

**Gapped alignment** – An important feature one should be mindful of when choosing an alignment tool is whether the tool utilizes the gapped alignment algorithm. Since gapped alignment only mildly increases alignment sensitivity, it is not crucial to pick a supporting tool for many general purposes. However it is especially crucial for variant calling, specifically insertions and deletions (indels) detection (Krawitz et al., 2010) and it is highly recommended to choose a tool that implements gapped alignment (Li and Durbin, 2009; Li et al., 2008; *Novoalign*), when venturing on variant detection experiments, or when targeting

**Mismatches and Gap penalties** – Most alignment tools allow the user to set the number of allowed mismatches between the read and a reference location and the scoring scale for gap opening and extension. Allowing more mismatches results in a higher portion of mapped reads but at the cost of increased ambiguity and reduced confidence of these alignments. Mismatch allowance should be set while considering the specific experiment at hand. For example, when undergoing microRNA expression profiling, one will want an accurate estimate of the abundance of each microRNA , and should not allow a high mismatch rate if any. On variant calling experiments however, the user should consider the possible expected size range of the variants before setting the allowed mismatch and gap penalty parameters (e.g, if one aims to find a >5nt long deletion, the mismatch limitation should

**Multiple mapping -** In theory, unique alignment, mapping a read to a single unique loci on the reference genome is expected by most reads longer than 30 nts when aligning against a large human scale reference. Usually, a portion of the reads will remain unmapped due to contaminant origin or sequencing errors, or more commonly, they will ambiguously map to several different locations (multiple mapping) due to sequence homology and repetitiveness. Different alignment tools flag these multiply mapped reads, and provide the user with the option to either randomly assign them to one loci (Li and Durbin, 2009) or just output all of them (*Novoalign*). Researchers may choose to incorporate only uniquely mapped reads into their downstream analysis, or set a maximal number of different mapping locations for incorporated reads. Discarding multiply mapped reads results in loss of a substantial portion of the data, with potential crucial effects on the following analysis. Currently, there are several approaches for allocation of these multiply mapped reads. One method is to count each read as if originating from each of the mapped loci, potentially over-estimating the expression or coverage of some, since the same read could not have originated from more than one loci. Another method is to divide each read count between

variant detection fidelity (DePristo et al., 2011).

known indel abundant areas.

allow it).

Assembly refers to the process of piecing together short DNA/RNA sequences into longer ones (e.g contigs) which are then grouped to form scaffolds for computationally reconstructing a sample's genetic component. When the assembly process is performed with the assistance of a reference genome, it is referred to as mapping assembly, if no reference is available it is called *de novo* assembly. Original computational assembly tools were designed to use capillary-based sequencing's 800 base pairs long sequences in order to deduce the original full sequence through examination of overlapping segments. Deep sequencing data presents a more compound assembly problem due to higher amounts of sequences that are significantly shorter. Though it adds complexity to the process, this significant increase in throughput enables the successful realization of whole mammalian genome *de novo* assembly as shown in (Li, Fan, et al., 2010; Li, Zhu, et al., 2010). Sequencing errors, uneven genome coverage and reads too short to be informative in repeated regions required a new breed of assembly tools designed specifically for short reads (Butler et al., 2008; Chaisson and Pevzner, 2008; Dohm et al., 2007; Jeck et al., 2007; Li, Zhu, et al., 2010; Margulies et al., 2005b; Simpson et al., 2009; Treangen et al., 2011; Zerbino and Birney, 2008). These tools mainly rely on two algorithms, and differ mostly in the way they deal with sequencing errors and inconsistencies and sequence repeats. Since tools utilized today could be either deprecated or significantly changed in the near future, we will not address the underlying advantages and disadvantages for each specific tool. We will, however, cover some of the more general inherent challenges of deep sequencing data assembly and recommended optimization methods.

**Assembly Algorithms** - Currently there are two main models for deep sequencing data assembly, Overlap-Layout-Consensus (OLC) (Myers, 1995) which calculates overlaps by (computationally expensive) pairwise alignments, and de Bruijn graph-based (DBG) which creates a shared k-mer dictionary for the assembly process. K is often set by the user and it is recommended that it be set large enough so that most overlaps are true and do not occur by chance, and short enough so as to allow overlap between related sequences. Since comprehensive reviews are available on these algorithms (Miller et al., 2010), we will focus more on specific algorithm related considerations for tool selection. A recent overview comparing the performance of a variety of tools for assembly under different conditions (Zhang et al., 2011), recommended the use of OLC based assemblers (Hernandez et al., 2008; Margulies et al., 2005b) for small scale (e.g microorganisms) genome assemblies While

Deep Sequencing Data Analysis: Challenges and Solutions 663

al., 2010). Detection of these variants from deep sequencing data requires in most cases both a reference genetic sequence to compare the sequence data against (Li, Li, et al., 2009), and a specialized variant calling software that utilizes probabilistic methods for correctly inferring variants. The process is complicated by areas of low coverage, sequencing errors, misalignment caused by either low complexity and repeat regions or adjacent variants and library preparation biases (e.g PCR duplicates) (Chan, 2009). Variant calling depends on an efficient combination between an accurate alignment and sophisticated inference of variance from it. Since alignment optimization was already discussed in a previous section, in this section our focus will be more on aspects of variant deduction. We will cover the basic common challenges and difficulties both general and specific for each variant type, Present leading bioinformatic tools and databases and their contributions to the field and provide the user with critical considerations and solutions for some of the aforementioned

After initial alignment, certain factors can critically alter the results of variant detection. One should consider them prior to downstream analysis and implement the appropriate

**Depth of coverage –** Previous studies demonstrated positive correlation between variant calling sensitivity and increased read depth (Krawitz et al., 2010). Depth can be increased by either reducing the size of the selected or enriched target region, performing a higher number of sequencing cycles to produce longer reads to cover the target region or simply assigning more sequencing lanes. Each method has its benefits and drawbacks. For example, assigning an additional lane to sequence the same sample requires a higher financial investment but allows better noise filtration and sequencing errors recognition. Targeting a specific region increases the coverage and sensitivity at the selected segment, but at the cost of information loss at the areas outside. After the sequencing process is complete, upper and lower depth thresholds should be applied on the sequencing data before variant calling is performed. Setting a lower coverage limit removes erroneous mismatches caused by sequencing errors and thus supported by very few reads (Durbin et al., 2010; Li, Li, et al., 2009). Although it is recommended on most tools, setting a lower limit has been shown to reduce sensitivity without increasing specificity in some tools (Goya et al., 2010) and therefore should be considered in the context of the utilized tool. Setting an upper limit removes mismatches caused by copy number variations, PCR duplicates introduced by library preparation (Gomez-Alvarez et al., 2009) and reads mapping to paralogous sequences. The limit should be set according to the initial coverage and we recommend setting the limit to ~10 times the average coverage. PCR duplicates should be further assessed, removed and marked using specialized tools (Li et al., 2009; *PICARD*) as

**Mapping quality and Quality recalibration –** Some reads mapping to under represented regions in the genome, especially low complexity and repetitive regions will be inaccurately mapped with a low mapping quality. SNPs derived from these reads have higher chance of being false-positives (Durbin et al., 2010) and should be more carefully examined, setting a more strict quality and coverage threshold if possible. As mentioned in the prior section of alignment considerations, quality recalibration increases the validity of the alignment qualities so that they better denote the probability of a mismatch between the base and the reference. Naturally, these re-calibrated qualities improve the efficiency of variant detection tools that incorporate alignment qualities into their calling algorithms (Koboldt et al., 2009;

challenges.

modifications if necessary.

mentioned in the pre-analysis processing section.

Li, Li, et al., 2009; McKenna et al., 2010; Qi, et al., 2010).

reserving the use of the less computationally demanding DBG based tools for the assembly of large (eukaryote) genomes (Butler et al., 2008; Li, Zhu, et al., 2010; Simpson et al., 2009; Zerbino and Birney, 2008). Another consideration is the read size, with OLC being most appropriate for a limited number of fairly long reads (~100-800 bp) and DBG more suited for the assembly of millions of short reads (25-100 bp) (Miller et al., 2010). One should note that DBG based tool's implementation of specific heuristics reduces CPU demand but at the cost of higher sensitivity to sequencing errors that could result in a much higher memory requirement. We therefore urge the user to run a more strict quality assessment and filtration when embarking on DBG based assembly. We also note that some of the assembly challenges such as identical repeat regions longer than the sequenced reads length, remain insurmountable by computational and algorithmic improvements and must be alleviated by technical means such as longer reads or paired-end sequencing (Cahill et al., 2010).

**Quality Assessment** - An assembly's quality is measured by it's contiguity and cumulative size and the accuracy of the assembly. The contiguity is assessed using length statistics such as contig and scaffold maximal and average length, combined total length and N50 (The length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly (Miller et al., 2010)). An assembly's accuracy is more difficult to assess and external data is usually needed to reveal both missassembly (e.g sequences that are inaccurately joined) and per base accuracy (e.g contigs with nucleotide mismatches). One way to estimate fidelity is by utilizing paired end reads, realigning them against the assembled contigs to reveal discrepancies in insert size which probably indicate wrong assembly. When there are available reference sequences they should be utilized for further validation of the assembled contigs, matching sequences and marking possible mismatches and chimeras (non-related sequences assembled into one contig). If no reference sequence is available, it has been shown that the sequence of available closely related organism (e.g comparative assembly (Pop et al., 2004)) could be utilized for the same purpose and for contig adjacency assessment (Gnerre, et al., 2009; Husemann and Stoye, 2010; Meader et al., 2010). A crucial aspect of assembly quality assurance is the sequence quality. Erroneous sequence reads result in higher computer memory requirements (especially in DBG based tools (Miller et al., 2010)) and either no assembly output or wrong inaccurate contigs. As part of the assembly related quality assurance, it is recommended to discard all reads with ambiguous bases (e.g N) and reads composed entirely of homo-polymer sequences to alleviate this increase in computational demand (Paszkiewicz and Studholme, 2010). It is also good practice to trim low quality bases from read edges and of course remove adapters prior to assembly.

Assembly represents one of the more challenging computational tasks at present and it is further complicated when implemented on deep sequencing data. General considerations mentioned in this chapter will help the user to both better understand the challenges inherent in the sequence data and to match a selected tool's underlying implemented algorithm with the data at hand and the assembly goals. Moreover, assembly quality could now be better assessed using the aforementioned parameters, such as N50 and fidelity, in order to compare assembly tools performance for both existing and future software.

#### **4.3 Variant calling**

Variant calling refers to the identification of single nucleotide polymorphisms (SNPs), insertions and deletions (indels), copy number variations (CNVs) and other types of structural variations (e.g inversions, translocations etc.) in a sequenced sample (Durbin et

reserving the use of the less computationally demanding DBG based tools for the assembly of large (eukaryote) genomes (Butler et al., 2008; Li, Zhu, et al., 2010; Simpson et al., 2009; Zerbino and Birney, 2008). Another consideration is the read size, with OLC being most appropriate for a limited number of fairly long reads (~100-800 bp) and DBG more suited for the assembly of millions of short reads (25-100 bp) (Miller et al., 2010). One should note that DBG based tool's implementation of specific heuristics reduces CPU demand but at the cost of higher sensitivity to sequencing errors that could result in a much higher memory requirement. We therefore urge the user to run a more strict quality assessment and filtration when embarking on DBG based assembly. We also note that some of the assembly challenges such as identical repeat regions longer than the sequenced reads length, remain insurmountable by computational and algorithmic improvements and must be alleviated by

technical means such as longer reads or paired-end sequencing (Cahill et al., 2010).

bases from read edges and of course remove adapters prior to assembly.

**4.3 Variant calling** 

Assembly represents one of the more challenging computational tasks at present and it is further complicated when implemented on deep sequencing data. General considerations mentioned in this chapter will help the user to both better understand the challenges inherent in the sequence data and to match a selected tool's underlying implemented algorithm with the data at hand and the assembly goals. Moreover, assembly quality could now be better assessed using the aforementioned parameters, such as N50 and fidelity, in

Variant calling refers to the identification of single nucleotide polymorphisms (SNPs), insertions and deletions (indels), copy number variations (CNVs) and other types of structural variations (e.g inversions, translocations etc.) in a sequenced sample (Durbin et

order to compare assembly tools performance for both existing and future software.

**Quality Assessment** - An assembly's quality is measured by it's contiguity and cumulative size and the accuracy of the assembly. The contiguity is assessed using length statistics such as contig and scaffold maximal and average length, combined total length and N50 (The length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly (Miller et al., 2010)). An assembly's accuracy is more difficult to assess and external data is usually needed to reveal both missassembly (e.g sequences that are inaccurately joined) and per base accuracy (e.g contigs with nucleotide mismatches). One way to estimate fidelity is by utilizing paired end reads, realigning them against the assembled contigs to reveal discrepancies in insert size which probably indicate wrong assembly. When there are available reference sequences they should be utilized for further validation of the assembled contigs, matching sequences and marking possible mismatches and chimeras (non-related sequences assembled into one contig). If no reference sequence is available, it has been shown that the sequence of available closely related organism (e.g comparative assembly (Pop et al., 2004)) could be utilized for the same purpose and for contig adjacency assessment (Gnerre, et al., 2009; Husemann and Stoye, 2010; Meader et al., 2010). A crucial aspect of assembly quality assurance is the sequence quality. Erroneous sequence reads result in higher computer memory requirements (especially in DBG based tools (Miller et al., 2010)) and either no assembly output or wrong inaccurate contigs. As part of the assembly related quality assurance, it is recommended to discard all reads with ambiguous bases (e.g N) and reads composed entirely of homo-polymer sequences to alleviate this increase in computational demand (Paszkiewicz and Studholme, 2010). It is also good practice to trim low quality al., 2010). Detection of these variants from deep sequencing data requires in most cases both a reference genetic sequence to compare the sequence data against (Li, Li, et al., 2009), and a specialized variant calling software that utilizes probabilistic methods for correctly inferring variants. The process is complicated by areas of low coverage, sequencing errors, misalignment caused by either low complexity and repeat regions or adjacent variants and library preparation biases (e.g PCR duplicates) (Chan, 2009). Variant calling depends on an efficient combination between an accurate alignment and sophisticated inference of variance from it. Since alignment optimization was already discussed in a previous section, in this section our focus will be more on aspects of variant deduction. We will cover the basic common challenges and difficulties both general and specific for each variant type, Present leading bioinformatic tools and databases and their contributions to the field and provide the user with critical considerations and solutions for some of the aforementioned challenges.

After initial alignment, certain factors can critically alter the results of variant detection. One should consider them prior to downstream analysis and implement the appropriate modifications if necessary.

**Depth of coverage –** Previous studies demonstrated positive correlation between variant calling sensitivity and increased read depth (Krawitz et al., 2010). Depth can be increased by either reducing the size of the selected or enriched target region, performing a higher number of sequencing cycles to produce longer reads to cover the target region or simply assigning more sequencing lanes. Each method has its benefits and drawbacks. For example, assigning an additional lane to sequence the same sample requires a higher financial investment but allows better noise filtration and sequencing errors recognition. Targeting a specific region increases the coverage and sensitivity at the selected segment, but at the cost of information loss at the areas outside. After the sequencing process is complete, upper and lower depth thresholds should be applied on the sequencing data before variant calling is performed. Setting a lower coverage limit removes erroneous mismatches caused by sequencing errors and thus supported by very few reads (Durbin et al., 2010; Li, Li, et al., 2009). Although it is recommended on most tools, setting a lower limit has been shown to reduce sensitivity without increasing specificity in some tools (Goya et al., 2010) and therefore should be considered in the context of the utilized tool. Setting an upper limit removes mismatches caused by copy number variations, PCR duplicates introduced by library preparation (Gomez-Alvarez et al., 2009) and reads mapping to paralogous sequences. The limit should be set according to the initial coverage and we recommend setting the limit to ~10 times the average coverage. PCR duplicates should be further assessed, removed and marked using specialized tools (Li et al., 2009; *PICARD*) as mentioned in the pre-analysis processing section.

**Mapping quality and Quality recalibration –** Some reads mapping to under represented regions in the genome, especially low complexity and repetitive regions will be inaccurately mapped with a low mapping quality. SNPs derived from these reads have higher chance of being false-positives (Durbin et al., 2010) and should be more carefully examined, setting a more strict quality and coverage threshold if possible. As mentioned in the prior section of alignment considerations, quality recalibration increases the validity of the alignment qualities so that they better denote the probability of a mismatch between the base and the reference. Naturally, these re-calibrated qualities improve the efficiency of variant detection tools that incorporate alignment qualities into their calling algorithms (Koboldt et al., 2009; Li, Li, et al., 2009; McKenna et al., 2010; Qi, et al., 2010).

Deep Sequencing Data Analysis: Challenges and Solutions 665

common as transitions (e.g ~0.5), low quality variant calling or data is implied and quality

**dbSNP validation –** After producing a list of detected SNPs, it is highly recommended to compare it against dbSNP, the largest repository of SNP data found within the National Center for Biotechnology Information database. Detected SNPs present in the database are considered as known, and the ones not found are considered novel (Li and Stockwell, 2010). The portion of novel SNPs detected in a deep sequencing experiment should range between 1 and 10 percent (DePristo et al., 2011). If this proportion is higher, a high rate of false positive variants is suggested and we recommended reevaluating the detection process and

Indels are the second most common type of polymorphism and the most common structural variant, in this sub-section we will address only short indels as the next section will deal with the larger (>1000kb) structural variants. Most indels range between 2-16 bases in length (Mullaney, et al.,2010) (also referred to as micro-indels) and their frequency has been shown to vary across the genome with lower rate in conserved and functional regions and an increased rate in hot spots for genetic variation. The average indel rate is approximately one indel in 5.1 to 13.2 kb of DNA (Mills et al., 2006). Their presence implicates on the pathogenesis of disease, gene expression and functionality, viral disease forms identification and they can be used as genetic markers in natural populations. Indels occur in an estimated rate that is eight-fold lower than SNPs (Durbin et al., 2010). This rate varies extensively between sequenced individuals, usually due to variability between mapping and detection tools. Reads covering an indel are generally more difficult to map since their correct alignment either involves complex gapped alignment or paired-end sequencing inference. Optimal indel detection is performed by combining application of an appropriate alignment software and variant detection tool (Albers et al., 2010; Koboldt et al., 2009; Li et al., 2009; McKenna et al., 2010; Qi et al., 2010; Kai et al., 2009), and careful adjustment of their parameters according to the suspected variants. As mentioned before in the alignment section, it is highly recommended to perform indel calling with alignment tools that implement gapped alignment (Krawitz et al., 2010; Li and Durbin, 2009; Li et al., 2008;

*Novoalign*). A few considerations when addressing insertion-deletion detection:

**Read length –** Increasing the read length has been shown to improve the ability to map and detect insertion related reads. Sequence reads 36 bases long, such as the ones produced by the Illumina GAIIx, have been shown to be inefficient for detection of insertions longer than 3 bases with a complete inability to detect insertions longer than 7 bases. Hence the length of the sequenced reads should be considered according to the insertion size range suspicion and adjusted appropriately. Naturally, when insertion size is expected to surpass the read length it is impossible to detect them using single-end sequencing. Increasing the read length has also been shown to improve micro-indel (<10 bases) detection sensitivity without significantly affecting specificity, demonstrating a more efficient method for increasing

**Paired-end reads** – Indel detection greatly improves when based on paired end reads deep sequencing data (Mullaney et al., 2010). Both alignment (Li and Durbin, 2009; Li et al.,2008) and variant detection tools (Kai Ye et al., 2009) utilize paired-end reads so that one of the reads is used to pinpoint the pair's loci in the reference while the other read can be subjected to gapped alignment and indel inference. Furthermore, the insert (e.g the unsequenced gap

possibly implementing a more strict variation inclusion criteria.

thresholds should be reassessed.

**4.3.2 Insertions and deletions (Indels)** 

coverage than simply producing more reads.

**Cross-lane comparison –** It is good practice, when different-lane same-sample sequences are available, to compare the amount of SNPs, insertions and deletions detected for each lane. If one of the lanes has a significantly higher amount of detected variants, it is probable that it will introduce false-positives to the analysis and exclusion of that lane from downstream variant calling is recommended. Another possible data validation option is comparison against a SNP chip if available (Koboldt et al., 2010). Going over each annotated SNP provides the user with more than a million checkpoints to ascertain both the validity and fidelity of the sequencing process, and the chromosomal representation (e.g haploid or diploid).

We will now address a few more variant specific considerations and applications.

#### **4.3.1 Single nucleotide polymorphisms**

After aligning deep sequencing reads against a reference genome, SNPs can be naively inferred from the results by simply denoting each base that is inconsistent between reference and read as a SNP. This straightforward inference of mismatches results in a massive amount of alleged SNPs, many of which suffer from some sort of inaccuracy such as: calling a mismatch in the wrong location, homozygousity and heterozygousity discrepancies and even calling a mismatch in the correct location but with the wrong base. Currently most SNP calling tools (Koboldt et al., 2009; Li et al., 2009; 2008; Li, Li, et al., 2009; McKenna et al., 2010; Qi et al., 2010) apply different probabilistic based considerations and heuristics such as quality assessment and recalibration, SNP filtration, local realignment, coverage assessment, prior probability based on known SNPs, genotype based likelihood and even cancer genomics (Goya et al., 2010) to elucidate SNPs from alignment results. The user should be familiar with these considerations and be aware of the tools that apply each when performing SNP calling. We will go over some of them and discuss their effects and benefits.

**Local realignment –** Current mapping tools align reads independently of the alignment region context. If a read's beginning or end maps to a region containing an indel, a mismatch will be called instead of an indel due to alignment scoring considerations. Adding a secondary, local alignment that considers reads that support the presence of an indel in the vicinity of either detected SNPs or known SNP sites retrieved from dbSNP (Day, 2010), results in a significant reduction in false positive SNPs (Durbin et al., 2010; McKenna et al., 2010). This local realignment is highly recommended prior to SNP analysis and is either performed inherently in some tools (Qi et al., 2010) or can be specifically performed using other available tools (McKenna et al., 2010).

**Base Alignment Quality –** Since local realignment is a computationally intensive process that depends on correctly denoting insertions and deletions, another method for increased SNP detection accuracy is purposed (Li, 2011). Implementing a per-base alignment quality recalibration for re-evaluation of misalignment probability using profile hidden markov models. This quality recalibration can be performed using SAMtools (Li et al., 2009).

**Transition / Transversion Ratio (Ti/Tv) –** The expected ratio between transitions (e.g purine purine substitutions) and transversions (e.g purine pyrimidine substitutions) can be elucidated from empirical data retrieved from the 1000 Genomes project Durbin et al., 2010). This ratio could be utilized as an initial quality assessment standard. Currently the expected Ti/Tv ratio is ~2.3 for whole-genome sequencing and around 3.3 for whole-exome sequencing (coding regions only) (DePristo et al., 2011). When detected SNPs demonstrate a ratio closer to the expected ratio for random substitutions, with transversions twice as

**Cross-lane comparison –** It is good practice, when different-lane same-sample sequences are available, to compare the amount of SNPs, insertions and deletions detected for each lane. If one of the lanes has a significantly higher amount of detected variants, it is probable that it will introduce false-positives to the analysis and exclusion of that lane from downstream variant calling is recommended. Another possible data validation option is comparison against a SNP chip if available (Koboldt et al., 2010). Going over each annotated SNP provides the user with more than a million checkpoints to ascertain both the validity and fidelity of the sequencing process, and the chromosomal representation (e.g haploid or

After aligning deep sequencing reads against a reference genome, SNPs can be naively inferred from the results by simply denoting each base that is inconsistent between reference and read as a SNP. This straightforward inference of mismatches results in a massive amount of alleged SNPs, many of which suffer from some sort of inaccuracy such as: calling a mismatch in the wrong location, homozygousity and heterozygousity discrepancies and even calling a mismatch in the correct location but with the wrong base. Currently most SNP calling tools (Koboldt et al., 2009; Li et al., 2009; 2008; Li, Li, et al., 2009; McKenna et al., 2010; Qi et al., 2010) apply different probabilistic based considerations and heuristics such as quality assessment and recalibration, SNP filtration, local realignment, coverage assessment, prior probability based on known SNPs, genotype based likelihood and even cancer genomics (Goya et al., 2010) to elucidate SNPs from alignment results. The user should be familiar with these considerations and be aware of the tools that apply each when performing SNP calling. We will go over some of

**Local realignment –** Current mapping tools align reads independently of the alignment region context. If a read's beginning or end maps to a region containing an indel, a mismatch will be called instead of an indel due to alignment scoring considerations. Adding a secondary, local alignment that considers reads that support the presence of an indel in the vicinity of either detected SNPs or known SNP sites retrieved from dbSNP (Day, 2010), results in a significant reduction in false positive SNPs (Durbin et al., 2010; McKenna et al., 2010). This local realignment is highly recommended prior to SNP analysis and is either performed inherently in some tools (Qi et al., 2010) or can be specifically performed using

**Base Alignment Quality –** Since local realignment is a computationally intensive process that depends on correctly denoting insertions and deletions, another method for increased SNP detection accuracy is purposed (Li, 2011). Implementing a per-base alignment quality recalibration for re-evaluation of misalignment probability using profile hidden markov

**Transition / Transversion Ratio (Ti/Tv) –** The expected ratio between transitions (e.g purine purine substitutions) and transversions (e.g purine pyrimidine substitutions) can be elucidated from empirical data retrieved from the 1000 Genomes project Durbin et al., 2010). This ratio could be utilized as an initial quality assessment standard. Currently the expected Ti/Tv ratio is ~2.3 for whole-genome sequencing and around 3.3 for whole-exome sequencing (coding regions only) (DePristo et al., 2011). When detected SNPs demonstrate a ratio closer to the expected ratio for random substitutions, with transversions twice as

models. This quality recalibration can be performed using SAMtools (Li et al., 2009).

We will now address a few more variant specific considerations and applications.

diploid).

**4.3.1 Single nucleotide polymorphisms** 

them and discuss their effects and benefits.

other available tools (McKenna et al., 2010).

common as transitions (e.g ~0.5), low quality variant calling or data is implied and quality thresholds should be reassessed.

**dbSNP validation –** After producing a list of detected SNPs, it is highly recommended to compare it against dbSNP, the largest repository of SNP data found within the National Center for Biotechnology Information database. Detected SNPs present in the database are considered as known, and the ones not found are considered novel (Li and Stockwell, 2010). The portion of novel SNPs detected in a deep sequencing experiment should range between 1 and 10 percent (DePristo et al., 2011). If this proportion is higher, a high rate of false positive variants is suggested and we recommended reevaluating the detection process and possibly implementing a more strict variation inclusion criteria.

#### **4.3.2 Insertions and deletions (Indels)**

Indels are the second most common type of polymorphism and the most common structural variant, in this sub-section we will address only short indels as the next section will deal with the larger (>1000kb) structural variants. Most indels range between 2-16 bases in length (Mullaney, et al.,2010) (also referred to as micro-indels) and their frequency has been shown to vary across the genome with lower rate in conserved and functional regions and an increased rate in hot spots for genetic variation. The average indel rate is approximately one indel in 5.1 to 13.2 kb of DNA (Mills et al., 2006). Their presence implicates on the pathogenesis of disease, gene expression and functionality, viral disease forms identification and they can be used as genetic markers in natural populations. Indels occur in an estimated rate that is eight-fold lower than SNPs (Durbin et al., 2010). This rate varies extensively between sequenced individuals, usually due to variability between mapping and detection tools. Reads covering an indel are generally more difficult to map since their correct alignment either involves complex gapped alignment or paired-end sequencing inference. Optimal indel detection is performed by combining application of an appropriate alignment software and variant detection tool (Albers et al., 2010; Koboldt et al., 2009; Li et al., 2009; McKenna et al., 2010; Qi et al., 2010; Kai et al., 2009), and careful adjustment of their parameters according to the suspected variants. As mentioned before in the alignment section, it is highly recommended to perform indel calling with alignment tools that implement gapped alignment (Krawitz et al., 2010; Li and Durbin, 2009; Li et al., 2008; *Novoalign*). A few considerations when addressing insertion-deletion detection:

**Read length –** Increasing the read length has been shown to improve the ability to map and detect insertion related reads. Sequence reads 36 bases long, such as the ones produced by the Illumina GAIIx, have been shown to be inefficient for detection of insertions longer than 3 bases with a complete inability to detect insertions longer than 7 bases. Hence the length of the sequenced reads should be considered according to the insertion size range suspicion and adjusted appropriately. Naturally, when insertion size is expected to surpass the read length it is impossible to detect them using single-end sequencing. Increasing the read length has also been shown to improve micro-indel (<10 bases) detection sensitivity without significantly affecting specificity, demonstrating a more efficient method for increasing coverage than simply producing more reads.

**Paired-end reads** – Indel detection greatly improves when based on paired end reads deep sequencing data (Mullaney et al., 2010). Both alignment (Li and Durbin, 2009; Li et al.,2008) and variant detection tools (Kai Ye et al., 2009) utilize paired-end reads so that one of the reads is used to pinpoint the pair's loci in the reference while the other read can be subjected to gapped alignment and indel inference. Furthermore, the insert (e.g the unsequenced gap

Deep Sequencing Data Analysis: Challenges and Solutions 667

suggest the expected region in some cases. Coverage biases, as mentioned in previous

**Clustering methods –** After SV signatures have been detected, a calculated inference process is crucial. Most methods utilize some sort of clustering for all the pairs supporting a variant, and deduce variant information from that cluster. Since most methods rely on PEM and coverage for SV detection, they differ mainly on their clustering methods. It is important to familiarize with these methods since some are more suited for certain variation types. Since describing each clustering method is beyond the scope of this chapter, we will only point out certain aspects that are both easy to implement and have significant effects on detection quality. The standard clustering method (Korbel et al., 2009) utilizes only uniquely mapped read pairs, and discards the ones mapped to multiple loci. It also utilizes a set standard deviation limit for the difference between known insert size range and the observed mapped reference distance. These "hard" filters reduce the effectiveness of both homozygousity/heterozygousity inference and small scale variation detection (Medvedev et al., 2009). Soft (Hormozdiari et al., 2010) and Distribution (K. Chen et al., 2009; McKernan et al., 2009) based clustering methods consider multiply mapped reads and assigns them according to their supporting context, thus increasing sensitivity for the presence of small

indels and heterozygousity and should be considered when experimentally relevant.

**Validation by assembly –** It is recommended to combine *de novo* assembly with structural variant detection in order to validate detected variants. Once the assembly process is complete, a search for supporting and conflicting sequence contigs should be performed

We note that the field of structural variation deduction from deep sequencing data is still in its infancy and both false-positive and false-negative rates are far from satisfactory (Hormozdiari et al., 2009). As both sequencing technology improves, raising coverage and read length, and the algorithmic utilization of such improvements continues, we expect

Calling variants using deep sequencing data often results in a multitude of detected variations, even after strict and effective quality filtration as denoted earlier, deep sequencing data reveals thousands to millions of different variations (Imelfort et al., 2009). These variations can result in biological effects through introduction of different amino acids into protein sequences, early termination of coding sequences and alteration of regulatory elements and splice sites. A natural step following the variant calling process is annotating the detected variants and elucidating their effect and biological significance, separating clinically, scientifically and medically relevant variations from neutral, non functional ones. In a large list spanning this many variants, manual annotation of each variant effect is neither feasible or accurate. We will therefore cover currently leading

principles for computational classification and prioritization of detected variants.

**Initial prioritization -** The first step in variation characterization is basic variation properties deduction. Variant properties such as it's location, whether in a known coding sequence, non coding transcript, promoter, splice site etc. Once a variation is localized in a coding sequence, a subsequent analysis of it's frame effect and whether its synonymous (e.g changing an amino acid) or non-synonymous should be performed. These basic properties allow initial prioritization of the variation list, considering that coding sequence non-sense

sections, can also misconstrue the SV detection process.

greater utilization of deep sequencing for SV detection.

(Koboldt et al., 2010).

**4.3.4 Variant classification** 

between a read pair) can also be used to deduce the presence of an indel (discussed in the next section).

#### **4.3.3 Structural variants**

Structural variants (Feuk et al., 2006) are defined as genomic alterations that involve segments of DNA that are larger than 1 kb. They include: (1) Copy number variations (CNV), which are sections in the DNA with a variable copy number when comparing to a reference genome. Insertions, deletions and duplications are types of CNVs. (2) Segmental duplications, several copies of DNA segments that are almost identical (>90%) that can appear in a variable number of copies, also considered a type of CNV. (3) Inversions, segments in the DNA that are reversed in orientation. (4) Translocations, an intra or inter chromosomal location shift in a DNA segment without changing the total DNA content. (5) Segmental uniparental disomy, where a diploid individual's pair of homologous chromosomes originated from a single parent. Since current deep sequencing platforms do not produce reads that span the length of structural variants, utilization of paired end mapping is necessary for their exact elucidation. The quality of structural variation detection using deep sequencing can be assessed by the accuracy of break point localization, copy number count and variation size estimation (Medvedev et al., 2009).

**Paired-end mapping (PEM)** - Paired-end sequencing refers to the process of sequencing a cloned DNA fragment on both ends, resulting in two associated sequence reads with an unsequenced insert between them. The insert length varies between several bases to several thousands of bases and thus appropriate for the detection process of the aforementioned large scale structural variants. Structural variants are often detected indirectly through associated paired-end deep sequencing data patterns (Bashir et al., 2008; Korbel et al., 2009; Medvedev et al., 2009). Some of these patterns approximate the location of the structural breakpoints, and some provide an exact localization. For example, the signature of an insertion or deletion can be easily inferred by comparing the expected read pair distance according to the reference with the expected insert size, if the reference distance is longer or shorter than the insert size, the presence of a deletion or insertion can be inferred respectively but the deduction of the exact location of the indel from these signatures is more difficult. However, in an "anchored split mapping" signature, when one read from a pair is perfectly aligned against the reference and its pair cannot be aligned against its designated reference location, the unaligned read can be utilized in order to pinpoint the exact location of existing large deletions or small insertions (Medvedev et al., 2009). SV PEM signatures improve all aspects of SV detection quality and so PEM is highly recommended for this purpose.

**Insert size –** insert size, set by the size of the DNA fragments introduced by library preparation can affect the outcome of SV detection. If the experimental goal is to detect as many structural variants as possible, a larger insert length is suggested. If however, a more precise localization of the variants is necessary, a shorter insert length is recommended, though resulting in an overall lower variant discovery sensitivity (e.g if you find the variant, there is a greater probability of precise localization) (Bashir et al., 2008).

**Depth of coverage –** Coverage can also be utilized for SV elucidation, specifically large scale deletions and duplications (Yoon et al., 2009). As we expect reads mapping to each region to follow a Poisson distribution, deviations from the expected coverage suggest the presence of a duplication or deletion. SV detection benefits from combining increased coverage with abundance of paired-end reads with a significant increase in specificity (Bashir et al., 2008). Coverage cannot be utilized however to elucidate the exact location of these SVs, only to

between a read pair) can also be used to deduce the presence of an indel (discussed in the

Structural variants (Feuk et al., 2006) are defined as genomic alterations that involve segments of DNA that are larger than 1 kb. They include: (1) Copy number variations (CNV), which are sections in the DNA with a variable copy number when comparing to a reference genome. Insertions, deletions and duplications are types of CNVs. (2) Segmental duplications, several copies of DNA segments that are almost identical (>90%) that can appear in a variable number of copies, also considered a type of CNV. (3) Inversions, segments in the DNA that are reversed in orientation. (4) Translocations, an intra or inter chromosomal location shift in a DNA segment without changing the total DNA content. (5) Segmental uniparental disomy, where a diploid individual's pair of homologous chromosomes originated from a single parent. Since current deep sequencing platforms do not produce reads that span the length of structural variants, utilization of paired end mapping is necessary for their exact elucidation. The quality of structural variation detection using deep sequencing can be assessed by the accuracy of break point localization, copy

**Paired-end mapping (PEM)** - Paired-end sequencing refers to the process of sequencing a cloned DNA fragment on both ends, resulting in two associated sequence reads with an unsequenced insert between them. The insert length varies between several bases to several thousands of bases and thus appropriate for the detection process of the aforementioned large scale structural variants. Structural variants are often detected indirectly through associated paired-end deep sequencing data patterns (Bashir et al., 2008; Korbel et al., 2009; Medvedev et al., 2009). Some of these patterns approximate the location of the structural breakpoints, and some provide an exact localization. For example, the signature of an insertion or deletion can be easily inferred by comparing the expected read pair distance according to the reference with the expected insert size, if the reference distance is longer or shorter than the insert size, the presence of a deletion or insertion can be inferred respectively but the deduction of the exact location of the indel from these signatures is more difficult. However, in an "anchored split mapping" signature, when one read from a pair is perfectly aligned against the reference and its pair cannot be aligned against its designated reference location, the unaligned read can be utilized in order to pinpoint the exact location of existing large deletions or small insertions (Medvedev et al., 2009). SV PEM signatures improve all

aspects of SV detection quality and so PEM is highly recommended for this purpose.

there is a greater probability of precise localization) (Bashir et al., 2008).

**Insert size –** insert size, set by the size of the DNA fragments introduced by library preparation can affect the outcome of SV detection. If the experimental goal is to detect as many structural variants as possible, a larger insert length is suggested. If however, a more precise localization of the variants is necessary, a shorter insert length is recommended, though resulting in an overall lower variant discovery sensitivity (e.g if you find the variant,

**Depth of coverage –** Coverage can also be utilized for SV elucidation, specifically large scale deletions and duplications (Yoon et al., 2009). As we expect reads mapping to each region to follow a Poisson distribution, deviations from the expected coverage suggest the presence of a duplication or deletion. SV detection benefits from combining increased coverage with abundance of paired-end reads with a significant increase in specificity (Bashir et al., 2008). Coverage cannot be utilized however to elucidate the exact location of these SVs, only to

number count and variation size estimation (Medvedev et al., 2009).

next section).

**4.3.3 Structural variants** 

suggest the expected region in some cases. Coverage biases, as mentioned in previous sections, can also misconstrue the SV detection process.

**Clustering methods –** After SV signatures have been detected, a calculated inference process is crucial. Most methods utilize some sort of clustering for all the pairs supporting a variant, and deduce variant information from that cluster. Since most methods rely on PEM and coverage for SV detection, they differ mainly on their clustering methods. It is important to familiarize with these methods since some are more suited for certain variation types. Since describing each clustering method is beyond the scope of this chapter, we will only point out certain aspects that are both easy to implement and have significant effects on detection quality. The standard clustering method (Korbel et al., 2009) utilizes only uniquely mapped read pairs, and discards the ones mapped to multiple loci. It also utilizes a set standard deviation limit for the difference between known insert size range and the observed mapped reference distance. These "hard" filters reduce the effectiveness of both homozygousity/heterozygousity inference and small scale variation detection (Medvedev et al., 2009). Soft (Hormozdiari et al., 2010) and Distribution (K. Chen et al., 2009; McKernan et al., 2009) based clustering methods consider multiply mapped reads and assigns them according to their supporting context, thus increasing sensitivity for the presence of small indels and heterozygousity and should be considered when experimentally relevant.

**Validation by assembly –** It is recommended to combine *de novo* assembly with structural variant detection in order to validate detected variants. Once the assembly process is complete, a search for supporting and conflicting sequence contigs should be performed (Koboldt et al., 2010).

We note that the field of structural variation deduction from deep sequencing data is still in its infancy and both false-positive and false-negative rates are far from satisfactory (Hormozdiari et al., 2009). As both sequencing technology improves, raising coverage and read length, and the algorithmic utilization of such improvements continues, we expect greater utilization of deep sequencing for SV detection.

#### **4.3.4 Variant classification**

Calling variants using deep sequencing data often results in a multitude of detected variations, even after strict and effective quality filtration as denoted earlier, deep sequencing data reveals thousands to millions of different variations (Imelfort et al., 2009). These variations can result in biological effects through introduction of different amino acids into protein sequences, early termination of coding sequences and alteration of regulatory elements and splice sites. A natural step following the variant calling process is annotating the detected variants and elucidating their effect and biological significance, separating clinically, scientifically and medically relevant variations from neutral, non functional ones. In a large list spanning this many variants, manual annotation of each variant effect is neither feasible or accurate. We will therefore cover currently leading principles for computational classification and prioritization of detected variants.

**Initial prioritization -** The first step in variation characterization is basic variation properties deduction. Variant properties such as it's location, whether in a known coding sequence, non coding transcript, promoter, splice site etc. Once a variation is localized in a coding sequence, a subsequent analysis of it's frame effect and whether its synonymous (e.g changing an amino acid) or non-synonymous should be performed. These basic properties allow initial prioritization of the variation list, considering that coding sequence non-sense

Deep Sequencing Data Analysis: Challenges and Solutions 669

**Coding sequence variants -** In order to ascertain the most likely phenotype affiliated coding sequence variation from a given list, current variation profiling methods utilize biochemical and physical properties of both amino acids and proteins considering both structure (Ramensky et al., 2002) and function (Bromberg and Rost, 2007; Calabrese et al., 2009) and utilizing various probability algorithm (Mi, et al., 2007). Possible incorporated characteristics include: molecular mass, polarity, acidity, basicity, aromaticity, conformational flexibility and hydrophobicity of amino acids (Ng and Henikoff, 2006) and hydrogen bond breaks, introduction of a buried polar residue, loss of salt bridge, insertion of proline into α-helix, and the breaking of disulfide bonds in proteins (Wang and Moult, 2001). Some available tools (Ashkenazy et al., 2010; Kumar et al., 2009; Li et al., 2009) utilize the fact that functionally crucial amino acids are evolutionary conserved, by employing multiple sequence alignment based conservation scores in order to prioritize given variations. Utilizing orthologous sequences for this purpose demonstrates higher efficiency than incorporation of paralogous, since the latter represents proteins with slight differences in both sequence and function and is less informative for conservation analysis. It was shown that conservation degree is in fact the most reliable method for predicting possible

For the purpose of both prioritization and functional analysis optimization, we recommend combining available annotation tools that employ a variety of prioritization features (George et al., 2008; Lee and Shatkay, 2008). A recent study implemented some of these variation classification methods on recorded SNPs in a target gene, in order to elucidate possible cancer causing mutations, reducing the initial number of suspected SNPs from thousands to less than 30 (Choura and Rebai, 2009). Another study utilized bioinformatic tools to classify known non synonymous mutations in colon cancer and was able to pinpointed four SNPs already known as related to increased cancer risk (Doss and Sethumadhavan, 2009). However, a recent comprehensive review (Karchin, 2009), implemented and compared leading variant classification tools on three different studies (Doecke et al., 2008; Fatemi et al., 2008; Van Deerlin et al., 2008) associating both exonic and intronic, novel and known SNPs with a variety of disease, and demonstrated that a combination of several tools can possibly result in conflicting annotations and functional effects deduction. Another comparison (Thusberg et al., 2011), that tested the performance of several of the aforementioned tools in predicting pathogenicity using test data retrieved from dbSNP, demonstrated the sensitivity characterizing these tools to range between 0.59 to 0.9, with the preferred tools for their analysis to be SNPs&GO and MutPred (Calabrese et al., 2009; Li et al., 2009). Both studies agree that inference of functionality and pathogenicity is not a fully

automatic pathway and educated interpretation of the results must be conducted.

Deep-sequencing data analysis is a growing field with many computational challenges. A normal deep sequencing run outputs a massive amount of data which require complex computational processing and interpretation. The overflow of available bioinformatic tools and software for each of the optional analysis steps presents a challenge for the researcher aiming to evaluate and interpret deep sequencing data. In this chapter we familiarized the reader with crucial concepts and considerations for preparation, refinement, analysis and elucidation of valid and accurate conclusions. The field is rapidly evolving both in hardware and sequencing platform technology and in computational techniques, algorithms, software

pathogenicity of a missense variant (Flanagan et al., 2010).

**5. Conclusion** 

mutations are more likely to be functionally relevant than mutations in an unexpressed genomic sequence. When dealing with an annotated genome, computational tools should be utilized for this purpose (Conde et al., 2006; Li and Stockwell, 2010; McLaren et al., 2010; Yuan et al., 2006). We recommend checking the dbSNP version utilized by chosen annotation tools and strive to employ the most up-to-date version available so as to increase the availability of variant annotations.

mutations are more likely to be functionally relevant than mutations in an unexpressed genomic sequence. When dealing with an annotated genome, computational tools should be utilized for this purpose (Conde et al., 2006; Li and Stockwell, 2010; McLaren et al., 2010; Yuan et al., 2006). We recommend checking the dbSNP version utilized by chosen annotation tools and strive to employ the most up-to-date version available so as to increase

the availability of variant annotations.

**Coding sequence variants -** In order to ascertain the most likely phenotype affiliated coding sequence variation from a given list, current variation profiling methods utilize biochemical and physical properties of both amino acids and proteins considering both structure (Ramensky et al., 2002) and function (Bromberg and Rost, 2007; Calabrese et al., 2009) and utilizing various probability algorithm (Mi, et al., 2007). Possible incorporated characteristics include: molecular mass, polarity, acidity, basicity, aromaticity, conformational flexibility and hydrophobicity of amino acids (Ng and Henikoff, 2006) and hydrogen bond breaks, introduction of a buried polar residue, loss of salt bridge, insertion of proline into α-helix, and the breaking of disulfide bonds in proteins (Wang and Moult, 2001). Some available tools (Ashkenazy et al., 2010; Kumar et al., 2009; Li et al., 2009) utilize the fact that functionally crucial amino acids are evolutionary conserved, by employing multiple sequence alignment based conservation scores in order to prioritize given variations. Utilizing orthologous sequences for this purpose demonstrates higher efficiency than incorporation of paralogous, since the latter represents proteins with slight differences in both sequence and function and is less informative for conservation analysis. It was shown that conservation degree is in fact the most reliable method for predicting possible pathogenicity of a missense variant (Flanagan et al., 2010).

For the purpose of both prioritization and functional analysis optimization, we recommend combining available annotation tools that employ a variety of prioritization features (George et al., 2008; Lee and Shatkay, 2008). A recent study implemented some of these variation classification methods on recorded SNPs in a target gene, in order to elucidate possible cancer causing mutations, reducing the initial number of suspected SNPs from thousands to less than 30 (Choura and Rebai, 2009). Another study utilized bioinformatic tools to classify known non synonymous mutations in colon cancer and was able to pinpointed four SNPs already known as related to increased cancer risk (Doss and Sethumadhavan, 2009). However, a recent comprehensive review (Karchin, 2009), implemented and compared leading variant classification tools on three different studies (Doecke et al., 2008; Fatemi et al., 2008; Van Deerlin et al., 2008) associating both exonic and intronic, novel and known SNPs with a variety of disease, and demonstrated that a combination of several tools can possibly result in conflicting annotations and functional effects deduction. Another comparison (Thusberg et al., 2011), that tested the performance of several of the aforementioned tools in predicting pathogenicity using test data retrieved from dbSNP, demonstrated the sensitivity characterizing these tools to range between 0.59 to 0.9, with the preferred tools for their analysis to be SNPs&GO and MutPred (Calabrese et al., 2009; Li et al., 2009). Both studies agree that inference of functionality and pathogenicity is not a fully automatic pathway and educated interpretation of the results must be conducted.

#### **5. Conclusion**

Deep-sequencing data analysis is a growing field with many computational challenges. A normal deep sequencing run outputs a massive amount of data which require complex computational processing and interpretation. The overflow of available bioinformatic tools and software for each of the optional analysis steps presents a challenge for the researcher aiming to evaluate and interpret deep sequencing data. In this chapter we familiarized the reader with crucial concepts and considerations for preparation, refinement, analysis and elucidation of valid and accurate conclusions. The field is rapidly evolving both in hardware and sequencing platform technology and in computational techniques, algorithms, software

Deep Sequencing Data Analysis: Challenges and Solutions 671

Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G.,

Hall, K. P., et al. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. *Nature*, *456*(7218), 53-59. doi:10.1038/nature07517 Bromberg, Y., & Rost, B. (2007). SNAP: predict effect of non-synonymous polymorphisms on function. *Nucleic Acids Research*, *35*(11), 3823-3835. doi:10.1093/nar/gkm238 Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I. A., Belmonte, M. K., Lander, E. S.,

Nusbaum, C., et al. (2008). ALLPATHS: de novo assembly of whole-genome shotgun microreads. *Genome Research*, *18*(5), 810-820. doi:10.1101/gr.7337908 Cahill, M. J., Köser, C. U., Ross, N. E., & Archer, J. A. C. (2010). Read length and repeat

resolution: exploring prokaryote genomes using next-generation sequencing

annotations improve the predictive score of human disease-related mutations in

SNP discovery. *Methods in Molecular Biology (Clifton, N.J.)*, *578*, 95-111.

D., et al. (2009). BreakDancer: an algorithm for high-resolution mapping of genomic

(2010). Maternal Plasma DNA Analysis with Massively Parallel Sequencing by Ligation for Noninvasive Prenatal Diagnosis of Trisomy 21. *Clin Chem*, *56*(3), 459-

SNPs effects in human ErbB genes. *Journal of Receptor and Signal Transduction* 

al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. *Bioinformatics (Oxford, England)*, *25*(11), 1422-1423.

file format for sequences with quality scores, and the Solexa/Illumina FASTQ

Schymkowitz, J., et al. (2006). PupaSuite: finding functional single nucleotide polymorphisms for large-scale genotyping purposes. *Nucleic Acids Research*, *34*(Web

of Illumina second-generation sequencing data. *BMC Bioinformatics*, *11*, 485.

M., Omenn, G., et al. (2010). NGSQC: cross-platform quality analysis pipeline for

technologies. *PloS One*, *5*(7), e11518. doi:10.1371/journal.pone.0011518 Calabrese, R., Capriotti, E., Fariselli, P., Martelli, P. L., & Casadio, R. (2009). Functional

proteins. *Human Mutation*, *30*(8), 1237-1244. doi:10.1002/humu.21047 Chaisson, M. J., & Pevzner, P. A. (2008). Short read fragment assembly of bacterial genomes.

Chan, E. Y. (2009). Next-generation sequencing methods: impact of sequencing accuracy on

Chen, K., Wallis, J. W., McLellan, M. D., Larson, D. E., Kalicki, J. M., Pohl, C. S., McGrath, S.

Choura, M., & Rebaï, A. (2009). Applications of computational tools to predict functional

Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., et

Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L., & Rice, P. M. (2010). The Sanger FASTQ

variants. *Nucleic Acids Research*, *38*(6), 1767-1771. doi:10.1093/nar/gkp1137 Conde, L., Vaquerizas, J. M., Dopazo, H., Arbiza, L., Reumers, J., Rousseau, F.,

Cox, M. P., Peterson, D. A., & Biggs, P. J. (2010). SolexaQA: At-a-glance quality assessment

Dai, M., Thompson, R. C., Maher, C., Contreras-Galindo, R., Kaplan, M. H., Markovitz, D.

structural variation. *Nat Meth*, *6*(9), 677-681. doi:10.1038/nmeth.1363 Chiu, R. W. K., Sun, H., Akolekar, R., Clouser, C., Lee, C., McKernan, K., Zhou, D., et al.

*Genome Research*, *18*(2), 324-330. doi:10.1101/gr.7088808

463. doi:<p>10.1373/clinchem.2009.136507</p>

Server issue), W621-625. doi:10.1093/nar/gkl071

*Research*, *29*(5), 286-291. doi:10.1080/10799890902911948

doi:10.1007/978-1-60327-411-1\_5

doi:10.1093/bioinformatics/btp163

doi:10.1186/1471-2105-11-485

and tools. It is crucial to understand the various challenges involved in deep sequencing experiments, and the current available solutions, both in concept and in practice. The concepts presented in this chapter are aimed towards optimizing deep sequencing experiments, concentrating on initial steps of data preparation and quality refinement and covering several possible analysis pathways while denoting some of the currently available and leading tools, and some of their underlying methods.

The first section of this chapter, introduced deep sequencing technology's available platforms in regards to their advantages and limitations, emphasizing that although they are all considered high throughput sequencing platforms, they present different capabilities and proficiencies. When a choice between platforms is available, one can improve data retrieval and validity simply by matching the most appropriate platform with the specific experimental needs.

The second section covered the concept of deep sequencing data quality control. Using bioinformatic tools, based on both empirical and probabilistic deduction, sequencing derived errors can be reduced which otherwise would be incorporated into downstream analysis. We described current quality scales, with methods for their assessment and their relevance for improved data retrieval. Employment of these quality control and assurance methods can assist in uncovering biased sequencing lanes and recurring errors and contaminants that could significantly alter deep sequencing results. We therefore strongly urge users to utilize them prior to any following experimental evaluation, making their incorporation a standard in deep sequencing experiments.

The third and subsequent sections covered specific and very common analysis pathways: alignment, assembly and variant calling. The chapter introduced basic challenges faced in each type of analysis, their current limitations and considerations pivotal for preferential experimental planning. A description of each challenge was accompanied by delineation of current methods, tools and solutions when available. Familiarized with these challenges, the user can now conduct better analytic decisions and employ the most appropriate tools and techniques. Understanding the exact edge of each analytic pathway can help the user to perform their deep sequencing experiments in the most effective manner employing both current and future software for optimal variant calling.

#### **6. References**


and tools. It is crucial to understand the various challenges involved in deep sequencing experiments, and the current available solutions, both in concept and in practice. The concepts presented in this chapter are aimed towards optimizing deep sequencing experiments, concentrating on initial steps of data preparation and quality refinement and covering several possible analysis pathways while denoting some of the currently available

The first section of this chapter, introduced deep sequencing technology's available platforms in regards to their advantages and limitations, emphasizing that although they are all considered high throughput sequencing platforms, they present different capabilities and proficiencies. When a choice between platforms is available, one can improve data retrieval and validity simply by matching the most appropriate platform with the specific

The second section covered the concept of deep sequencing data quality control. Using bioinformatic tools, based on both empirical and probabilistic deduction, sequencing derived errors can be reduced which otherwise would be incorporated into downstream analysis. We described current quality scales, with methods for their assessment and their relevance for improved data retrieval. Employment of these quality control and assurance methods can assist in uncovering biased sequencing lanes and recurring errors and contaminants that could significantly alter deep sequencing results. We therefore strongly urge users to utilize them prior to any following experimental evaluation, making their

The third and subsequent sections covered specific and very common analysis pathways: alignment, assembly and variant calling. The chapter introduced basic challenges faced in each type of analysis, their current limitations and considerations pivotal for preferential experimental planning. A description of each challenge was accompanied by delineation of current methods, tools and solutions when available. Familiarized with these challenges, the user can now conduct better analytic decisions and employ the most appropriate tools and techniques. Understanding the exact edge of each analytic pathway can help the user to perform their deep sequencing experiments in the most effective manner employing both

Albers, C. A., Lunter, G., Macarthur, D. G., McVean, G., Ouwehand, W. H., & Durbin, R.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local

Ashkenazy, H., Erez, E., Martz, E., Pupko, T. & Ben-Tal, N. ConSurf 2010: calculating

Bashir, A., Volik, S., Collins, C., Bafna, V., & Raphael, B. J. (2008). Evaluation of paired-end

*Computational Biology*, *4*(4), e1000051. doi:10.1371/journal.pcbi.1000051

(2010). Dindel: Accurate indel calls from short-read data. *Genome Research*.

alignment search tool. *Journal of Molecular Biology*, *215*(3), 403-410.

evolutionary conservation in sequence and structure of proteins and nucleic acids.

sequencing strategies for detection of genome rearrangements in cancer. *PLoS* 

and leading tools, and some of their underlying methods.

incorporation a standard in deep sequencing experiments.

current and future software for optimal variant calling.

Nucleic Acids Res 38, W529-533 (2010).

doi:10.1101/gr.112326.110

doi:10.1006/jmbi.1990.9999

experimental needs.

**6. References** 


Deep Sequencing Data Analysis: Challenges and Solutions 673

Flicek, P., & Birney, E. (2009). Sense from sequence reads: methods for alignment and assembly. *Nature Methods*, *6*(11 Suppl), S6-S12. doi:10.1038/nmeth.1376 Frith, M. C., Wan, R., & Horton, P. (2010). Incorporating sequence quality data into

Galan, M., Guivier, E., Caraux, G., Charbonnel, N., & Cosson, J.-F. (2010). A 454 multiplex

in large-scale studies. *BMC Genomics*, *11*, 296. doi:10.1186/1471-2164-11-296 George Priya Doss, C., Sudandiradoss, C., Rajasekaran, R., Choudhury, P., Sinha, P., Hota,

Gnerre, S., Lander, E. S., Lindblad-Toh, K., & Jaffe, D. B. (2009). Assisted assembly: how to

Gomez-Alvarez, V., Teal, T. K., & Schmidt, T. M. (2009). Systematic artifacts in metagenomes

Goto, N., Prins, P., Nakao, M., Bonnal, R., Aerts, J., & Katayama, T. (2010). BioRuby:

Harris, T. D., Buzby, P. R., Babcock, H., Beer, E., Bowers, J., Braslavsky, I., Causey, M., et al.

Hashimoto, T., de Hoon, M. J. L., Grimmond, S. M., Daub, C. O., Hayashizaki, Y., &

Hernandez, D., François, P., Farinelli, L., Osterås, M., & Schrenzel, J. (2008). De novo

computer. *Genome Research*, *18*(5), 802-809. doi:10.1101/gr.072033.107 Holland, R. C. G., Down, T. A., Pocock, M., Prlić, A., Huen, D., James, K., Foisy, S., et al.

*(Oxford, England)*, *24*(18), 2096-2097. doi:10.1093/bioinformatics/btn397

Hormozdiari, F., Alkan, C., Eichler, E. E., & Sahinalp, S. C. (2009). Combinatorial algorithms

Hormozdiari, F., Hajirasouliha, I., Dao, P., Hach, F., Yorukoglu, D., Alkan, C., Eichler, E. E.,

*(Oxford, England)*, *26*(20), 2617-2619. doi:10.1093/bioinformatics/btq475 Goya, R., Sun, M. G. F., Morin, R. D., Leung, G., Ha, G., Wiegand, K. C., Senz, J., et al. (2010).

doi:10.1093/nar/gkq010

doi:10.1007/s10142-008-0086-7

doi:10.1038/ismej.2009.72

*10*(8), R88. doi:10.1186/gb-2009-10-8-r88

doi:10.1093/bioinformatics/btq040

1000 Genomes Project, http://www.1000genomes.org/

doi:10.1093/bioinformatics/btq216

*N.Y.)*, *320*(5872), 106-109. doi:10.1126/science.1150427

*25*(19), 2613-2614. doi:10.1093/bioinformatics/btp438

*Research*, *19*(7), 1270-1278. doi:10.1101/gr.088633.108

alignment improves DNA read mapping. *Nucleic Acids Research*, *38*(7), e100.

sequencing method for rapid and reliable genotyping of highly polymorphic genes

P., Batra, U. P., et al. (2008). Applications of computational algorithm tools to identify functional SNPs. *Functional & Integrative Genomics*, *8*(4), 309-316.

improve a de novo genome assembly by using related species. *Genome Biology*,

from complex microbial communities. *The ISME Journal*, *3*(11), 1314-1317.

bioinformatics software for the Ruby programming language. *Bioinformatics* 

SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. *Bioinformatics (Oxford, England)*, *26*(6), 730-736.

(2008). Single-molecule DNA sequencing of a viral genome. *Science (New York,* 

Faulkner, G. J. (2009). Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. *Bioinformatics (Oxford, England)*,

bacterial genome sequencing: millions of very short reads assembled on a desktop

(2008). BioJava: an open-source framework for bioinformatics. *Bioinformatics* 

for structural variation detection in high-throughput sequenced genomes. *Genome* 

et al. (2010). Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. *Bioinformatics (Oxford, England)*, *26*(12), i350-357.

deep sequencing data. *BMC Genomics*, *11 Suppl 4*, S7. doi:10.1186/1471-2164-11-S4- S7


Dalloul, R. A., Long, J. A., Zimin, A. V., Aslam, L., Beal, K., Blomberg, L. A., Bouffard, P., et

Day, I. N. M. (2010). dbSNP in the detail and copy number complexities. *Human Mutation*,

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philippakis,

Doecke, J., Zhao, Z. Z., Pandeya, N., Sadeghi, S., Stark, M., Green, A. C., Hayward, N. K., et

Dohm, J. C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2007). SHARCGS, a fast and

Dohm, J. C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008). Substantial biases in ultra-

Dolan, P. C., & Denver, D. R. (2008). TileQC: a system for tile-based quality control of Solexa

Doss, C. G. P., & Sethumadhavan, R. (2009). Investigation on the role of nsSNPs in HNPCC

Durbin, R. M., Abecasis, G. R., Altshuler, D. L., Auton, A., Brooks, L. D., Durbin, R. M.,

Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133

Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. II.

Ewing, B., Hillier, L., Wendl, M C, & Green, P. (1998). Base-calling of automated sequencer traces using phred. I. Accuracy assessment. *Genome Research*, *8*(3), 175-185.

Fatemi, S. H., King, D. P., Reutiman, T. J., Folsom, T. D., Laurence, J. A., Lee, S., Fan, Y.-T., et

Feuk, L., Carson, A. R., & Scherer, S. W. (2006). Structural variation in the human genome.

Flanagan, S. E., Patch, A.-M., & Ellard, S. (2010). Using SIFT and PolyPhen to predict loss-of-

al. (2008). PDE4B polymorphisms and decreased PDE4B expression are associated with schizophrenia. *Schizophrenia Research*, *101*(1-3), 36-49.

function and gain-of-function mutations. *Genetic Testing and Molecular Biomarkers*,

sequencing. *Nature*, *467*(7319), 1061-1073. doi:10.1038/nature09534

*FASTX - toolkit*. (n.d.). Retrieved from http://hannonlab.cshl.edu/fastx\_toolkit/

*Nature Reviews. Genetics*, *7*(2), 85-97. doi:10.1038/nrg1767

Error probabilities. *Genome Research*, *8*(3), 186-194.

S7

doi:10.1371/journal.pbio.1000475

*31*(1), 2-4. doi:10.1002/humu.21149

*Cancer*, *123*(1), 174-180. doi:10.1002/ijc.23410

*36*(16), e105. doi:10.1093/nar/gkn425

doi:10.1186/1423-0127-16-42

doi:10.1016/j.schres.2008.01.029

*14*(4), 533-537. doi:10.1089/gtmb.2010.0036


*Genome Research*, *17*(11), 1697-1706. doi:10.1101/gr.6435207

data. *BMC Bioinformatics*, *9*, 250. doi:10.1186/1471-2105-9-250

doi:10.1038/ng.806

deep sequencing data. *BMC Genomics*, *11 Suppl 4*, S7. doi:10.1186/1471-2164-11-S4-

al. (2010). Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. *PLoS Biology*, *8*(9).

A. A., et al. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. *Nat Genet*, *advance online publication*.

al. (2008). Polymorphisms in MGMT and DNA repair genes and the risk of esophageal adenocarcinoma. *International Journal of Cancer. Journal International Du* 

highly accurate short-read assembly algorithm for de novo genomic sequencing.

short read data sets from high-throughput DNA sequencing. *Nucleic Acids Research*,

genes--a bioinformatics approach. *Journal of Biomedical Science*, *16*, 42.

Gibbs, R. A., et al. (2010). A map of human genome variation from population-scale


Deep Sequencing Data Analysis: Challenges and Solutions 675

Lee, P. H., & Shatkay, H. (2008). F-SNP: computationally predicted functional SNPs for

Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, et al., DNA sequencing of a

Li, Biao, Krishnan, V. G., Mort, M. E., Xin, F., Kamati, K. K., Cooper, D. N., Mooney, S. D., et

Li, H. (2011). Improving SNP discovery by base alignment quality. *Bioinformatics*, *27*(8), 1157

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., et al. (2009).

Li, H., & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. *Briefings in Bioinformatics*, *11*(5), 473-483. doi:10.1093/bib/bbq015 Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing reads and calling

Li, K., & Stockwell, T. (2010). VariantClassifier: A hierarchical variant classifier for annotated genomes. *BMC Research Notes*, *3*(1), 191. doi:10.1186/1756-0500-3-191 Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., et al. (2010). The sequence and de

Li, R., Li, Y., Fang, X., Yang, H., Wang, Jian, Kristiansen, K., & Wang, Jun. (2009). SNP

Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., & Wang, Jun. (2009). SOAP2: an

Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., et al. (2010). De novo assembly of

Lin, H., Zhang, Zefeng, Zhang, M. Q., Ma, B., & Li, M. (2008). ZOOM! Zillions of oligos

Mardis, E. R. (2008). The impact of next-generation sequencing technology on genetics.

Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J.,

*Trends in Genetics: TIG*, *24*(3), 133-141. doi:10.1016/j.tig.2007.12.007

*England)*, *25*(16), 2078-2079. doi:10.1093/bioinformatics/btp352

doi:10.1093/nar/gkm904

doi:10.1093/bioinformatics/btp528

doi:10.1093/bioinformatics/btp324

doi:10.1101/gr.078212.108

doi:10.1038/nature08696

*19*(6), 1124-1132. doi:10.1101/gr.088013.108

*20*(2), 265-272. doi:10.1101/gr.097261.109

doi:10.1093/bioinformatics/btn416

doi:10.1038/nature03959

*25*(15), 1966-1967. doi:10.1093/bioinformatics/btp336


6;456(7218):66-72.

disease association studies. *Nucleic Acids Research*, *36*(Database issue), D820-824.

cytogenetically normal acute myeloid leukaemia genome. *Nature*. 2008 Nov

al. (2009). Automated inference of molecular mechanisms of disease from amino acid substitutions. *Bioinformatics (Oxford, England)*, *25*(21), 2744-2750.

transform. *Bioinformatics (Oxford, England)*, *25*(14), 1754-1760.

The Sequence Alignment/Map format and SAMtools. *Bioinformatics (Oxford,* 

variants using mapping quality scores. *Genome Research*, *18*(11), 1851-1858.

novo assembly of the giant panda genome. *Nature*, *463*(7279), 311-317.

detection for massively parallel whole-genome resequencing. *Genome Research*,

improved ultrafast tool for short read alignment. *Bioinformatics (Oxford, England)*,

human genomes with massively parallel short read sequencing. *Genome Research*,

mapped. *Bioinformatics (Oxford, England)*, *24*(21), 2431-2437.

Braverman, M. S., Chen, Y.-J., & Chen, Z. (2005a). Genome sequencing in microfabricated high-density picolitre reactors. *Nature*, *437*(7057), 376-380.


Husemann, P., & Stoye, J. (2010). Phylogenetic comparative assembly. *Algorithms for* 

Imelfort, M., Duran, C., Batley, J., & Edwards, D. (2009). Discovering genetic polymorphisms

Isakov O, Modai S, Shomron N. Pathogen detection using short-RNA deep sequencing subtraction and assembly. *Bioinformatics*. 2011 Aug 1;27(15):2027-30. Jeck, W. R., Reinhardt, J. A., Baltrus, D. A., Hickenbotham, M. T., Magrini, V., Mardis, E. R.,

Karchin, R. (2009). Next generation tools for the annotation of human SNPs. *Briefings in* 

Kelley, D. R., Schatz, M. C., & Salzberg, S. L. (2010). Quake: quality-aware detection and

Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. *Genome Research*, *12*(4), 656-664.

Kircher, M., & Kelso, J. (2010). High-throughput DNA sequencing--concepts and limitations.

Koboldt, D. C., Chen, K., Wylie, T., Larson, D. E., McLellan, M. D., Mardis, E. R., Weinstock,

Koboldt, D. C., Ding, L., Mardis, E. R., & Wilson, R. K. (2010). Challenges of sequencing

Korbel, J. O., Abyzov, A., Mu, X. J., Carriero, N., Cayting, P., Zhang, Zhengdong, Snyder,

sequencing data. *Genome Biology*, *10*(2), R23. doi:10.1186/gb-2009-10-2-r23 Krawitz, P., Rödelsperger, C., Jäger, M., Jostins, L., Bauer, S., & Robinson, P. N. (2010).

Kumar, P., Henikoff, S., & Ng, P. C. (2009). Predicting the effects of coding non-synonymous

Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient

Lassmann, T., Hayashizaki, Y., & Daub, C. O. (2009). TagDust--a program to eliminate

Lassmann, T., Hayashizaki, Y., & Daub, C. O. (2011). SAMStat: monitoring biases in next

doi:10.1101/gr.229202. Article published online before March 2002

in next‐generation sequencing data. *Plant Biotechnology Journal*, *7*(4), 312-317.

Dangl, J. L., et al. (2007). Extending assembly of short DNA sequences to handle error. *Bioinformatics (Oxford, England)*, *23*(21), 2942-2944.

correction of sequencing errors. *Genome Biology*, *11*(11), R116. doi:10.1186/gb-2010-

*BioEssays: News and Reviews in Molecular, Cellular and Developmental Biology*, *32*(6),

G. M., et al. (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. *Bioinformatics (Oxford, England)*, *25*(17), 2283-2285.

human genomes. *Briefings in Bioinformatics*, *11*(5), 484 -498. doi:10.1093/bib/bbq016

M., et al. (2009). PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end

Microindel detection in short-read sequence data. *Bioinformatics (Oxford, England)*,

variants on protein function using the SIFT algorithm. *Nature Protocols*, *4*(7), 1073-

alignment of short DNA sequences to the human genome. *Genome Biology*, *10*(3),

artifacts from next generation sequencing data. *Bioinformatics (Oxford, England)*,

generation sequencing data. *Bioinformatics (Oxford, England)*, *27*(1), 130-131.

*Molecular Biology: AMB*, *5*, 3. doi:10.1186/1748-7188-5-3

*Bioinformatics*, *10*(1), 35-52. doi:10.1093/bib/bbn047

doi:10.1111/j.1467-7652.2009.00406.x

doi:10.1093/bioinformatics/btm451

524-536. doi:10.1002/bies.200900181

doi:10.1093/bioinformatics/btp373

1081. doi:10.1038/nprot.2009.86

R25. doi:10.1186/gb-2009-10-3-r25

doi:10.1093/bioinformatics/btq614

*26*(6), 722-729. doi:10.1093/bioinformatics/btq027

*25*(21), 2839-2840. doi:10.1093/bioinformatics/btp527

11-11-r116


Deep Sequencing Data Analysis: Challenges and Solutions 677

Myers, E. W. (1995). Toward simplifying and accurately formulating fragment assembly.

Ng, P. C., & Henikoff, S. (2006). Predicting the effects of amino acid substitutions on protein

Nothnagel, M., Herrmann, A., Wolf, A., Schreiber, S., Platzer, M., Siebert, R., Krawczak, M.,

*Novoalign*. (n.d.).. Retrieved from http://www.novocraft.com/main/page.php?s=novoalign Paşaniuc, B., Zaitlen, N., & Halperin, E. (2011). Accurate Estimation of Expression Levels of

Paszkiewicz, K., & Studholme, D. J. (2010). De novo assembly of short sequence reads.

Pop, M., Phillippy, A., Delcher, A. L., & Salzberg, S. L. (2004). Comparative genome

Qi, J., Zhao, F., Buboltz, A., & Schuster, S. C. (2010). inGAP: an integrated next-generation

Ramensky, V., Bork, P., & Sunyaev, S. (2002). Human non-synonymous SNPs: server and

Schadt, E. E., Turner, S., & Kasarskis, A. (2010). A window into third-generation sequencing. *Human Molecular Genetics*, *19*(R2), R227-240. doi:10.1093/hmg/ddq416 Schmieder, R., & Edwards, R. (2011). Quality control and preprocessing of metagenomic

Schmieder, R., Lim, Y. W., Rohwer, F., & Edwards, R. (2010). TagCleaner: Identification and

Schwartz, S., Oren, R., & Ast, G. (2011). Detection and removal of biases in the analysis of

Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. M., & Birol, I. (2009).

Stajich, J. E., Block, D., Boulez, K., Brenner, S. E., Chervitz, S. A., Dagdigian, C., Fuellen, G.,

Thusberg, J., Olatubosun, A., & Vihinen, M. (2011). Performance of mutation pathogenicity

Trapnell, C., & Salzberg, S. L. (2009). How to map billions of short reads onto genomes.

*Nature Biotechnology*, *27*(5), 455-457. doi:10.1038/nbt0509-455

genome analysis pipeline. *Bioinformatics (Oxford, England)*, *26*(1), 127-129.

datasets. *Bioinformatics (Oxford, England)*, *27*(6), 863-864.

removal of tag sequences from genomic and metagenomic datasets. *BMC* 

next-generation sequencing reads. *PloS One*, *6*(1), e16685.

ABySS: a parallel assembler for short read sequence data. *Genome Research*, *19*(6),

et al. (2002). The Bioperl toolkit: Perl modules for the life sciences. *Genome Research*,

prediction methods on missense variants. *Human Mutation*, *32*(4), 358-368.

*Briefings in Bioinformatics*, *11*(5), 457-472. doi:10.1093/bib/bbq020 *PICARD*. (n.d.).. Retrieved from http://picard.sourceforge.net/index.shtml

doi:10.1146/annurev.genom.7.080505.115630

*Human Genetics*. doi:10.1007/s00439-011-0971-3

assembly. *Briefings in Bioinformatics*, *5*(3), 237-248.

survey. *Nucleic Acids Research*, *30*(17), 3894-3900.

*Bioinformatics*, *11*, 341. doi:10.1186/1471-2105-11-341

doi:10.1089/cmb.2010.0259

doi:10.1093/bioinformatics/btp615

doi:10.1093/bioinformatics/btr026

doi:10.1371/journal.pone.0016685

1117-1123. doi:10.1101/gr.089532.108

*12*(10), 1611-1618. doi:10.1101/gr.361602

doi:10.1002/humu.21445

*2*(2), 275-290.

*Journal of Computational Biology: A Journal of Computational Molecular Cell Biology*,

function. *Annual Review of Genomics and Human Genetics*, *7*, 61-80.

et al. (2011). Technology-specific error signatures in the 1000 Genomes Project data.

Homologous Genes in RNA-seq Experiments. *Journal of Computational Biology: A Journal of Computational Molecular Cell Biology*, *18*(3), 459-468.


Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J.,

Martínez-Alcántara, A., Ballesteros, E., Feng, C., Rojas, M., Koshinsky, H., Fofanov, V. Y.,

McKenna, A., Hanna, Matthew, Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A.,

McKernan, K. J., Peckham, H. E., Costa, G. L., McLaughlin, S. F., Fu, Y., Tsung, E. F.,

encoding. *Genome Research*, *19*(9), 1527-1541. doi:10.1101/gr.091868.109 McKernan, K. J., Peckham, H. E., Costa, G. L., McLaughlin, S. F., Fu, Y., Tsung, E. F.,

encoding. *Genome Research*, *19*(9), 1527-1541. doi:10.1101/gr.091868.109 McLaren, W., Pritchard, B., Rios, D., Chen, Y., Flicek, P., & Cunningham, F. (2010). Deriving

*Research*, *20*(5), 675-684. doi:10.1101/gr.096966.109

doi:10.1038/nature03959

doi:10.1093/bioinformatics/btp429

1303. doi:10.1101/gr.107524.110

S13-20. doi:10.1038/nmeth.1374

doi:10.1093/nar/gkl869

doi:10.1038/nmeth.1226

R136. doi:10.1093/hmg/ddq400

*Genetics*, *11*(1), 31-46. doi:10.1038/nrg2626

Braverman, M. S., Chen, Y.-J., & Chen, Z. (2005b). Genome sequencing in microfabricated high-density picolitre reactors. *Nature*, *437*(7057), 376-380.

Havlak, P., et al. (2009). PIQA: pipeline for Illumina G1 genome analyzer data quality assessment. *Bioinformatics (Oxford, England)*, *25*(18), 2438-2439.

Garimella, K., et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. *Genome Research*, *20*(9), 1297-

Clouser, C. R., et al. (2009). Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base

Clouser, C. R., et al. (2009). Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base

the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. *Bioinformatics*, *26*(16), 2069 -2070. doi:10.1093/bioinformatics/btq330 Meader, S., Hillier, L. W., Locke, D., Ponting, C. P., & Lunter, G. (2010). Genome assembly

quality: assessment and improvement using the neutral indel model. *Genome* 

structural variation with next-generation sequencing. *Nature Methods*, *6*(11 Suppl),

sequence and function evolution data with expanded representation of biological pathways. *Nucleic Acids Research*, *35*(Database issue), D247-252.

Devine, Scott E. (2006). An initial map of insertion and deletion (INDEL) variation in the human genome. *Genome Research*, *16*(9), 1182-1190. doi:10.1101/gr.4565806 Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and

quantifying mammalian transcriptomes by RNA-Seq. *Nature Methods*, *5*(7), 621-628.

deletions (INDELs) in human genomes. *Human Molecular Genetics*, *19*(R2), R131-

Medvedev, P., Stanciu, M., & Brudno, M. (2009). Computational methods for discovering

Metzker, M. L. (2010). Sequencing technologies - the next generation. *Nature Reviews.* 

Mi, H., Guo, N., Kejariwal, A., & Thomas, P. D. (2007). PANTHER version 6: protein

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. *Genomics*, *95*(6), 315-327. doi:10.1016/j.ygeno.2010.03.001 Mills, Ryan E, Luttig, C. T., Larkins, C. E., Beauchamp, A., Tsui, C., Pittard, W Stephen, &

Mullaney, J. M., Mills, R. E., Pittard, W. S., & Devine, S. E. (2010). Small insertions and


**30** 

*Brazil* 

Vasco Azevedo et al.\*

**Whole Genome Annotation: In Silico Analysis** 

*Federal University of Minas Gerais (UFMG) and Federal University of Pará (UFPA),* 

After a genome is assembled, the next step is genomic annotation, which can generate data that will allow various types of research of the model organism. Complete DNA sequences of the organism are then mapped in areas pertinent to the research objectives. In this chapter, we explore relevant ongoing research on genes and consider the gene as a basic mapping unit. Gene prediction is the first hurdle we come across to begin the extensive and intensive work demonstrated in first item, which deals with assembly of the genome. Gene prediction can be made with computational techniques for recognizing gene sequences, including stop codons and the initial portions of nucleotide sequences; it involves empirical rules concerning minimum coding sequences (CDS's) and is limited due to overlapping

Finishing gene prediction step by a computer initiates the functional annotation stage. Functional annotation, item 3, can be done initially by computer, using similarity in sequence alignment. However, no software is capable of generating a functional annotation without many false positive results, since conserved protein domains with varied functions make gene sequence alignment difficult. In this case, after automatic annotation, the predicted genes need to be revised manually. In manual curation, item 4, an expert can more accurately locate frameshifts in the DNA strand. Depending on the number of errors found, genomic annotation may be postponed, requiring a return to the previous stage of genome assembly. In manual curation, the principal contributions are usually correction of the start codon position, gene name, gene product and, finally,

When functional annotation is completed, the genome should subsequently be submitted. It occurs after the assembly and annotation steps making the data generated available in public-access databanks. Submission is a pre-requisite for publication in scientific journals. Another advantage of genome publication in public-access sites is that it permits use of various genome analysis tools. For example, searches for genomic plasticity, pangenomic study, exported antigens and evaluation of innate and adaptive immune responses. The pangenome approach, item 5, concepts of species can be used as a filter for targeting candidates for vaccines, diagnostic kits and drug development. For drug development, the

 Vinicius Abreu, Sintia Almeida, Anderson Santos, Siomar Soares, Amjad Ali, Anne Pinto, Aryane Magalhães, Eudes Barbosa, Rommel Ramos, Louise Cerdeira, Adriana Carneiro, Paula Schneider, Artur

*Federal University of Minas Gerais (UFMG) and Federal University of Pará (UFPA), Brazil* 

**1. Introduction** 

sequences coding forward and reverse.

identification of frameshifts.

Silva and Anderson Miyoshi

 \*


### **Whole Genome Annotation: In Silico Analysis**

Vasco Azevedo et al.\*

*Federal University of Minas Gerais (UFMG) and Federal University of Pará (UFPA), Brazil* 

#### **1. Introduction**

678 Bioinformatics – Trends and Methodologies

Treangen, T. J., Sommer, D. D., Angly, F. E., Koren, S., & Pop, M. (2011). Next Generation

Van Deerlin, V. M., Leverenz, J. B., Bekris, L. M., Bird, T. D., Yuan, W., Elman, L. B., Clay, D.,

Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. *Nucleic Acids Res 38*, e164 (2010). Wang, Z., & Moult, J. (2001). SNPs, protein structure, and disease. *Human Mutation*, *17*(4),

Ye, Kai, Schulz, M. H., Long, Q., Apweiler, R., & Ning, Z. (2009). Pindel: a pattern growth

Yoon, S., Xuan, Z., Makarov, V., Ye, Kenny, & Sebat, J. (2009). Sensitive and accurate

Yuan, H.-Y., Chiou, J.-J., Tseng, W.-H., Liu, C.-H., Liu, C.-K., Lin, Y.-J., Wang, H.-H., et al.

Zerbino, D. R., & Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. *Genome Research*, *18*(5), 821-829. doi:10.1101/gr.074492.107 Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J., & Shen, B. (2011). A practical comparison

technologies. *PloS One*, *6*(3), e17915. doi:10.1371/journal.pone.0017915 Zhao, X., Palmer, L. E., Bolanos, R., Mircean, C., Fasulo, D., & Wittenberg, G. M. (2010).

*Molecular Cell Biology*, *17*(11), 1549-1560. doi:10.1089/cmb.2010.0127

*Andreas D. Baxevanis... [et Al*, *Chapter 11*, Unit11.8.

409-416. doi:10.1016/S1474-4422(08)70071-1

doi:10.1002/0471250953.bi1108s33

263-270. doi:10.1002/humu.22

doi:10.1093/nar/gkl236

doi:10.1093/bioinformatics/btp394

*19*(9), 1586 -1592. doi:10.1101/gr.092981.109

Sequence Assembly with AMOS. *Current Protocols in Bioinformatics / Editoral Board,* 

et al. (2008). TARDBP mutations in amyotrophic lateral sclerosis with TDP-43 neuropathology: a genetic and histopathological analysis. *Lancet Neurology*, *7*(5),

approach to detect break points of large deletions and medium sized insertions from paired-end short reads. *Bioinformatics (Oxford, England)*, *25*(21), 2865-2871.

detection of copy number variants using read depth of coverage. *Genome Research*,

(2006). FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization. *Nucleic Acids Research*, *34*(Web Server issue), W635-641.

of de novo genome assembly software tools for next-generation sequencing

EDAR: an efficient error detection and removal algorithm for next generation sequencing data. *Journal of Computational Biology: A Journal of Computational*  After a genome is assembled, the next step is genomic annotation, which can generate data that will allow various types of research of the model organism. Complete DNA sequences of the organism are then mapped in areas pertinent to the research objectives. In this chapter, we explore relevant ongoing research on genes and consider the gene as a basic mapping unit. Gene prediction is the first hurdle we come across to begin the extensive and intensive work demonstrated in first item, which deals with assembly of the genome. Gene prediction can be made with computational techniques for recognizing gene sequences, including stop codons and the initial portions of nucleotide sequences; it involves empirical rules concerning minimum coding sequences (CDS's) and is limited due to overlapping sequences coding forward and reverse.

Finishing gene prediction step by a computer initiates the functional annotation stage. Functional annotation, item 3, can be done initially by computer, using similarity in sequence alignment. However, no software is capable of generating a functional annotation without many false positive results, since conserved protein domains with varied functions make gene sequence alignment difficult. In this case, after automatic annotation, the predicted genes need to be revised manually. In manual curation, item 4, an expert can more accurately locate frameshifts in the DNA strand. Depending on the number of errors found, genomic annotation may be postponed, requiring a return to the previous stage of genome assembly. In manual curation, the principal contributions are usually correction of the start codon position, gene name, gene product and, finally, identification of frameshifts.

When functional annotation is completed, the genome should subsequently be submitted. It occurs after the assembly and annotation steps making the data generated available in public-access databanks. Submission is a pre-requisite for publication in scientific journals. Another advantage of genome publication in public-access sites is that it permits use of various genome analysis tools. For example, searches for genomic plasticity, pangenomic study, exported antigens and evaluation of innate and adaptive immune responses. The pangenome approach, item 5, concepts of species can be used as a filter for targeting candidates for vaccines, diagnostic kits and drug development. For drug development, the

<sup>\*</sup> Vinicius Abreu, Sintia Almeida, Anderson Santos, Siomar Soares, Amjad Ali, Anne Pinto, Aryane Magalhães, Eudes Barbosa, Rommel Ramos, Louise Cerdeira, Adriana Carneiro, Paula Schneider, Artur Silva and Anderson Miyoshi

*Federal University of Minas Gerais (UFMG) and Federal University of Pará (UFPA), Brazil* 

Whole Genome Annotation: In Silico Analysis 681

organized in a continuous cluster. Instead, the coding regions (exons) are often widely interspersed with non–coding intervening sequences (introns). Furthermore, in many cases the intronic region is much larger than the exonic region. These low–density coding sequences are evident in the human genome, in which only approximately 3% of the DNA generates proteins. The exon and intron issue can be compared to trying to read a non– continuous article in a journal. In an analogy, one must first identify in which part of the journal (genome) the article (gene) of interest is; then, as the DNA sequences are read, it is necessary to identify which part is informative (exon) and which part contains random information (intron). Also, genes can be altered by alternative splicing, which is a process that generates multiple protein sequences from the same gene sequence template

Gene prediction methodology for eukaryotes involves two distinct aspects; the first focuses on the information utilized for gene recognition, basically recognizing signal functions in the DNA strand; the second uses algorithms implemented by prediction programs for accurate prediction of gene structure and organization. The signal function search can be divided into two mechanisms utilized for locating genes. One classifies the content of the DNA

(i) The content sensor classifies the DNA regions into coding and non-coding segments (introns, intergenic regions and untranslated regions). This mechanism involves two approaches, intrinsic and extrinsic. The extrinsic approach relies on the assumption that coding regions are evolutionarily more conserved than non–coding regions. Consequently, this methodology employs local alignment tools, like BLAST (Johnson et al., 2008) ; this makes it possible to make comparisons within the genome and between closely-related species. However, one important flaw in this approach involves the necessity of identifying homologies within the database in order to extract results. If none is found, this methodology is unable to determine if a region "codes" for a protein (Sleator, 2010). (ii) The functional sensor approach searches the genome for consensus sequences. Consensus sequences are extracted from multiple alignments of functionally-related documented sequences. The functional signals involve transcription, translation and splice sites. Transcriptional signals includes the CAP signal at the transcriptional start site and the polyadenylation signal located 20 to 30 bp downstream of the coding region. Another important signal to identify is the translation initiation site, although this feature has limitations due to a lack of knowledge concerning initiation sites in eukaryotes (Mathé et al.,

Unlike eukaryotes, the archaeal, bacterial and virus genomes are highly gene-dense. The protein coding regions usually represent more than 90% of the genome. Therefore the accuracy of gene predictors depends primarily on determining which of the six frames contains the real gene. The simplest approach in gene prediction is to look for Open Reading Frames (ORFs). An ORF is a DNA sequence that initiates at a start codon and ends at a stop codon, with no other intervening stop codon. One way to locate genes is to look for ORFs with the mean size of proteins (roughly 900 base pairs) (Allen et al., 2004). Therefore, long

ORFs indicate possible genes, although this methodology fails to predict small genes.

The major problem in simply applying this technique is the possibility of ORF overlap in the different DNA strains. This approach must be used along with guidelines to avoid

strand and the other searches for functional signals in the genome:

(Schellenberg et al., 2008).

2002).

**2.3 Prokaryotes** 

core set of proteins is a more likely source of useful information, for developing both vaccines and diagnostic materials for a unique pangenome set of a species of interest.

Genomic plasticity, item 6, is the dynamic property of genomes, involving DNA gains, losses, and rearrangement; it allows bacteria to adapt to new hosts and environments. There are several mechanisms that can drive these changes, including point mutations, gene conversions, rearrangements (inversion or translocation), deletions and DNA insertions from other organisms (through plasmids, bacteriophages, transposons, insertion elements and genomic islands). Gene acquisition and loss by all these mechanisms influences bacterial lifestyles and physiological versatility. Analyses of HGT regions in silico has become feasible due to the introduction of next–generation sequencing technologies, which allows sequencing of prokaryotic genomes at a faster rate than the earlier Sanger method and at a considerably lower operational cost. Consequently, the number of complete genome sequences available for analysis has grown and continues to grow rapidly.

In post-genomics, study of Reverse Vaccinology (RV), item 7, can provide predictions of the sub cellular locations of an entire predicted proteome. Additionally, these previous annotations, prediction of peptides with high affinity for class I and II MHC proteins is another in silico analysis that increases the probability of selecting antigens that can promote immune responses in organisms infected by a pathogen. The field of research referred to as immunoinformatics, item 8, is giving us the opportunity to analyze antigens with greater selectivity and increase the likelihood of developing a successful vaccine.

#### **2. Gene prediction**

The development of modern sequencing technology has resulted in an exponential increase in the number of available genome sequences. To illustrate, in 1997 there were 10 complete genome sequences of bacteria available in the NCBI (Lukashin & Borodovsky, 1998); by 2011, this number had sharply increased to 1,538 http://www.ncbi.nlm.nih.gov/ genomes/lproks.cgi. This enormous increase in the quantity of available information stimulated the development of tools for gene prediction. The development of these tools is a tremendous challenge, and it is a major contribution of Bioinformatics to the field of genomics.

#### **2.1 Gene prediction strategies**

Gene prediction programs can be divided into two categories: an empirical category, which relies on sequence similarity; and ab initio, which uses signal and content sensors. Empirical gene predictors search for similarity in the genome; they predict genes based on homologies with known databases, such as genomic DNA, cDNA, dbEST and proteins. This approach facilitates the identification of well–conserved exons. Ab initio gene finders use sequence information of signal and content sensors. Usually, these programs are based on Hidden Markov Models. Ab initio can be organized into categories based on the number of genome sequences used in gene analysis; it includes single, dual and multiple–genome predictors. Integrated approaches couple the extrinsic methodology of empirical gene–finders and intrinsic ab initio prediction. This technique significantly improves gene prediction protocols (Allen et al., 2004).

#### **2.2 Eukaryotes**

The complexity of the challenge faced by Bioinformatics is only completely understood when we look at the complexity of the eukaryotic genome. Within genomes, genes are not

core set of proteins is a more likely source of useful information, for developing both vaccines and diagnostic materials for a unique pangenome set of a species of interest. Genomic plasticity, item 6, is the dynamic property of genomes, involving DNA gains, losses, and rearrangement; it allows bacteria to adapt to new hosts and environments. There are several mechanisms that can drive these changes, including point mutations, gene conversions, rearrangements (inversion or translocation), deletions and DNA insertions from other organisms (through plasmids, bacteriophages, transposons, insertion elements and genomic islands). Gene acquisition and loss by all these mechanisms influences bacterial lifestyles and physiological versatility. Analyses of HGT regions in silico has become feasible due to the introduction of next–generation sequencing technologies, which allows sequencing of prokaryotic genomes at a faster rate than the earlier Sanger method and at a considerably lower operational cost. Consequently, the number of complete

genome sequences available for analysis has grown and continues to grow rapidly.

**2. Gene prediction** 

**2.1 Gene prediction strategies** 

protocols (Allen et al., 2004).

**2.2 Eukaryotes** 

In post-genomics, study of Reverse Vaccinology (RV), item 7, can provide predictions of the sub cellular locations of an entire predicted proteome. Additionally, these previous annotations, prediction of peptides with high affinity for class I and II MHC proteins is another in silico analysis that increases the probability of selecting antigens that can promote immune responses in organisms infected by a pathogen. The field of research referred to as immunoinformatics, item 8, is giving us the opportunity to analyze antigens with greater selectivity and increase the likelihood of developing a successful vaccine.

The development of modern sequencing technology has resulted in an exponential increase in the number of available genome sequences. To illustrate, in 1997 there were 10 complete genome sequences of bacteria available in the NCBI (Lukashin & Borodovsky, 1998); by 2011, this number had sharply increased to 1,538 http://www.ncbi.nlm.nih.gov/ genomes/lproks.cgi. This enormous increase in the quantity of available information stimulated the development of tools for gene prediction. The development of these tools is a tremendous challenge, and it is a major contribution of Bioinformatics to the field of genomics.

Gene prediction programs can be divided into two categories: an empirical category, which relies on sequence similarity; and ab initio, which uses signal and content sensors. Empirical gene predictors search for similarity in the genome; they predict genes based on homologies with known databases, such as genomic DNA, cDNA, dbEST and proteins. This approach facilitates the identification of well–conserved exons. Ab initio gene finders use sequence information of signal and content sensors. Usually, these programs are based on Hidden Markov Models. Ab initio can be organized into categories based on the number of genome sequences used in gene analysis; it includes single, dual and multiple–genome predictors. Integrated approaches couple the extrinsic methodology of empirical gene–finders and intrinsic ab initio prediction. This technique significantly improves gene prediction

The complexity of the challenge faced by Bioinformatics is only completely understood when we look at the complexity of the eukaryotic genome. Within genomes, genes are not organized in a continuous cluster. Instead, the coding regions (exons) are often widely interspersed with non–coding intervening sequences (introns). Furthermore, in many cases the intronic region is much larger than the exonic region. These low–density coding sequences are evident in the human genome, in which only approximately 3% of the DNA generates proteins. The exon and intron issue can be compared to trying to read a non– continuous article in a journal. In an analogy, one must first identify in which part of the journal (genome) the article (gene) of interest is; then, as the DNA sequences are read, it is necessary to identify which part is informative (exon) and which part contains random information (intron). Also, genes can be altered by alternative splicing, which is a process that generates multiple protein sequences from the same gene sequence template (Schellenberg et al., 2008).

Gene prediction methodology for eukaryotes involves two distinct aspects; the first focuses on the information utilized for gene recognition, basically recognizing signal functions in the DNA strand; the second uses algorithms implemented by prediction programs for accurate prediction of gene structure and organization. The signal function search can be divided into two mechanisms utilized for locating genes. One classifies the content of the DNA strand and the other searches for functional signals in the genome:

(i) The content sensor classifies the DNA regions into coding and non-coding segments (introns, intergenic regions and untranslated regions). This mechanism involves two approaches, intrinsic and extrinsic. The extrinsic approach relies on the assumption that coding regions are evolutionarily more conserved than non–coding regions. Consequently, this methodology employs local alignment tools, like BLAST (Johnson et al., 2008) ; this makes it possible to make comparisons within the genome and between closely-related species. However, one important flaw in this approach involves the necessity of identifying homologies within the database in order to extract results. If none is found, this methodology is unable to determine if a region "codes" for a protein (Sleator, 2010). (ii) The functional sensor approach searches the genome for consensus sequences. Consensus sequences are extracted from multiple alignments of functionally-related documented sequences. The functional signals involve transcription, translation and splice sites. Transcriptional signals includes the CAP signal at the transcriptional start site and the polyadenylation signal located 20 to 30 bp downstream of the coding region. Another important signal to identify is the translation initiation site, although this feature has limitations due to a lack of knowledge concerning initiation sites in eukaryotes (Mathé et al., 2002).

#### **2.3 Prokaryotes**

Unlike eukaryotes, the archaeal, bacterial and virus genomes are highly gene-dense. The protein coding regions usually represent more than 90% of the genome. Therefore the accuracy of gene predictors depends primarily on determining which of the six frames contains the real gene. The simplest approach in gene prediction is to look for Open Reading Frames (ORFs). An ORF is a DNA sequence that initiates at a start codon and ends at a stop codon, with no other intervening stop codon. One way to locate genes is to look for ORFs with the mean size of proteins (roughly 900 base pairs) (Allen et al., 2004). Therefore, long ORFs indicate possible genes, although this methodology fails to predict small genes.

The major problem in simply applying this technique is the possibility of ORF overlap in the different DNA strains. This approach must be used along with guidelines to avoid

Whole Genome Annotation: In Silico Analysis 683

GeneMark is a public access program for gene prediction in eukaryotes. It is a family of gene prediction programs developed at Georgia Institute of Technology, Atlanta, Georgia, USA. GeneMark can operate in two ways: the first one is online, where one can make predictions, using for comparison one of the many available models; the second option is for novel genomes, in this way one can install and run the program locally. The web–based version of

For gene prediction in eukaryotes, GeneMark combines two programs, GeneMark–E\* and GeneMark.hmm–E. The GeneMark-E program determines the protein-coding potential of a DNA sequence (within a sliding window) by using species-specific parameters of the Markov models of coding and non-coding regions. This approach allows delineating local variations with coding potential. The GeneMark graph shows details of the protein-coding potential distribution along a sequence, while the GeneMark.hmm-E program predicts genes and intergenic regions in a sequence as a whole. The Hidden Markov models take advantage of the "grammar" of gene organization. The GeneMark.hmm programs identify the most likely parse of the whole DNA sequence into protein coding genes (with possible

The statistical model employed in the GeneMark.hmm algorithm is a hidden Markov model. It includes hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes. It also includes hidden states for start site (initiation site), stop site (termination site), and donor and acceptor splice sites. The protein-coding states (initial, internal, terminal exons and single exon genes) emit nucleotide sequences modeled by inhomogeneous 3–periodic fifth–order Markov chains. The non-coding states (intron and intergenic regions) emit sequences modeled by homogeneous Markov chains (Lukashin &

Automated functional annotation of genomes can be quite efficient because it is a computational process based on the alignment of ORF sequences of the organism with sequences from various other organisms (Kislyuk et al., 2010). Public domain databases contain full annotations of thousands of prokaryotic organisms (Benson et al., 2008). Automatic functional annotation takes advantage of knowledge concerning ORFs of homologous organisms, saving considerable time in manual curation (Li et al., 2010). However, care must be taken with fully automated functional annotation, since similarity of sequences can easily incur false positives (Lorenzi et al., 2010). In this section we discuss the advantages and dangers of using fully-automated functional annotation, and we explore

Algorithms for alignment of biological sequences are intensively used in automatic functional annotation (Aparicio et al., 2006; Meyer et al., 2003). Alignments of ORFs from a newly assembled genome with counterpart ORFs can provide the first hints about the new genome. For an organism with about 2,000 ORFs, analysis of similar sequences against a database of non-redundant (NR) proteins from NCBI can consume several processing hours. For example, assuming that this analysis is done on a computer isolated from the internet, hardware with 24 Gb RAM and eight processors, totaling 24 GHz CPU, this task will

GeneMark is available at http://exon.biology.gatech.edu/.

**2.4.4 GeneMarkTM** 

introns) and intergenic regions.

**3. Automated functional annotation** 

some features of tools and services for this purpose.

**3.1 Massive sequence alignments must be planned** 

consume approximately eight hours of processing time.

Borodovsky, 1998).

overlapping, choosing the more likely candidates. Also, numerous false positives are found in non-coding regions. Due to the high gene density, it is difficult to confidently state that any gene predicted in a non–coding region is false. This problem can be minimized by searching for homologies in closely–related organisms. If we do not find a conserved sequence in related species, it is assumed that the prediction (of a gene) is false.

Another problem faced by prediction programs in prokaryotes is how to determine the start codon of a sequence. The first initiation site in a sequence is not necessarily the true one. To solve this problem, programs can employ ribosome binding sites (RBS), which provide a strong signal, indicating the position of the true start site. In conclusion, there is a drop in prediction accuracy in high–GC–content genomes. Rich GC genomes contain fewer stop codons and more spurious ORFs. These false ORFs are often chosen by prediction programs instead of the real ones in the same DNA region. Additionally, the longer ORFs in GC–rich genomes contain more potential start codons, leading to a drop in the accuracy of translation initiation site prediction (Hyatt et al., 2010).

#### **2.4 Tools**

#### **2.4.1 Glimmer**

The first version of Glimmer (Gene Locator and Interpolated Markov ModelER) was released in 1998 ; the 3.02 version was released in 2006. Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer uses interpolated Markov models (IMMs) to identify coding regions and distinguish them from noncoding DNA. Glimmer was the primary microbial gene finder used at The Institute for Genomic Research (TIGR), where it was first developed, and it has been used to annotate the complete genomes of over 100 bacterial species from TIGR and other labs. Like other gene prediction programs, Glimmer can be installed and run locally and has a web-based platform (Salzberg et al., 1998). All one needs for online gene prediction of a genome is the fasta version of the sequence and access to the site:

http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer\_3.cgi.

#### **2.4.2 FgenesB**

FgenesB is a package developed by Softberry Inc. for automatic annotation of bacterial genomes. The gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. The package includes options to work on sets of sequences, such as scaffolds of bacterial genomes or short sequencing reads extracted from bacterial communities. For community sequence annotation, it includes ABsplit program, which separates archebacterial and eubacterial sequences. FGENESB was used in the first published bacterial community annotation project (Tyson et al., 2004).

#### **2.4.3 Prodigal**

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. Prodigal focuses specifically on three goals: improved gene structure prediction, improved translation initiation site recognition, and reduced false positives (Hyatt et al., 2010). The source code is freely available under the General Public License and the program can be accessed at http://compbio.ornl.gov/prodigal/.

#### **2.4.4 GeneMarkTM**

682 Bioinformatics – Trends and Methodologies

overlapping, choosing the more likely candidates. Also, numerous false positives are found in non-coding regions. Due to the high gene density, it is difficult to confidently state that any gene predicted in a non–coding region is false. This problem can be minimized by searching for homologies in closely–related organisms. If we do not find a conserved

Another problem faced by prediction programs in prokaryotes is how to determine the start codon of a sequence. The first initiation site in a sequence is not necessarily the true one. To solve this problem, programs can employ ribosome binding sites (RBS), which provide a strong signal, indicating the position of the true start site. In conclusion, there is a drop in prediction accuracy in high–GC–content genomes. Rich GC genomes contain fewer stop codons and more spurious ORFs. These false ORFs are often chosen by prediction programs instead of the real ones in the same DNA region. Additionally, the longer ORFs in GC–rich genomes contain more potential start codons, leading to a drop in the accuracy of

The first version of Glimmer (Gene Locator and Interpolated Markov ModelER) was released in 1998 ; the 3.02 version was released in 2006. Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. Glimmer uses interpolated Markov models (IMMs) to identify coding regions and distinguish them from noncoding DNA. Glimmer was the primary microbial gene finder used at The Institute for Genomic Research (TIGR), where it was first developed, and it has been used to annotate the complete genomes of over 100 bacterial species from TIGR and other labs. Like other gene prediction programs, Glimmer can be installed and run locally and has a web-based platform (Salzberg et al., 1998). All one needs for online gene prediction of a genome is the

FgenesB is a package developed by Softberry Inc. for automatic annotation of bacterial genomes. The gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. The package includes options to work on sets of sequences, such as scaffolds of bacterial genomes or short sequencing reads extracted from bacterial communities. For community sequence annotation, it includes ABsplit program, which separates archebacterial and eubacterial sequences. FGENESB was used in the first published bacterial community annotation project (Tyson et al.,

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. Prodigal focuses specifically on three goals: improved gene structure prediction, improved translation initiation site recognition, and reduced false positives (Hyatt et al., 2010). The source code is freely available under the General Public

License and the program can be accessed at http://compbio.ornl.gov/prodigal/.

sequence in related species, it is assumed that the prediction (of a gene) is false.

translation initiation site prediction (Hyatt et al., 2010).

fasta version of the sequence and access to the site:

http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer\_3.cgi.

**2.4 Tools 2.4.1 Glimmer** 

**2.4.2 FgenesB** 

2004).

**2.4.3 Prodigal** 

GeneMark is a public access program for gene prediction in eukaryotes. It is a family of gene prediction programs developed at Georgia Institute of Technology, Atlanta, Georgia, USA. GeneMark can operate in two ways: the first one is online, where one can make predictions, using for comparison one of the many available models; the second option is for novel genomes, in this way one can install and run the program locally. The web–based version of GeneMark is available at http://exon.biology.gatech.edu/.

For gene prediction in eukaryotes, GeneMark combines two programs, GeneMark–E\* and GeneMark.hmm–E. The GeneMark-E program determines the protein-coding potential of a DNA sequence (within a sliding window) by using species-specific parameters of the Markov models of coding and non-coding regions. This approach allows delineating local variations with coding potential. The GeneMark graph shows details of the protein-coding potential distribution along a sequence, while the GeneMark.hmm-E program predicts genes and intergenic regions in a sequence as a whole. The Hidden Markov models take advantage of the "grammar" of gene organization. The GeneMark.hmm programs identify the most likely parse of the whole DNA sequence into protein coding genes (with possible introns) and intergenic regions.

The statistical model employed in the GeneMark.hmm algorithm is a hidden Markov model. It includes hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes. It also includes hidden states for start site (initiation site), stop site (termination site), and donor and acceptor splice sites. The protein-coding states (initial, internal, terminal exons and single exon genes) emit nucleotide sequences modeled by inhomogeneous 3–periodic fifth–order Markov chains. The non-coding states (intron and intergenic regions) emit sequences modeled by homogeneous Markov chains (Lukashin & Borodovsky, 1998).

#### **3. Automated functional annotation**

Automated functional annotation of genomes can be quite efficient because it is a computational process based on the alignment of ORF sequences of the organism with sequences from various other organisms (Kislyuk et al., 2010). Public domain databases contain full annotations of thousands of prokaryotic organisms (Benson et al., 2008). Automatic functional annotation takes advantage of knowledge concerning ORFs of homologous organisms, saving considerable time in manual curation (Li et al., 2010). However, care must be taken with fully automated functional annotation, since similarity of sequences can easily incur false positives (Lorenzi et al., 2010). In this section we discuss the advantages and dangers of using fully-automated functional annotation, and we explore some features of tools and services for this purpose.

#### **3.1 Massive sequence alignments must be planned**

Algorithms for alignment of biological sequences are intensively used in automatic functional annotation (Aparicio et al., 2006; Meyer et al., 2003). Alignments of ORFs from a newly assembled genome with counterpart ORFs can provide the first hints about the new genome. For an organism with about 2,000 ORFs, analysis of similar sequences against a database of non-redundant (NR) proteins from NCBI can consume several processing hours. For example, assuming that this analysis is done on a computer isolated from the internet, hardware with 24 Gb RAM and eight processors, totaling 24 GHz CPU, this task will consume approximately eight hours of processing time.

Whole Genome Annotation: In Silico Analysis 685

current technology, it is not possible to dispense with manual curation of an automatic annotation, or even experimental evidence concerning gene prediction and annotation based on sequence similarities (Poptsova & Gogarten, 2010). Although it is not normally feasible to initially include experimental verification of gene prediction, it seems reasonable to take advantage of expert human annotation of genomes to help determine the outcome of automatic annotation. Assuming one is working on the pangenome of an organism, such a measure can not only reduce false positives in comparisons of sequence similarities, but also determination of homologous genomes based on a particular annotation. During automatic annotation, a measure that has the potential to minimize error propagation would be allocating different weights for the results of sequence similarity to genes from organisms

The following are some tools for automatic annotation of entire genomes, with brief

One of the reasons that GenDB is included among a select set of tools for automatic annotation of genomes is the fact that it was developed for the web platform (Meyer et al., 2003). Geographically dispersed research groups can benefit from web interfaces using standard tools and a centralized database. Version 2.4 of GenDB has three modules: core, web, and gui. The core module has programs written in perl that allow creation of an annotation project, importation of data in fasta / EMBL format, execution pipeline automatic annotation, display of circular genomic maps, data export and annotation project deletion. Implementation of the programs in the module allows a team of curators to work on the web and edit diverse features of various genes. The gui module has editing features that are more sophisticated than those of the web module, allowing execution of tasks performed by the core module, but with a graphical interface. The GenDB program performs sequence alignments using the program Blast (Altschul et al., 1997) and allows incorporation of predictions of conserved domains of protein families based on InterPro-Scan (Hunter et al., 2009),as well as transmembrane domains based on TMHMM (Krogh et al., 2001), and indications of export to the extracellular medium through SignalP (Bendtsen

This tool was designed as an interface for Gene Ontology (GO); additional features have transformed it into a more comprehensive annotation platform (Aparicio et al., 2006). The program menus include various steps initiating annotation, with an automatic alignment of genome sequences against a protein-based non-redundant (NR) NCBI database, through prediction of conserved domains (InterPro–Scan), GO annotation ratings against the enzymatic English Enzymatic Code (EC) and subsequent visualization of molecular interactions in a genome by means of maps in the format of the Kyoto Encyclopedia of Genes and Genomes (KEEGO). Being a visually oriented tool, it has graphical tools to help analyze the vast amount of data generated in the predictions. A user of B2G does not necessarily have to perform all the steps of analysis that are offered, but in order to advance to the next phase of analysis it is imperative that the previous phase be performed

descriptions of their core functionality and instructions on how to use them.

for which there is evidence of expert manual curation.

**3.4 Tools** 

**3.4.1 GenDB** 

et al., 2004).

**3.4.2 BLAST2GO (B2G)** 

Though it is a completely automated computer process, the user has considerable responsibility to set conditions to be utilized in the computation in order to obtain good quality data. These conditions define the quality criteria that best fit the type of organism, for example, the cut–off value for a significant alignment with sequences of other organisms in the NCBI, the number of homologous sequences to be returned as a result and the file format of the output alignment. An additional parameter is required if the sequence search (query) and the targeted search sequences (subject) are in different formats (nucleotides versus amino acids). This parameter determines the most adequate table for translation of codons of the organism in question so that the alignment algorithm of sequences is able to interpret the correspondence between the query sequences and the subject. The number of parameters of an algorithm for aligning sequences can be quite large, justifying training with a heavy workload for optimal utilization. Our objective here is not to explore possible situations, but to alert users that the results of these algorithms can improve these alignments by reading the manual algorithm and consequently adjusting it to a particular situation concerning a query organism or subject. Thus, when beginning a massive alignment sequences project involving a novel genome, with an analysis that will take hours and create high expectations, it is advisable to use not just the basic configurations in these alignment algorithms. It would be useful to take time to weigh and incorporate options that will determine the success or failure of these alignments.

#### **3.2 Knowledge reapplication and time saving**

There has been significant growth in the number of DNA sequences available in public databases, because of new genome sequencing technologies, which have made it simpler, more efficient and cheaper to obtain complete genomes (Zhao & Grant, 2010). Fully assembled and annotated genomes of various forms of bacterial life are available to facilitate the processing and inclusion of a newly assembled genome. This wide range of genomes provides the opportunity for new research into large-scale SNPs, DNA methylation and mRNA expression profiles, and resequencing data (Datta et al., 2010). It also allows comparison of annotations from different research groups working with different organisms, some of which may be homologous to a newly-sequenced genome. Just as one can take advantage of knowledge about the function of genes from different organisms, it is also advisable to use the personal knowledge of a researcher on a specific organism in order to accelerate the process of automatic annotation. Based on evidence about a high degree of evolutionary proximity between a newly-assembled genome and a particular organism homolog that already has a fully-assembled and annotated genome, we can choose to use only the annotation of such an organism as a resource for a first automatic annotation.

The problems a researcher would normally encounter when utilizing annotations from various genomes could be resolved by comparison with the annotation of a homologous organism. This situation is common when one examines the pangenome of a species, as it is expected that most of the coding sequences of different strains of bacteria are not very different (Trost et al., 2010). In this case, it appears to be advantageous to identify a small set of target organisms (subject) in a sequence similarity search, with the objective of providing a first genome annotation (query); this may even be a set with only one organism.

#### **3.3 Error propagation: Automated versus manual annotation**

It is important to bear in mind that the GenBank is not a fully curated database (Benson et al., 2008); many genomes may have been deposited only as automatic annotations. With current technology, it is not possible to dispense with manual curation of an automatic annotation, or even experimental evidence concerning gene prediction and annotation based on sequence similarities (Poptsova & Gogarten, 2010). Although it is not normally feasible to initially include experimental verification of gene prediction, it seems reasonable to take advantage of expert human annotation of genomes to help determine the outcome of automatic annotation. Assuming one is working on the pangenome of an organism, such a measure can not only reduce false positives in comparisons of sequence similarities, but also determination of homologous genomes based on a particular annotation. During automatic annotation, a measure that has the potential to minimize error propagation would be allocating different weights for the results of sequence similarity to genes from organisms for which there is evidence of expert manual curation.

#### **3.4 Tools**

684 Bioinformatics – Trends and Methodologies

Though it is a completely automated computer process, the user has considerable responsibility to set conditions to be utilized in the computation in order to obtain good quality data. These conditions define the quality criteria that best fit the type of organism, for example, the cut–off value for a significant alignment with sequences of other organisms in the NCBI, the number of homologous sequences to be returned as a result and the file format of the output alignment. An additional parameter is required if the sequence search (query) and the targeted search sequences (subject) are in different formats (nucleotides versus amino acids). This parameter determines the most adequate table for translation of codons of the organism in question so that the alignment algorithm of sequences is able to interpret the correspondence between the query sequences and the subject. The number of parameters of an algorithm for aligning sequences can be quite large, justifying training with a heavy workload for optimal utilization. Our objective here is not to explore possible situations, but to alert users that the results of these algorithms can improve these alignments by reading the manual algorithm and consequently adjusting it to a particular situation concerning a query organism or subject. Thus, when beginning a massive alignment sequences project involving a novel genome, with an analysis that will take hours and create high expectations, it is advisable to use not just the basic configurations in these alignment algorithms. It would be useful to take time to weigh and incorporate options that

There has been significant growth in the number of DNA sequences available in public databases, because of new genome sequencing technologies, which have made it simpler, more efficient and cheaper to obtain complete genomes (Zhao & Grant, 2010). Fully assembled and annotated genomes of various forms of bacterial life are available to facilitate the processing and inclusion of a newly assembled genome. This wide range of genomes provides the opportunity for new research into large-scale SNPs, DNA methylation and mRNA expression profiles, and resequencing data (Datta et al., 2010). It also allows comparison of annotations from different research groups working with different organisms, some of which may be homologous to a newly-sequenced genome. Just as one can take advantage of knowledge about the function of genes from different organisms, it is also advisable to use the personal knowledge of a researcher on a specific organism in order to accelerate the process of automatic annotation. Based on evidence about a high degree of evolutionary proximity between a newly-assembled genome and a particular organism homolog that already has a fully-assembled and annotated genome, we can choose to use only the annotation of such an organism as a resource for a first automatic annotation. The problems a researcher would normally encounter when utilizing annotations from various genomes could be resolved by comparison with the annotation of a homologous organism. This situation is common when one examines the pangenome of a species, as it is expected that most of the coding sequences of different strains of bacteria are not very different (Trost et al., 2010). In this case, it appears to be advantageous to identify a small set of target organisms (subject) in a sequence similarity search, with the objective of providing

a first genome annotation (query); this may even be a set with only one organism.

It is important to bear in mind that the GenBank is not a fully curated database (Benson et al., 2008); many genomes may have been deposited only as automatic annotations. With

**3.3 Error propagation: Automated versus manual annotation** 

will determine the success or failure of these alignments.

**3.2 Knowledge reapplication and time saving** 

The following are some tools for automatic annotation of entire genomes, with brief descriptions of their core functionality and instructions on how to use them.

#### **3.4.1 GenDB**

One of the reasons that GenDB is included among a select set of tools for automatic annotation of genomes is the fact that it was developed for the web platform (Meyer et al., 2003). Geographically dispersed research groups can benefit from web interfaces using standard tools and a centralized database. Version 2.4 of GenDB has three modules: core, web, and gui. The core module has programs written in perl that allow creation of an annotation project, importation of data in fasta / EMBL format, execution pipeline automatic annotation, display of circular genomic maps, data export and annotation project deletion. Implementation of the programs in the module allows a team of curators to work on the web and edit diverse features of various genes. The gui module has editing features that are more sophisticated than those of the web module, allowing execution of tasks performed by the core module, but with a graphical interface. The GenDB program performs sequence alignments using the program Blast (Altschul et al., 1997) and allows incorporation of predictions of conserved domains of protein families based on InterPro-Scan (Hunter et al., 2009),as well as transmembrane domains based on TMHMM (Krogh et al., 2001), and indications of export to the extracellular medium through SignalP (Bendtsen et al., 2004).

#### **3.4.2 BLAST2GO (B2G)**

This tool was designed as an interface for Gene Ontology (GO); additional features have transformed it into a more comprehensive annotation platform (Aparicio et al., 2006). The program menus include various steps initiating annotation, with an automatic alignment of genome sequences against a protein-based non-redundant (NR) NCBI database, through prediction of conserved domains (InterPro–Scan), GO annotation ratings against the enzymatic English Enzymatic Code (EC) and subsequent visualization of molecular interactions in a genome by means of maps in the format of the Kyoto Encyclopedia of Genes and Genomes (KEEGO). Being a visually oriented tool, it has graphical tools to help analyze the vast amount of data generated in the predictions. A user of B2G does not necessarily have to perform all the steps of analysis that are offered, but in order to advance to the next phase of analysis it is imperative that the previous phase be performed

Whole Genome Annotation: In Silico Analysis 687

predicted genes will be validated and their products named (Stein, 2001). A more detailed description of the gene or gene family product is obtained through similarity analyses using protein data banks that contain well-characterized and conserved proteins (Overbeek et al.,

In functional annotation done with Artemis, several fields should be filled out to increase knowledge about particular genome elements. It is necessary to use annotation terms, which involve an official nomenclature developed for this purpose. Some of these terms and respective examples are given below: "LOCUS-TAG" is the term used to identify all of the genome elements, except for the feature "misc". Generally, one uses an abbreviation to identify the particular species, followed by an underline (\_) and numbers, for example: Cp1002\_0001 (*Corynebacterium pseudotuberculosis*, strain 1002). For tRNAs, the nomenclature is the abbreviation, followed by underline, a "t" and numbers, with a specific count, which is not included in the total CDS count, among others; for example: Cp1002\_t001. For rRNAs, the nomenclature is the symbol followed by underline, an "r" and numbers, with specific counts, not included in the total CDS count; for example: Cp1002\_r001. "PROTEIN\_ID" is used to designate all of the elements of the genome, except for the feature "misc". It is a standardized form for NCBI to identify e proteins; for example: gnl|gbufpa|Cp1002\_0001. "GENE" is one of the most important topics to be informed in manual annotation, indicating the gene symbol of the protein; fore example: pld. The field "SIMILARITY" corresponds to information obtained from the best similarity search result – BLASTp. Various types of information should be entered into this field, such as similarity among organisms, size of the amino-acids sequence analyzed, e-value and also the percentage identity between its own protein and the protein found in the data bank; for example: similar to *Corynebacterium pseudotuberculosis* 1002, hypothetical protein Cp1002\_00047 (345 aa), e–value: 0.0, 98% ID in 344 aa. In "PRODUCT", there is a description of the gene product, for which similarity was found in the public domain data bank; for example: Phospholipase D. The tag "PSEUDO" should be added whenever a protein presents one or various breaks, due to insertion of a premature stop codon. These are the famous proteins that have frameshifts or probable

pseudogenes. Consequently, the manual annotation window has this pattern:

Chromosomal replication initiation protein (603 aa), e value: 0.0, 98% id in 599s aa"

Manual curation is a very complex task and is subject to errors for various reasons. One of these is a lack of padronization in the interpretation of BLAST results. Another problem is propagation of errors, which involves prediction of protein function based on proteins that were also predicted but could have imprecise or even incorrect annotation (Gilks et al., 2002). For these reasons, some criteria are suggested in order to obtain reliable functional

/product="Chromosomal replication initiation protein"

/similarity="Similar to *Corynebacterium pseudotuberculosis* FRC41,

2005).

/gene="dnaA"

/colour=3

/locus\_tag="Cp1002\_0001"

**4.2 Steps for manual curation** 

/protein\_id="gnl|ufmg|Cp1002\_0001"

**4.1 Technical terms used in manual annotation** 

beforehand. Processing of an entire genome with approximately two thousand ORFs can take several days, as the first step is always sequence alignment against the NCBI NR base. Fortunately, B2G is designed to be a modular analysis tool. If a B2G user has computational resources that are more efficient than the shared resources on the public server, the user can perform alignment of sequences on his own hardware to generate an output in HTML format and continue the alignment processes following annotation with B2G. Should the user be dissatisfied with the efficiency of processes of annotating GO terms of the server's common B2G, there is a version of B2G than he can run separately with his superior hardware. The results generated in the offline mode can be uploaded to the online tool to continue the review process using a variety of tools, including statistical comparisons between two genomes. B2G was developed with Sun Java technology, which can be run on any operating system; however, the B2G offline module is designed to run on the Linux platform.

#### **3.4.3 CpDB relational schema: a practical example**

This tutorial has approximately 100 steps, including software installation and configuration, edition of files by Linux commands or through interfaces with biological sequence manipulation programs. The tutorial presumes that the programs Artemis, Java (Sun) and Blast version 2.2.20 or previous were locally installed. Many editions of files are made with the "sed" program of Linux, which is included in most Linux versions. All of the steps in this manual can be automated in order to develop an automatic pipeline for annotation, allowing the *Corynebacterium pseudotuberculosis* DataBase (CpDB), a relational database schema and tools for bacterial genomes annotation and pos-genome research, to become another web-based automatic annotation environment. For now, this tutorial has an instructional character, to help make a student aware of the necessities and difficulties involved in the process of automatic annotation of genomes. In order to obtain the tutorial files, type the following command in Linux, Ubuntu 10.10 or later:

#### **svn checkout svn://150.164.37.20/genomes/autoannotation --username=student - password=bioinfo**

After finalizing the verification of all of the files, this tutorial continues in the document "Tutorial.pdf", which will be in the folder "autoannotation".

#### **4. Manual curation**

Genome annotation is a process that consists of adding analyses and biological interpretations to DNA sequence information. This process can be divided (Stein, 2001), into three main categories: annotation of nucleotides, proteins and processes. Annotation of nucleotides can be done when there is information about the complete genome (or DNA segments) of an organism. It involves looking for the physical location (position on the chromosome) of each part of the sequence and discovering the location of the genes, RNAs, repeat elements, etc. In the annotation of proteins, which is done when there is information about the genes (obtained by genome or cDNA sequencing) of an organism, there is a search for gene function. Besides general predictions about gene and protein function, other information can be found in an annotation, such as biochemical and structural properties of a protein, prediction of operons, gene ontology, evolutionary relationships and metabolic cycles (Stothard & Wishart, 2006). Consequently, functional annotation or manual curation is a fundamental part of the process of assembling and annotating a genome, in which the curator is the person responsible for validating the elements. In manual curation, all of the

beforehand. Processing of an entire genome with approximately two thousand ORFs can take several days, as the first step is always sequence alignment against the NCBI NR base. Fortunately, B2G is designed to be a modular analysis tool. If a B2G user has computational resources that are more efficient than the shared resources on the public server, the user can perform alignment of sequences on his own hardware to generate an output in HTML format and continue the alignment processes following annotation with B2G. Should the user be dissatisfied with the efficiency of processes of annotating GO terms of the server's common B2G, there is a version of B2G than he can run separately with his superior hardware. The results generated in the offline mode can be uploaded to the online tool to continue the review process using a variety of tools, including statistical comparisons between two genomes. B2G was developed with Sun Java technology, which can be run on any operating

system; however, the B2G offline module is designed to run on the Linux platform.

This tutorial has approximately 100 steps, including software installation and configuration, edition of files by Linux commands or through interfaces with biological sequence manipulation programs. The tutorial presumes that the programs Artemis, Java (Sun) and Blast version 2.2.20 or previous were locally installed. Many editions of files are made with the "sed" program of Linux, which is included in most Linux versions. All of the steps in this manual can be automated in order to develop an automatic pipeline for annotation, allowing the *Corynebacterium pseudotuberculosis* DataBase (CpDB), a relational database schema and tools for bacterial genomes annotation and pos-genome research, to become another web-based automatic annotation environment. For now, this tutorial has an instructional character, to help make a student aware of the necessities and difficulties involved in the process of automatic annotation of genomes. In order to obtain the tutorial

**svn checkout svn://150.164.37.20/genomes/autoannotation --username=student --**

After finalizing the verification of all of the files, this tutorial continues in the document

Genome annotation is a process that consists of adding analyses and biological interpretations to DNA sequence information. This process can be divided (Stein, 2001), into three main categories: annotation of nucleotides, proteins and processes. Annotation of nucleotides can be done when there is information about the complete genome (or DNA segments) of an organism. It involves looking for the physical location (position on the chromosome) of each part of the sequence and discovering the location of the genes, RNAs, repeat elements, etc. In the annotation of proteins, which is done when there is information about the genes (obtained by genome or cDNA sequencing) of an organism, there is a search for gene function. Besides general predictions about gene and protein function, other information can be found in an annotation, such as biochemical and structural properties of a protein, prediction of operons, gene ontology, evolutionary relationships and metabolic cycles (Stothard & Wishart, 2006). Consequently, functional annotation or manual curation is a fundamental part of the process of assembling and annotating a genome, in which the curator is the person responsible for validating the elements. In manual curation, all of the

**3.4.3 CpDB relational schema: a practical example** 

files, type the following command in Linux, Ubuntu 10.10 or later:

"Tutorial.pdf", which will be in the folder "autoannotation".

**password=bioinfo** 

**4. Manual curation** 

predicted genes will be validated and their products named (Stein, 2001). A more detailed description of the gene or gene family product is obtained through similarity analyses using protein data banks that contain well-characterized and conserved proteins (Overbeek et al., 2005).

#### **4.1 Technical terms used in manual annotation**

In functional annotation done with Artemis, several fields should be filled out to increase knowledge about particular genome elements. It is necessary to use annotation terms, which involve an official nomenclature developed for this purpose. Some of these terms and respective examples are given below: "LOCUS-TAG" is the term used to identify all of the genome elements, except for the feature "misc". Generally, one uses an abbreviation to identify the particular species, followed by an underline (\_) and numbers, for example: Cp1002\_0001 (*Corynebacterium pseudotuberculosis*, strain 1002). For tRNAs, the nomenclature is the abbreviation, followed by underline, a "t" and numbers, with a specific count, which is not included in the total CDS count, among others; for example: Cp1002\_t001. For rRNAs, the nomenclature is the symbol followed by underline, an "r" and numbers, with specific counts, not included in the total CDS count; for example: Cp1002\_r001. "PROTEIN\_ID" is used to designate all of the elements of the genome, except for the feature "misc". It is a standardized form for NCBI to identify e proteins; for example: gnl|gbufpa|Cp1002\_0001. "GENE" is one of the most important topics to be informed in manual annotation, indicating the gene symbol of the protein; fore example: pld. The field "SIMILARITY" corresponds to information obtained from the best similarity search result – BLASTp. Various types of information should be entered into this field, such as similarity among organisms, size of the amino-acids sequence analyzed, e-value and also the percentage identity between its own protein and the protein found in the data bank; for example: similar to *Corynebacterium pseudotuberculosis* 1002, hypothetical protein Cp1002\_00047 (345 aa), e–value: 0.0, 98% ID in 344 aa. In "PRODUCT", there is a description of the gene product, for which similarity was found in the public domain data bank; for example: Phospholipase D. The tag "PSEUDO" should be added whenever a protein presents one or various breaks, due to insertion of a premature stop codon. These are the famous proteins that have frameshifts or probable pseudogenes. Consequently, the manual annotation window has this pattern:

/gene="dnaA" /product="Chromosomal replication initiation protein" /locus\_tag="Cp1002\_0001" /protein\_id="gnl|ufmg|Cp1002\_0001" /colour=3 /similarity="Similar to *Corynebacterium pseudotuberculosis* FRC41, Chromosomal replication initiation protein (603 aa), e value: 0.0, 98% id in 599s aa"

#### **4.2 Steps for manual curation**

Manual curation is a very complex task and is subject to errors for various reasons. One of these is a lack of padronization in the interpretation of BLAST results. Another problem is propagation of errors, which involves prediction of protein function based on proteins that were also predicted but could have imprecise or even incorrect annotation (Gilks et al., 2002). For these reasons, some criteria are suggested in order to obtain reliable functional

Whole Genome Annotation: In Silico Analysis 689

(conservation at the level of genes), the main genomic rearrangements and integration of new genomic islands (Field et al., 2005). This algorithm is written in the Java language and is available for the following operating systems: UNIX, Macintosh and Windows. Artemis is capable of processing data in the formats EMBL and GENBANK, or even sequences in the

BLAST (Altschul et al., 1990) is a tool that is widely used for the characterization of products coded by genes that are identified by gene prediction. It is able to identify a great majority of the alignments that attend the desired criteria, with a significant gain in performance (Gibas & Jambeck, 2001). This program is available on the NCBI - National Center for Biotechnology Information site http://www.ncbi.nlm.nih.gov (Stein, 2001), which is considered the central databank for genome information. As shown in the figure, BLAST has programs for alignment of protein and nucleotide sequences, among others, according to the

**Program Entry sequence Type of sequence target** 

Through this type of algorithm, we can compare any DNA sequence or protein (query) with all of the genome sequences in the public domain (subject) (Altschul et al., 1997). It is important to note that the program BLAST does not try to make a comparison of the full extension of the molecules that are being compared, but rather it identifies in the data bank

In the manual annotation of genomes, analysis of BLAST parameters, such as the number of points obtained (score), gap opening/extension penalties, number of expected alignments in the case of scores equal to or superior to the alignment that is being investigated (expectation value), and the normalized score (bitscore), are indispensible for the interpretation of the results. The smaller the value of "E", the smaller the chance of such a comparison being found merely by chance, consequently inferring a greater homology between the sequence being investigated and the data base (Baxevanis & Ouellette, 2001). Among the sequences with identity above 50%, a general approach is to characterize the function of the known sequence and transfer this annotation to the new sequence. Though annotation transfer is a common practice, a high rate of error has been reported when this is done without due caution (Liberman, 2004). Based on this principle, we consider that for sequences with identity above 80%, a simple alignment or a comparison with a protein that has been experimentally characterized using BLAST can be sufficient to infer function, as long as the pair being compared has similar lengths and align end to end without large

BLASTp Protein Protein BLASTn Nucleotide Nucleotide BLASTx Translated nucleotide Protein TBLASTn Protein Translated nucleotide TBLASTx Translated nucleotide Translated nucleotide

a sequence that is sufficiently similar to that of the sequence that is being studied.

format FASTA.

**4.5 Sequence similarity searches** 

needs of the work that is to be undertaken:

Table 1. Types of BLAST – NCBI programs.

**4.5.2 Interpreting blast results** 

**4.5.1 BLAST (Basic Local Alignment Search Tool)** 

annotation. The fundamental step for doing this well is mining data obtained from similarity analyses of BLASTp data banks. It is recommended to give greater value to annotation of proteins of individuals of the same species or of species that are phylogenetically close to the organism under study, the protein of which one wants to infer the function of, decreasing in this way the possibility of annotation errors. Another parameter is to observe if there is any consensus among the first 10 hits (the same protein is identified among various). In this case, even if the best hit is not identified as such, it is preferable to identify the sequence as similar to that of an organism that appears various times in the BLASTp results and is within the consensus. In cases where there is no consensus or when the e-value of the best hit (first BLAST result and which corresponds to the best alignment within the data bank that is being researched) is significantly larger than that of the following sequences, it is preferable to transfer the annotation of the best hit (Prosdocimi, 2003), or if necessary, in cases of non-significant alignments, always also run a similarity search at the nucleotide level (BLASTn). Other criteria are also analyzed, such as percentage identity between the sequence being analyzed and the sequence in the data bank, score value and e-value, as well as pair-by-pair alignment evaluation. This evaluation consists of checking the texture of the alignment (evaluating the number of gaps, size of the gaps, and the number of conserved substitutions of amino acids). If doubts remain, research of domain data banks and protein classification are also commonly utilized.

#### **4.3 Frame shifts (Pseudogenes)**

Comparisons between non-coding regions of genomes of various prokaryotic species has aided in the identification and characterization of genome segments with regulatory roles (Pareja et al., 2006), contributing to the elucidation of genetic circuits of transcriptional regulation. These non-coding regions, known as pseudogenes, are DNA sequences that are highly similar to functional genes but do not express a functional protein, probably because of deleterious mutations. These degraded genes contain one or more inactivating mutations, such as a nonsense mutation that introduces a premature stop codon, resulting in an incomplete protein and a later change in the open reading frame (Lerat & Ochman, 2005). When found in the genome, the break region is checked with Artemis, and the quality of the bases in that region is also evaluated. Whenever possible, addition or removal of erroneous bases can restore the reading frame. If there is no data that justifies addition or removal of bases, the genes should be classified as pseudogenes (tag /pseudo).

#### **4.4 Tools**

#### **4.4.1 Artemis**

The program Artemis, (Berriman & Rutherford, 2003), available for download at http://www.sanger.ac.uk/Software/Artemis is a freely-distributed algorithm developed for visualization of genomes and for annotation and manual curation. Artemis allows the curator to visualize various characteristics of the genome sequences, such as: product coded by the predicted gene; presence of tRNAs and rRNAs; search for protein and nucleotide similarity in biological data banks; visualization of probable domains and conserved protein families; visualization of GC / AT content, and misplaced codon use; and various other functions. These data can be visualized in the six phases of translating DNA reads into proteins (Rutherford et al., 2000). Also, the program provides a visualization of BLAST visits between two complete genome sequences, allowing rapid analysis of the degree of synteny

annotation. The fundamental step for doing this well is mining data obtained from similarity analyses of BLASTp data banks. It is recommended to give greater value to annotation of proteins of individuals of the same species or of species that are phylogenetically close to the organism under study, the protein of which one wants to infer the function of, decreasing in this way the possibility of annotation errors. Another parameter is to observe if there is any consensus among the first 10 hits (the same protein is identified among various). In this case, even if the best hit is not identified as such, it is preferable to identify the sequence as similar to that of an organism that appears various times in the BLASTp results and is within the consensus. In cases where there is no consensus or when the e-value of the best hit (first BLAST result and which corresponds to the best alignment within the data bank that is being researched) is significantly larger than that of the following sequences, it is preferable to transfer the annotation of the best hit (Prosdocimi, 2003), or if necessary, in cases of non-significant alignments, always also run a similarity search at the nucleotide level (BLASTn). Other criteria are also analyzed, such as percentage identity between the sequence being analyzed and the sequence in the data bank, score value and e-value, as well as pair-by-pair alignment evaluation. This evaluation consists of checking the texture of the alignment (evaluating the number of gaps, size of the gaps, and the number of conserved substitutions of amino acids). If doubts remain, research

of domain data banks and protein classification are also commonly utilized.

bases, the genes should be classified as pseudogenes (tag /pseudo).

Comparisons between non-coding regions of genomes of various prokaryotic species has aided in the identification and characterization of genome segments with regulatory roles (Pareja et al., 2006), contributing to the elucidation of genetic circuits of transcriptional regulation. These non-coding regions, known as pseudogenes, are DNA sequences that are highly similar to functional genes but do not express a functional protein, probably because of deleterious mutations. These degraded genes contain one or more inactivating mutations, such as a nonsense mutation that introduces a premature stop codon, resulting in an incomplete protein and a later change in the open reading frame (Lerat & Ochman, 2005). When found in the genome, the break region is checked with Artemis, and the quality of the bases in that region is also evaluated. Whenever possible, addition or removal of erroneous bases can restore the reading frame. If there is no data that justifies addition or removal of

The program Artemis, (Berriman & Rutherford, 2003), available for download at http://www.sanger.ac.uk/Software/Artemis is a freely-distributed algorithm developed for visualization of genomes and for annotation and manual curation. Artemis allows the curator to visualize various characteristics of the genome sequences, such as: product coded by the predicted gene; presence of tRNAs and rRNAs; search for protein and nucleotide similarity in biological data banks; visualization of probable domains and conserved protein families; visualization of GC / AT content, and misplaced codon use; and various other functions. These data can be visualized in the six phases of translating DNA reads into proteins (Rutherford et al., 2000). Also, the program provides a visualization of BLAST visits between two complete genome sequences, allowing rapid analysis of the degree of synteny

**4.3 Frame shifts (Pseudogenes)** 

**4.4 Tools 4.4.1 Artemis**  (conservation at the level of genes), the main genomic rearrangements and integration of new genomic islands (Field et al., 2005). This algorithm is written in the Java language and is available for the following operating systems: UNIX, Macintosh and Windows. Artemis is capable of processing data in the formats EMBL and GENBANK, or even sequences in the format FASTA.

#### **4.5 Sequence similarity searches**

#### **4.5.1 BLAST (Basic Local Alignment Search Tool)**

BLAST (Altschul et al., 1990) is a tool that is widely used for the characterization of products coded by genes that are identified by gene prediction. It is able to identify a great majority of the alignments that attend the desired criteria, with a significant gain in performance (Gibas & Jambeck, 2001). This program is available on the NCBI - National Center for Biotechnology Information site http://www.ncbi.nlm.nih.gov (Stein, 2001), which is considered the central databank for genome information. As shown in the figure, BLAST has programs for alignment of protein and nucleotide sequences, among others, according to the needs of the work that is to be undertaken:


Table 1. Types of BLAST – NCBI programs.

Through this type of algorithm, we can compare any DNA sequence or protein (query) with all of the genome sequences in the public domain (subject) (Altschul et al., 1997). It is important to note that the program BLAST does not try to make a comparison of the full extension of the molecules that are being compared, but rather it identifies in the data bank a sequence that is sufficiently similar to that of the sequence that is being studied.

#### **4.5.2 Interpreting blast results**

In the manual annotation of genomes, analysis of BLAST parameters, such as the number of points obtained (score), gap opening/extension penalties, number of expected alignments in the case of scores equal to or superior to the alignment that is being investigated (expectation value), and the normalized score (bitscore), are indispensible for the interpretation of the results. The smaller the value of "E", the smaller the chance of such a comparison being found merely by chance, consequently inferring a greater homology between the sequence being investigated and the data base (Baxevanis & Ouellette, 2001). Among the sequences with identity above 50%, a general approach is to characterize the function of the known sequence and transfer this annotation to the new sequence. Though annotation transfer is a common practice, a high rate of error has been reported when this is done without due caution (Liberman, 2004). Based on this principle, we consider that for sequences with identity above 80%, a simple alignment or a comparison with a protein that has been experimentally characterized using BLAST can be sufficient to infer function, as long as the pair being compared has similar lengths and align end to end without large

Whole Genome Annotation: In Silico Analysis 691

called the core genome. A "dispensable genome or accessory genome" consists of genome sequences present in more than two strains but are not part of the core genome. "Unique genomic sequences" or "unique genes" are strain-specific genes. These genes are limited to single strain. The pangenome is important for identification and for designing effective

There are many web tools and softwares available to manage and efficiently extract data from genomes of various strains of the same species. These tools recognize the accession numbers allotted to complete genomes submitted to NCBI and to other databanks. Online tools developed by the Computational Genomics group of Bielefeld University, Germany, EDGAR – "Efficient Database framework for comparative Genome Analyses using BLAST score Ratios" http://edgar.cebitec.uni-bielefeld.de are efficient web tools to determine the core genome, along with dispensable and unique genes in the form of colored graphs and

For example, we analyzed the core genome, dispensable genes and unique genes, using "EDGAR", of three different *Corynebacterium pseudotuberculosis* strains, *C. pseudotuberculosis*

This core genome consists of 1,862 genes, with 48 dispensable genes between Cp–I19 and Cp1002, 52 dispensable genes between Cp-I19 and CpC231, and 103 dispensable genes between Cp1002 and CpC231. There were 208, 46 and 36 unique genes in strains Cp-I19,

The high degree of adaptability of bacteria to a wide range of environments and hosts is long known to be influenced by genome plasticity, a dynamic property that involves DNA gain, loss and rearrangement (Maurelli et al., 1998). Various mechanisms can drive these changes, including point mutations, gene conversions, rearrangements (inversion or translocation), deletions and DNA insertions from other organisms (plasmids, bacteriophages, transposons, insertion elements and genomic islands) (Schmidt & Hensel,

Plasmids contribute to genomic plasticity through their transfer capability. They are also able to mobilize co-resident plasmids and integrate into the chromosome. Plasmids may harbor antibiotic resistance genes and other genes associated with pathogenicity (Dobrindt & Hacker, 2001); e.g., Rhodococcus equi harbors a virulence plasmid that codes for surface-

Bacteriophages are viruses that infect bacteria and which influence genome plasticity through transduction mechanisms. Functional phages inject DNA from one bacterium into another one without causing damage to the acceptor organism; the DNA can incorporate into the acceptor genome leading to adaptive changes. Additionally, prophages (viral DNA incorporated in the bacterial chromosome) confer protection against lytic infections and they can harbor virulence genes that may be acquired by the acceptor bacterium and directly affect its pathogenicity; this has been reported from various species, including Clostridium

associated proteins (vap genes) that is absent in avirulent strains (Takai et al., 2000).

Cp–I19, *C. pseudotuberculosis* Cp1002 and *C. pseudotuberculosis* CpC231.

vaccines and drug targets (Mira et al., 2010).

tables (Blom et al., 2009)

**6. Genome plasticity** 

2004).

**6.1 Plasmids** 

**6.2 Bacteriophages** 

Cp1002 and CpC231, respectively.

deletions or insertions. For pairs with identity in the range of 50–80%, the general approach for attributing function includes evaluation of databanks with homologous protein and protein domain families.

#### **4.5.3 PFAM**

Proteins generally are composed of one or more functional regions, or domains. Different combinations of domains result in the large variety of proteins found in nature. Identification of the domains that are found in proteins can, therefore, provide insight about protein function (Sanger Institute, 2009). In sequences with an identity of less than 70%, without end to end similarity, the approach that is used is to evaluate the protein domains through a search of the Pfam database, which gives very extensive coverage (Mazumder & Vasudevan, 2008). The Pfam database is accessible via the Web http://pfam.sanger.ac.uk and is available in various formats for download. This databank is contains two complementary groupings; Pfam–A is composed of high–quality protein domains that have been manually verified, while Pfam–B contains data that has been generated automatically from the ProDom databank (Finn et al., 2010). Pfam–B is generally lower in quality, though it can suggest new domains that can be added to the manual annotation, if they are not available in Pfam–A. Basically, in Pfam, the sequences that are in full alignment are identified through a search for a hidden profile using the algorithm Hidden Markov Model (HMM), which is later generated using the software HMMER, based on the UniProt database (UniProt, 2007). These HMMs are statistical models that capture specific information about how much each alignment column is conserved and indicates the residuals in this evaluation.

#### **5. Genomics**

A genome is the complete set of DNA sequences of a living organism; it consists of coding and non–coding sequences. Genomics is a discipline of genetics that deals with genomes or DNA sequences. Simply put, genomics is the study of genomes. Computational genomics derives knowledge from genome sequences and related data, including both DNA and RNA sequences as well as experimental data. Computational biology mainly deals with whole genome analysis to understand the DNA mechanisms and molecular biology of a species. As biological datasets are extremely large, computational biology has become an important part of modern biology.

#### **5.1 Pangenomics**

The efficient and low cost sequencing technologies that are currently available provide complete genome sequences of pathogenic, industrially useful, and other economicallyimportant organisms. Genome sequences, and information that is coded in these sequences, can help identify pathogenicity and other important genes.

Complete genomic sequences of various strains of a species are important to help us understand pathogenesis mechanisms and to determine how genetic variability affects pathogenesis; it would be difficult to extract such useful information from a single genome (Lefébure & Stanhope, 2007).

A pangenome consists of a "core genome", which contains the gene or sequences present in all strains. In other words, genes that are found in all the genomes in a species of bacteria are

deletions or insertions. For pairs with identity in the range of 50–80%, the general approach for attributing function includes evaluation of databanks with homologous protein and

Proteins generally are composed of one or more functional regions, or domains. Different combinations of domains result in the large variety of proteins found in nature. Identification of the domains that are found in proteins can, therefore, provide insight about protein function (Sanger Institute, 2009). In sequences with an identity of less than 70%, without end to end similarity, the approach that is used is to evaluate the protein domains through a search of the Pfam database, which gives very extensive coverage (Mazumder & Vasudevan, 2008). The Pfam database is accessible via the Web http://pfam.sanger.ac.uk and is available in various formats for download. This databank is contains two complementary groupings; Pfam–A is composed of high–quality protein domains that have been manually verified, while Pfam–B contains data that has been generated automatically from the ProDom databank (Finn et al., 2010). Pfam–B is generally lower in quality, though it can suggest new domains that can be added to the manual annotation, if they are not available in Pfam–A. Basically, in Pfam, the sequences that are in full alignment are identified through a search for a hidden profile using the algorithm Hidden Markov Model (HMM), which is later generated using the software HMMER, based on the UniProt database (UniProt, 2007). These HMMs are statistical models that capture specific information about how much each alignment column is conserved and indicates the

A genome is the complete set of DNA sequences of a living organism; it consists of coding and non–coding sequences. Genomics is a discipline of genetics that deals with genomes or DNA sequences. Simply put, genomics is the study of genomes. Computational genomics derives knowledge from genome sequences and related data, including both DNA and RNA sequences as well as experimental data. Computational biology mainly deals with whole genome analysis to understand the DNA mechanisms and molecular biology of a species. As biological datasets are extremely large, computational biology has become an important

The efficient and low cost sequencing technologies that are currently available provide complete genome sequences of pathogenic, industrially useful, and other economicallyimportant organisms. Genome sequences, and information that is coded in these sequences,

Complete genomic sequences of various strains of a species are important to help us understand pathogenesis mechanisms and to determine how genetic variability affects pathogenesis; it would be difficult to extract such useful information from a single genome

A pangenome consists of a "core genome", which contains the gene or sequences present in all strains. In other words, genes that are found in all the genomes in a species of bacteria are

can help identify pathogenicity and other important genes.

protein domain families.

residuals in this evaluation.

part of modern biology.

(Lefébure & Stanhope, 2007).

**5.1 Pangenomics** 

**5. Genomics** 

**4.5.3 PFAM** 

called the core genome. A "dispensable genome or accessory genome" consists of genome sequences present in more than two strains but are not part of the core genome. "Unique genomic sequences" or "unique genes" are strain-specific genes. These genes are limited to single strain. The pangenome is important for identification and for designing effective vaccines and drug targets (Mira et al., 2010).

There are many web tools and softwares available to manage and efficiently extract data from genomes of various strains of the same species. These tools recognize the accession numbers allotted to complete genomes submitted to NCBI and to other databanks. Online tools developed by the Computational Genomics group of Bielefeld University, Germany, EDGAR – "Efficient Database framework for comparative Genome Analyses using BLAST score Ratios" http://edgar.cebitec.uni-bielefeld.de are efficient web tools to determine the core genome, along with dispensable and unique genes in the form of colored graphs and tables (Blom et al., 2009)

For example, we analyzed the core genome, dispensable genes and unique genes, using "EDGAR", of three different *Corynebacterium pseudotuberculosis* strains, *C. pseudotuberculosis* Cp–I19, *C. pseudotuberculosis* Cp1002 and *C. pseudotuberculosis* CpC231.

This core genome consists of 1,862 genes, with 48 dispensable genes between Cp–I19 and Cp1002, 52 dispensable genes between Cp-I19 and CpC231, and 103 dispensable genes between Cp1002 and CpC231. There were 208, 46 and 36 unique genes in strains Cp-I19, Cp1002 and CpC231, respectively.

#### **6. Genome plasticity**

The high degree of adaptability of bacteria to a wide range of environments and hosts is long known to be influenced by genome plasticity, a dynamic property that involves DNA gain, loss and rearrangement (Maurelli et al., 1998). Various mechanisms can drive these changes, including point mutations, gene conversions, rearrangements (inversion or translocation), deletions and DNA insertions from other organisms (plasmids, bacteriophages, transposons, insertion elements and genomic islands) (Schmidt & Hensel, 2004).

#### **6.1 Plasmids**

Plasmids contribute to genomic plasticity through their transfer capability. They are also able to mobilize co-resident plasmids and integrate into the chromosome. Plasmids may harbor antibiotic resistance genes and other genes associated with pathogenicity (Dobrindt & Hacker, 2001); e.g., Rhodococcus equi harbors a virulence plasmid that codes for surfaceassociated proteins (vap genes) that is absent in avirulent strains (Takai et al., 2000).

#### **6.2 Bacteriophages**

Bacteriophages are viruses that infect bacteria and which influence genome plasticity through transduction mechanisms. Functional phages inject DNA from one bacterium into another one without causing damage to the acceptor organism; the DNA can incorporate into the acceptor genome leading to adaptive changes. Additionally, prophages (viral DNA incorporated in the bacterial chromosome) confer protection against lytic infections and they can harbor virulence genes that may be acquired by the acceptor bacterium and directly affect its pathogenicity; this has been reported from various species, including Clostridium

Whole Genome Annotation: In Silico Analysis 693

Gene acquisition and loss through HGT influence bacterial lifestyles and their physiological versatility (Dobrindt & Hacker, 2001). The increasing number of complete genome sequences available for analysis has stimulated in silico research in an effort to identify HGT events. Horizontally–acquired regions can be identified based on observation G+C content and codon usage patterns, which differ among species and species groups. Sets of genes acquired by HGT events show deviations in these patterns that reflect the genomic signature of the donor genome (Langille et al., 2008). Various softwares can be used to identify HGT events based on base composition patterns (wavelet analysis of G+C content, cumulative GC profile, P–web, IVOM, IslandPath and PAI–IDA) and codon usage deviation (SIGI–HMM and PAI–IDA). However, due to adaptations in codon usage (Karlin et al., 1998), which tend towards homogenous base composition distributions (Hershberg & Petrov, 2009), identification of mobile regions based on genomic signature is only possible for regions that have recently been acquired from phylogenetically distant organisms, i.e. those that have a

Additionally, identification of HGT events may be aided by concentrating on regions that are flanked by tRNA genes, which are "hot spots" for transfer elements since they possess 3'– terminal insertion sequences that are recognized by various integrases (Hou, 1999). The integration of PAIs into these insertion sequences is responsible for their instability, since a single integrase may cause excision of the entire region. Insertion/deletion events have been demonstrated in PAIs I and II of E. coli strain 536, which are flanked by selC and leuX tRNA genes (Blum et al., 1994), and in high pathogenicity islands (HPIs) of several *Yersinia pseudotuberculosis* and *Y. pestis* strains (Lesic et al., 2004), which frequently insert into ASN3

However, although efficient in the identification of HGT events, approaches based on genomic signature and flanking tRNAs are not aimed at classification of GEIs, since they do not consider the overall gene content of the region. Additionally, horizontally acquired regions may deviate only in G+C content or codon usage alone, which would be a problem for the identification process if only one of these features is used to identify the event. However, there are tools designed to identify a specific class of GEIs, pathogenicity islands, through a multi-pronged strategy that overcomes such constraints. These tools are named PredictBias (Pundhir et al., 2008), IslandViewer (Waack et al., 2006) and PIPS (unpublished); they perform analyses based on genomic signature deviations that are not found in closelyrelated organisms and finding of genes coding for virulence factors. Although all of these programs use similar strategies and are complementary, PIPS deserves special attention

In analysis of C. diphtheriae strain NCTC 13129, PIPS outperformed the other approaches, identifying 12 out of the 13 PAIs of the reference strain, compared to 10 by IslandViewer and six by PredictBias. In the identification of PAIs of uropathogenic E. coli strain CFT073, PIPS had an overall accuracy of 93.9% (unpublished) against 89.5% for IslandViewer and

Reverse Vaccinology (RV) (Rappuoli, 2000) starts from the genomic sequence of a pathogen, which is an expected coded sequence for all the possible genes expressed during the life cycle of the pathogen. All open reading frames (ORF's) derived from the genome sequence

**6.5 Software to identify horizontal gene transfer (HGT) events** 

discrepant genomic signature when compared to the acceptor genome.

since it surpasses the others in accuracy and is easy to install.

tRNA genes.

88.1% for PredictBias.

**7. Reverse vaccinology** 

botulinum, Streptococcus pyogenes, Staphylococcus aureus, Escherichia coli and C. diphtheriae (Brüssow et al., 2004).

#### **6.3 Genomic islands**

Genomic islands (GEIs) affect genome plasticity because of their mobility and their capability of carrying a large number of genes as a single block, including operons and groups of coding genes with related functions. These GEIs can cause dramatic changes that lead the acceptor bacterium to evolve very rapidly compared to wild-type counterparts. GEIs are characterized as large DNA regions acquired from other organisms. They vary in size (10-200 kb), and can harbor sequences derived from phages and/or plasmids, including integrase genes; GEIs are flanked by tRNA genes or direct repeats, which help produce their characteristic instability (Hacker & Carniel, 2001). The instability of GEIs is exemplified by rapid gene acquisition and/or loss and changes in gene composition, as seen in different strains of Burkholderia pseudomallei (Tumapa et al., 2008). Additionally, GEIs can be classified into several classes according to gene content. These include Symbiotic Islands, which are involved in the association of bacterium with Leguminosae hosts (Barcellos et al., 2007); Resistance Islands, which harbor genes related to antibiotic resistance (Krizova & Nemec, 2010); Metabolic Islands, which contain genes associated with secondary metabolite biosynthesis (Tumapa et al., 2008); and Pathogenicity Islands (PAIs), which have a high concentration of virulence genes. PAIs are associated with pathogenic bacteria and have been implicated in the reemergence of various pathogens as causes of serious disease problems (Dobrindt et al., 2000). The first description of a PAI was made in 1990, in vitro (Hacker et al., 1990),. The identification was based on the observation of a close relation between deletion of hemolysin and fimbrial adhesin coding regions and non pathogenic strains of E. coli. This was investigated by gene cloning technique, pulse field electrophoresis and Southern hybridization. Using these procedures, they showed that the hemolysin and fimbrial adhesin coding genes are located in the same chromosomal region in several wild-type strains of E. coli and that they go through deletion events both in vivo and in vitro (Hacker et al., 1990).

#### **6.4 "Black Holes"**

Additionally, it is important to keep in mind that gene deletion is just as important as gene acquirement in some organisms. One example of this event is the so called "Black Holes" or deletion events of "antivirulence" genes, i.e. genes whose expression in pathogenic organisms is incompatible with virulence. The concept of evolution through deletion of "antivirulence" genes is based on the premise that genes required for adaptation of one organism in a specific niche may inhibit adaptability in another niche, a potential host, for example (Maurelli, 2007). In E. coli, loss of cadA, the lysine decarboxylase (LDC) coding gene, and ompT, which synthesizes an outer membrane protein, may confer virulence (Suzuki & Sasakawa, 2001). The mechanism of action of cadaverine, produced by decarboxylation of lysine by LDC, is still unknown. However, there are two hypotheses: cadaverine inactivates the synthesized toxin, or cadaverine acts directly on the target cell to protect it. Maurelli et al. (1998) demonstrated that when rabbit mucous cells are pre-treated with cadaverine and then washed, they are protected from enteroxin effects. Absence of Omp-T in Shigella strains and enteroinvasive E. coli strains is crucial for maintaining VirG on the cell surface, a pre-requisite for mobility on mammal cells, including bacterial dispersion through epithelial cells (Suzuki & Sasakawa, 2001).

botulinum, Streptococcus pyogenes, Staphylococcus aureus, Escherichia coli and C.

Genomic islands (GEIs) affect genome plasticity because of their mobility and their capability of carrying a large number of genes as a single block, including operons and groups of coding genes with related functions. These GEIs can cause dramatic changes that lead the acceptor bacterium to evolve very rapidly compared to wild-type counterparts. GEIs are characterized as large DNA regions acquired from other organisms. They vary in size (10-200 kb), and can harbor sequences derived from phages and/or plasmids, including integrase genes; GEIs are flanked by tRNA genes or direct repeats, which help produce their characteristic instability (Hacker & Carniel, 2001). The instability of GEIs is exemplified by rapid gene acquisition and/or loss and changes in gene composition, as seen in different strains of Burkholderia pseudomallei (Tumapa et al., 2008). Additionally, GEIs can be classified into several classes according to gene content. These include Symbiotic Islands, which are involved in the association of bacterium with Leguminosae hosts (Barcellos et al., 2007); Resistance Islands, which harbor genes related to antibiotic resistance (Krizova & Nemec, 2010); Metabolic Islands, which contain genes associated with secondary metabolite biosynthesis (Tumapa et al., 2008); and Pathogenicity Islands (PAIs), which have a high concentration of virulence genes. PAIs are associated with pathogenic bacteria and have been implicated in the reemergence of various pathogens as causes of serious disease problems (Dobrindt et al., 2000). The first description of a PAI was made in 1990, in vitro (Hacker et al., 1990),. The identification was based on the observation of a close relation between deletion of hemolysin and fimbrial adhesin coding regions and non pathogenic strains of E. coli. This was investigated by gene cloning technique, pulse field electrophoresis and Southern hybridization. Using these procedures, they showed that the hemolysin and fimbrial adhesin coding genes are located in the same chromosomal region in several wild-type strains of E. coli and that they go through deletion events both in vivo

Additionally, it is important to keep in mind that gene deletion is just as important as gene acquirement in some organisms. One example of this event is the so called "Black Holes" or deletion events of "antivirulence" genes, i.e. genes whose expression in pathogenic organisms is incompatible with virulence. The concept of evolution through deletion of "antivirulence" genes is based on the premise that genes required for adaptation of one organism in a specific niche may inhibit adaptability in another niche, a potential host, for example (Maurelli, 2007). In E. coli, loss of cadA, the lysine decarboxylase (LDC) coding gene, and ompT, which synthesizes an outer membrane protein, may confer virulence (Suzuki & Sasakawa, 2001). The mechanism of action of cadaverine, produced by decarboxylation of lysine by LDC, is still unknown. However, there are two hypotheses: cadaverine inactivates the synthesized toxin, or cadaverine acts directly on the target cell to protect it. Maurelli et al. (1998) demonstrated that when rabbit mucous cells are pre-treated with cadaverine and then washed, they are protected from enteroxin effects. Absence of Omp-T in Shigella strains and enteroinvasive E. coli strains is crucial for maintaining VirG on the cell surface, a pre-requisite for mobility on mammal cells, including bacterial

dispersion through epithelial cells (Suzuki & Sasakawa, 2001).

diphtheriae (Brüssow et al., 2004).

and in vitro (Hacker et al., 1990).

**6.4 "Black Holes"** 

**6.3 Genomic islands** 

#### **6.5 Software to identify horizontal gene transfer (HGT) events**

Gene acquisition and loss through HGT influence bacterial lifestyles and their physiological versatility (Dobrindt & Hacker, 2001). The increasing number of complete genome sequences available for analysis has stimulated in silico research in an effort to identify HGT events. Horizontally–acquired regions can be identified based on observation G+C content and codon usage patterns, which differ among species and species groups. Sets of genes acquired by HGT events show deviations in these patterns that reflect the genomic signature of the donor genome (Langille et al., 2008). Various softwares can be used to identify HGT events based on base composition patterns (wavelet analysis of G+C content, cumulative GC profile, P–web, IVOM, IslandPath and PAI–IDA) and codon usage deviation (SIGI–HMM and PAI–IDA). However, due to adaptations in codon usage (Karlin et al., 1998), which tend towards homogenous base composition distributions (Hershberg & Petrov, 2009), identification of mobile regions based on genomic signature is only possible for regions that have recently been acquired from phylogenetically distant organisms, i.e. those that have a discrepant genomic signature when compared to the acceptor genome.

Additionally, identification of HGT events may be aided by concentrating on regions that are flanked by tRNA genes, which are "hot spots" for transfer elements since they possess 3'– terminal insertion sequences that are recognized by various integrases (Hou, 1999). The integration of PAIs into these insertion sequences is responsible for their instability, since a single integrase may cause excision of the entire region. Insertion/deletion events have been demonstrated in PAIs I and II of E. coli strain 536, which are flanked by selC and leuX tRNA genes (Blum et al., 1994), and in high pathogenicity islands (HPIs) of several *Yersinia pseudotuberculosis* and *Y. pestis* strains (Lesic et al., 2004), which frequently insert into ASN3 tRNA genes.

However, although efficient in the identification of HGT events, approaches based on genomic signature and flanking tRNAs are not aimed at classification of GEIs, since they do not consider the overall gene content of the region. Additionally, horizontally acquired regions may deviate only in G+C content or codon usage alone, which would be a problem for the identification process if only one of these features is used to identify the event. However, there are tools designed to identify a specific class of GEIs, pathogenicity islands, through a multi-pronged strategy that overcomes such constraints. These tools are named PredictBias (Pundhir et al., 2008), IslandViewer (Waack et al., 2006) and PIPS (unpublished); they perform analyses based on genomic signature deviations that are not found in closelyrelated organisms and finding of genes coding for virulence factors. Although all of these programs use similar strategies and are complementary, PIPS deserves special attention since it surpasses the others in accuracy and is easy to install.

In analysis of C. diphtheriae strain NCTC 13129, PIPS outperformed the other approaches, identifying 12 out of the 13 PAIs of the reference strain, compared to 10 by IslandViewer and six by PredictBias. In the identification of PAIs of uropathogenic E. coli strain CFT073, PIPS had an overall accuracy of 93.9% (unpublished) against 89.5% for IslandViewer and 88.1% for PredictBias.

#### **7. Reverse vaccinology**

Reverse Vaccinology (RV) (Rappuoli, 2000) starts from the genomic sequence of a pathogen, which is an expected coded sequence for all the possible genes expressed during the life cycle of the pathogen. All open reading frames (ORF's) derived from the genome sequence

Whole Genome Annotation: In Silico Analysis 695

emergence of the concept of pangenomics RV (PGRV) (Bambini & Rappuoli, 2009). PGRV can also apply the concepts of core, extended, and character genomes. The core genome in PGRV is composed of exported genes (genes that transcribe exported proteins) that are common to all strains, genes that could be candidates for a universal vaccine, while the extended genome consists of genes that are absent in at least one of the strains of the studied species, while the character genome consists of genes that are specific to a strain (Lapierre & Gogarten, 2009). From the standpoint of vaccines, the core and character genomes would be good candidates to develop a vaccine that is suitable for all strains, without losing sight of

The immune system has considerable diversity in its components, such as, for example, immunoglobulin receptors of lymphocytes, or cytokines, with the principle cell types being B- and T-cells, which have important roles in inflammation, infection and protection (Evans, 2008). Immunoinformatics is very complex and can be characterized as a combinatory science, since it has a great complexity of regulatory cycles and network type interactions, which allows the utilization of computational models to resolve problems that can be converted into biological significant responses (Brusic & Petrovsky, 2003). This leads us to immunoinformatics, which is the application of informatics techniques to immune system molecules, with the main objective of helping develop vaccines through the prediction of

The immunological databases are a source of data used to explore, refine and develop new tools and algorithms (Salimi et al., 2010). There is a large variety of databases that group information relevant to the immune system. The Nucleic Acids Research Molecular Biology Database Collection http://www3.oup.co.uk/nar/database/c/ included 29 immunological databases in March 2011. The International ImMunoGeneTics information system (IMGT), the world reference databank for immunogenetics and immunoinformatics, was created by Marie-Paule Lefranc in 1989 (Lefranc et al., 2009). This databank is specialized in immunoglobulins or antibodies, T-cell receptors (TCR), MHCs, and others. The IMGT is constituted of a variety of databanks, including: structure, monoclonal antibody, sequence and genome databanks. All of these databanks are curated manually and daily by a team that works fulltime, which helps maintain high-quality annotation and standardization of the information. Other databases that house information related to epitopes, such as AntiJen (Toseland et al., 2005) and FIMM (Schonbach et al., 2000), have not been maintained and their data has migrated to other websites. Among these, the most promising epitope database seems to be the Immune Epitope Database (IEDB) (Peters et al., 2005), which is a curated database that has information based on experimental data associated with the target epitope; consequently, it is hoped that all of the information in the various existing

The principal goal of immunoinformatics is the development of algorithms that can both help develop vaccines and analyze the gene products of pathogens, such as viruses and

the particularities of specific genes in each strain.

immunogenic epitopes (Flower & Doytchinova, 2002).

databanks also migrates to IEDB within the next few years.

**8. Immunoinformatics** 

**8.1 Immunological databases** 

**8.2 Epitope prediction** 

can be evaluated with a computer program in order to determine their aptitude as vaccine candidates. Special attention is given to exported proteins because they are essential in hostpathogen interactions. Examples of this interaction can be cited: (i) adherence to host cells, (ii) invasion of compliant cells, (iii) damage to host tissues, (iv) resistance to environmental stress by the machinery defense of the cell being infected and finally, (v) mechanisms for subversion of the host immune response (Sibbald & van Dij, 2009). The word "Reverse" in RV can be explained by the reverse genetics (RG) technique. Before the dawn of genomics, there were attempts to discover the genes responsible for each phenotype. With Crick's central dogma (DNA > RNA > Protein) the research path was reversed. In possession of the likely gene sequence, several techniques have been developed to identify changes in the phenotype of an organism derived from sequence changes in genes. The principle of Crick's dogma is also used by RV; when a gene sequence is found, one can determine whether a probable protein encoded by this sequence can be an antigen capable of stimulating an immune response in a host organism.

Long before the creation of the term RV, a number of approaches had been considered to determine exported proteins in order to move to the next step of the production of a subunit vaccine (Diaz Romero & Outschoorn, 1994). For example, research on exported proteins was advanced as an alternative to subunit vaccines based on the polysaccharide capsule of meningococci. Vaccines produced with such antigens had a low capacity to induce a satisfactory immune response. This research effort on exported proteins includes almost two decades of work searching for a vaccine against meningococcal serogroup B, which now gives good results. This vaccine currently is the best RV alternative for the production of a subunit vaccine for Neisseria meningitidis serogroup B. Meningitis caused by serogroup B (Men B) is responsible for approximately half of the worldwide incidence of this disease (Diaz Romero & Outschoorn, 1994), and this research result for targeted vaccination is commonly used as a demonstration of the usefulness of RV, because of the excellent results.

Currently, a subunit vaccine against Men B created with antigens targeted by RV is being tested in phase-2 clinical trials (Bambini & Rappuoli, 2009). The advantages of RV continue to be attractive, enabling vaccine research for organisms whose cultivation in the laboratory is difficult or impossible. Reducing the time needed to select target proteins could allow investigation of different species or strains at the same time, for selecting vaccine candidates that can elicit adaptive immune responses. To achieve these benefits all we need is to have a sequenced genome, a personal computer and core software widely available to the scientific community. These conditions demonstrate another advantage of using RV, the low cost. What we call core software is a set of tools for identifying well-known motifs, such as, for example, SignalP, TMHMM, LipoP, and HMMSEARCH. There is still room for innovation in the use of core software; the choice of software strategies can be directed to the identification of vaccine candidates specific to an organism, such as in the case of gramnegative (bilayer) or gram positive (monolayer) bacteria, or also according to heuristics for selection of vaccine candidates with specific characteristics. For example, membrane or exported to the extracellular environment (Barinov et al., 2009).

The concept of RV was adapted to fit a new reality of widespread availability of genomic data (Rinaudo et al., 2009). Instead of researching vaccine targets for a single strain or subspecies of an organism, we can do it simultaneously in dozens of genomes, exploring potential joint antigens or those exclusive to multiple genomes (Lapierre & Gogarten, 2009). The possibility of having a large number of genomes available to implement RV leads to the

can be evaluated with a computer program in order to determine their aptitude as vaccine candidates. Special attention is given to exported proteins because they are essential in hostpathogen interactions. Examples of this interaction can be cited: (i) adherence to host cells, (ii) invasion of compliant cells, (iii) damage to host tissues, (iv) resistance to environmental stress by the machinery defense of the cell being infected and finally, (v) mechanisms for subversion of the host immune response (Sibbald & van Dij, 2009). The word "Reverse" in RV can be explained by the reverse genetics (RG) technique. Before the dawn of genomics, there were attempts to discover the genes responsible for each phenotype. With Crick's central dogma (DNA > RNA > Protein) the research path was reversed. In possession of the likely gene sequence, several techniques have been developed to identify changes in the phenotype of an organism derived from sequence changes in genes. The principle of Crick's dogma is also used by RV; when a gene sequence is found, one can determine whether a probable protein encoded by this sequence can be an antigen capable of stimulating an

Long before the creation of the term RV, a number of approaches had been considered to determine exported proteins in order to move to the next step of the production of a subunit vaccine (Diaz Romero & Outschoorn, 1994). For example, research on exported proteins was advanced as an alternative to subunit vaccines based on the polysaccharide capsule of meningococci. Vaccines produced with such antigens had a low capacity to induce a satisfactory immune response. This research effort on exported proteins includes almost two decades of work searching for a vaccine against meningococcal serogroup B, which now gives good results. This vaccine currently is the best RV alternative for the production of a subunit vaccine for Neisseria meningitidis serogroup B. Meningitis caused by serogroup B (Men B) is responsible for approximately half of the worldwide incidence of this disease (Diaz Romero & Outschoorn, 1994), and this research result for targeted vaccination is commonly used as a demonstration of the usefulness of RV, because of the excellent results. Currently, a subunit vaccine against Men B created with antigens targeted by RV is being tested in phase-2 clinical trials (Bambini & Rappuoli, 2009). The advantages of RV continue to be attractive, enabling vaccine research for organisms whose cultivation in the laboratory is difficult or impossible. Reducing the time needed to select target proteins could allow investigation of different species or strains at the same time, for selecting vaccine candidates that can elicit adaptive immune responses. To achieve these benefits all we need is to have a sequenced genome, a personal computer and core software widely available to the scientific community. These conditions demonstrate another advantage of using RV, the low cost. What we call core software is a set of tools for identifying well-known motifs, such as, for example, SignalP, TMHMM, LipoP, and HMMSEARCH. There is still room for innovation in the use of core software; the choice of software strategies can be directed to the identification of vaccine candidates specific to an organism, such as in the case of gramnegative (bilayer) or gram positive (monolayer) bacteria, or also according to heuristics for selection of vaccine candidates with specific characteristics. For example, membrane or

immune response in a host organism.

exported to the extracellular environment (Barinov et al., 2009).

The concept of RV was adapted to fit a new reality of widespread availability of genomic data (Rinaudo et al., 2009). Instead of researching vaccine targets for a single strain or subspecies of an organism, we can do it simultaneously in dozens of genomes, exploring potential joint antigens or those exclusive to multiple genomes (Lapierre & Gogarten, 2009). The possibility of having a large number of genomes available to implement RV leads to the emergence of the concept of pangenomics RV (PGRV) (Bambini & Rappuoli, 2009). PGRV can also apply the concepts of core, extended, and character genomes. The core genome in PGRV is composed of exported genes (genes that transcribe exported proteins) that are common to all strains, genes that could be candidates for a universal vaccine, while the extended genome consists of genes that are absent in at least one of the strains of the studied species, while the character genome consists of genes that are specific to a strain (Lapierre & Gogarten, 2009). From the standpoint of vaccines, the core and character genomes would be good candidates to develop a vaccine that is suitable for all strains, without losing sight of the particularities of specific genes in each strain.

#### **8. Immunoinformatics**

The immune system has considerable diversity in its components, such as, for example, immunoglobulin receptors of lymphocytes, or cytokines, with the principle cell types being B- and T-cells, which have important roles in inflammation, infection and protection (Evans, 2008). Immunoinformatics is very complex and can be characterized as a combinatory science, since it has a great complexity of regulatory cycles and network type interactions, which allows the utilization of computational models to resolve problems that can be converted into biological significant responses (Brusic & Petrovsky, 2003). This leads us to immunoinformatics, which is the application of informatics techniques to immune system molecules, with the main objective of helping develop vaccines through the prediction of immunogenic epitopes (Flower & Doytchinova, 2002).

#### **8.1 Immunological databases**

The immunological databases are a source of data used to explore, refine and develop new tools and algorithms (Salimi et al., 2010). There is a large variety of databases that group information relevant to the immune system. The Nucleic Acids Research Molecular Biology Database Collection http://www3.oup.co.uk/nar/database/c/ included 29 immunological databases in March 2011. The International ImMunoGeneTics information system (IMGT), the world reference databank for immunogenetics and immunoinformatics, was created by Marie-Paule Lefranc in 1989 (Lefranc et al., 2009). This databank is specialized in immunoglobulins or antibodies, T-cell receptors (TCR), MHCs, and others. The IMGT is constituted of a variety of databanks, including: structure, monoclonal antibody, sequence and genome databanks. All of these databanks are curated manually and daily by a team that works fulltime, which helps maintain high-quality annotation and standardization of the information. Other databases that house information related to epitopes, such as AntiJen (Toseland et al., 2005) and FIMM (Schonbach et al., 2000), have not been maintained and their data has migrated to other websites. Among these, the most promising epitope database seems to be the Immune Epitope Database (IEDB) (Peters et al., 2005), which is a curated database that has information based on experimental data associated with the target epitope; consequently, it is hoped that all of the information in the various existing databanks also migrates to IEDB within the next few years.

#### **8.2 Epitope prediction**

The principal goal of immunoinformatics is the development of algorithms that can both help develop vaccines and analyze the gene products of pathogens, such as viruses and

Whole Genome Annotation: In Silico Analysis 697

(HMM), support vector machines (SVM), homology modeling, protein threading and docking techniques have been developed. The NetMHC 3.2 server http://www.cbs.dtu.dk/services/NetMHC/ predicts binding of peptides to a series of different HLA alleles using artificial neural networks (ANNs) and weight matrixes. All of the previous versions are available online, for comparison and reference. ANNs were trained with 57 different human MHCs (HLA), representing all of the 12 HLA alleles, supertypes A and B (Lund et al., 2004). Also predictions are available for 22 animal alleles (monkey and rat). ANN prediction values are given in nM IC50 values. Weight prediction matrixes use an aptitude score, with a high aptitude score indicating strong binding. Predictions can be made for sizes from 8 to 11 for all of the alleles using an ANNs algorithm trained with 9mer peptides. Probably because of the limited quantity of 10mer data available, this method has better prediction value when an ANNs algorithm is trained with 10mer data. However, one should be careful with 8mer predictions, since some alleles do not link to 8mer to a significant degree. Binding peptides are indicated at output as strongly binding (SB) and weakly binding (WB). The allele for each HLA supertype is indicated in

The NetMHCII 2.2 server http://www.cbs.dtu.dk/services/NetMHCII/ predicts peptides that bind to MHC classe II alleles HLA–DR, HLA–DQ, HLA–DP and mouse alleles, using ANNs. Predictions can be obtained for the 14 HLA–DR alleles, including the nine HLA–DR, six HLA–DQ, and six HLA–DP supertypes and two H2 class II alleles in mice. The prediction values are given in nM IC50 values, and in %–Rank for a random set of 1,000,000 natural peptides. Strongly and weakly binding peptides are indicated in the

Without a doubt, there is a great variety of predictors, which when they are combined can be quite precise in the prediction of T–cell epitopes; however, this is only possible when well–characterized alleles are available, which is true for some alleles that have been predicted as MHC class I alleles, but much less so for those predicted as MHC class II. This is even more of a problem in the prediction of B cell proteins, for which it is often necessary to have prior knowledge of the structure and sequence of the protein. Nevertheless, it is known that no method can go further than the data used to train it, and only through extensive compilation and by obtaining high quality data, will it be possible to create

Allen, J. E., Pertea, M. Salzberg, S. L. 2004. Computational gene prediction using multiple

Almagro, J. C. 2004. Identification of differences in the specificity-determining residues of

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. Lipman, D. J. 1990. Basic local alignment

Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. Lipman, D. J.

antibodies that recognize antigens of different size: implications for the rational

1997. Gapped blast and psi-blast: a new generation of protein database search

excellent models that will can be generally applied (Flower & Doytchinova, 2002).

design of antibody repertoires, *J Mol Recognit* 17(2): 132–43.

the selection window for HLA alleles (Lundegaard et al., 2008).

sources of evidence, *Genome Res* 14(1): 142–8.

programs, *Nucleic Acids Res* 25(17): 3389–402.

search tool, *J Mol Biol* 215(3): 403–10.

output file (Nielsen et al., 2007).

**9. References** 

bacteria. This is why it is very important to understand antigen-antibody interactions. Macallum et al. (1996) made a detailed analysis of 26 antigen–antibody complexes; they found that binding between molecules is very complex, and that there are different antibody–antigen classes for different types of molecules. A later study of 59 antigen– antibody interactions (Almagro, 2004) found results similar to those of Macallum. These studies show that tools that can identify molecules and predict their interactions with other molecules need to be very accurate and sensitive.

#### **8.2.1 B cell epitope prediction**

Epitopes of B cells are antigenic regions that are recognized by antibodies of the immune system, specifically those that interact with B cell receptors. These epitopes can be continuous or discontinuous (Kumagai & Tsumoto, 2001). B–cell epitopes can be used to design vaccines and new diagnostic tests (Larsen et al., 2006). As with T cells, there are also numerous methodologies to model and predict B–cell epitopes. The classic system to predict B–cell epitopes (Hopp & Woods, 1981) uses propensity scale methods (Parker et al., 1986; Levitt, 1978). This method attributes a propensity value to each amino acid, based on studies of the physical–chemical properties. A combination of various scales can improve the prediction results (Pellequer et al., 1991). This work used hydrophilicity scales (Parker et al., 1986), as well as secondary structure (Levitt, 1978; Chou & Fasman, 1978) and accessibility (Emini et al., 1985). The Immune Epitope Database and Analysis Resource, IEDB (Peters et al., 2005), utilizes parameters such as hydrophilicity, flexibility, accessibility, turns, exposed surface, polarity and antigenic propensity of polypeptides chains, which have been correlated with the location of continuous epitopes. All of the prediction calculations are based on propensity scales. Another methodology that can be used to predict continuous Bcell epitopes combines hidden Markov model (HMM) and propensity scale methods (Parker et al., 1986; Levitt, 1978); it is called Bepipred http://www.cbs.dtu.dk/services/BepiPred/ (Larsen et al., 2006). This methodology has given increased prediction accuracy. Prediction of discontinuous B–cell epitopes has also improved, due to an increase in the number of three–dimensional (3D) structures of antibody–antigen complexes available in PDB and in IMGT/3Dstructure-DB (Kaas et al., 2004) and in the Epitome database (Schlessinger et al., 2006).

#### **8.2.2 T cell epitope prediction**

There are two classes of T cells: (1) CD8+ T cytotoxic (Tc) cells, which produce cytotoxins responsible for cell lysis, recognize peptides presented by class I MHCs and (2) CD4+ T helper (Th) cells, which recognize proteins associated with MHC class II. Interferon γ (IFN– γ) and tumor necrosis factor β (TNF–β) are produced by Th1 cells. Th2 cells produce interleukin 4 (IL–4), IL–5, IL–10 and IL–13. Eptitopes that bind to MHC de class I generally are 8–10 amino acids long, with a mean of nine amino acids (Reche et al.., 2002), while epitopes that bind to MHC class II are 13–17 amino acids long (Sercarz & Maverakis, 2003; Chicz et al., 1992). There are various online tools for predicting T–cell epitopes on the basis of MHC class I and class II binding. Prediction of MHC binding is based on motifs associated with epitopes or binders for specific alleles. SYFPEITHI is a tool that is widely used for prediction of T–cell epitopes and MHC binding; however, these predictions have been found to be of low quality (Ruppert et al., 1993). More sophisticated tools that use quantitative matrixes, artificial neural network decision trees, hidden Markov models

bacteria. This is why it is very important to understand antigen-antibody interactions. Macallum et al. (1996) made a detailed analysis of 26 antigen–antibody complexes; they found that binding between molecules is very complex, and that there are different antibody–antigen classes for different types of molecules. A later study of 59 antigen– antibody interactions (Almagro, 2004) found results similar to those of Macallum. These studies show that tools that can identify molecules and predict their interactions with other

Epitopes of B cells are antigenic regions that are recognized by antibodies of the immune system, specifically those that interact with B cell receptors. These epitopes can be continuous or discontinuous (Kumagai & Tsumoto, 2001). B–cell epitopes can be used to design vaccines and new diagnostic tests (Larsen et al., 2006). As with T cells, there are also numerous methodologies to model and predict B–cell epitopes. The classic system to predict B–cell epitopes (Hopp & Woods, 1981) uses propensity scale methods (Parker et al., 1986; Levitt, 1978). This method attributes a propensity value to each amino acid, based on studies of the physical–chemical properties. A combination of various scales can improve the prediction results (Pellequer et al., 1991). This work used hydrophilicity scales (Parker et al., 1986), as well as secondary structure (Levitt, 1978; Chou & Fasman, 1978) and accessibility (Emini et al., 1985). The Immune Epitope Database and Analysis Resource, IEDB (Peters et al., 2005), utilizes parameters such as hydrophilicity, flexibility, accessibility, turns, exposed surface, polarity and antigenic propensity of polypeptides chains, which have been correlated with the location of continuous epitopes. All of the prediction calculations are based on propensity scales. Another methodology that can be used to predict continuous Bcell epitopes combines hidden Markov model (HMM) and propensity scale methods (Parker et al., 1986; Levitt, 1978); it is called Bepipred http://www.cbs.dtu.dk/services/BepiPred/ (Larsen et al., 2006). This methodology has given increased prediction accuracy. Prediction of discontinuous B–cell epitopes has also improved, due to an increase in the number of three–dimensional (3D) structures of antibody–antigen complexes available in PDB and in IMGT/3Dstructure-DB (Kaas et al., 2004) and in the Epitome database (Schlessinger et al.,

There are two classes of T cells: (1) CD8+ T cytotoxic (Tc) cells, which produce cytotoxins responsible for cell lysis, recognize peptides presented by class I MHCs and (2) CD4+ T helper (Th) cells, which recognize proteins associated with MHC class II. Interferon γ (IFN– γ) and tumor necrosis factor β (TNF–β) are produced by Th1 cells. Th2 cells produce interleukin 4 (IL–4), IL–5, IL–10 and IL–13. Eptitopes that bind to MHC de class I generally are 8–10 amino acids long, with a mean of nine amino acids (Reche et al.., 2002), while epitopes that bind to MHC class II are 13–17 amino acids long (Sercarz & Maverakis, 2003; Chicz et al., 1992). There are various online tools for predicting T–cell epitopes on the basis of MHC class I and class II binding. Prediction of MHC binding is based on motifs associated with epitopes or binders for specific alleles. SYFPEITHI is a tool that is widely used for prediction of T–cell epitopes and MHC binding; however, these predictions have been found to be of low quality (Ruppert et al., 1993). More sophisticated tools that use quantitative matrixes, artificial neural network decision trees, hidden Markov models

molecules need to be very accurate and sensitive.

**8.2.1 B cell epitope prediction** 

**8.2.2 T cell epitope prediction** 

2006).

(HMM), support vector machines (SVM), homology modeling, protein threading and docking techniques have been developed. The NetMHC 3.2 server http://www.cbs.dtu.dk/services/NetMHC/ predicts binding of peptides to a series of different HLA alleles using artificial neural networks (ANNs) and weight matrixes. All of the previous versions are available online, for comparison and reference. ANNs were trained with 57 different human MHCs (HLA), representing all of the 12 HLA alleles, supertypes A and B (Lund et al., 2004). Also predictions are available for 22 animal alleles (monkey and rat). ANN prediction values are given in nM IC50 values. Weight prediction matrixes use an aptitude score, with a high aptitude score indicating strong binding. Predictions can be made for sizes from 8 to 11 for all of the alleles using an ANNs algorithm trained with 9mer peptides. Probably because of the limited quantity of 10mer data available, this method has better prediction value when an ANNs algorithm is trained with 10mer data. However, one should be careful with 8mer predictions, since some alleles do not link to 8mer to a significant degree. Binding peptides are indicated at output as strongly binding (SB) and weakly binding (WB). The allele for each HLA supertype is indicated in the selection window for HLA alleles (Lundegaard et al., 2008).

The NetMHCII 2.2 server http://www.cbs.dtu.dk/services/NetMHCII/ predicts peptides that bind to MHC classe II alleles HLA–DR, HLA–DQ, HLA–DP and mouse alleles, using ANNs. Predictions can be obtained for the 14 HLA–DR alleles, including the nine HLA–DR, six HLA–DQ, and six HLA–DP supertypes and two H2 class II alleles in mice. The prediction values are given in nM IC50 values, and in %–Rank for a random set of 1,000,000 natural peptides. Strongly and weakly binding peptides are indicated in the output file (Nielsen et al., 2007).

Without a doubt, there is a great variety of predictors, which when they are combined can be quite precise in the prediction of T–cell epitopes; however, this is only possible when well–characterized alleles are available, which is true for some alleles that have been predicted as MHC class I alleles, but much less so for those predicted as MHC class II. This is even more of a problem in the prediction of B cell proteins, for which it is often necessary to have prior knowledge of the structure and sequence of the protein. Nevertheless, it is known that no method can go further than the data used to train it, and only through extensive compilation and by obtaining high quality data, will it be possible to create excellent models that will can be generally applied (Flower & Doytchinova, 2002).

#### **9. References**


Whole Genome Annotation: In Silico Analysis 699

Cole, S. T., Eiglmeier, K., Parkhill, J., James, K. D., Thomson, N. R., Barrell, B. G. 2001. Massive gene decay in the leprosy bacillus, *Nature* 409(6823): 1007–11. Datta, S., Datta, S., Kim, S., Chakraborty, S. Gill, R. S. 2010. Statistical analyses of next generation sequence data: A partial overview, *J Proteomics Bioinform* 3(6): 183–190. Diaz Romero, J. Outschoorn, I. M. 1994. Current status of meningococcal group b vaccine candidates: capsular or noncapsular? , *Clin Microbiol Rev* 7(4): 559–75. Dobrindt, U. Hacker, J. 2001. Whole genome plasticity in pathogenic bacteria, *Curr Opin* 

Dobrindt, U., Janke, B., Piechaczek, K., Nagy, G., Ziebuhr, W., Fischer, G., Schierhorn, A.,

Emini, E. A., Hughes, J. V., Perlow, D. S. Boger, J. 1985. Induction of hepatitis a virusneutralizing antibody by a virus-specific synthetic peptide, *J Virol* 55(3): 836–9. Evans, M. C. 2008. Recent advances in immunoinformatics: application of in silico tools to

Field, D., Feil, E. J. Wilson, G. A. 2005. Databases and software for the comparison of

Finn, R. D., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T.,

Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, O. L.,

Flower, D. R. Doytchinova, I. A. 2002. Immunoinformatics and the prediction of

Gibas, C. Jambeck, P. 2001. Developing bioinformatics computer skills, *O'Reilly* 1(1): 21–22. Gilks, W. R., Audit, B., De Angelis, D., Tsoka, S. Ouzounis, C. A. 2002. Modeling the

Hacker, J., Bender, L., Ott, M., Wingender, J., Lund, B., Marre, R. Goebel, W. 1990. Deletions

Hopp, T. P. Woods, K. R. 1981. Prediction of protein antigenic determinants from amino

Hou, Y. M. 1999. Transfer rnas and pathogenicity islands, *Trends Biochem Sci* 24(8): 295–8. Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Yeats, C. 2009. Interpro:

Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S. R., Sonnhammer, E. L. L. Bateman, A. 2006. Pfam: clans, web tools and services, *Nucleic Acids Res*

Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E. L. L., Eddy, S. R. Bateman, A. 2010. The pfam protein families database, *Nucleic Acids Res*

percolation of annotation errors in a database of protein sequences, *Bioinformatics*

of chromosomal regions coding for fimbriae and hemolysins occur in vitro and in vivo in various extraintestinal escherichia coli isolates, *Microb Pathog* 8(3): 213–25. Hacker, J. Carniel, E. 2001. Ecological fitness, genomic islands and bacterial pathogenicity. a darwinian view of the evolution of microbes, *EMBO Rep* 2(5): 376–81. Hershberg, R. Petrov, D. A. 2009. General rules for optimal codon choice, *PLoS Genet* 5(7):

the integrative protein signature database, *Nucleic Acids Res* 37(Database issue):

impact for microbial evolution, *Int J Med Microbiol* 290(4-5): 307–11.

drug development, *Curr Opin Drug Discov Devel* 11(2): 233–41.

prokaryotic genomes, *Microbiology* 151(Pt 7): 2125–32.

immunogenicity, *Appl Bioinformatics* 1(4): 167–76.

acid sequences, *Proc Natl Acad Sci U S A* 78(6): 3824–8.

Hecker, M., Blum-Oehler, G. Hacker, J. 2000. Toxin genes on pathogenicity islands:

*Microbiol* 4(5): 550–7.

34(Database issue): D247–51.

38(Database issue): D211–22.

18(12): 1641–9.

e1000556.

D211–5.


Aparicio, G., Götz,, S., Conesa, A., Segrelles, D., Blanquer, I., García, J. M., Hernandez, V.,

for functional genomics analysis, *Stud Health Technol Inform* 120: 194–204. Bambini, S. Rappuoli, R. 2009. The use of genomics in microbial vaccine development, *Drug* 

Barcellos, F. G., Menna, P., da Silva Batista, J. S. Hungria, M. 2007. Evidence of horizontal

Barinov, A., Loux, V., Hammani, A., Nicolas, P., Langella, P., Ehrlich, D., Maguin, E. van de

Bendtsen, J. D., Nielsen, H., von Heijne, G. Brunak, S. 2004. Improved prediction of signal

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. Wheeler, D. L. 2008. Genbank,

Berriman, M. Rutherford, K. 2003. Viewing and annotating sequence data with artemis, *Brief* 

Blom, J., Albaum, S. P., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M.

Blum, G., Ott, M., Lischewski, A., Ritter, A., Imrich, H., Tschäpe, H. Hacker, J. 1994. Excision

Brown, T. A. 1999. Genes e expressão gênica., *Genética – um enfoque molecular* 1(2): 124–132. Bru, C., Courcelle, E., Carrère, S., Beausse, Y., Dalmar, S. Kahn, D. 2005. The prodom

Brusic, V. Petrovsky, N. 2003. Immunoinformatics–the new kid in town, *Novartis Found* 

Brüssow, H., Canchaya, C. Hardt, W.-D. 2004. Phages and the evolution of bacterial

Chicz, R. M., Urban, R. G., Lane, W. S., Gorga, J. C., Stern, L. J., Vignali, D. A. Strominger, J.

Choi, G.-E., Eom, S.-H., Jung, K.-H., Son, J.-W., Shin, A.-R., Shin, S.-J., Kim, K.-H., Chang, C.

Chou, P. Y. Fasman, G. D. 1978. Prediction of the secondary structure of proteins from their

amino acid sequence, *Adv Enzymol Relat Areas Mol Biol* 47: 45–148.

Goesmann, A. 2009. Edgar: a software framework for the comparative analysis of

of large dna regions termed pathogenicity islands from trna-specific loci in the chromosome of an escherichia coli wild-type pathogen, *Infect Immun* 62(2): 606–14.

database of protein domain families: more emphasis on 3d, *Nucleic Acids Res*

pathogens: from genomic rearrangements to lysogenic conversion, *Microbiol Mol* 

L. 1992. Predominant naturally processed peptides bound to hla-dr1 are derived from mhc-related molecules and are heterogeneous in size, *Nature* 358(6389): 764–8.

L. Kim, H.-J. 2010. Cysa2: A candidate serodiagnostic marker for mycobacterium

in a brazilian savannah soil, *Appl Environ Microbiol* 73(8): 2635–43.

peptides: Signalp 3.0, *J Mol Biol* 340(4): 783–95.

*Nucleic Acids Res* 36(Database issue): D25–30.

prokaryotic genomes, *BMC Bioinformatics* 10: 154.

*Symp* 254: 3–13; discussion 13–22, 98–101, 250–2.

tuberculosis infection, *Respirology* 15(4): 636–42.

*Discov Today* 14(5-6): 252–60.

*Wiley* (2): 260–2.

*Bioinform* 4(2): 124–32.

33(Database issue): D212–5.

*Biol Rev* 68(3): 560–602.

Robles, M. Talon, M. 2006. Blast2go goes grid: developing a grid-enabled prototype

transfer of symbiotic genes from a bradyrhizobium japonicum inoculant strain to indigenous diazotrophs sinorhizobium (ensifer) fredii and bradyrhizobium elkanii

Guchte, M. 2009. Prediction of surface exposed proteins in streptococcus pyogenes, with a potential application to other gram-positive bacteria, *Proteomics* 9(1): 61–73. Baxevanis, A. D. Ouellette, F. F. 2001. A practical guide to the analysis of genes and proteins,


Whole Genome Annotation: In Silico Analysis 701

Levitt, M. 1978. Conformational preferences of amino acids in globular proteins, *Biochemistry*

Li, L., Shiga, M., Ching, W.-K. Mamitsuka, H. 2010. Annotating gene functions with

Liberman, F. 2004. *Análise dos fatores determinantes para a qualidade da anotação genômica* 

Lorenzi, H. A., Puiu, D., Miller, J. R., Brinkac, L. M., Amedeo, P., Hall, N. Caler, E. V. 2010.

Lukashin, A. V. Borodovsky, M. 1998. Genemark.hmm: new solutions for gene finding,

Lund, O., Nielsen, M., Kesmir, C., Petersen, A. G., Lundegaard, C., Worning, P., Sylvester-

Lundegaard, C., Lamberth, K., Harndahl, M., Buus, S., Lund, O. Nielsen, M. 2008. Netmhc-

Macallum, R. M., Martin, A. C. R. Thornton, J. M. 1996. Antibody-antigen interactions:

Mathé, C., Sagot, M.-F., Schiex, T. Rouzé, P. 2002. Current methods of gene prediction, their

Maurelli, A. T. 2007. Black holes, antivirulence genes, and gene inactivation in the evolution

Maurelli, A. T., Fernández, R. E., Bloch, C. A., Rode, C. K. Fasano, A. 1998. "black holes" and

Mazumder, R. Vasudevan, S. 2008. Structure-guided comparative analysis of proteins:

Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., Clausen, J., Kalinowski, J.,

annotation system for prokaryote genomes, *Nucleic Acids Res* 31(8): 2187–95. Mira, A., Martín-Cuadrado, A. B., D'Auria, G. Rodríguez-Valera, F. 2010. The bacterial pangenome:a new paradigm in microbiology, *Int Microbiol* 13(2): 45–57.

strengths and weaknesses, *Nucleic Acids Res* 30(19): 4103–17.

of bacterial pathogens, *FEMS Microbiol Lett* 267(1): 1–8.

*automática*, Master's thesis, Universidade Católica de Brasília.

*Microbiol* 52(5): 1337–48.

*Nucleic Acids Res* 26(4): 1107–15.

*Immunogenetics* 55(12): 797–810.

17(20): 4277–85.

*Inform* 22: 95–120.

4(6): e716.

12.

45.

3943–8.

e1000151.

of its cognate integrase and hef, a new recombination directionality factor, *Mol* 

integrative spectral clustering on microarray expressions and sequences, *Genome* 

New assembly, reannotation and analysis of the entamoeba histolytica genome reveal new genomic features and protein content information, *PLoS Negl Trop Dis*

Hvid, C., Lamberth, K., Roder, G., Justesen, S., Buus, S. Brunak, S. 2004. Definition of supertypes for hla molecules using clustering of specificity matrices,

3.0: accurate web accessible predictions of human, mouse and monkey mhc class i affinities for peptides of length 8-11, *Nucleic Acids Res* 36(Web Server issue): W509–

Contact analysis and binding site topography, *Journal of Molecular Biology* 262: 732–

bacterial pathogenicity: a large genomic deletion that enhances the virulence of shigella spp. and enteroinvasive escherichia coli, *Proc Natl Acad Sci U S A* 95(7):

principles, tools, and applications for predicting function, *PLoS Comput Biol* 4(9):

Linke, B., Rupp, O., Giegerich, R. Pühler, A. 2003. Gendb–an open source genome


Hyatt, D., Chen, G.-L., Locascio, P. F., Land, M. L., Larimer, F. W. Hauser, L. J. 2010.

Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S. Madden, T. L. 2008. Ncbi blast: a better web interface, *Nucleic Acids Res* 36(Web Server issue): W5–9. Kaas, Q., Ruiz, M. Lefranc, M. P. 2004. Imgt/3dstructure-db and imgt/structuralquery, a

Karlin, S., Mrázek, J. Campbell, A. M. 1998. Codon usages in different gene classes of the

Kendrew, J. 1999. In: The encyclopedia of molecular biology, *in* B. Science (ed.), *Gene*, Porto

Kislyuk, A. O., Katz, L. S., Agrawal, S., Hagen, M. S., Conley, A. B., Jayaraman, P.,

Krizova, L. Nemec, A. 2010. A 63 kb genomic resistance island found in a multidrug-

Krogh, A., Larsson, B., von Heijne, G. Sonnhammer, E. L. 2001. Predicting transmembrane

Kumagai, I. Tsumoto, K. 2001. Antigen-antibody binding, *Encyclopedia of Life Sciences -* 

Langille, M. G. I. Brinkman, F. S. L. 2009. Islandviewer: an integrated interface for

Langille, M. G. I., Hsiao, W. W. L. Brinkman, F. S. L. 2008. Evaluation of genomic island predictors using a comparative genomics approach, *BMC Bioinformatics* 9: 329. Lapierre, P. Gogarten, J. P. 2009. Estimating the size of the bacterial pan-genome, *Trends* 

Larsen, J. E., Lund, O. Nielsen, M. 2006. Improved method for predicting linear b-cell

Lefranc, M. P., Giudicelli, V., Ginestoux, C., Jabado-Michaloud, J., Folch, G., Bellahcene, F.,

Lefébure, T. Stanhope, M. J. 2007. Evolution of the core and pan-genome of streptococcus: positive selection, recombination, and genome composition, *Genome Biol* 8(5): R71. Lerat, E. Ochman, H. 2005. Recognizing the pseudogenes in bacterial genomes, *Nucleic Acids* 

Lesic, B., Bach, S., Ghigo, J.-M., Dobrindt, U., Hacker, J. Carniel, E. 2004. Excision of the high-

Wu, Y., Gemrot, E., Brochet, X., Lane, J., Regnier, L., Ehrenmann, F., Lefranc, G. Duroux, P. 2009. Imgt, the international immunogenetics information system,

pathogenicity island of yersinia pseudotuberculosis requires the combined actions

*BMC Bioinformatics* 11: 119.

Alegre, pp. 343–401.

*Antimicrob Chemother* 65(9): 1915–8.

*Nature Publishing Group* pp. 1–7.

*Mol Biol* 305(3): 567–80.

25(5): 664–5.

*Genet* 25(3): 107–10.

*Res* 33(10): 3125–32.

epitopes, *Immunome Res* 2: 2.

*Nucleic Acids Res* 37(Database issue): D1006–12.

26.

*Nucleic Acids Res* 32(Database issue): D208–10.

escherichia coli genome, *Mol Microbiol* 29(6): 1341–55.

Prodigal: prokaryotic gene recognition and translation initiation site identification,

database and a tool for immunoglobulin, t cell receptor and mhc structural data,

Nelakuditi, V., Humphrey, J. C., Sammons, S. A., Govil, D., Mair, R. D., Tatti, K. M., Tondella, M. L., Harcourt, B. H., Mayer, L. W. Jordan, I. K. 2010. A computational genomics pipeline for prokaryotic sequencing projects, *Bioinformatics* 26(15): 1819–

resistant acinetobacter baumannii isolate of european clone i from 1977, *J* 

protein topology with a hidden markov model: application to complete genomes, *J* 

computational identification and visualization of genomic islands, *Bioinformatics*

of its cognate integrase and hef, a new recombination directionality factor, *Mol Microbiol* 52(5): 1337–48.


Whole Genome Annotation: In Silico Analysis 703

Schellenberg, M. J., Ritchie, D. B. MacMillan, A. M. 2008. Pre-mrna splicing: a complex

Schlessinger, A., Ofran, Y., Yachdav, G. Rost, B. 2006. Epitome: database of structureinferred antigenic epitopes, *Nucleic Acids Res* 34(Database issue): D777–80. Schmidt, H. Hensel, M. 2004. Pathogenicity islands in bacterial pathogenesis, *Clin Microbiol* 

Schonbach, C., Koh, J. L., Sheng, X., Wong, L. Brusic, V. 2000. Fimm, a database of functional

Sercarz, E. E. Maverakis, E. 2003. Mhc-guided processing: binding of large antigen

Servant, F., Bru, C., Carrère, S., Courcelle, E., Gouzy, J., Peyruc, D. Kahn, D. 2002. Prodom: automated clustering of homologous domains, *Brief Bioinform* 3(3): 246–51. Setúbal, J. Meidanis, J. 1997. *Introduction to Computational Molecular Biology*, Pacific Grove. Sibbald, M. J. J. B. van Dij, J. M. l. 2009. Secretome mapping in gram-positive pathogens. in

Sleator, R. D. 2010. An overview of the current status of eukaryote gene prediction

Smith, T. F. Waterman, M. S. 1981. Identification of common molecular subsequences, *J Mol* 

Suzuki, T. Sasakawa, C. 2001. Molecular basis of the intracellular spreading of shigella, *Infect* 

Takai, S., Hines, S. A., Sekizaki, T., Nicholson, V. M., Alperin, D. A., Osaki, M., Takamatsu,

Trost, B., Haakensen, M., Pittet, V., Ziola, B. Kusalik, A. 2010. Analysis and comparison of

Tumapa, S., Holden, M. T. G., Vesaratchavest, M., Wuthiekanun, V., Limmathurotsakul, D.,

Tyson, G. W., Chapman, J., Hugenholtz, P., Allen, E. E., Ram, R. J., Richardson, P. M.,

UniProt 2007. The universal protein resource (uniprot), *Nucleic Acids Res* 35(Database issue):

D., Nakamura, M., Suzuki, K., Ogino, N., Kakuda, T., Dan, H. Prescott, J. F. 2000. Dna sequence and comparison of virulence plasmids from rhodococcus equi atcc

the pan-genomic properties of sixteen well-characterized bacterial genera, *BMC* 

Chierakul, W., Feil, E. J., Currie, B. J., Day, N. P. J., Nierman, W. C. Peacock, S. J. 2008. Burkholderia pseudomallei genome plasticity associated with genomic island

Solovyev, V. V., Rubin, E. M., Rokhsar, D. S. Banfield, J. F. 2004. Community structure and metabolism through reconstruction of microbial genomes from the

Stein, L. 2001. Genome annotation: from sequence to biology, *Nat Rev Genet* 2(7): 493–503. Stothard, P. Wishart, D. S. 2006. Automated bacterial genome analysis and annotation, *Curr* 

karl wooldridge (ed.), bacterial secreted protein: Secretory mechanisms and role in

picture in higher definition, *Trends Biochem Sci* 33(6): 243–6.

molecular immunology, *Nucleic Acids Res* 28(1): 222–4.

pathogenesis, *Caister Academic Press* pp. 193–225.

33701 and 103, *Infect Immun* 68(12): 6840–7.

fragments, *Nat Rev Immunol* 3(8): 621–9.

strategies, *Gene* 461(1-2): 1–4.

*Opin Microbiol* 9(5): 505–10.

*Immun* 69(10): 5959–66.

*Microbiol* 10: 258.

D193–7.

variation, *BMC Genomics* 9: 190.

environment, *Nature* 428(6978): 37–43.

*Biol* 147(1): 195–7.

*Rev* 17(1): 14–56.


Nielsen, M., Lundegaard, C. Lund, O. 2007. Prediction of mhc class ii binding affinity using

Overbeek, R., Begley, T., Butler, R. M., Choudhuri, J. V., Chuang, H.-Y., Cohoon, M., de

Pareja, E., Pareja-Tobes, P., Manrique, M., Pareja-Tobes, E., Bonal, J. Tobes, R. 2006.

Parker, J. M., Guo, D. Hodges, R. S. 1986. New hydrophilicity scale derived from high-

Pearson, W. R. Lipman, D. J. 1988. Improved tools for biological sequence comparison, *Proc* 

Pellequer, J. L., Westhof, E. Van Regenmortel, M. H. 1991. Predicting location of continuous epitopes in proteins from their primary structures, *Methods Enzymol* 203: 176–201. Peters, B., Sidney, J., Bourne, P., Bui, H. H., Buus, S., Doh, G., Fleri, W., Kronenberg, M.,

Poptsova, M. S. Gogarten, J. P. 2010. Using comparative genome analysis to identify problems in annotated microbial genomes, *Microbiology* 156(Pt 7): 1909–17. Prosdocimi, F. 2003. Bioinformática: manual do usuário., *Biotecnologia Ciência &* 

Pundhir, S., Vijayvargiya, H. Kumar, A. 2008. Predictbias: a server for the identification of genomic and pathogenicity islands in prokaryotes, *In Silico Biol* 8(3-4): 223–34.

Retter, I., Althaus, H. H., Munch, R. Muller, W. 2005. Vbase2, an integrative v gene database,

Rinaudo, C. D., Telford, J. L., Rappuoli, R. Seib, K. L. 2009. Vaccinology in the genome era, *J* 

Ruppert, J., Sidney, J., Celis, E., Kubo, R. T., Grey, H. M. Sette, A. 1993. Prominent role of

Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M. A. Barrell, B. 2000. Artemis: sequence visualization and annotation, *Bioinformatics* 16(10): 944–5. Salimi, N., Fleri, W., Peters, B. Sette, A. 2010. Design and utilization of epitope-based

Salzberg, S. L., Delcher, A. L., Kasif, S. White, O. 1998. Microbial gene identification using

databases and predictive tools, *Immunogenetics* 62(4): 185–96.

interpolated markov models, *Nucleic Acids Res* 26(2): 544–8.

secondary anchor residues in peptide binding to hla-a2.1 molecules, *Cell* 74(5): 929–

Rappuoli, R. 2000. Reverse vaccinology, *Curr Opin Microbiol* 3(5): 445–50.

*Nucleic Acids Res* 33(Database issue): D671–4.

238.

3(3): e91.

37.

*Acids Res* 33(17): 5691–702.

*Biochemistry* 25(19): 5425–32.

*Desenvolvimento* 2(29): 2.

*Clin Invest* 119(9): 2515–25.

*Natl Acad Sci U S A* 85(8): 2444–8.

prokaryotic organisms, *BMC Microbiol* 6: 29.

smm-align, a novel stabilization matrix alignment method, *BMC Bioinformatics* 8:

Crécy-Lagard, V., Diaz, N., Disz, T. Edwards, e. a. 2005. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes, *Nucleic* 

Extratrain: a database of extragenic regions and transcriptional information in

performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and x-ray-derived accessible sites,

Kubo, R., Lund, O., Nemazee, D., Ponomarenko, J. V., Sathiamurthy, M., Schoenberger, S., Stewart, S., Surko, P., Way, S., Wilson, S. Sette, A. 2005. The immune epitope database and analysis resource: from vision to blueprint, *PLoS Biol*


**Part 10** 

**Drug Design** 


**Part 10** 

**Drug Design** 

704 Bioinformatics – Trends and Methodologies

Waack, S., Keller, O., Asper, R., Brodag, T., Damm, C., Fricke, W. F., Surovcik, K., Meinicke,

Zhao, J. Grant, S. F. A. 2010. Advances in whole genome sequencing technology, *Curr Pharm* 

genomes using hidden markov models, *BMC Bioinformatics* 7: 142.

*Biotechnol*.

P. Merkl, R. 2006. Score-based prediction of genomic islands in prokaryotic

**31** 

*India* 

**Designing of Anti-Cancer Drug Targeted to** 

Amit Kumar, Kriti Verma and Amita Sinha  *Department of Bioinformatics and MolecularBiology,* 

*Bioaxis DNA Research Centre, Hyderabad,* 

**Bcl-2 Associated Athanogene (BAG1) Protein** 

Cancer is a disease of uncontrolled cell growth in tissues. This growth may lead to metastasis, which is the invasion of adjacent tissue and infiltration beyond the site of initiation. Cancer is initiated by activation of oncogenes or inactivation of tumor suppressor genes. Nearly 10-30% of all adenocarcinomas are due to the mutations in the *K-ras* protooncogene. [1] Function and regulation of Bcl-2 proteins depends upon their interaction with other non-family member proteins, including NIP1, NIP2, NIP3, p53 BP2, Raf-1, CED-4, calcineurin, R-Ras and Bag-1 to form homo and hetero dimmers.[21] Bag1 belongs to the Bcl-2 associated athanogene (BAG) family of multifunctional proteins. This widely expressed protein interacts with a number of signalling molecules (including Bcl2, HGF receptor and Raf1) as it regulates signalling molecules in pathways involving cell survival, growth and differentiation. [13] Bcl2 associated athanogene (BAG1) protein is involved in regulation of the Ras/Raf signal transduction pathway. Of particular relevance to tumour cells, BAG-1 interacts with the anti-apoptotic BCL-2 protein, various nuclear hormone receptors the 70 kDa heat shock proteins, Hsc70 and Hsp70; and serine/threonine kinase. Raf-1 which plays an important role in MAPK pathway.[2][3][4] Recent studies have shown that BAG-1 expression is frequently altered in malignant cells, and BAG-1 expression may have clinical value as a prognostic or predictive marker for various cancer types including breast cancer, prostate cancer and lung cancer.[6][7][8] (Fig 1) Interaction with chaperones may account for many of the pleiotropic effects associated with BAG-1 over expression. The finding that BAG-1 can independently associate with Raf-1 or Bcl-2 provides at least two mechanisms by

Bcl2-associated athanogene (BAG) family proteins participate in a wide variety of cellular processes to regulate growth control pathways, including cell survival (stress response), proliferation, migration, signalling and apoptosis (Fig 2).[2][5][18] This family of cochaperones functionally regulates signal transduction proteins Raf/MEK/ERK and transcription factors important for cell stress responses, apoptosis, proliferation, cell migration and hormone action. In response to stress, they bind to heat shock proteins HSP70/HSC70 coordinating cell growth signals, by down-regulating the activity of serine/threonine kinase, Raf-1, which plays an important role in MAPK pathway. [5][9] The proteins show anti-apoptotic activity and increase the anti-cell death function of BCL-2 induced by various stimuli. Over expression of BAG-1 suppresses activation of caspases and

**1. Introduction** 

which BAG-1 promotes cell survival. [20]

### **Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein**

Amit Kumar, Kriti Verma and Amita Sinha  *Department of Bioinformatics and MolecularBiology,* 

*Bioaxis DNA Research Centre, Hyderabad, India* 

#### **1. Introduction**

Cancer is a disease of uncontrolled cell growth in tissues. This growth may lead to metastasis, which is the invasion of adjacent tissue and infiltration beyond the site of initiation. Cancer is initiated by activation of oncogenes or inactivation of tumor suppressor genes. Nearly 10-30% of all adenocarcinomas are due to the mutations in the *K-ras* protooncogene. [1] Function and regulation of Bcl-2 proteins depends upon their interaction with other non-family member proteins, including NIP1, NIP2, NIP3, p53 BP2, Raf-1, CED-4, calcineurin, R-Ras and Bag-1 to form homo and hetero dimmers.[21] Bag1 belongs to the Bcl-2 associated athanogene (BAG) family of multifunctional proteins. This widely expressed protein interacts with a number of signalling molecules (including Bcl2, HGF receptor and Raf1) as it regulates signalling molecules in pathways involving cell survival, growth and differentiation. [13] Bcl2 associated athanogene (BAG1) protein is involved in regulation of the Ras/Raf signal transduction pathway. Of particular relevance to tumour cells, BAG-1 interacts with the anti-apoptotic BCL-2 protein, various nuclear hormone receptors the 70 kDa heat shock proteins, Hsc70 and Hsp70; and serine/threonine kinase. Raf-1 which plays an important role in MAPK pathway.[2][3][4] Recent studies have shown that BAG-1 expression is frequently altered in malignant cells, and BAG-1 expression may have clinical value as a prognostic or predictive marker for various cancer types including breast cancer, prostate cancer and lung cancer.[6][7][8] (Fig 1) Interaction with chaperones may account for many of the pleiotropic effects associated with BAG-1 over expression. The finding that BAG-1 can independently associate with Raf-1 or Bcl-2 provides at least two mechanisms by which BAG-1 promotes cell survival. [20]

Bcl2-associated athanogene (BAG) family proteins participate in a wide variety of cellular processes to regulate growth control pathways, including cell survival (stress response), proliferation, migration, signalling and apoptosis (Fig 2).[2][5][18] This family of cochaperones functionally regulates signal transduction proteins Raf/MEK/ERK and transcription factors important for cell stress responses, apoptosis, proliferation, cell migration and hormone action. In response to stress, they bind to heat shock proteins HSP70/HSC70 coordinating cell growth signals, by down-regulating the activity of serine/threonine kinase, Raf-1, which plays an important role in MAPK pathway. [5][9] The proteins show anti-apoptotic activity and increase the anti-cell death function of BCL-2 induced by various stimuli. Over expression of BAG-1 suppresses activation of caspases and

Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein 709

a molecular switch that controls cells to proliferate in normal conditions but become

The C-terminus of the BAG domain is also a site of interaction with Bcl-2 which provides a supra-additive anti-apoptotic effect. The BAG-1 protein shares no significant homology with

All BAG-1 isoforms also contain an ubiquitin-like domain (ULD), similar to ubiquitin and ubiquitin-like proteins that appears to be essential for at least some biological effects[8][7][15]. Although the precise function of the ULD in BAG-1 is unknown, BAG-1 isoforms are very stable proteins suggesting that they are not generally targets for degradation by the

Bcl-2 is an anti-apoptotic protein located mainly on the outer membrane of mitochondria. It has been found that over-expression of Bcl-2 inhibits cells from undergoing apoptosis in response to a various stimuli. [12] The members of the Bcl-2 family share one or more of the four characteristic domains of homology entitled the Bcl-2 homology (BH) domains (named BH1, BH2, BH3 and BH4).[11][13] The BH domains are known to be crucial for its function, as deletion of these domains via molecular cloning affects survival/apoptosis rates. The anti-apoptotic Bcl-2 proteins, such as Bcl-2 and Bcl-xL, conserve all four BH domains. Bcl-2 interacts with pro-apoptotic proteins BAX and BAK. The hydrophobic unit of Bcl-2 forms a heterodimer with the amphipathic unit of BAX and BAK. This heterodimer formation inhibits release of cytochrome c from the mitochondria and prevents activation of caspases. The protein encoded by BAG1 gene binds to BCL2 and is referred to as BCL2-associated

Bcl-2 or other Bcl-2 family proteins, which can form homo- and heterodimers. [11]

ubiquitin/proteasome system and are not covalently attached to other proteins.

athanogene. It enhances the anti-apoptotic effects of BCL2. [12][13]

Fig. 2. Interaction of BAG1 protein with other proteins and cellular components. [8]

quiescent under a stressful environment.[16]

**2. Role of BAG1 and Bcl2 in apoptosis** 

apoptosis induced by a very broad range of agents in different cell types, for example chemotherapeutic agents, radiation and growth factor withdrawal. Therefore, in addition to contributing to reduced cell death in cancer development, BAG-1 may also contribute to resistance to important therapeutic modalities.

Fig. 1. BAG1 expression in normal and diseased human tissues.

BAG-1 proteins are expressed as multiple isoforms generated by alternate translation initiation from a single mRNA. Translation of the major human BAG-1 isoform, BAG-1S, initiates at an internal AUG codon, whereas of the larger BAG-1L (p50) and BAG-1M proteins translation begins upstream at CUG and AUG codons, respectively.[6][9] Hence, the proteins share a common C-terminus. However, the larger isoforms have additional Nterminal sequences. Various domains have been identified within BAG-1 proteins. [14] A potential nuclear localisation signal (NLS) within the unique N-terminal domain of BAG-1L has been identified. However, BAG-1S and BAG-1M lack this sequence. BAG-1S is largely located in the cytoplasm in contrast to BAG-1M which partitions between the nucleus and cytoplasm. [8]

At the carboxy terminal of all BAG-1 isoforms there is a conserved region of about 110 amino-acids, named as the '**BAG domain',** which binds and regulates Hsp70/Hsc70 molecular chaperones. [8][23] BAG domains are present in Bcl-2-associated athanogene 1 and silencer of death domains.

The crystal structure of the BAG domain revealed that it consists of three anti-parallels α helices. In the BAG domain the first and the second α-helices interact with the serine/threonine kinase Raf-1 and the second and third α-helices interact with the ATPbinding pocket of Hsc70/Hsp70. Therefore, Raf-1 and Hsp70/Hsc70 have partially overlapping sites and their binding is competitive. [2] [8]

BAG-1 promotes cell growth by binding to and stimulating Raf-1 activity. The binding of Hsp70 to BAG-1 diminishes Raf-1 signalling and inhibits subsequent events, such as DNA synthesis, as well as arrests cell cycle. When cellular levels of Hsp70 are elevated during stress, or in cells conditionally over expressing Hsp70, Bag1-Raf-1 is displaced by Bag1- Hsp70, and DNA synthesis is arrested.[5][10] Thus, BAG-1 has been suggested to function as a molecular switch that controls cells to proliferate in normal conditions but become quiescent under a stressful environment.[16]

The C-terminus of the BAG domain is also a site of interaction with Bcl-2 which provides a supra-additive anti-apoptotic effect. The BAG-1 protein shares no significant homology with Bcl-2 or other Bcl-2 family proteins, which can form homo- and heterodimers. [11]

All BAG-1 isoforms also contain an ubiquitin-like domain (ULD), similar to ubiquitin and ubiquitin-like proteins that appears to be essential for at least some biological effects[8][7][15]. Although the precise function of the ULD in BAG-1 is unknown, BAG-1 isoforms are very stable proteins suggesting that they are not generally targets for degradation by the ubiquitin/proteasome system and are not covalently attached to other proteins.

#### **2. Role of BAG1 and Bcl2 in apoptosis**

708 Bioinformatics – Trends and Methodologies

apoptosis induced by a very broad range of agents in different cell types, for example chemotherapeutic agents, radiation and growth factor withdrawal. Therefore, in addition to contributing to reduced cell death in cancer development, BAG-1 may also contribute to

BAG-1 proteins are expressed as multiple isoforms generated by alternate translation initiation from a single mRNA. Translation of the major human BAG-1 isoform, BAG-1S, initiates at an internal AUG codon, whereas of the larger BAG-1L (p50) and BAG-1M proteins translation begins upstream at CUG and AUG codons, respectively.[6][9] Hence, the proteins share a common C-terminus. However, the larger isoforms have additional Nterminal sequences. Various domains have been identified within BAG-1 proteins. [14] A potential nuclear localisation signal (NLS) within the unique N-terminal domain of BAG-1L has been identified. However, BAG-1S and BAG-1M lack this sequence. BAG-1S is largely located in the cytoplasm in contrast to BAG-1M which partitions between the nucleus and

At the carboxy terminal of all BAG-1 isoforms there is a conserved region of about 110 amino-acids, named as the '**BAG domain',** which binds and regulates Hsp70/Hsc70 molecular chaperones. [8][23] BAG domains are present in Bcl-2-associated athanogene 1

The crystal structure of the BAG domain revealed that it consists of three anti-parallels α helices. In the BAG domain the first and the second α-helices interact with the serine/threonine kinase Raf-1 and the second and third α-helices interact with the ATPbinding pocket of Hsc70/Hsp70. Therefore, Raf-1 and Hsp70/Hsc70 have partially

BAG-1 promotes cell growth by binding to and stimulating Raf-1 activity. The binding of Hsp70 to BAG-1 diminishes Raf-1 signalling and inhibits subsequent events, such as DNA synthesis, as well as arrests cell cycle. When cellular levels of Hsp70 are elevated during stress, or in cells conditionally over expressing Hsp70, Bag1-Raf-1 is displaced by Bag1- Hsp70, and DNA synthesis is arrested.[5][10] Thus, BAG-1 has been suggested to function as

resistance to important therapeutic modalities.

Fig. 1. BAG1 expression in normal and diseased human tissues.

overlapping sites and their binding is competitive. [2] [8]

cytoplasm. [8]

and silencer of death domains.

Bcl-2 is an anti-apoptotic protein located mainly on the outer membrane of mitochondria. It has been found that over-expression of Bcl-2 inhibits cells from undergoing apoptosis in response to a various stimuli. [12] The members of the Bcl-2 family share one or more of the four characteristic domains of homology entitled the Bcl-2 homology (BH) domains (named BH1, BH2, BH3 and BH4).[11][13] The BH domains are known to be crucial for its function, as deletion of these domains via molecular cloning affects survival/apoptosis rates. The anti-apoptotic Bcl-2 proteins, such as Bcl-2 and Bcl-xL, conserve all four BH domains. Bcl-2 interacts with pro-apoptotic proteins BAX and BAK. The hydrophobic unit of Bcl-2 forms a heterodimer with the amphipathic unit of BAX and BAK. This heterodimer formation inhibits release of cytochrome c from the mitochondria and prevents activation of caspases. The protein encoded by BAG1 gene binds to BCL2 and is referred to as BCL2-associated athanogene. It enhances the anti-apoptotic effects of BCL2. [12][13]

Fig. 2. Interaction of BAG1 protein with other proteins and cellular components. [8]

Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein 711

Clustalw is a multiple sequence alignment program that calculates the best match for the selected DNA or protein sequences and then lines them up so that the identities, similarities and the differences can be seen. Boxshade works by global alignment of all sequence. Conserved and similar residues are emphasized by various degrees of shading. KEGG database was used for pathway analysis for this Bag1 protein. Genecards has been used in this work to get various expression and sequence related information pertaining to proteins of the Bag1 protein. In this work all the above programs have been used to obtain the

peptide recognition pattern for the interpretation of results**.** 

Fig. 3. Flowchart depicting the drug targeted strategy.

#### **3. Role of Raf-1/ MAPK pathway in cancer**

The pathways regulated by BAG-1 play key roles in the development and progression of cancer and determining response to therapy. The extracellular signal-related kinase (ERK), among the MAPK pathways, plays a key role in promotion of cellular proliferation, survival, and metastasis, this pathway directly affects the initiation and progression of human tumors. This pathway has been found to be activated in numerous cancer types without obvious genetic mutations. Cell lines derived from various organs such as pancreas, colon, lung, ovary and kidney have been reported to show a high degree of MAP kinase activation as observed in 50 tumor cell lines. However, it ihas been found that the constitutive activation of MAP kinases in tumor cells is not due to the disorder of MAP kinases themselves, but is due to the disorder of Raf-1, Ras, or some other signaling molecules upstream of Ras.[3]

The Ras/Raf/MEK/ERK pathway also interacts with the p53 pathway thereby regulating the activity and subcellular localization of BCl2 family proteins (Bim, Bak, Bax, Puma and Noxa). Thus the Raf/MEK/ERK pathway has different effects on growth, prevention of apoptosis, cell cycle arrest and induction of drug resistance in cells of various lineages.[3][4]

#### **4. Methodology used in present study**

The strategy used in this project is to target the first alpha helix of the BAG domain. Binding of a ligand to the first alpha helix provides two simultaneous scenarios i.e. firstly it blocks the site for Raf1 binding and thus blocks the MAPK pathway. Secondly, it makes the second and third alpha helix available for Hsp70 binding. Binding of Hsp70 to BAG1 protein renders the heat shock protein inactive as BAG1 has been found to have inhibitory effect on Hsp proteins. This shall produce pseudo stress conditions and attenuate DNA synthesis and cellular proliferation.

Thus, the aim of the present study is to design a drug, targeted to the first alpha helix of 'BAG Domain' of BAG1 protein that binds to the competitive binding site of Raf1 and Hsp70 thereby blocking the binding site of Raf1, making it available for Hsp70 binding and hence suppressing its anti-apoptotic activity. The therapeutic goal is to arrest further tumor progression and trigger tumor-selective cell death by disrupting the balance between proapoptotic proteins and anti-apoptotic proteins (Fig 3).

#### **5. Important tools and databases**

NCBI is a primary database majorly used for sequence retrieval and similarity based searches. We used NCBI for our sequence retrieval of query protein sequence. BLAST is the most widely used sequence similarity search programme. It finds regions of local similarity between sequences. In this study protein blast has been extensively used. PubMed database was primarily used for literature search including journals, abstracts, full text articles and other sources related to the research. PDB is repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids which is obtained by X-ray crystallography and NMR spectroscopy. PDB was used to retrieve the 3D structure of the protein. Biology Workbench is a web based tool integrated with access to a wide variety of analysis and modeling tools. This tool has been used for phylogenetic analysis of the Bag1 protein sequences.

The pathways regulated by BAG-1 play key roles in the development and progression of cancer and determining response to therapy. The extracellular signal-related kinase (ERK), among the MAPK pathways, plays a key role in promotion of cellular proliferation, survival, and metastasis, this pathway directly affects the initiation and progression of human tumors. This pathway has been found to be activated in numerous cancer types without obvious genetic mutations. Cell lines derived from various organs such as pancreas, colon, lung, ovary and kidney have been reported to show a high degree of MAP kinase activation as observed in 50 tumor cell lines. However, it ihas been found that the constitutive activation of MAP kinases in tumor cells is not due to the disorder of MAP kinases themselves, but is due to the disorder of Raf-1, Ras, or some other signaling

The Ras/Raf/MEK/ERK pathway also interacts with the p53 pathway thereby regulating the activity and subcellular localization of BCl2 family proteins (Bim, Bak, Bax, Puma and Noxa). Thus the Raf/MEK/ERK pathway has different effects on growth, prevention of apoptosis, cell cycle arrest and induction of drug resistance in cells of various lineages.[3][4]

The strategy used in this project is to target the first alpha helix of the BAG domain. Binding of a ligand to the first alpha helix provides two simultaneous scenarios i.e. firstly it blocks the site for Raf1 binding and thus blocks the MAPK pathway. Secondly, it makes the second and third alpha helix available for Hsp70 binding. Binding of Hsp70 to BAG1 protein renders the heat shock protein inactive as BAG1 has been found to have inhibitory effect on Hsp proteins. This shall produce pseudo stress conditions and attenuate DNA synthesis and

Thus, the aim of the present study is to design a drug, targeted to the first alpha helix of 'BAG Domain' of BAG1 protein that binds to the competitive binding site of Raf1 and Hsp70 thereby blocking the binding site of Raf1, making it available for Hsp70 binding and hence suppressing its anti-apoptotic activity. The therapeutic goal is to arrest further tumor progression and trigger tumor-selective cell death by disrupting the balance between pro-

NCBI is a primary database majorly used for sequence retrieval and similarity based searches. We used NCBI for our sequence retrieval of query protein sequence. BLAST is the most widely used sequence similarity search programme. It finds regions of local similarity between sequences. In this study protein blast has been extensively used. PubMed database was primarily used for literature search including journals, abstracts, full text articles and other sources related to the research. PDB is repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids which is obtained by X-ray crystallography and NMR spectroscopy. PDB was used to retrieve the 3D structure of the protein. Biology Workbench is a web based tool integrated with access to a wide variety of analysis and modeling tools. This tool has been used for phylogenetic analysis of the Bag1

**3. Role of Raf-1/ MAPK pathway in cancer** 

molecules upstream of Ras.[3]

cellular proliferation.

protein sequences.

**4. Methodology used in present study** 

apoptotic proteins and anti-apoptotic proteins (Fig 3).

**5. Important tools and databases** 

Clustalw is a multiple sequence alignment program that calculates the best match for the selected DNA or protein sequences and then lines them up so that the identities, similarities and the differences can be seen. Boxshade works by global alignment of all sequence. Conserved and similar residues are emphasized by various degrees of shading. KEGG database was used for pathway analysis for this Bag1 protein. Genecards has been used in this work to get various expression and sequence related information pertaining to proteins of the Bag1 protein. In this work all the above programs have been used to obtain the peptide recognition pattern for the interpretation of results**.** 

Fig. 3. Flowchart depicting the drug targeted strategy.

Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein 713

with the maximum score i.e. 1HX1 was chosen and viewed in Rasmol. (Fig. 7) The PDB ID 1HX1 shows the 3D structure for two molecules Hsp70 and Bag domain as Chain A (400aa) and Chain B (114aa) respectively. The Chain B i.e. Bag domain (receptor) was isolated and its energy was optimized to 3734.78 au using ArgusLab. The molecule converged at 298.92

CASTP was used to find the active pocket in the receptor protein. The amino acid Lysine in Chain B (receptor molecule) at position 172 is selected as the target residue for Arguslab

A number of small molecules that bind to the target were searched by screening libraries of potential drug compounds. The toxic effects and pharmacodynamics of the compounds was tested by ADME/Tox. The compound Carmustine was chosen out of the drug library as it followed the Lipinski's rule of five and had the best combination of required properties for a

The molecule Carmustine was designed in ArgusLab and geometry optimization of the drug got converged at energy of 22.2 Kcal\mol. Total energy of the compound converged at

The drug was docked to the target residue using Arguslab and Hex5.1. In Hex, the drug docked to the receptor with an Emax value (Energy) of -94.68 kcal\mol and Emin value of - 166.49 kcal\mol. (Fig. 9) In Arguslab, the drug docked to the target receptor with energy of -

Fig. 4. Dendogram depicting the phylogenetic relationship of Bag1 (Query) protein in Homo

sapiens with Bag1 protein of other model organisms (humans).

docking as it is the most hydrophilic residue in the active pocket.

kcal\mol.


5.51 kcal\mol.

potential drug candidate. (Table 2)

### **6. Structure analysis tools**

ProtParam is a tool which allows the computation of various physical and chemical parameters for a protein sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY).

The HNN method was used for secondary structure prediction for Bag1 protein. CPHmodels is a web server predicting protein 3D structure by use of single template homology modeling. The CPHmodels server predicts protein structure from amino acid sequence with respect to distance constraints. CPHmodels is a collection of methods and databases consisting of the following tools: CPHmodels was used for tertiary structure prediction of Bag1 protein.

The 3d structure obtained as a pdb file format was viewed using RasMol.

PRODOM and PROSITE is a comprehensive database of protein domain and families. PROSITE offers tools for protein sequence analysis and motif detection.

STRING is a database of known and predicted protein interactions.

CASTP (Computed Atlas of Surface Topography of Proteins) is a server that provides identification and measurements of surface accessible pockets as well as interior inaccessible cavities for proteins and other molecules.

#### **7. Drug discovery tools**

Drugbank, Pubchem, therapeutic target database (TTD), Tocris are the various databases which were searched for potential drug candidates.

ArgusLab was used as a molecular modelling program to optimize the target receptor protein and design a drug targeted to the BAG1 protein. The quantum mechanical calculations were performed using the Argus compute server.

HEX5.1 is an interactive protein docking and molecular superposition program. HEX5.1 was used for docking of the selected drug candidate with the BAG1 target protein.

#### **8. Results and discussion**

BAG1 isoform 1-L sequence was retrieved from NCBI. The protein BAG1-L is 345 amino acids in length. Sequence analysis by BLASTp shows that Bag1-L protein showed maximum identities with *Bos Taurus* (83%) followed by *Mus musculus* (80%). BLAST results of BAG1 protein compared with other model organisms is shown in Table 1.

Evolutionary relationship of BAG1 protein among various species was obtained in the form of a dendrogram (Fig.4) The query protein was seen to be most closely related to *Mus musculus* and showed a distinct evolutionary relationship with *Suberites domuncula*. The highly conserved regions in the amino acid sequence of BAG1 in protein among various model organisms were analysed using Boxshade (Fig. 5). The amino acids Glycine and Glutamine were found to be the maximum conserved regions among all the species analysed.

The structural analysis of BAG1-L protein was done by using ProtParam, HNN & CPH. Since the GRAVY value is negative (-0.905) it can be inferred from the results that the protein is a hydrophobic molecule present at the cell surface. It is an unstable protein. It consists of 40.00% alpha helices and no beta bridges or turns are present. (Fig. 6) Using CPH model, the 3D structure of the protein was retrieved in the form of a PDB ID. The PDB ID

ProtParam is a tool which allows the computation of various physical and chemical parameters for a protein sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY). The HNN method was used for secondary structure prediction for Bag1 protein. CPHmodels is a web server predicting protein 3D structure by use of single template homology modeling. The CPHmodels server predicts protein structure from amino acid sequence with respect to distance constraints. CPHmodels is a collection of methods and databases consisting of the following tools: CPHmodels was used for tertiary structure

PRODOM and PROSITE is a comprehensive database of protein domain and families.

CASTP (Computed Atlas of Surface Topography of Proteins) is a server that provides identification and measurements of surface accessible pockets as well as interior inaccessible

Drugbank, Pubchem, therapeutic target database (TTD), Tocris are the various databases

ArgusLab was used as a molecular modelling program to optimize the target receptor protein and design a drug targeted to the BAG1 protein. The quantum mechanical

HEX5.1 is an interactive protein docking and molecular superposition program. HEX5.1 was

BAG1 isoform 1-L sequence was retrieved from NCBI. The protein BAG1-L is 345 amino acids in length. Sequence analysis by BLASTp shows that Bag1-L protein showed maximum identities with *Bos Taurus* (83%) followed by *Mus musculus* (80%). BLAST results of BAG1

Evolutionary relationship of BAG1 protein among various species was obtained in the form of a dendrogram (Fig.4) The query protein was seen to be most closely related to *Mus musculus* and showed a distinct evolutionary relationship with *Suberites domuncula*. The highly conserved regions in the amino acid sequence of BAG1 in protein among various model organisms were analysed using Boxshade (Fig. 5). The amino acids Glycine and Glutamine were found to be the maximum conserved regions among all the species

The structural analysis of BAG1-L protein was done by using ProtParam, HNN & CPH. Since the GRAVY value is negative (-0.905) it can be inferred from the results that the protein is a hydrophobic molecule present at the cell surface. It is an unstable protein. It consists of 40.00% alpha helices and no beta bridges or turns are present. (Fig. 6) Using CPH model, the 3D structure of the protein was retrieved in the form of a PDB ID. The PDB ID

used for docking of the selected drug candidate with the BAG1 target protein.

The 3d structure obtained as a pdb file format was viewed using RasMol.

PROSITE offers tools for protein sequence analysis and motif detection. STRING is a database of known and predicted protein interactions.

**6. Structure analysis tools** 

prediction of Bag1 protein.

**7. Drug discovery tools** 

**8. Results and discussion** 

analysed.

cavities for proteins and other molecules.

which were searched for potential drug candidates.

calculations were performed using the Argus compute server.

protein compared with other model organisms is shown in Table 1.

with the maximum score i.e. 1HX1 was chosen and viewed in Rasmol. (Fig. 7) The PDB ID 1HX1 shows the 3D structure for two molecules Hsp70 and Bag domain as Chain A (400aa) and Chain B (114aa) respectively. The Chain B i.e. Bag domain (receptor) was isolated and its energy was optimized to 3734.78 au using ArgusLab. The molecule converged at 298.92 kcal\mol.

CASTP was used to find the active pocket in the receptor protein. The amino acid Lysine in Chain B (receptor molecule) at position 172 is selected as the target residue for Arguslab docking as it is the most hydrophilic residue in the active pocket.

A number of small molecules that bind to the target were searched by screening libraries of potential drug compounds. The toxic effects and pharmacodynamics of the compounds was tested by ADME/Tox. The compound Carmustine was chosen out of the drug library as it followed the Lipinski's rule of five and had the best combination of required properties for a potential drug candidate. (Table 2)

The molecule Carmustine was designed in ArgusLab and geometry optimization of the drug got converged at energy of 22.2 Kcal\mol. Total energy of the compound converged at -101.413 au.

The drug was docked to the target residue using Arguslab and Hex5.1. In Hex, the drug docked to the receptor with an Emax value (Energy) of -94.68 kcal\mol and Emin value of - 166.49 kcal\mol. (Fig. 9) In Arguslab, the drug docked to the target receptor with energy of - 5.51 kcal\mol.

Fig. 4. Dendogram depicting the phylogenetic relationship of Bag1 (Query) protein in Homo sapiens with Bag1 protein of other model organisms (humans).

Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein 715

Fig. 7. Visualization result of Bag1protein using the tool RASMOL showing the 3D structure

Fig. 8. STRING analysis of BAG1 protein showing its interaction with other proteins in the

human. The number of lines represents the strength of interaction.

details.

Fig. 5. Boxshade showing some of the highly conserved (green) and similar (cyan) pattern for Bag1 protein in different model organisms.

Fig. 6. The secondary structure analysis result of Bag1 by HNN tool. The alpha helices are shaded blue, beta sheets are shaded red.

Fig. 5. Boxshade showing some of the highly conserved (green) and similar (cyan) pattern

Fig. 6. The secondary structure analysis result of Bag1 by HNN tool. The alpha helices are

for Bag1 protein in different model organisms.

shaded blue, beta sheets are shaded red.

Fig. 7. Visualization result of Bag1protein using the tool RASMOL showing the 3D structure details.

Fig. 8. STRING analysis of BAG1 protein showing its interaction with other proteins in the human. The number of lines represents the strength of interaction.

Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein 717

(Over 10 PubMed links)

(Over 10 PubMed links)

(10 or fewer PubMed links)

(Over 10 PubMed links)

(10 or fewer PubMed links)

adjust.

(0%)

adjust.

(0%)

adjust.

(0%)

adjust.

(0%)

adjust.

(3%)

adjust.

elegans]

adjust.

norvegicus]

**BLAST Results** 

GENE ID: 573 BAG1 | BCL2-associated athanogene [Homo sapiens]

Score = 724 bits (1671), Expect = 0.0, Method: Compositional matrix

GENE ID: 613855 BAG1 | BCL2-associated athanogene [Bos taurus] Score = 413 bits (950), Expect = 1e-115, Method: Compositional matrix

Identities = 198/236 (83%), Positives = 214/236 (90%), Gaps = 1/236

GENE ID: 12017 Bag1 | BCL2-associated athanogene 1 [Mus musculus]

Score = 362 bits (831), Expect = 1e-99, Method: Compositional matrix

Identities = 172/214 (80%), Positives = 188/214 (87%), Gaps = 0/214

GENE ID: 420967 BAG1 | BCL2-associated athanogene [Gallus gallus]

Score = 310 bits (712), Expect = 8e-85, Method: Compositional matrix

Identities = 144/209 (68%), Positives = 179/209 (85%), Gaps = 2/209

Score = 497 bits (1145), Expect = 9e-141, Method: Compositional matrix

Identities = 250/358 (69%), Positives = 286/358 (79%), Gaps = 13/358

Score = 34.5 bits (102), Expect = 0.053, Method: Compositional matrix

Identities = 17/49 (34%), Positives = 29/49 (59%), Gaps = 1/49 (2%)

GENE ID: 172373 bag-1 | BAG1 (human) homolog [Caenorhabditis

Score = 51.5 bits (160), Expect = 8e-07, Method: Compositional matrix

Identities = 44/145 (30%), Positives = 81/145 (55%), Gaps = 8/145 (5%)

GENE ID: 297994 Bag1 | BCL2-associated athanogene [Rattus

GENE ID: 8616246 sonA | UAS domain-containing protein [Dictyostelium discoideum AX4] (10 or fewer PubMed links)

Identities = 345/345 (100%), Positives = 345/345 (100%), Gaps = 0/345

**Model Organism Name** 

*Homo sapiens* 

*Bos taurus* 

*Mus musculus* 

*Gallus gallus* 

*Rattus norvegicus* 

*Dictyostelium discoideum* 

*Caenorhabditis elegans* 

Fig. 9. Docking analysis result of Bag1 protein with the drug Carmustine in Hex5.1.


Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein 719

The BAG proteins having anti-apoptotic activity promotes cell growth by binding to and stimulating Raf-1 activity. BAG-1 binds to the serine/threonine kinase Raf-1 or Hsc70/Hsp70 in a mutually exclusive interaction. The binding of Hsp70 to BAG-1 diminishes Raf-1 signalling and inhibits subsequent events, such as DNA synthesis, as well as arrests the cell cycle. Hence Bag1 plays an important role in the progression of cancers when over expressed. The 345 amino acid long protein sequence of BAG1-L was obtained from NCBI and a BLASTp was performed to analyze its evolutionary relationships with other counterparts in various model organisms. This was confirmed by the phylogenetic analysis done using SDSC workbench. The dendrogram presents that Bag1 had close

From the primary structure analysis it was concluded that BAG1 protein is a surface protein which is hydrophilic in nature. Its secondary structure analysis confirmed that it contains more alpha helices and no beta sheets. Its 3D structure was obtained in the form of PDB id and viewed in Rasmol. The chain B of PDB structure represents the BAG domain. Various confirmatory tools were used for validation of the results. The geometry and energy of the BAG domain was optimized in Arguslab. Using Castp, the active pocket in the BAG domain was identified and the most hydrophilic residue in the first alpha helix of the BAG domain i.e. LYS at position 242 of BAG1-L protein sequence obtained from NCBI (or

A drug library was maintained of possible lead compounds that follow the Lipinski rule of five and their toxicity and disposition was checked using ADME/TOX. These candidate drugs were docked to the target receptor. The drug CARMUSTINE showed the best docking result with docking Energy of -5.51kcal/mol. As the docking (Fig 9) was successful in both HEX5.1 and Arguslab it can be concluded that Carmustine can be a potential drug for BAG1 binding and arresting tumor progression. Further analysis must be performed on this drug

The work was done by worthwhile efforts of the staff and the research associates of the Department of Molecular Biology Bioaxis DNA Research Centre, Hyderabad, India. In addition, the authors would like to thank all the technical staff of instrumental section in developing and maintaining the various databases and tools mentioned in this article.

[1] Downward J. Targeting RAS signalling pathways in cancer therapy. *Nat Rev* 

[2] Klára Briknarová1, Shinichi Takayama, Lars Brive, Marnie L. Havert, Deborah A. Knee1,

[3] Sebolt-Leopold JS. Oncovera Therapeutics, Ann Arbor. Advances in the development of

Jesus Velasco1, Sachiko Homma, Edelmira Cabezas1, Joan Stuart, David W. Hoyt, Arnold C. Satterthwait, Miguel Llinás, John C. Reed & Kathryn R. Ely. Structural analysis of BAG1 cochaperone and its interactions with Hsc70 heat shock protein.

cancer therapeutics directed against the RAS-mitogen-activated protein kinase

position 172 in the Chain B of pdb id 1HX1) was selected as the target receptor.

evolutionary relationship with *Mus musculus.*

for use in treatment of cancer.

*Cancer.* (2003) 3(1):11-22.

*Nature Structural Biology,* (2001) 8(4):349-52.

pathway. *Clin Cancer Res* (2008) 14:3651-3656

**9. Acknowledgements** 

**10. References** 


Table 1. Showing the result of query sequence BLAST with different model organisms.


Table 2. List of various potential drug candidates for binding with Bag domain of Bag 1 protein.

The BAG proteins having anti-apoptotic activity promotes cell growth by binding to and stimulating Raf-1 activity. BAG-1 binds to the serine/threonine kinase Raf-1 or Hsc70/Hsp70 in a mutually exclusive interaction. The binding of Hsp70 to BAG-1 diminishes Raf-1 signalling and inhibits subsequent events, such as DNA synthesis, as well as arrests the cell cycle. Hence Bag1 plays an important role in the progression of cancers when over expressed. The 345 amino acid long protein sequence of BAG1-L was obtained from NCBI and a BLASTp was performed to analyze its evolutionary relationships with other counterparts in various model organisms. This was confirmed by the phylogenetic analysis done using SDSC workbench. The dendrogram presents that Bag1 had close evolutionary relationship with *Mus musculus.*

From the primary structure analysis it was concluded that BAG1 protein is a surface protein which is hydrophilic in nature. Its secondary structure analysis confirmed that it contains more alpha helices and no beta sheets. Its 3D structure was obtained in the form of PDB id and viewed in Rasmol. The chain B of PDB structure represents the BAG domain. Various confirmatory tools were used for validation of the results. The geometry and energy of the BAG domain was optimized in Arguslab. Using Castp, the active pocket in the BAG domain was identified and the most hydrophilic residue in the first alpha helix of the BAG domain i.e. LYS at position 242 of BAG1-L protein sequence obtained from NCBI (or position 172 in the Chain B of pdb id 1HX1) was selected as the target receptor.

A drug library was maintained of possible lead compounds that follow the Lipinski rule of five and their toxicity and disposition was checked using ADME/TOX. These candidate drugs were docked to the target receptor. The drug CARMUSTINE showed the best docking result with docking Energy of -5.51kcal/mol. As the docking (Fig 9) was successful in both HEX5.1 and Arguslab it can be concluded that Carmustine can be a potential drug for BAG1 binding and arresting tumor progression. Further analysis must be performed on this drug for use in treatment of cancer.

#### **9. Acknowledgements**

The work was done by worthwhile efforts of the staff and the research associates of the Department of Molecular Biology Bioaxis DNA Research Centre, Hyderabad, India. In addition, the authors would like to thank all the technical staff of instrumental section in developing and maintaining the various databases and tools mentioned in this article.

#### **10. References**

718 Bioinformatics – Trends and Methodologies

[Suberites domuncula]

(10 or fewer PubMed links)

Length=258

adjust.

malayi]

adjust.

**BLAST Results** 

emb|CAJ65915.1| BAG family molecular chaperone regulator 1

Score = 99.5 bits (324), Expect = 4e-19, Method: Compositional matrix

Identities = 63/186 (33%), Positives = 110/186 (59%), Gaps = 4/186 (2%)

GENE ID: 6105907 Bm1\_55120 | BAG domain containing protein [Brugia

Score = 47.1 bits (145), Expect = 5e-06, Method: Compositional matrix

Identities = 57/199 (30%), Positives = 92/199 (46%), Gaps = 8/199 (4%)

Table 1. Showing the result of query sequence BLAST with different model organisms.

Table 2. List of various potential drug candidates for binding with Bag domain of Bag 1

**Model Organism Name** 

*Suberites domuncula* 

*Brugia malayi* 

protein.


Designing of Anti-Cancer Drug Targeted to Bcl-2 Associated Athanogene (BAG1) Protein 721

[18] Takahashi N, Yanagihara M, Ogawa Y, Yamanoha B, Andoh T. Down-regulation of Bcl-

[19] Rorke S, Murphy S, Khalifa M, Chernenko G, Tang SC. Prognostic significance of BAG-1 expression in nonsmall cell lung cancer. *Int J Canc*er. (2001) 20; 95(5):317-22. [20] Shinichi Takayama, David N.Bimston1, Shu-ichi Matsuzawa, Brian C.Freeman1,2,

[21] Graham P, Matthew B and John LC. Mammalian cells express two differently localized

[22] Yang X, Pater A, Tang SC. Cloning and characterization of the human BAG-1 gene

[23] Zeiner M, Gehring U. A protein that interacts with members of the nuclear hormone

[24] Moriyama T, Littell RD, Debernardo R, Oliva E, Lynch MP, Rueda BR, Duska LR. BAG-

[25] Clemo NK, Collard TJ, Southern SL, Edwards KD, Moorghen M, Packham G, Hague A,

[26] Clemo NK, Arhel NJ, Barnes JD, Baker J, Moorghen M, Packham GK, Paraskeva C,

[27] Townsend PA, Cutress RI, Sharp A, Brimmell M, Packham G. BAG-1 prevents stress-

[28] Kikuchi R, Noguchi T, Takeno S, Funada Y, Moriyama H, Uchida Y. Nuclear BAG-1

[29] Liu HY, Wang ZM, Bai Y, Wang M, Li Y, Wei S, Zhou QH, Chen J. Different BAG-1

[31] Pusztai L, Krishnamurti S, Perez Cardona J, Sneige N, Esteva FJ, Volchenok M,

apoptosis in breast cancer cells. *Acta Pharmacol Sin.* (2009) 30(2):235-41. [30] Vora HH, Mehta SV, Shah KN, Brahmbhatt BV, Desai NS, Shukla SN, Shah PM.

dependent pathway. *Cancer Res.* (2003) 15; 63(14):4150-7.

*Res Commun.* (2003) 14; 301(3):798-803.

activity. *Carcinogenesis.* (2008) 29(4):849-57

;4887–4896.

(32):4546-53.

94(2):289-95.

33(Pt 4):676-8.

4; 87(10):1136-9.

5;92(25):11465-9.

807-813

2-interacting protein BAG-1 confers resistance to anti-cancer drugs. *Biochem Biophys* 

Christine Aime-Sempe, Zhihua Xie, Richard I.Morimoto1 and John C.Reed3. BAG-1 modulates the chaperone activity of Hsp70/Hsc70. *The EMBO Journal* (1997) 16

Bag-1 isoforms generated by alternative translation initiation. *Biochem. J.* (1997) 328;

promoter: upregulation by tumor-derived p53 mutants. *Oncogene.* (1999) 12; 18

receptor family: identification and cDNA cloning. *Proc Natl Acad Sci U S A.* (1995)

1 expression in normal and neoplastic endometrium. *Gynecol Oncol.* (2004)

Paraskeva C, Williams AC. BAG-1 is up-regulated in colorectal tumour progression and promotes colorectal tumour cell survival through increased NF-kappaB

Williams AC. The role of the retinoblastoma protein (Rb) in the nuclear localization of BAG-1: implications for colorectal tumour cell survival. *Biochem Soc Trans.* (2005)

induced long-term growth inhibition in breast cancer cells via a chaperone-

expression reflects malignant potential in colorectal carcinomas. *Br J Cancer.* (2002)

isoforms have distinct functions in modulating chemotherapeutic-induced

Cytoplasmic localization of BAG-1 in leukoplakia and carcinoma of the tongue: correlation with p53 and c-erbB2 in carcinoma. *Int J Biol Markers.*( 2007) 22(2):100-7.

Breitenfelder P, Kau SW, Takayama S, Krajewski S, Reed JC, Bast RC Jr, Hortobagyi GN. Expression of BAG-1 and BcL-2 proteins before and after


[4] Hoshino R, Chatani Y, Yamori T, Tsuruo T, Oka H, Yoshida O, Shimada Y, Ari-i S,

[5] Sharp A, Crabb SJ, Townsend PA, Cutress RI, Brimmell M, Wang XH, Packham G. BAG-

[6] Rudolf Götz, Boris W Kramer, Guadalupe Camarero and Ulf R Rapp. BAG-1 haploinsufficiency impairs lung tumorigenesis. *BMC Cancer* (2004) 24; 4:85. [7] HE Maki, OR Saram¨aki, L Shatkina, PM Martikainen, TLJ Tammela, WM van Weerden,

[9] Jaewhan Song, Masahiro Takeda & Richard I. Morimoto. Bag1–Hsp70 mediates a

[10] Ellen A. A. Nollen,1 Jeanette F. Brunsting,1 Jaewhan Song,2 Harm H. Kampinga,1 and

[11] Ruth M. Kluck, Ella Bossy-Wetzel, Douglas R. Green, Donald D. Newmeyer. The

[12] Yang J, Liu X, Bhalla K, Kim CN, Ibrado AM, Cai J, Peng TI, Jones DP, Wang X.

[13] Wang, HG, Takayama S, Rapp UR, Reed JC. Bcl-2 interacting protein, BAG-1, binds to and activates the kinase Raf-1. *Proc. Natl. Acad. Sci. U.S.A* (1996). 93(14):7063-8 [14] Takayama S, Xie Z, Reed JC. An evolutionarily conserved family of Hsp70/Hsc70

[15] Takayama S, Krajewski S, Krajewska M, Kitada S, Zapata JM, Kochel K, Knee D,

normal tissues and tumor cell lines. *Cancer Res*. (1998) 15; 58(14):3116-31. [16] Sondermann H, Scheufler C, Schneider C, Hohfeld J, Hartl FU, Moarefi I. Structure of a

[17] Sharp A, Crabb SJ, Johnson PW, Hague A, Cutress R, Townsend PA, Ganesan A,

1 in carcinogenesis. *Expert Rev Mol Med.* (2004) 6(7):1-15.

*Oncogene*. (1999) 18(3);813-22

*Nature Cell Biology* (2001) 3: 276 – 282.

Apoptosis. *Science*. (1997) 275(5303):1132-6.

blocked. *Science.* (1997) 275(5303);1129-32.

exchange factors. *Science*. (2001) 291(5508):1553-7.

Chaperone Activity. *Mol Cell Biol.* (2000) 20(3):1083-8

839.

786.

331(2): 680–689.

Wada H, Fujimoto J, Kohno M. Constitutive activation of the 41-/43-kDa mitogen-activated protein kinase signaling pathway in human tumors.

RL Vessella, ACB Cato2 and T Visakorpi1. Overexpression and gene amplification of BAG-1L in hormone-refractory prostate cancer. *J Pathol* (2007) 212: 395–401 [8] R I Cutress, P A Townsend, M Brimmell, A C Bateman, A Hague and G Packham. BAG-

1 expression and function in human cancer. *British Journal of Cancer* (2002) 87; 834 –

physiological stress signalling pathway that regulates Raf-1/ERK and cell growth.

Richard I. Morimoto2 . Bag1 Functions In Vivo as a Negative Regulator of Hsp70

Release of Cytochrome c from Mitochondria: A Primary Site for Bcl-2 Regulation of

Prevention of apoptosis by Bcl-2: release of cytochrome c from mitochondria

molecular chaperone regulators. *The Journal of Biological Chem.* (1999) 274 (2); 781–

Scudiero D, Tudor G, Miller GJ, Miyashita T, Yamada M, Reed JC.. Expression and location of Hsp70/Hsc-binding anti-apoptotic protein BAG-1 and its variants in

Bag/Hsc70 complex: convergent functional evolution of Hsp70 nucleotide

Packham G. Thioflavin S (NSC71948) interferes with Bcl-2-associated athanogene (BAG-1)-mediated protein-protein interactions. *J Pharmacol Exp Ther*. (2009);


neoadjuvant chemotherapy of locally advanced breast cancer. *Cancer Invest.* (2004);22(2):248-56.


[32] Tang SC. BAG-1, an anti-apoptotic tumour marker. *IUBMB Life.* (2002) Feb; 53(2):99-

[33] Brive L, Takayama S, Briknarová K, Homma S, Ishida SK, Reed JC, Ely KR. The

[34] Nadler Y, Camp RL, Giltnane JM, Moeder C, Rimm DL, Kluger HM, Kluger Y.

[35] Lüders J, Demand J, Höhfeld J. The ubiquitin-related BAG-1 provides a link between

[36] McCubrey JA, Steelman LS, Chappell WH, Abrams SL, Wong EW, Chang F, Lehmann

*Biochem Biophys Res Commun.* (2001) 21; 289(5):1099-105.

(2004);22(2):248-56.

*Cancer Res.* (2008) 10(2):R35.

275(7):4613-7.

1773(8):1263-84.

105.

neoadjuvant chemotherapy of locally advanced breast cancer. *Cancer Invest.*

carboxyl-terminal lobe of Hsc70 ATPase domain is sufficient for binding to BAG1.

Expression patterns and prognostic value of Bag-1 and Bcl-2 in breast cancer. *Breast* 

the molecular chaperones Hsc70/Hsp70 and the proteasome. *J Biol Chem.* (2000)

B, Terrian DM, Milella M, Tafuri A, Stivala F, Libra M, Basecke J, Evangelisti C, Martelli AM, Franklin RA. Roles of the Raf/MEK/ERK pathway in cell growth, malignant transformation and drug resistance. *Biochim Biophys Acta*. (2007).

### *Edited by Mahmood A. Mahdavi*

Bioinformatics - Trends and Methodologies is a collection of different views on most recent topics and basic concepts in bioinformatics. This book suits young researchers who seek basic fundamentals of bioinformatic skills such as data mining, data integration, sequence analysis and gene expression analysis as well as scientists who are interested in current research in computational biology and bioinformatics including next generation sequencing, transcriptional analysis and drug design. Because of the rapid development of new technologies in molecular biology, new bioinformatic techniques emerge accordingly to keep the pace of in silico development of life science. This book focuses partly on such new techniques and their applications in biomedical science. These techniques maybe useful in identification of some diseases and cellular disorders and narrow down the number of experiments required for medical diagnostic.

Photo by Rost-9D / iStock

Bioinformatics - Trends and Methodologies

Bioinformatics

Trends and Methodologies

*Edited by Mahmood A. Mahdavi*