**Microarray Analysis in Drug Discovery and Biomarker Identification**

Yushi Liu1 and Joseph S. Verducci2 *1Lovelace Respiratory Research Institute 2The Ohio State University USA* 

#### **1. Introduction**

202 Medicinal Chemistry and Drug Design

Puri, A., R. P. Saxena, et al. (1992). "Immunostimulant activity of Picroliv, the iridoid

Re, F. and J. L. Strominger (2002). "Monomeric recombinant MD-2 binds toll-like receptor 4

Rezaei, N. (2006). "Therapeutic targeting of pattern-recognition receptors." *Int.* 

Rock, F. L., G. Hardiman, et al. (1998). "A family of human receptors structurally related to

Romagne, F. (2007). "Current and future drugs targeting one class of innate immunity

Sansonetti, P. J. (2006). "The innate signaling of dangers and the dangers of innate signaling."

Schwartz, R. S. (2000). "Advances in Immunology-A New Series of Review Articles." *New* 

Seifert, R., G. Schultz, et al. (1990). "Activation of superoxide formation and lysozyme

Singh, M. and D. O'Hagan (1999). "Advances in vaccine adjuvants." *Nature Biotechnology* 17:

Stahl, P. D. and R. A. Ezekowitz (1998). "The mannose receptors is a pattern recognition receptor involved in host defence." *Current Opinion in Immunology* 10: 50-5. Stanley, M. A. (2002). "Imiquimod and the imidazoquinolones: mechanism of action and therapeutic potential." *Clinical and Experimental Dermatology* 27: 571-577. Takeuchi, O. and S. Akira (2010). "Pattern recognition receptors and inflammation." *Cell* 140:

Taylor, M. E. and K. Drickamer (1993). "Structural requirements for high affinity binding of

Thatte, U. M. and S. A. Dahanukar (1997). "The 'Rasayana' Concept: Clues from

Wack, A. and R. Rappuoli (2005). "Vaccinology at the beginning of the 21st century." *Current* 

Werts, C., S. E. Girardin, et al. (2006). "TIR, CARD and PYRIN: three domains for an

Wiesmüller, K.-H., W. G. Bessler, et al. (1992). "Solid phase peptide synthesis of lipopeptide

Wiesmüller, K.-H., G. Jung, et al. (1989). "Novel low-molecular-weight synthetic vaccine

complex ligands by the macrophages mannose receptors." *Journal of Biological* 

Immunodulatory Therapy. In: upadhyay S. (ed); Narosa Publishing House, New

vaccines eliciting epitope-specific B-, T-helper and T-killer cell response."

against foot-and-mouth disease containing a potent B-cell and macrophage

release in human neutrophils by the synthetic lipopeptide Pam3Cys-Ser-(Lys)4. Involvement of guanine-nucleotide-binding proteins and synergism with

receptors: The toll-like receptors." *Drug Discov. Today* 12: 80–87.

donovani infection in Hamsters." *Planta Medica* 58: 528-532.

Drosophila Toll." *Proc Natl Acad Sci USA* 95: 588-93.

chemotactic peptides." *Biochemical Journal* 267: 795-802.

*Chemistry* 277: 23427-23432.

*Immunopharmacol.* 6: 863–869.

*Nat Immunology* 7: 1237-42.

1075-1081.

805–820.

*Chemistry* 268: 399-404.

Delhi." *Immunomodulation*: 141-148.

*in Opinion Immunology* 17: 411-418.

activator." *Vaccine* 7: 29-33.

antimicrobial triad." *Cell Death Differ.* 13: 798–815.

*International Journal of Peptide Protein Research* 40: 255-60.

*England Journal of Medicine* 343: 61.

glycoside fraction of Picrorhiza kurroa, and its protective action against Leishmania

tightly and confers lipopolysaccharide responsiveness." *Journal of Biological* 

#### **1.1 Motivation of microarray development**

Microarrays play important roles in Medicinal Chemistry and Drug Discovery. In the Premicroarray era, scientists used to study one gene at a time. This approach is costly and time consuming. Quite often, many genes that interact with each other would be ignored. Therefore, the discovery of candidate drug targets is challenging, requiring the rapid development of techniques to identify the difference genomic profiling in disease and normal conditions, which will facilitate the understanding of the disease mechanism and the development of potential drugs for disease treatment.

#### **1.2 Examples of microarray in biomarker identification**

Microarray has been successfully applied to the comparison of genomic profiling for human tissues. One advantages of microarray is that it can find some potential drug targets which have been ignored previously. The example for this is the study by Heller *et al* in rheumatoid tissues. (Heller et al, 1997) They found around 100 genes known to be involved in inflammation. (Heller et al, 1997) However, additional genes such as interleukin-6 and matrix metallo-elastase are also found to be overexpressed remarkably, which is not anticipated *a priori*, since matrix metallo-elastase is thought to be distributed only within alveolar macrophages and placental cells.(Debouck and Goodfellow, 1999). Beside human, microarray has also been successfully applied to model organisms such as mouse. (Debouck and Goodfellow, 1999) Animal models play important roles in discovering therapeutic targets and potential drug development. Although the genome for the animals does not agree completely with the human genome, they are more easily to be manipulated. By careful design of the experiments, the treatment effect can be seen more clearly with less noisy background. Moreover, genes can be either knocked down or overexpressed to study the influence on phenotype. People used to use techniques such as differential display PCR to discover genes that are differentially expressed in the animal models and achieved some success. (Wang et al, 1995) However, this technique is much slower compared with microarray.

Quite often, drugs can bind to specific targets within cells and potentially influence different pathways.(Windle & Guiseppi-Elie, 2003) The genes that are differentially expressed between drug-treated and untreated conditions are typically known as biomarkers. Such biomarkers not only help to identify patients at risk, but they also may lead to breakthroughs in understanding the mechanism for different diseases.(Ko et al., 2005)An example is the review by Chen that summarized the recent microarray research in biomarker identification in atherosclerosis and in-stent stenosis.(Chen et al., 2011)

#### **1.3 Diagnosis using microarray**

Microarray has greatly advanced the biology field and the biomarkers identified can be further used to form classifiers for prediction in clinical studies. For example, Gulob *et al* used the gene expression pattern from microarrays to classify acute myeloid leukaemia (AML) and acute lymphoblastic leukaemia (ALL) without other information. (Gulob et al, 1999) Therefore, this field draws the attention not only from biologists but also statisticians and bioinformaticians. Through their collaborative efforts, there are many successful instances. We will introduce some of them in section two later.

The rapid development of this technique also resulted in several FDA approved test. AmpliChip CYP450 test is a clinical test to find specific genetic variation of two cytochrome P450 genes CYP2D6 and CYP2C19 genes including deletion and duplications. (de Leon, 2006) These two enzymes account for the variability of drug metabolism for each patient and offers enriched information for the doctors during prescription of psychiatric drugs. (de Leon, 2006) CYP2D6 can be divided into four categories: Poor Metabolizer, Intermediate Metabolizer, Extensive Metabolizer, and Ultrarapid Metabolizer. (de Leon, 2006) Similarly, for CYP2C19, only two categories are found: Poor Metabolizer and Extensive Metabolizer. (de Leon, 2006) The assay works as follows: First, the gene is amplified by PCR and then the amplified product will be fragmented and labelled. Subsequently, these fragments will be hybridized to the microarray chip and the chip is scanned for further analysis. (de Leon, 2006) For further information of this FDA approved test, please see the website at http://molecular .roche.com/assays/Pages/AmpliChipCYP450Test.aspx.

Another FDA approved diagnosis test is MammaPrint to assess the risk of breast tumor and this will help to decide the effectiveness of chemotherapy on the patients. (van't Veer et al, 2002) The assay uses the fresh tissue to study the Amsterdam 70-gene breast cancer gene signature by microarray analysis. (van't Veer et al, 2002) Readers interested in this test can also obtain more information about the MINDACT trial (Microarray In Node negative and 1-3 positive lymph node Disease may Avoid Chemo Therapy) in the paper by Cardoso et al.(Cardoso et al, 2008)

In general, identification of biomarkers by microarray greatly speeds the progress of research by enabling the simultaneous monitoring of the expression of thousands of genes. However, there are many potential pitfalls in analyzing the output from these arrays. (Verducci, et al., 2006) Due to importance of proper analysis, we will give a brief introduction to the statistical methodology underlying proper analysis.

#### **2. Mechanisms and processing of microarrays**

Medicinal chemistry has increasingly employed microarrays to identify both key target genes and gene networks that can regulate the effectiveness of drugs. The basic scheme is

between drug-treated and untreated conditions are typically known as biomarkers. Such biomarkers not only help to identify patients at risk, but they also may lead to breakthroughs in understanding the mechanism for different diseases.(Ko et al., 2005)An example is the review by Chen that summarized the recent microarray research in

Microarray has greatly advanced the biology field and the biomarkers identified can be further used to form classifiers for prediction in clinical studies. For example, Gulob *et al* used the gene expression pattern from microarrays to classify acute myeloid leukaemia (AML) and acute lymphoblastic leukaemia (ALL) without other information. (Gulob et al, 1999) Therefore, this field draws the attention not only from biologists but also statisticians and bioinformaticians. Through their collaborative efforts, there are many successful

The rapid development of this technique also resulted in several FDA approved test. AmpliChip CYP450 test is a clinical test to find specific genetic variation of two cytochrome P450 genes CYP2D6 and CYP2C19 genes including deletion and duplications. (de Leon, 2006) These two enzymes account for the variability of drug metabolism for each patient and offers enriched information for the doctors during prescription of psychiatric drugs. (de Leon, 2006) CYP2D6 can be divided into four categories: Poor Metabolizer, Intermediate Metabolizer, Extensive Metabolizer, and Ultrarapid Metabolizer. (de Leon, 2006) Similarly, for CYP2C19, only two categories are found: Poor Metabolizer and Extensive Metabolizer. (de Leon, 2006) The assay works as follows: First, the gene is amplified by PCR and then the amplified product will be fragmented and labelled. Subsequently, these fragments will be hybridized to the microarray chip and the chip is scanned for further analysis. (de Leon, 2006) For further information of this FDA approved test, please see the website at

Another FDA approved diagnosis test is MammaPrint to assess the risk of breast tumor and this will help to decide the effectiveness of chemotherapy on the patients. (van't Veer et al, 2002) The assay uses the fresh tissue to study the Amsterdam 70-gene breast cancer gene signature by microarray analysis. (van't Veer et al, 2002) Readers interested in this test can also obtain more information about the MINDACT trial (Microarray In Node negative and 1-3 positive lymph node Disease may Avoid Chemo Therapy) in the paper by Cardoso et

In general, identification of biomarkers by microarray greatly speeds the progress of research by enabling the simultaneous monitoring of the expression of thousands of genes. However, there are many potential pitfalls in analyzing the output from these arrays. (Verducci, et al., 2006) Due to importance of proper analysis, we will give a brief

Medicinal chemistry has increasingly employed microarrays to identify both key target genes and gene networks that can regulate the effectiveness of drugs. The basic scheme is

biomarker identification in atherosclerosis and in-stent stenosis.(Chen et al., 2011)

instances. We will introduce some of them in section two later.

http://molecular .roche.com/assays/Pages/AmpliChipCYP450Test.aspx.

introduction to the statistical methodology underlying proper analysis.

**2. Mechanisms and processing of microarrays** 

**1.3 Diagnosis using microarray** 

al.(Cardoso et al, 2008)

illustrated in Figure 1.1. Two cell strains (one is drug treated and one is non-treated) are harvested and the whole RNA from each strain is then extracted. This is followed by reverse transcription of RNA and the resulted cDNA is labelled with either of the two fluorescence dyes (Cys-3 or Cys-5). Then the mixed cDNA from both samples is hybridized to the probesets on the microarray chip. The probesets are the small oligonucleotides that have the complementary sequence of the cDNA attached to the array at each spot. After intensive washing, the intensity from the fluorescence of the dye labelled on cDNA at each spot is measured and recorded. These data will be used for further analysis.

Fig. 1. Illustration of the microarray process. RNA is extracted from treated (treatment) and untreated (control) cell lines, followed by reverse transcription. The reversely transcribed cDNA is then labelled with the fluorescence dye and hybridized to the probes containing complementary fragments. After unbounded cDNA is washed away, the binding at each probe is then quantified based on the fluorescence intensity of the bounded cDNA.

The above method is referred to as a two channel array because a mixture of cDNA from two treatments is measured directly. In contrast, a one channel array will only have cDNA from one sample to be hybridized to the probesets. In this case the fluorescence intensity from each sample is measured separately. For either type of array, processing and analyzing the data present both statistical and biological challenges. Fortunately, many such approaches have been integrated in the freely distrubuted statistical software R (http://www.r-project.org/) and the software Bioconductor (http://www. bioconductor. org/). Typically the data processing step includes four steps: image analysis, quality assessment, pre-processing and statistical inference.(Tibshirani et al., 2005)

#### **2.1 Image analysis and quality assessment**

In the image analysis step, each spot is quantified and then converted to intensity afterwards. The method of quantification depends on the brand of the arrays. Quality assessment is usually performed at two levels: array level and probe level. On the array level, fingerprint smudges or washed out corners, are generally recognized. Other problems such as defects of the array, errors in RNA extraction also belong in this category. One common criterion is that if the percentage of spots without any signal is higher than 30%, the expression array will fail in the quality control step.(Tibshirani et al., 2005) Poor quality at probe level will include errors like faulty printing, uneven distribution, contamination, or poor level of signal to noise ratio. In addition, several parameters can be used to determine the quality of the array: uniformity, which is minimal variation in pixel intensity within a spot; and the brightness, which is the foreground to background ratio.(Tibshirani et al., 2005) Normally, researchers will ignore the spots of poor quality in subsequent analysis.

#### **2.2 Pre-processing**

Two types of errors usually happen at this stage: (1) systematic error, which influences all measurements within one microarray chip with similar effect -- this error may be corrected by estimation; and (2) random error that cannot be explained or corrected, which is typically known as noise. Such errors are totally stochastic and have different influence on different probes.(Tibshirani et al., 2005) Typically, the pre-processing stage contains three steps: background correction, normalization and summarization. For the widely used Affymetrix chips, many Bioconductor routines are available in R for pre-processing. These require creation of an *AffyBatch* object based on raw Affymetrix data (in a .cel file). The first step is the background adjustment. In this step, one tends to subtract the control intensity from the treatment, to 'denoise' the intensity. However, direct subtraction of uncertain quantities can increase the level of noise and possibly result in negative intensity values for certain spots. Various methods to circumvent these problems are available as *method* parameters in the *bg.correct* function in *R*:


$$\mathbf{W}\_{\mathbf{k}(\mathbf{x},\mathbf{y})} = \mathbf{1} / \left( \mathbf{d}^2 \mathbf{z}\_{\mathbf{k}(\mathbf{x},\mathbf{y})} + \mathbf{S}\_0 \right) \tag{1}$$

In the above formula, the weight is determined by the Euclidean distance from (x,y) to the centroid of the space k and the smoothing coefficient represented by d2 k and S0, respectively.(Tibshirani et al, 2005) Irizarry *et al*. (2003) compare *RMA* and *MAS 5.0* in detail.

*c. Ideal mismatch*: Neither of the above methods uses mismatch information. Although direct subtraction of the mismatch intensity from the perfect match intensity creates the

the expression array will fail in the quality control step.(Tibshirani et al., 2005) Poor quality at probe level will include errors like faulty printing, uneven distribution, contamination, or poor level of signal to noise ratio. In addition, several parameters can be used to determine the quality of the array: uniformity, which is minimal variation in pixel intensity within a spot; and the brightness, which is the foreground to background ratio.(Tibshirani et al., 2005) Normally, researchers will ignore the spots of poor quality in subsequent analysis.

Two types of errors usually happen at this stage: (1) systematic error, which influences all measurements within one microarray chip with similar effect -- this error may be corrected by estimation; and (2) random error that cannot be explained or corrected, which is typically known as noise. Such errors are totally stochastic and have different influence on different probes.(Tibshirani et al., 2005) Typically, the pre-processing stage contains three steps: background correction, normalization and summarization. For the widely used Affymetrix chips, many Bioconductor routines are available in R for pre-processing. These require creation of an *AffyBatch* object based on raw Affymetrix data (in a .cel file). The first step is the background adjustment. In this step, one tends to subtract the control intensity from the treatment, to 'denoise' the intensity. However, direct subtraction of uncertain quantities can increase the level of noise and possibly result in negative intensity values for certain spots. Various methods to circumvent these problems are available as *method* parameters in the

*a. RMA method*: which is based on the assumption that the observed signal is a mixture of

*b. MAS 5.0 method*: due to the above disadvantages, *RMA* may not produce optimal result. Therefore, *MAS 5.0* is sometimes used instead. Here, the whole array can be partitioned as k rectangular grids.(Tibshirani et al, 2005) The probeset, with lowest intensity for the grid, is used as the noise value to calculate the background corrected intensity within a particular rectangle. The intensities of these probes are further adjusted according to the weighted average of the background intensity of all grids

In the above formula, the weight is determined by the Euclidean distance from (x,y) to the centroid of the space k and the smoothing coefficient represented by d2k and S0, respectively.(Tibshirani et al, 2005) Irizarry *et al*. (2003) compare *RMA* and *MAS 5.0* in detail. *c. Ideal mismatch*: Neither of the above methods uses mismatch information. Although direct subtraction of the mismatch intensity from the perfect match intensity creates the

mean *α.* Thus the fluorescence intensity *O* we observe is the addition of the signal and noise. Assuming the above, *E*(S|O*)*, which is the conditional expectation of the signal based on the observed intensity will be used as the background corrected values. However, the disadvantages for this are: only the PM(perfect match) values are used and MM(mismatch) values, which contains useful information for background noise are discarded.(Tibshirani et al., 2005) and the results may not be robust if there are gross deviations from the model assumptions. These assumptions may be checked visually

*<sup>2</sup>* and exponential signal (S) with

k(x,y)+S0) (1)

Gaussian background noise (N) with meanμvariance

**2.2 Pre-processing** 

*bg.correct* function in *R*:

via different plotting methods.

according to the following formula:

Wk(x,y)=1/(d2

problems of added noise and negative intensity, *ideal mismatch* adjusts the observed mismatch so that it will never be higher than the PM intensities. The detailed formula of this is available in (Tibshirani et al., 2005) for further reading.

The next step is normalization of scores across different microarrays so that they can be compared fairly with each other. A variety of methods available in the *normalize* function of Bioconductor will be introduced*:*


The final step of preprocessing step is the summation, which is trying to integrate intensity values from multiple probes of a particular gene and obtain its expression value. The R routines *expresso* and *threestep* offer great flexibility in deciding how much to weight each probe. (Tibshirani et al., 2005) Summation completes the pre-processing step, and we are now ready to begin proper analysis.

#### **2.3 Statistical methods for biomarker identification in microarray analysis**

#### **2.3.1 Introduction to basic microarray analysis**

Since microarray can be viewed as high-dimensional dataset with fewer replicates. Traditional variable selection procedures like stepwise selection cannot identify the biomarkers effectively; modifications or new procedures are developed to accommodate this.

*a. Shrinkage Methods*: One particular drawback from stepwise selection of genes that distinguish treatment from control is its poor performance when the variables (gene expression levels) are highly correlated. However, this is exactly what happens on for microarray data since many genes on the array typically are involved in the same pathway. This inspired the development of the shrinkage methods, which can be viewed as constrained optimization. One advantage of shrinkage methods is they are more continuous than the subset selection and do not exhibit high variance. (Hastie et al., 2001) Theoretically, shrinkage methods do not minimize the residual sum of squares; instead, they impose a penalty on the residual sum of squares. Nowadays, different forms of penalty are proposed and some of the most commonly used ones are introduced here.

*Ridge Regression*: Ridge regression introduces a penalty on the size of the coefficients, thus leading to the shrinkage of the regression coefficients.(Hastie et al., 2001) Mathematically, *ridge* solves the following:

$$\tilde{\boldsymbol{\beta}}^{\text{ridge}} = \underset{\boldsymbol{\beta}}{\text{arg min}} \{ \sum\_{i=1}^{N} (\mathbf{y}\_i - \boldsymbol{\beta}\_0 - \sum\_{j=1}^{p} \mathbf{x}\_{ij} \boldsymbol{\beta}\_j)^2 + \lambda \sum\_{j=1}^{p} \boldsymbol{\beta}\_j^2 \} \tag{2}$$

Here, λ (the regularization parameter) is greater than or equal to zero and controls the amount of shrinkage towards zero. When the regularization parameter is zero, this approach is converted back to ordinary least square (OLS) estimation. The penalized formulation (2) has an equivalent formulation in terms of constrained optimization, which may be achieved using convex programming methods:

$$\begin{aligned} \bar{\beta}^{ridge} &= \underset{\beta}{\arg\min} \sum\_{i=1}^{N} \left( y\_i - \beta\_0 - \sum\_{j=1}^{p} x\_{ij}\beta\_j \right)^2 \\ \sum\_{j=1}^{p} \beta\_j^2 &\le s \end{aligned} \tag{3}$$

Ridge estimation is suitable for situations when many correlated variables are present in the model.(Hastie et al., 2001) In these cases, the least squares estimator may be poorly determined, since the large positive coefficients may cancel out the negative coefficients on the correlated variables.(Hastie et al., 2001) Ridge regression can effectively prevent this from happening. As the unique solution to (2), the ridge estimator has explicit form:

$$
\tilde{\boldsymbol{\beta}}^{\text{ridge}} = \left(\mathbf{X}^{\dagger}\mathbf{X} + \lambda\mathbf{I}\right)^{-1}\mathbf{X}^{\dagger}\mathbf{Y} \tag{4}
$$

Since microarray can be viewed as high-dimensional dataset with fewer replicates. Traditional variable selection procedures like stepwise selection cannot identify the biomarkers effectively; modifications or new procedures are developed to accommodate

*a. Shrinkage Methods*: One particular drawback from stepwise selection of genes that distinguish treatment from control is its poor performance when the variables (gene expression levels) are highly correlated. However, this is exactly what happens on for microarray data since many genes on the array typically are involved in the same pathway. This inspired the development of the shrinkage methods, which can be viewed as constrained optimization. One advantage of shrinkage methods is they are more continuous than the subset selection and do not exhibit high variance. (Hastie et al., 2001) Theoretically, shrinkage methods do not minimize the residual sum of squares; instead, they impose a penalty on the residual sum of squares. Nowadays, different forms of penalty are proposed and some of the most commonly used ones are

*Ridge Regression*: Ridge regression introduces a penalty on the size of the coefficients, thus leading to the shrinkage of the regression coefficients.(Hastie et al., 2001) Mathematically,

0

Here, λ (the regularization parameter) is greater than or equal to zero and controls the amount of shrinkage towards zero. When the regularization parameter is zero, this approach is converted back to ordinary least square (OLS) estimation. The penalized formulation (2) has an equivalent formulation in terms of constrained optimization, which

*<sup>N</sup> p p ridge*

2 1

*s*

*p j j*

may be achieved using convex programming methods:

11 1 arg min{ ( ) }

0 1 1

*y x*

*i i j j*

argmin ( )

*i j*

*<sup>N</sup> <sup>p</sup> ridge*

Ridge estimation is suitable for situations when many correlated variables are present in the model.(Hastie et al., 2001) In these cases, the least squares estimator may be poorly determined, since the large positive coefficients may cancel out the negative coefficients on the correlated variables.(Hastie et al., 2001) Ridge regression can effectively prevent this

<sup>1</sup> ( ) *ridge t t*

 

from happening. As the unique solution to (2), the ridge estimator has explicit form:

*i i j j j ij j y x*

  2

(2)

 

2

*XX I XY* (4)

(3)

  2

**2.3 Statistical methods for biomarker identification in microarray analysis** 

**2.3.1 Introduction to basic microarray analysis** 

this.

*ridge* 

introduced here.

solves the following:

where I is the identity matrix. Hence, adding a positive constant to the diagonal of Xt X allows a singular matrix to be inverted, effectively reducing the dependencies among the estimated coefficients. This was the original motivation. (Hoerl & Kennard, 1970)

*Lasso Regression:* Lasso regression is similar to ridge regression, simply replacing the L2 norm ridge penalty in (2) by an L1 norm penalty. The lasso form of (3) becomes

$$\begin{aligned} \tilde{\boldsymbol{\beta}}^{\text{lasso}} &= \underset{\boldsymbol{\beta}}{\arg\min} \sum\_{i=1}^{N} (y\_i - \boldsymbol{\beta}\_0 - \sum\_{j=1}^{p} \boldsymbol{x}\_{ij} \boldsymbol{\beta}\_j) \\ \sum\_{j=1}^{p} \left| \boldsymbol{\beta}\_j \right| &\le \mathbf{t} \end{aligned} \tag{5}$$

Changing from an L2 to an L1 penalty results in a estimator that is nonlinear in y.(Hastie et al., 2001) In this case, nonlinear programming is needed to get the lasso solution iteratively.(Hastie et al., 2001) The algorithms of the least angle regression (LARS) can be simply modified to implement the lasso.(Bradley et al., 2004) A special feature of the lasso is the "sparseness" of the solution: some of the coefficients become exactly zero the constraint *t* becomes sufficiently small. If t is large enough, then no shrinkage is performed. For the case of orthonormal columns of X, the lasso has a simple form in terms of the OLS coefficients and the penalty :

$$\left| \text{sign}(\hat{\boldsymbol{\beta}}\_{j}) (\middle| \hat{\boldsymbol{\beta}}\_{j} \middle| - \boldsymbol{\gamma})\_{+} \right. \tag{6}$$

*Bridge Regression*: Both ridge and lasso regressions are very popular. They can be generalized to bridge regression to achieve the some of the benefits of both.(Ildiko & Friedman, 1993) In bridge regression, people try to find β that satisfies the following, where *1 ≤ q ≤ 2*:

$$\begin{aligned} \tilde{\boldsymbol{\beta}} &= \operatorname\*{arg\,min}\_{\boldsymbol{\beta}} \sum\_{i=1}^{N} \left( y\_i - \beta\_0 - \sum\_{j=1}^{p} x\_{ij} \beta\_j \right)^2 \\ &\sum\_{j=1}^{p} \left| \beta\_j \right|^q \le s \end{aligned} \tag{7}$$

*SCAD Regression:* Despite the good theoretical properties lasso/ridge regression, these penalties do not achieve the following desired properties simultaneously: unbiasedness (the estimator is close to the true parameter when the true parameter is large), sparseness (irrelevant predictors are automatically removed), and continuity (estimator is continuous, preventing the instability of hard thresh holding estimators).(Fan & Li, 2001) The new penalty is defined for a parameter as

$$p^{\prime} \_{\lambda} (\theta) = \lambda \left\{ I \left( \theta \le \lambda \right) + \frac{\left( a\lambda - \theta \right)\_{+}}{\left( a - 1 \right) \lambda} I \left( \theta > \lambda \right) \right\} \tag{8}$$

This penalty, known as the Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan & Li, 2001) helps to improve the properties of L1 and hard thresh holding penalties such

 <sup>2</sup> <sup>2</sup> *p I* ( ) . The SCAD penalty can also be viewed as a quadratic spline function, with knots at λ and aλ.(Fan & Li, 2001) Also, it does not extremely penalize large values of θ and the solution is continuous.(Fan & Li, 2001) Fan gave the solution to SCAD in the context of wavelets.(Fan, 1997) The SCAD estimator has the following form:

$$\begin{aligned} \hat{\theta} &= \operatorname{sign}(z) (|z| - \lambda)\_+ \quad \text{when} \quad |z| \le 2\lambda\\ \hat{\theta} &= \{(a - 1)z - \operatorname{sign}(z)a\lambda\} / \{a - 2\} \text{ when } 2\lambda < |z| \le a\lambda\\ \hat{\theta} &= z \text{ when } |z| > a\lambda \end{aligned} \tag{9}$$

One feature of this penalty is the oracle property: when some variables are not present in the true model, these corresponding coefficients tend to be estimated as zero when the sample size gets large.(Fan & Li, 2001) Asymptotically, it provides as good estimation of the coefficients as if the underlying model were known beforehand.(Fan & Li, 2001) Selection and estimation of the variable coefficients is automatic and simultaneous.(Fan & Li, 2001) Among the many instances where SCAD has been applied to microarrays for biomarker selection, the study by Wang *et al*. (2007) successfully discovered 71 potential transcriptional factors (TF) in the cell cycle of yeast. These included 19 out of 21 known and experimentally verified TFs related to the cell cycle. Additional TFs showed periodic transcriptional effects and thus were biologically important and worth further study.(Wang et al., 2007)

*b. Methods Involving Derived Inputs: Principal Components Regression:* When a large number of correlated inputs (e.g. potential biomarkers) are available, instead of keeping all these inputs in a regression model, it may be beneficial to consider just a few linear combinations of them. A logically justifiable choice for the coefficients used in the linear combination is the normalized vector *a* that gives the largest sample variance of all possible normalized linear combinations of the input variables. (Hastie et al., 2001) This is called the first principal component. The first *p* principal components are found sequentially, with each successive component maximizing input variation subject to being orthogonal to previous ones. When all the principal components are used, the method becomes the usual least square estimation. However, when fewer principal components are used, this method is similar to ridge regression.(Hastie et al., 2001) As an example, Tan *et al*. (2005) used total principal component regression to classify the tumors into different categories.

*Partial Least Squares:* In contrast to principal component regression, which uses linear combination only of the input variables, partial least square(PLS) also allows for some information from the response variable y in the linear combinations.(Hastie et al., 2001) The algorithm is as follows: Assume y is centered and each xj is properly standardized. PLS first regresses y on each *xj* to obtain the corresponding coefficient 1 ˆ *<sup>j</sup>* . Subsequently, we can define the first partial least square direction as 1 1 *j j z x* . (Hastie et al., 2001) Then *Y* is regressed on *z1* to obtain the corresponding coefficients which is followed by orthogonalizing *x1…xp* in reference to *z1*.(Hastie et al., 2001) This process is repeated until the desired number of directions is reached. As with principal component regression, using all *p* directions results in the usual least square estimation.(Hastie et al., 2001) Differently from principal component regression, PLS seeks input that is in the direction of high variation and high correlation with *y*.(Hastie et al., 2001) In addition, when the inputs are

function, with knots at λ and aλ.(Fan & Li, 2001) Also, it does not extremely penalize large values of θ and the solution is continuous.(Fan & Li, 2001) Fan gave the solution to SCAD in

*a z sign z a a when z a*

the context of wavelets.(Fan, 1997) The SCAD estimator has the following form:

ˆ ( 1) ( ) /( 2) 2

and thus were biologically important and worth further study.(Wang et al., 2007)

One feature of this penalty is the oracle property: when some variables are not present in the true model, these corresponding coefficients tend to be estimated as zero when the sample size gets large.(Fan & Li, 2001) Asymptotically, it provides as good estimation of the coefficients as if the underlying model were known beforehand.(Fan & Li, 2001) Selection and estimation of the variable coefficients is automatic and simultaneous.(Fan & Li, 2001) Among the many instances where SCAD has been applied to microarrays for biomarker selection, the study by Wang *et al*. (2007) successfully discovered 71 potential transcriptional factors (TF) in the cell cycle of yeast. These included 19 out of 21 known and experimentally verified TFs related to the cell cycle. Additional TFs showed periodic transcriptional effects

*b. Methods Involving Derived Inputs: Principal Components Regression:* When a large number of correlated inputs (e.g. potential biomarkers) are available, instead of keeping all these inputs in a regression model, it may be beneficial to consider just a few linear combinations of them. A logically justifiable choice for the coefficients used in the linear combination is the normalized vector *a* that gives the largest sample variance of all possible normalized linear combinations of the input variables. (Hastie et al., 2001) This is called the first principal component. The first *p* principal components are found sequentially, with each successive component maximizing input variation subject to being orthogonal to previous ones. When all the principal components are used, the method becomes the usual least square estimation. However, when fewer principal components are used, this method is similar to ridge regression.(Hastie et al., 2001) As an example, Tan *et al*. (2005) used total principal component regression to classify the

*Partial Least Squares:* In contrast to principal component regression, which uses linear combination only of the input variables, partial least square(PLS) also allows for some information from the response variable y in the linear combinations.(Hastie et al., 2001) The algorithm is as follows: Assume y is centered and each xj is properly standardized. PLS first

regressed on *z1* to obtain the corresponding coefficients which is followed by orthogonalizing *x1…xp* in reference to *z1*.(Hastie et al., 2001) This process is repeated until the desired number of directions is reached. As with principal component regression, using all *p* directions results in the usual least square estimation.(Hastie et al., 2001) Differently from principal component regression, PLS seeks input that is in the direction of high variation and high correlation with *y*.(Hastie et al., 2001) In addition, when the inputs are

regresses y on each *xj* to obtain the corresponding coefficient 1

define the first partial least square direction as 1 1 *j j z x*

 

*z when z a*

ˆ 2

 

*sign z z when z*

. The SCAD penalty can also be viewed as a quadratic spline

ˆ

*<sup>j</sup>* . Subsequently, we can

. (Hastie et al., 2001) Then *Y* is

(9)

<sup>2</sup> <sup>2</sup> *p I* ( )

 

 

ˆ

tumors into different categories.

  orthogonal to each other, the PLS will coincide with the least square estimates after the 1st step, setting the coefficients to be zero for the subsequent steps.

Because of its good properties with small sample sizes and many predictors, PLS has been applied to high-dimensional genomic data.(Boulesteix & Strimmer, 2006; Aaroe, *et al.*, 2010) Moreover, PLS regression can also be used to impute missing data. For example, Bras and Menezes (2006) impute missing values using PLS regression with all the genes as predictors. Huang *et al.* (2004) use a penalized version of PLS, which removes genes with poor power of prediction, in order to predict LVAD (left mechanical ventricular assist device) support time. In this case, the shrinkage parameter and the number of latent components are obtained using cross-validation. After proper shrinkage, some genes have coefficient zero, thus removing them from the model.(Huang et al., 2004) This reduces the complexity of the model, and serves as an example for combining both shrinkage and PLS.

*c. Bayes Variable Selection Methods:* Bayesian approaches use knowledge across genes for further inference.(Nott et al., 2007) For example, George and Foster (2000) adopted a binomial prior for the number of the differentially expressed genes (i.e. biomarkers) and a normal prior for their corresponding coefficients, assuming known and constant variance parameter. With informed choice of the hyperparameters, the authors ranked the genes according to the posterior probability that the gene belonged to the differentially expressed gene set or not. Interestingly, the gene ranking agreed with the ranking obtained by other criteria such as AIC (Akaike, 1973) or BIC (Schwartz, 1979; Nott *et al*., 2007) When the variance parameters were unknown, different priors were assumed for effects of the genes and variance of the genes. Lonnstedt and Speed used a normal prior for effects of the differentially expressed genes and an inverted gamma prior for the corresponding variance.(Lonnstedt & Speed., 2002) Thus, after choosing the hyperparameters properly, people derived an explicit expression for the log odds of differentially expressed genes, which was known as B-statistic. (Lonnstedt & Speed, 2002) In contrast, Nott *et al* considered a double tailed exponential prior for effects of the differentially expressed genes and an inverted gamma for the corresponding variance.(Nott et al., 2007) The motivation was that double tailed exponential was heavier on the tails and was related to lasso.(Nott et al., 2007) The proposed linear model was as follows:

$$M\_{\mathfrak{sl}} = \mu\_{\mathfrak{s}} + \varepsilon\_{\mathfrak{sl}^\vee} \tag{10}$$

*Mgj* is the expression value of gene *g* for array *j*. *μ<sup>g</sup>* is the gene specific mean expression value. *εgj* is *N(0, σ<sup>g</sup> 2)* where *σ<sup>g</sup> <sup>2</sup>* is the gene specific variance. All the *σ<sup>g</sup> <sup>2</sup>* are independent. Except the situation where we have infinite sample size, we can only conclude gene *g* is differentially expressed when |*μg*|>*k* and gene g is not differentially expressed when |*μg*|≤ *k*.(Nott et al., 2007) As a predefined cutoff value, *k* depends on the purpose of the experiment and other conditions. Correspondingly, *B*-statistic is then defined as follows to explore whether gene g belongs to the set of differentially expressed gene or not:

$$B(k) = \log \frac{\Pr\left( \left| \mu\_{\mathcal{S}} \right| > k \left| M \right)}{\Pr\left( \left| \mu\_{\mathcal{S}} \right| \le k \left| M \right)}\right) \tag{11}$$

The above represents the log odds ratio of the posterior probability given the data *M*. To implement this hierarchical Bayes procedure, calculation of the posterior probability is necessary. A modified version of MCMC (Markov Chain Monte Carlo) algorithm was proposed, and details of the computation of the above *B*-statistic when the prior distributions are given is discussed by Nott et al. (2007). When *k* is chosen to be 0 and a proper prior allows μg to be exactly 0, an explicit form of B(0) can be developed accordingly.(Lonnstedt & Speed., 2002) People usually use this statistic to rank genes instead of making inferences.(Nott et al., 2007; Lonnstedt & Speed., 2002)

*d. Hypotheses Testing as a Variable Selection Methods:* The t-statistic has already been widely used for hypothesis testing for a long time. Therefore, people try to apply this to microarray study to select appropriately biomarkers on microarray. However, the traditional t-statistic will not work in this situation due to the following reasons: First, hundreds or thousands hypotheses are being tested simultaneously; therefore, the multiple comparison issue exists. However, Bonferroni correction is too conservative and sometimes no gene can pass this vigorous criterion. Thus, suitable adjustment methods need to be developed to further control the overall error. Second, during the microarray analysis, large outliers are frequently observed, and they tend to drive the tstatistic to be large. Similarly, due to large number of tested genes and small number of replicates, the estimated variance for each gene is usually small, which tends to drive the t-statistic to be large.

To meet the first challenge, a new method known as false discovery rates (FDRs) has been proposed. (Benjamini & Hochberg, 1995; Tusher et al., 2001) False discovery rate is defined as the expected proportion of type I error using the available decision rule.(Benjamini & Hochberg, 1995) This method, readily available in *R*, is especially useful for microarray study, since it is easy to compute and not as overly conservative as is Bonferroni adjustment.

The second challenge requires a robust modification to the current version of t-statistic. One of them is known as *ad hoc* modification, which defines the modification by the data. Efron's 90% rule is in this category.(Efron et al., 2000) The modification is to add a constant term to the denominator to prevent the variance in the t-statistic from being too small.(Efron et al., 2001) The constant a0 is defined as the 90th percentile of all the standard errors of the genes.(Efron et al., 2000) Thus the ordinary t-statistic has the following format:

$$S\_{\mathcal{g}} = M\_{\mathcal{g}.} / \left(\mathbf{s}\_{\mathcal{g}} + a\_0\right) \tag{12}$$

*Mg.* denotes the average expression value for gene *g,* and *sg* denotes the corresponding standard deviation. Another example belonging to *ad hoc* modification is the SAM (Significance Analysis of Microarrays).(Tusher et al., 2001) For each gene, the SAM method assigns a score relative based on the changes in expression relative to the standard deviation for the repeated experiments. For genes with scores higher than certain thresholds, a permutation distribution is used to estimate the FDR.(Tusher et al., 2001) This method may be viewed as an empirical Bayes procedure, simply adding a constant each set of genes levels when estimating individual variances. This avoids difficulties when variances are computed from a small number of observations for each gene.(Tusher et al., 2001) This method showed great improvement in gene identification both FDR-wise and fold-wise in terms of the human cell response to the ionizing radiation.(Tusher et al., 2001) Despite of its robustness to individual outliers, the use of this *ad hoc* modification is still limited, and it is challenging to derive and study its theoretical properties.

necessary. A modified version of MCMC (Markov Chain Monte Carlo) algorithm was proposed, and details of the computation of the above *B*-statistic when the prior distributions are given is discussed by Nott et al. (2007). When *k* is chosen to be 0 and a proper prior allows μg to be exactly 0, an explicit form of B(0) can be developed accordingly.(Lonnstedt & Speed., 2002) People usually use this statistic to rank genes

*d. Hypotheses Testing as a Variable Selection Methods:* The t-statistic has already been widely used for hypothesis testing for a long time. Therefore, people try to apply this to microarray study to select appropriately biomarkers on microarray. However, the traditional t-statistic will not work in this situation due to the following reasons: First, hundreds or thousands hypotheses are being tested simultaneously; therefore, the multiple comparison issue exists. However, Bonferroni correction is too conservative and sometimes no gene can pass this vigorous criterion. Thus, suitable adjustment methods need to be developed to further control the overall error. Second, during the microarray analysis, large outliers are frequently observed, and they tend to drive the tstatistic to be large. Similarly, due to large number of tested genes and small number of replicates, the estimated variance for each gene is usually small, which tends to drive

To meet the first challenge, a new method known as false discovery rates (FDRs) has been proposed. (Benjamini & Hochberg, 1995; Tusher et al., 2001) False discovery rate is defined as the expected proportion of type I error using the available decision rule.(Benjamini & Hochberg, 1995) This method, readily available in *R*, is especially useful for microarray study, since it is easy to compute and not as overly conservative as is Bonferroni adjustment. The second challenge requires a robust modification to the current version of t-statistic. One of them is known as *ad hoc* modification, which defines the modification by the data. Efron's 90% rule is in this category.(Efron et al., 2000) The modification is to add a constant term to the denominator to prevent the variance in the t-statistic from being too small.(Efron et al., 2001) The constant a0 is defined as the 90th percentile of all the standard errors of the

*Mg.* denotes the average expression value for gene *g,* and *sg* denotes the corresponding standard deviation. Another example belonging to *ad hoc* modification is the SAM (Significance Analysis of Microarrays).(Tusher et al., 2001) For each gene, the SAM method assigns a score relative based on the changes in expression relative to the standard deviation for the repeated experiments. For genes with scores higher than certain thresholds, a permutation distribution is used to estimate the FDR.(Tusher et al., 2001) This method may be viewed as an empirical Bayes procedure, simply adding a constant each set of genes levels when estimating individual variances. This avoids difficulties when variances are computed from a small number of observations for each gene.(Tusher et al., 2001) This method showed great improvement in gene identification both FDR-wise and fold-wise in terms of the human cell response to the ionizing radiation.(Tusher et al., 2001) Despite of its robustness to individual outliers, the use of this *ad hoc* modification is still limited, and it is

. 0 /( ) *SM sa g gg* (12)

genes.(Efron et al., 2000) Thus the ordinary t-statistic has the following format:

challenging to derive and study its theoretical properties.

instead of making inferences.(Nott et al., 2007; Lonnstedt & Speed., 2002)

the t-statistic to be large.

Accordingly, a penalized likelihood version of t-statistic has been proposed and implemented. For example, Wu (2005) proposed another modified t-statistics, taking advantage of both SAM and lasso methods. The method works as follows: assume the linear regression situation,

$$\alpha\_{ij} = \beta\_0 + y\_j + \varepsilon\_j \tag{13}$$

where *xij* represents the expression of gene *i* on array *j; yj* is the indicator, whether the *jth* sample belongs to the control or treatment group. A *t*-statistic or *F*-statistic can be developed.(Wu, 2005) The test statistic involves ordinary between/within group sum of squares, both of which can be penalized like in lasso regression.(Wu, 2005) The test statistic in this scenario can be derived as:

$$t\_i^\* = \text{sign}(\overline{\mathbf{x}}\_{i1} - \overline{\mathbf{x}}\_{i2}) \frac{\left( \left| \overline{\mathbf{x}}\_{i1} - \overline{\mathbf{x}}\_{i2} \right| - \lambda \right)\_+}{\sqrt{n / \left( n\_1 n\_2 \right) s\_i^2 + \lambda^2 \left/ \left( n - 2 \right)}}\tag{14}$$

$$F\_i^\* = \frac{\left(\left|\overline{\mathbf{x}}\_{i1} - \overline{\mathbf{x}}\_{i2}\right|^2 - \lambda^2\right)\_+}{n / \left(n\_1 n\_2\right) s\_i^2 + \lambda^2 \ / \left(n - 2\right)}\tag{15}$$

Tusher's SAM statistic is as follows:

$$d\_i = \frac{\overline{\boldsymbol{\varpi}}\_{i1} - \overline{\boldsymbol{\varpi}}\_{i2}}{\mathbf{s}\_0 + \mathbf{s}\_1 \sqrt{n/(n\_1 n\_2)}} \tag{16}$$

Comparing the formulas (14—16), we can see that the penalized *F* or *t* statistic can be viewed as a special version of SAM, since the term λ2/ (n-2) coincides with the constant s0 in SAM statistic, which helps to stabilize the variance.(Wu, 2005) Furthermore, Wu showed that FDR can be calculated by permutation and then a cutoff can be put on the test statistic.(Wu, 2005) What makes the penalized SAM statistic superior to the ordinary SAM statistic is that penalized SAM statistic is derived rigorously from the situation of linear model and thus easier to develop its theoretical properties.(Wu, 2005) Through applications, this statistic also shown good performance.(Wu, 2005)

Another modified *t*-statistic is refined for a statistical model assuming both multiplicative and additive errors.(Ideker et al., 2000) The parameters within the model are subsequently estimated using maximum likelihood method with all the observations.(Ideker et al., 2000) Subsequently, a traditional maximum likelihood ratio test for each individual gene is carried out to identify the significance of the intensities.(Ideker et al., 2000) In some examples, this method can be shown superior to the simple fold approach.(Ideker et al., 2000) However, this method is naïve and has potential limitation as follows: first, the author does not use any multiple comparison adjustment techniques when performing multiple tests on thousands of genes simultaneously; this may be corrected by introducing the traditional FDR. Second, the author used chi-square as the distribution of -2\*ln(likelihood ratio), which may not hold for small sample size.(Ideker et al., 2000) A more suitable distribution needs to be derived accordingly.

One more example of a "modified" t-statistic is derived from a Bayesian approach, which has become popular in statistics. For example, the B-statistic, the log odd ratio of the posterior probability can be viewed as a Bayesian version of t-statistic.(Lonnstedt & Speed, 2002) Another example of "moderated" *t*-statistic is proposed by Smyth.(Smyh, 2004) It assumes a scale inverse chi square prior for the variances of the genes.(Smyh, 2004) Additionally, the parameters can be estimated using Bayesian method and the 'moderated' t statistic is obtained by substituting the corresponding variance with their estimate.(Smyh, 2004) Cui *et al* propose another modified t-statistic using similar approach.(Cui et al., 2005) First, they performed a simulation from a chi-square distribution, whose degrees of freedom depends on the sample size to estimate the variance. Then they derive a bias-corrected Stein estimator on the log scale.(Cui et al., 2005) Thus, this estimator is more robust since the shrinkage in the variance makes the estimator of variance more robust.

As we can see, the main drawback of "moderated t" is that it depends on a particular type of distribution. When the distribution assumption is not satisfied, these estimators will be inefficient and often lead to false inferences. This inspires the birth of the distribution free 'shrinkage t' statistic.(Opgen-Rhein & Strimer, 2007) The main idea behind this is shrinking the empirical variance of each gene towards the common median of all the variance.(Opgen-Rhein & Strimer, 2007) For each group, the ordinary variance is replaced by the corresponding shrinkage variance in the test statistic:

$$t\_k^\* = \frac{\overline{\chi}\_{k1} - \overline{\chi}\_{k2}}{\sqrt{\frac{\upsilon\_{k1}^\*}{n\_1} + \frac{\upsilon\_{k2}^\*}{n\_2}}}\tag{17}$$

For the above formula, n1 and n2 are sample size for group 1 and 2, respectively. *v*\* stands for the corresponding shrinkage variance estimator. Here, each empirical variance is shrinking towards the median, which is shown to be more efficient or robust than shrinking towards the mean or zero.(Opgen-Rhein & Strimer, 2007) We can also view the "shrinkage t" as a combination of a standard t statistic and the fold change statistic.(Opgen-Rhein & Strimer, 2007) Another feature is that the "shrinkage t" belongs to the James-Stein estimator, not relying on any explicit prior distribution assumption and its theoretical property will be easily derived. Furthermore, this method is computationally efficient and the corresponding gene ranking is consistent with other tests.(Opgen-Rhein & Strimer, 2007)

*Shrunken centroid method and SCOOP:* From a different point of view, Tibshirani developed the shrunken centroid method for biomarker identification.(Davies & Bromage, 2002; Tibshirani et al., 2005) For each gene within each group (i.e. treatment group or control group), the overall mean and the group means are calculated. The group means are shrunk toward the overall mean iteratively for each gene.(Davies & Bromage, 2002; Tibshirani et al., 2005) The shrunken values are used to rank the genes and the cutoff is chosen by crossvalidation. (Davies & Bromage, 2002; Tibshirani et al., 2005) The shrunken values can be also used to form a classifier and the authors used this method to classify the cancer conditions. (Davies & Bromage, 2002; Tibshirani et al., 2005) Despite the successes this method has achieved, it has one potential drawback: information about correlation among genes is distorted or lost during successive shrinkage, and, therefore, the identified genes may appear falsely to be independent of each other. Based on this method, Liu *et al.* (2009) developed an improved version of shrinkage centroid method: SCOOP (**S**hrunken **C**entroid **O**rthogonal **O**rdering **P**rojection) to extend to the cases with correlation variables. Instead of

posterior probability can be viewed as a Bayesian version of t-statistic.(Lonnstedt & Speed, 2002) Another example of "moderated" *t*-statistic is proposed by Smyth.(Smyh, 2004) It assumes a scale inverse chi square prior for the variances of the genes.(Smyh, 2004) Additionally, the parameters can be estimated using Bayesian method and the 'moderated' t statistic is obtained by substituting the corresponding variance with their estimate.(Smyh, 2004) Cui *et al* propose another modified t-statistic using similar approach.(Cui et al., 2005) First, they performed a simulation from a chi-square distribution, whose degrees of freedom depends on the sample size to estimate the variance. Then they derive a bias-corrected Stein estimator on the log scale.(Cui et al., 2005) Thus, this estimator is more robust since the

As we can see, the main drawback of "moderated t" is that it depends on a particular type of distribution. When the distribution assumption is not satisfied, these estimators will be inefficient and often lead to false inferences. This inspires the birth of the distribution free 'shrinkage t' statistic.(Opgen-Rhein & Strimer, 2007) The main idea behind this is shrinking the empirical variance of each gene towards the common median of all the variance.(Opgen-Rhein & Strimer, 2007) For each group, the ordinary variance is replaced by the

> \* 1 2 \* \* 1 2 1 2

For the above formula, n1 and n2 are sample size for group 1 and 2, respectively. *v*\* stands for the corresponding shrinkage variance estimator. Here, each empirical variance is shrinking towards the median, which is shown to be more efficient or robust than shrinking towards the mean or zero.(Opgen-Rhein & Strimer, 2007) We can also view the "shrinkage t" as a combination of a standard t statistic and the fold change statistic.(Opgen-Rhein & Strimer, 2007) Another feature is that the "shrinkage t" belongs to the James-Stein estimator, not relying on any explicit prior distribution assumption and its theoretical property will be easily derived. Furthermore, this method is computationally efficient and the corresponding

*Shrunken centroid method and SCOOP:* From a different point of view, Tibshirani developed the shrunken centroid method for biomarker identification.(Davies & Bromage, 2002; Tibshirani et al., 2005) For each gene within each group (i.e. treatment group or control group), the overall mean and the group means are calculated. The group means are shrunk toward the overall mean iteratively for each gene.(Davies & Bromage, 2002; Tibshirani et al., 2005) The shrunken values are used to rank the genes and the cutoff is chosen by crossvalidation. (Davies & Bromage, 2002; Tibshirani et al., 2005) The shrunken values can be also used to form a classifier and the authors used this method to classify the cancer conditions. (Davies & Bromage, 2002; Tibshirani et al., 2005) Despite the successes this method has achieved, it has one potential drawback: information about correlation among genes is distorted or lost during successive shrinkage, and, therefore, the identified genes may appear falsely to be independent of each other. Based on this method, Liu *et al.* (2009) developed an improved version of shrinkage centroid method: SCOOP (**S**hrunken **C**entroid **O**rthogonal **O**rdering **P**rojection) to extend to the cases with correlation variables. Instead of

*x x <sup>t</sup> v v n n*

*k*

gene ranking is consistent with other tests.(Opgen-Rhein & Strimer, 2007)

*k k*

*k k*

(17)

shrinkage in the variance makes the estimator of variance more robust.

corresponding shrinkage variance in the test statistic:

shrinking along the natural axes (Tibshirani et al., 2005), which ignores the potential linkage between variables, SCOOP rotates the axis and shrinks the group means in the direction preserving the least correlation information of variables. The algorithm of SCOOP is as follows: With the input of group information and the gene expression values from microarrays, two matrices are further identified: Between Epoch Covariance Matrix, containing all the variation between different groups, and Within Epoch Covariance Matrix, containing the variation originating from replication. Then, the eigenvalues and eigenvectors for both Between Epoch Covariance Matrix and Within Epoch Covariance Matrix are calculated by spectral decomposition. Since we have small number of samples and large number of variables (i.e. the genes) for microarray studies, both Between Epoch Covariance Matrix and Within Epoch Covariance Matrix are going to be highly singular. The union of the eigenvectors of the Between Epoch Covariance Matrix and Within Epoch Covariance Matrix with nonzero eigenvalues will form the basis functions of the new space (known as the Augmented Discriminant Space). For each gene, the group mean expression is shrunk towards the overall mean along the direction orthogonal to the Augmented Discriminant Space until the group means coincide, at which point that gene is eliminated from consideration. The amount of shrinkage needed for each gene is considered as its measure of importance. The above algorithm is carried out individually for each gene, producing a ranking of genes according to the importance measure. SCOOP has been successfully applied to identify biomarkers responsible for female rainbow trout reproductive cycle.(Liu, 2009, 2011)

#### **2.3.2 Introduction to basic microarray time course analysis**

Due to the decreasing cost of microarrays, their use in time course analysis has become ever more popular. The corresponding analysis is more challenging statistically than the two sample microarray situation. The time course may be longitudinal (where the mRNA samples for different time points are taken from the same individual), or cross-sectional (where the mRNA samples are extracted from different individual).(Tai & Speed, 2005) As a result, gene expression tends to be correlated for the longitudinal study or a design used for the cross-sectional study using a common reference. In addition, usually only 5-10 time points are available. Therefore, the traditional time series model cannot deal with such small series. This will require the development of new methods for analysis.

Typically, researchers are interested in identifying the genes whose expressions change over time. In the one-sample problem, some genes' patterns vary according to a common pattern. In the two-sample problem, we need to identify genes whose temporal changes differ under two or more biological conditions.(Tai & Speed, 2006)

One popular method typically used is a regression model. As an example, maSigPro belongs to this category and is available in *R*. (www.bioconductor.org) To find significantly different genes for two or more biological conditions, maSigPro first builds a global regression model with different experiment conditions acting as dummy variables.(Conesa et al., 2006) Then the significance of the estimated parameters in the model was tested to assess the significant differences between gene time course profiles.(Conesa et al., 2006)

Another method for microarray time course analysis is via ANOVA and the F-statistic. The classical ANOVA and mix-effect ANOVA models are used for cross-sectional and longitudinal study, respectively.(Diggle et al., 2002; Neter et al., 1996; Tai & Speed, 2005) For the one-sample problem, time is treated as one factor. Thereafter, the corresponding F-statistic is calculated.(Tai & Speed, 2005) Moreover, this method can be extended to the situation with multiple experiment conditions. Time, experiment condition and potentially their interaction are included for this model. An example of the classical ANOVA in time course study is available in (Wang & Kim, 2003). Also the multiple comparison adjustment for testing error is discussed for this method.(Ge & Speed, 2003)

Later, a robust version of ANOVA approach was proposed by Park *et al*, since it does not require the normality assumption.(Park et al., 2003) Similar to a two-way ANOVA model which includes time, biological conditions and their interaction as factors, genes that are concluded insignificant in this model will be reanalyzed in the same ANOVA model without the interaction term. (Park et al., 2003) Genes that are concluded significant in both models are chosen.(Park et al., 2003) Another modified ANOVA method is the ANOVA-SCA (analysis of variance-simultaneous component analysis), which takes into consideration about the correlation structure of the measured variables.(Nueda et al., 2007) Basically, principle component analysis is used to the estimated parameters of each source of variation in the ANOVA model.(Nueda et al, 2007) One advantage of this method is that it utilizes information from the experiment design and takes into consideration about correlation among the each source of variability associated with experimental factors. To identify the differentially expressed genes, the authors proposed another criterion for ANOVA-SCA: the mixture of leverage and SPE (square prediction error).(Nueda et al., 2007) Leverage quantifies how much a particular gene contributes to the multivariate ANOVA-SCA model, while SPE evaluates the goodness of fit of the model to a particular gene.(Nueda et al., 2007) The potential test statistic is and its p-value are obtained with reference to a weighted χ2 distribution.(Box, 1954) Nonetheless, the drawback of this method is that it does not use the actual time scale and direct smoothing cannot be applied. Besides, this method cannot be used when the time course points are irregular.

In summary, the ANOVA and the corresponding modified versions offer substantial advantages: they can separate variation due to each different factor, therefore, removing the non-random effects and reducing the potential noise within the data.(Box, 1954) However, there are two innate limitations: First, it assumes independent among different time points ignoring the potential correlation; second, the small number of replicates leads to unstable estimation of gene-specific variance, leading to big value of within time F-statistics even for genes with just small amount of changes. This leads to high false positive rates.(Tai & Speed, 2005) In addition, some differentially expressed genes may have outliers which tend to cause low F-statistic, resulting in false negative rates.(Tai & Speed, 2005) Thus, the idea of moderation is introduced.

To reduce the false positive rate or false negative rate, the gene-specific variance is shrinking towards a common value estimated from the whole gene set, known as moderation.(Tai & Speed, 2005) One example about the application of moderation to microarray time course is performed by Tai and Speed.(Tai & Speed, 2006) They derived the *MB*- and <sup>2</sup> *T* -statistic for one-sample or two-sample problem in the scenario of longitudinal microarray time course study, taking into consideration about the correlation across times. In detail, *MB*-statistic is the log 10 of the posterior odds whether the null or alternative hypothesis is true. When the number of replicates is equal for all genes, the *MB*-statistic under the null hypothesis is

longitudinal study, respectively.(Diggle et al., 2002; Neter et al., 1996; Tai & Speed, 2005) For the one-sample problem, time is treated as one factor. Thereafter, the corresponding F-statistic is calculated.(Tai & Speed, 2005) Moreover, this method can be extended to the situation with multiple experiment conditions. Time, experiment condition and potentially their interaction are included for this model. An example of the classical ANOVA in time course study is available in (Wang & Kim, 2003). Also the multiple comparison adjustment

Later, a robust version of ANOVA approach was proposed by Park *et al*, since it does not require the normality assumption.(Park et al., 2003) Similar to a two-way ANOVA model which includes time, biological conditions and their interaction as factors, genes that are concluded insignificant in this model will be reanalyzed in the same ANOVA model without the interaction term. (Park et al., 2003) Genes that are concluded significant in both models are chosen.(Park et al., 2003) Another modified ANOVA method is the ANOVA-SCA (analysis of variance-simultaneous component analysis), which takes into consideration about the correlation structure of the measured variables.(Nueda et al., 2007) Basically, principle component analysis is used to the estimated parameters of each source of variation in the ANOVA model.(Nueda et al, 2007) One advantage of this method is that it utilizes information from the experiment design and takes into consideration about correlation among the each source of variability associated with experimental factors. To identify the differentially expressed genes, the authors proposed another criterion for ANOVA-SCA: the mixture of leverage and SPE (square prediction error).(Nueda et al., 2007) Leverage quantifies how much a particular gene contributes to the multivariate ANOVA-SCA model, while SPE evaluates the goodness of fit of the model to a particular gene.(Nueda et al., 2007) The potential test statistic is and its p-value are obtained with reference to a weighted χ2 distribution.(Box, 1954) Nonetheless, the drawback of this method is that it does not use the actual time scale and direct smoothing cannot be applied. Besides,

In summary, the ANOVA and the corresponding modified versions offer substantial advantages: they can separate variation due to each different factor, therefore, removing the non-random effects and reducing the potential noise within the data.(Box, 1954) However, there are two innate limitations: First, it assumes independent among different time points ignoring the potential correlation; second, the small number of replicates leads to unstable estimation of gene-specific variance, leading to big value of within time F-statistics even for genes with just small amount of changes. This leads to high false positive rates.(Tai & Speed, 2005) In addition, some differentially expressed genes may have outliers which tend to cause low F-statistic, resulting in false negative rates.(Tai & Speed, 2005) Thus, the idea of

To reduce the false positive rate or false negative rate, the gene-specific variance is shrinking towards a common value estimated from the whole gene set, known as moderation.(Tai & Speed, 2005) One example about the application of moderation to microarray time course is performed by Tai and Speed.(Tai & Speed, 2006) They derived the *MB*- and <sup>2</sup> *T* -statistic for one-sample or two-sample problem in the scenario of longitudinal microarray time course study, taking into consideration about the correlation across times. In detail, *MB*-statistic is the log 10 of the posterior odds whether the null or alternative hypothesis is true. When the number of replicates is equal for all genes, the *MB*-statistic under the null hypothesis is

for testing error is discussed for this method.(Ge & Speed, 2003)

this method cannot be used when the time course points are irregular.

moderation is introduced.

supposed to have the expected profile equal to 0 in one-sample case or equal expected profiles in two-sample scenario.(Tai & Speed, 2006) Then the form of *MB*-statistic becomes a monotonic increasing function in <sup>2</sup> *T* . <sup>2</sup> *T* -statistic is *t t*' where *t* is the moderated multivariate t-statistic in the form of 1/2 1/2 *tnS X* .(Tai& Speed, 2006). ( 1) 1 *n S <sup>S</sup> n* 

where S represents the gene-specific variance-covariance matrices, *X* is the gene-specific average time course vector. n represents the number of replicates. The other two parameters ν and Λ can be estimated from all the genes. Both of the two statistic are derived when we assume independent and identical inverse Wishart priors to the gene-specific covariance matrices.(Tai & Speed, 2006) The advantage of this method comes from the incorporation of the information about the correlation structure, moderation and replication.(Tai & Speed, 2006) In addition, this statistic outperforms the ordinary F-statistic, due to moderation in empirical Bayes framework.(Tai & Speed, 2006) This procedure is shown to be very effective in false positive or negative rate reduction.(Tai & Speed, 2005) Thus, this procedure is incorporated in the Bioconductor "timecourse" package in *R*. (http:/ /www. bioconductor.org/packages/2.3/bioc/vignettes/timecourse/inst/doc/timecourse.pdf). The drawback of this method is modeling each gene independently, ignoring the latent genes pathway network and making no use of the actual time scale.

Another method that is used similar idea to estimate the unstable variance robustly and incorporate correlation in the study is based on the likelihood-based approach. Guo *et al*. develop a test based on the Wald statistic for one-sample longitudinal data.(Guo et al., 2003) This method adds a positive number to each diagonal element in the denominator matrix to incorporate the idea of moderation and stabilize the estimation of the variance.

$$w(i) = \left[L\hat{\beta}(i)\right]^T \left[L\hat{V}\_s(i)L^T + \mathcal{X}\_{o}I\_{r \times r}\right]^{-1} \left[L\hat{\beta}(i)\right] \tag{18}$$

In the above formula, L represents a matrix with dimension r X p, ˆ represents the p X 1 regression parameters estimation and *Vs*ˆ is the corresponding estimated variancecovariance matrix. is an estimated positive scalar to prevent inverting a highly singular matrix.(Guo et al., 2003) However, the limitation of this method is that it is only suitable for one-sample problem and using the asymptotic theory will not be suitable for small number of replicates.

Despite of the popularity of the above method, they all ignore one important fact in time course study: They do not make use of the time points dynamically. This is the reason to introduce B-splines or wavelets to model the gene temporal expression profiles. Natural Bsplines are piecewise cubic polynomials, which are smoothly connected at knots. It can describe the complicated gene expression patterns over time, since the linear combination of a series of basis functions can mimic any temporal profiles for genes. Each basis function can be thought as the potential expression pattern locally (i.e. the basis function will be zero outside certain time range). Comparing with methods that do not utilize time scale directly, B-splines have many advantages: reduce the noise, assuming only smoothing changes occur with time; use the actual time taken for the samples, easy to adapt for schedules with irregular time points; As an example, Bar-Joseph *et al.* present an algorithm to characterize the expression pattern of each gene by a continuous curve fitted by B-splines.(Bar-Joseph et al., 2003) They constrain the spline coefficients of genes within the same class so that their expression profiles can vary similarly. Thus, the gene expression pattern can be viewed dynamically. Comparing with previous methods, the reconstruction of the gene timecourse has 10-15% less error for those points that are not observed.(Bar-Joseph et al., 2003) Another approach proposed by Hong and Li is to solve the two-sample problem with B-splines adaption.(Hong & Li, 2006) In details, to identify biomarkers whose expression profiles are different under multiple biological conditions, linear combinations of basis functions are used to create smooth gene expression timecourse. The Markov chain Monte Carlo EM algorithm (MCEM) can be used to estimate the gene-specific parameters and hyperparameters from the hierarchical model. The genes are chosen using the empirical Bayes log posterior odds and the posterior probability based FDR.(Hong & Li, 2006) As a result, this method outperforms the traditional ANOVA model and is suitable for long time course data. Another example developed by Storey *et al.* and denoted as EDGE (**E**xtraction of **D**ifferential **G**ene **E**xpression) is also widely used for microarray timecourse study.(Storey et al., 2005) It estimates the coefficients of a B-spline function to fit the timecourse for each gene, and test whether all the coefficients are zero or not by an Fstatistic. If all the coefficients are zero, the genes are not differentially expressed. Q-value based on false discovery rate(FDR) is calculated for each individual gene to offer a suitable cutoff value.(Storey et al., 2005) This method is an example to combine B-splines with the hypothesis testing, using FDR to control the error rate. Therefore, this method is superior to other methods. However, this method does not use the correlation information between variables (i.e. genes) and needs improvement. Thus SCOOP in combination with B-spline offers an alternative for biomarker identification for microarray timecourse study. (Liu, 2009, 2011)

When the situation of multiple biological condition in microarray timecourse study is encountered, Yuan *et al.* develop a hidden Markov model approach.(Yuan & Kendziorski, 2006) For this method, the authors consider all possibilities of equality and inequality for all the means among the different biological conditions and the expression pattern process is modelled as a Markov chain.(Yuan & Kendziorski, 2006) These biological conditions are referred as states. Thus, the observations are conditionally independent given the state of the chain. In summary, this method can monitor the expression pattern for each gene and the observations at different time points may be dependent on each other. The differentially expressed genes are then selected based on the posterior probabilities of states of interest.(Yuan & Kendziorski, 2006) Moreover, it is suggested that the associated posterior probability is useful to cluster genes.(Yuan & Kendziorski, 2006)

#### **2.3.3 What is next?**

Although the microarray technology has lead to big breakthroughs in biology, there is one innate drawback in this technique: since all the sequence information about genes incorporated into the probes needs to be known *a priori*, the microarray can only obtain fixed and partially information about gene variants within the cell. This limitation requires the development of new techniques to gain the information for all the gene alleles simultaneously. Therefore, the next generation sequencing technique gains popularity and may be consequently lead to more informative microarrays. The first generation sequencing

al., 2003) They constrain the spline coefficients of genes within the same class so that their expression profiles can vary similarly. Thus, the gene expression pattern can be viewed dynamically. Comparing with previous methods, the reconstruction of the gene timecourse has 10-15% less error for those points that are not observed.(Bar-Joseph et al., 2003) Another approach proposed by Hong and Li is to solve the two-sample problem with B-splines adaption.(Hong & Li, 2006) In details, to identify biomarkers whose expression profiles are different under multiple biological conditions, linear combinations of basis functions are used to create smooth gene expression timecourse. The Markov chain Monte Carlo EM algorithm (MCEM) can be used to estimate the gene-specific parameters and hyperparameters from the hierarchical model. The genes are chosen using the empirical Bayes log posterior odds and the posterior probability based FDR.(Hong & Li, 2006) As a result, this method outperforms the traditional ANOVA model and is suitable for long time course data. Another example developed by Storey *et al.* and denoted as EDGE (**E**xtraction of **D**ifferential **G**ene **E**xpression) is also widely used for microarray timecourse study.(Storey et al., 2005) It estimates the coefficients of a B-spline function to fit the timecourse for each gene, and test whether all the coefficients are zero or not by an Fstatistic. If all the coefficients are zero, the genes are not differentially expressed. Q-value based on false discovery rate(FDR) is calculated for each individual gene to offer a suitable cutoff value.(Storey et al., 2005) This method is an example to combine B-splines with the hypothesis testing, using FDR to control the error rate. Therefore, this method is superior to other methods. However, this method does not use the correlation information between variables (i.e. genes) and needs improvement. Thus SCOOP in combination with B-spline offers an alternative for biomarker identification for microarray timecourse study. (Liu,

When the situation of multiple biological condition in microarray timecourse study is encountered, Yuan *et al.* develop a hidden Markov model approach.(Yuan & Kendziorski, 2006) For this method, the authors consider all possibilities of equality and inequality for all the means among the different biological conditions and the expression pattern process is modelled as a Markov chain.(Yuan & Kendziorski, 2006) These biological conditions are referred as states. Thus, the observations are conditionally independent given the state of the chain. In summary, this method can monitor the expression pattern for each gene and the observations at different time points may be dependent on each other. The differentially expressed genes are then selected based on the posterior probabilities of states of interest.(Yuan & Kendziorski, 2006) Moreover, it is suggested that the associated posterior

Although the microarray technology has lead to big breakthroughs in biology, there is one innate drawback in this technique: since all the sequence information about genes incorporated into the probes needs to be known *a priori*, the microarray can only obtain fixed and partially information about gene variants within the cell. This limitation requires the development of new techniques to gain the information for all the gene alleles simultaneously. Therefore, the next generation sequencing technique gains popularity and may be consequently lead to more informative microarrays. The first generation sequencing

probability is useful to cluster genes.(Yuan & Kendziorski, 2006)

2009, 2011)

**2.3.3 What is next?** 

is accredited to Frederick Sanger in 70s. (Sanger et al, 1977) Before the development of Sanger's chain-terminator method, Maxam and Gilbert used toxic chemicals to modify the bases, inferring the sequence of DNA fragment.(Maxam & Glibert, 1977) Sanger's chainterminator method gained popularity since the method involved less toxicity. The key of the chain-terminator method is the dideoxynucleotide triphosphates(ddNTPs) to terminate the DNA chain elongation. To sequence a particular DNA fragement, the DNA template, primer, DNA polymerase and deoxynucleotidephosphates(dNTPs) is split into four separate reactions with the addition of only one of the four radioactively or fluorescently labelled dideoxynucleotides (ddATP, ddGTP, ddCTP or ddTTP) in the four reactions.(Sanger et al., 1977) Therefore, during the elongation, ddNTPs are incorporated into some of the strands, leading to DNA fragments that have varying length. These fragments can be separated using gel electrophoresis and the relative position of the band on the gel be used to determine the base identity.(Sanger et al., 1977)

Fig. 2. The Sanger's chain-terminator method. For a fragment of DNA, the sample is split into four reactions containing dNTP, polymerase. Each reaction is supplemented with one type of ddNTP, serving as the chain terminator. In the above figure, we show only one reaction: the dNTP is depicted as tubes and ddCTP is depicted as triangles. The ddCTP terminates the reaction upon the addition of the ddCTP. The other three reactions form similar ladders and the sequences can be detected based on their relative position on the gel after gel electrophoresis.

In recent years, instead of using one fluorescence dye and four reactions, four different fluorescence dyes with unique emission wavelength will be used in a single reaction. Then the dye reader can automatically read the base identity after capillary electrophoresis. Readers interested in this technique can read the user's manual for ABI PRISM® 373 DNA Sequencer manual available at http:/ /www3.appliedbiosystems.com /cms/groups/ mcb\_ support/documents/generaldocuments/cms\_041831.pdf.

Therefore, the previous sequencing technique is laborious and time consuming. The current biological studies require more efficient ways to sequence. This is the motivation for next generation sequencing development.

The first step for the high-throughput sequencing is to prepare a template. In this step, genomic DNA is randomly split into small pieces to construct fragment template.(Metzker, 2010) When the genomic DNA is first circularized by ligation and then split into small fragments, this is known as mate-pair template, which has advantages over fragment template in alignment.(Metzker, 2010) However, due to the reason that single fluorescence event is hard to detect, the templates need to be amplified. Emulsion PCR by Roche and bridge PCR by illumina are introduced here. The sheared genomic DNA will be ligated with adaptors containing the same fragment.(Metzker, 2010) This allows the amplication of the DNA fragment using common PCR primer. For emulsion PCR, the water droplet containing the bead-DNA complex, primer, polymerase, and dNTP will be used to perform PCR amplification. Numerous droplets are created by emulsifying the oil-aqueous mixture, allowing all the genomic DNA fragments to be amplified simultaneously.(Metzker, 2010) Another popular amplification method is bridge PCR. Bridge PCR has two steps: initial priming and extension of the template. The genomic DNA fragments with adaptors at both ends will be immobilized and bent over to form a bridge. Subsequently, the DNA molecules will be amplified to form clusters.(Metzker, 2010) Despite the great success, the amplification procedure is time consuming and complicated. Moreover, AT or GC-rich sequences may be biased during the amplification. (Metzker, 2010) Therefore, singlemolecule templates technique which involves the immobilization of primer, template, or polymerase has become popular.(Metzker, 2010) The readers interested in this can obtain more details about this technique in Metzker 2010.

Fig. 3. The illustration of the emulsion PCR and bridge PCR. For emulsion PCR, the aqueous droplets can be created by emulsion in the oil water mixture. Then the template can be amplified with the primers within the bead. In the end, thousands of DNA fragments containing identical sequences to the template will be available within one bead for each aqueous droplet. For bridge PCR, the template is immobilized and bridge amplified to form a cluster.

Sequencing and imaging step follows the above amplification step. The four colour reversible termination method by illumina is introduced here. Right after the template clusters are obtained, the four nucleotides labelled with distinct fluorescence dye will be incorporated according to the template sequence and the elongation step halts upon the addition of fluorescence labelled nucleotide. Upon total internal reflection fluorescence imaging, TCEP (tris(2-carboxyethyl)phosphine) will be used to cleave the fluorescence dye and 3'-inhibitor to allow the next cycle of elongation. This process is iterated until the identities of all the nucleotides are known.(Metzker, 2010) The sequencing process for Roche/454 is called pyrosequencing, which uses a different mechanism for sequencing: following the emulsified PCR, the DNA-amplified beads are loaded into PTP (PicoTiterPlates) wells. Subsequently, this method allows the polymerase to add only one particular type nucleotide with the release of pyrophosphate. The pyrophosphate will then be converted with the emission of light by a

adaptors containing the same fragment.(Metzker, 2010) This allows the amplication of the DNA fragment using common PCR primer. For emulsion PCR, the water droplet containing the bead-DNA complex, primer, polymerase, and dNTP will be used to perform PCR amplification. Numerous droplets are created by emulsifying the oil-aqueous mixture, allowing all the genomic DNA fragments to be amplified simultaneously.(Metzker, 2010) Another popular amplification method is bridge PCR. Bridge PCR has two steps: initial priming and extension of the template. The genomic DNA fragments with adaptors at both ends will be immobilized and bent over to form a bridge. Subsequently, the DNA molecules will be amplified to form clusters.(Metzker, 2010) Despite the great success, the amplification procedure is time consuming and complicated. Moreover, AT or GC-rich sequences may be biased during the amplification. (Metzker, 2010) Therefore, singlemolecule templates technique which involves the immobilization of primer, template, or polymerase has become popular.(Metzker, 2010) The readers interested in this can obtain

Fig. 3. The illustration of the emulsion PCR and bridge PCR. For emulsion PCR, the aqueous droplets can be created by emulsion in the oil water mixture. Then the template can be amplified with the primers within the bead. In the end, thousands of DNA fragments containing identical sequences to the template will be available within one bead for each aqueous droplet. For bridge

Sequencing and imaging step follows the above amplification step. The four colour reversible termination method by illumina is introduced here. Right after the template clusters are obtained, the four nucleotides labelled with distinct fluorescence dye will be incorporated according to the template sequence and the elongation step halts upon the addition of fluorescence labelled nucleotide. Upon total internal reflection fluorescence imaging, TCEP (tris(2-carboxyethyl)phosphine) will be used to cleave the fluorescence dye and 3'-inhibitor to allow the next cycle of elongation. This process is iterated until the identities of all the nucleotides are known.(Metzker, 2010) The sequencing process for Roche/454 is called pyrosequencing, which uses a different mechanism for sequencing: following the emulsified PCR, the DNA-amplified beads are loaded into PTP (PicoTiterPlates) wells. Subsequently, this method allows the polymerase to add only one particular type nucleotide with the release of pyrophosphate. The pyrophosphate will then be converted with the emission of light by a

PCR, the template is immobilized and bridge amplified to form a cluster.

more details about this technique in Metzker 2010.

series of reactions.(Ronaghi et al., 1998) This is further recorded by a charge-coupled device camera. The order of the light emission will be used to induce the sequence. This mechanism is totally different from the reversible termination method and it does not require the use of modified dNTP to halt the elongation process.(Metzker, 2010)

Fig. 4. The illustration of the reverse terminator sequencing and pyrosequencing. For reverse terminator sequencing, different dNTP is labelled with different fluorescence dye. Upon addition of each dNTP, the reaction halts and the fluorescence is the recorded. Then the terminator and fluorescence dye of dNTP is cleaved. Subsequently, the next dNTP is incorporated and the whole process is iterated. For pyrosequencing, one single dNTP flows through with the addition of the nucleotide in the corresponding position. The release of the pyrophosphate will undergo enzymatic reaction to produce light. Therefore, the camera will record which of the fragments has this dNTP at its current position.

The next generation sequencing reads will subsequently need to be aligned to the reference genome or assembled *de novo*.(Chaisson et al., 2009; Pop & Salzberg, 2008; Trapnell & Salzberg, 2009) There are several challenges for the genome assembly and alignment besides cost and effort: First, some reads may not be aligned to reference genome due to the structural variant (e.g. deletion or insertion).(Metzker, 2010) Second, some reads are difficult to align to the highly repetitive regions.(Metzker, 2010) *de novo* assembly will be complicated for large genome although some successes are reported.(Butler *et al*, 2008; Hernandez et al., 2008; Zerbino & Birney, 2008)

Although there are so many challenges, this field is still undergoing rapid development and will play a main role in the personal genome era and personalized medicine field. The gigantic information from the next generation sequencing studies will require the collaboration between biologists, bioinformaticians, and biostatisticians. What we envision are more and more big breakthroughs in the field of life science.

#### **3. Conclusion**

In summary, we presented a detailed overview of microarray studies. We introduced the mechanism, the associated statistical analysis, and the potential substitution for microarraynext generation sequencing. Several examples of microarray studies to identify biomarkers are also presented. We hope this chapter can serve as a guide for beginners in the field of biomarker identification and drug discovery.

#### **4. Acknowledgment**

The authors thank NCI NIH HHS for the support of Grant R01 CA095568/ and NSF for the support of Grant DMS-0540693. The authors also thank the editor and reviewers for their constructive comments.

#### **5. References**


The next generation sequencing reads will subsequently need to be aligned to the reference genome or assembled *de novo*.(Chaisson et al., 2009; Pop & Salzberg, 2008; Trapnell & Salzberg, 2009) There are several challenges for the genome assembly and alignment besides cost and effort: First, some reads may not be aligned to reference genome due to the structural variant (e.g. deletion or insertion).(Metzker, 2010) Second, some reads are difficult to align to the highly repetitive regions.(Metzker, 2010) *de novo* assembly will be complicated for large genome although some successes are reported.(Butler *et al*, 2008; Hernandez et al.,

Although there are so many challenges, this field is still undergoing rapid development and will play a main role in the personal genome era and personalized medicine field. The gigantic information from the next generation sequencing studies will require the collaboration between biologists, bioinformaticians, and biostatisticians. What we envision

In summary, we presented a detailed overview of microarray studies. We introduced the mechanism, the associated statistical analysis, and the potential substitution for microarraynext generation sequencing. Several examples of microarray studies to identify biomarkers are also presented. We hope this chapter can serve as a guide for beginners in the field of

The authors thank NCI NIH HHS for the support of Grant R01 CA095568/ and NSF for the support of Grant DMS-0540693. The authors also thank the editor and reviewers for their

Aarøe, J.;Lindahl, T.; Dumeaux, V.; Sæbø, S.; Tobin, D.; Hagen, N.; Skaane, P.; Lönneborg,

Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. *Second International Symposium on Information Theory*, pp. 267-281. Bar-Joseph, Z.; Gerber, G.; & Gifford, D. K. (2003). Continuous representations of time-series gene expression data. *Journal of Computational Biology,* Vol. 10, pp. 341-356 Benjamini, Y.; & Hochberg, Y. (1995). Controlling the False Discovery Rate: A practical and

Boulesteix, A.; & Strimmer, K. (2006). Partial Least Squares: a Versatile Tool for the Analysis of High-dimensional Genomic Data. *Briefings in Bioinformatics,* Vol.8, pp. 32-44. Box, P. (1954). Some theorems on quadratic forms applied in the study of analysis of

A.; Sharma, P.; & Børresen-Dale, A-L. (2010). Gene Expression Profiling of Peripheral Blood Cells for Early Detection of Breast Cancer. *Breast Cancer Research,* 

Powerful Approach to Multiple Testing. *Journal of The Royal Statistical Society,* Vol.

variance problems: effect of inequality of variance in one way classification. *Ann.* 

are more and more big breakthroughs in the field of life science.

biomarker identification and drug discovery.

2008; Zerbino & Birney, 2008)

**3. Conclusion** 

**4. Acknowledgment** 

constructive comments.

Vol. 12, R7.

Ser B 57, pp. 289-300.

*Math.Stat.,* Vol. 25, pp. 290-302.

**5. References** 


http://statistics.stanford.edu/~ckirby/techreports/GEN/2000/2000-37B.pdf


Gentleman, R.; Carey V.J.; Huber, W.; Irizarry, R.A.; & Dudoit, S. (2005). *Bioinformatics and* 

George, E.I.; & Foster, D.P. (2000). Calibration and Emprical Bayes Variable Selection.

Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.;

Guo, X.; Qi, H.; Verfaillie, C.M.; & Pan, W. (2003). Statistical significance analysis of longitudinal gene expression data. *Bioinformatics,* Vol. 19, pp. 1628-1635. Hastie, T.; Tibshirani, R.; & Friedman, J. (2001). *The Elements of Statistical Learning* (1st ed.).

Hernandez, D; Francois, P.; Farinelli, L.; Osteras, M.; & Schrenzel, J. (2008). *De novo* bacterial

Heller, R.A.; Schena, M.; Chai, A.; Shalon, D.; Bedilion, T.; Gilmore, J.; Woolley, D. E.; &

Hong, F.; & Li, H. (2006). Functional Hierarchical Models for Identifying Genes with Different Time-Course Expression Profiles. *Biometrics,* Vol. 62, 534-544. Huang, X.; Pan, W.; Park, S.; Han, X.; Miller L. W.; & Hall, J. (2004). Modeling the

Ideker, T.; Thorsson, V.; Siegel, A.F.; & Hood L.E. (2000). Testing for Differentially-

Ildiko, E. F.; & Friedman, J. H. (1993). A statistical view of some chemometrics regression

Irizarry,R. A.;. Bolstad, B. M; Collin, F.; Cope, L. M.; Hobbs, B.; & Speed, T. P. (2003).

Ko, D.; Xu, W.; & Windle, B. (2005). Gene Function Classification Using NCI-60 Cell Line

Li, C.; & Wong, W. H. (2001). Model-based analysis of oligonucleotides arrays: model

Liu, Y.; Verducci, J.; Nagler, J.; Schultz, I.; Hook, S.; Cracium, G.; Sundling K.; & Hayton, W.

Liu, Y. (2011). *Properties of the SCOOP method of selecting gene sets. OhioLink*, Ph.D. dissertation, Dept. of Statistics,The Ohio State University, Columbus.

using cDNA microarrays. *Proc. Natl. Acad. Sci.*, Vol. 94, pp.2150-2155. Hoerl, A.E.; & Kennard, R. (1970). Ridge Regression:Biased Estimation for Nonorthogonal

Expression Monitoring. *Science*, Vol. 286, pp. 531-537.

computer. *Genome Res,* Vol. 18, pp. 802-809.

Problems. *Technometrics,* Vol. 12, pp. 55-67.

*Computational Biology,* Vol. 7, pp. 805-817.

tools. *Technometrics,* Vol. 35, pp. 109-135.

pp. e15 doi:10.1093/nar/gng015

research 0032.

pp. 192-208

York.

894.

*Biometrica,* Vol.87, pp. 731-747.

Springer, New York.

*Computational Biology Solutions Using R and Bioconductor*:(1st ed.) Springer, New

Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; Bloomfield, C. D.;& Lander E. S.(1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene

genome sequencing: millinos of very short reads assembled on a desktop

Davis, R. W. (1997) Discovery and analysis of inflammatory disease-related genes

Relationship between LVAD Support Time and Gene Expression Changes in Human Heart by Penalized Partial Least Squares. *Bioinformatics,* Vol. 20, pp. 888-

Expressed Genes by Maximum-Likelihood Analysis of Microarray Data. *Journal of* 

Summaries of Affymetrix GeneChip probe level data. *Nucleic Acids Research,* Vol.31

Gene Expression Profiles. *Computational Biology and Chemistry,* Vol. 29, pp. 412-419.

validation, design issues and standard error application. *Genome Biology,* Vol. 2, pp.

(2009). Time Course Analysis of Microarray Data for the Pathway of Reproductive Development in Female Rainbow Trout *Statistical Analysis and Datamining*, Vol. 2,


 *http://www.ds.unifi.it/StatGen2005/works/day4/speed\_latest.pdf*.

