**2.2. Classical versus robust (resistant) statistics**

One of the main assumptions when analyzing HT screening data is that the data is normally distributed, or it complies with the central limit theorem, where the mean of the distributed values converge to normal distribution unless there are systematic errors associated with the screen (Coma et al. 2009). Therefore, log transformations are often applied to the data in the pre-processing stage to achieve more symmetrically distributed data around the mean as in a normal distribution, to represent the relationship between variables in a more linear way especially for cell growth assays, and to make an efficient use of the assay quality assess‐ ment parameters (Sui and Wu 2007).

In HT screening practices, the presence of outliers - data points that do not fall within the range of the rest of the data - is generally experienced. Distortions to the normal distribution of the data caused by outliers impact the results negatively. Therefore, an HT data set with outliers needs to be analyzed carefully to avoid an unreliable and inefficient "hit" selection process. Although outliers in control wells can be easily identified, it should be clear that outliers in the test sample may be misinterpreted as real "hits" instead of random errors.

well as for "hit" selection, so that the false discovery rates can be inherently reduced. Posi‐ tive controls are often chosen from small-molecule compounds or gene silencing agents that are known to have the desired effect on the target of interest; however, this may be a diffi‐ cult task if very little is known about the biological process (Zhang et al. 2008a). On the other hand, selection of negative controls from non-targeting reagents is more challenging due to higher potential of biological off-target effects in RNAi screens compared to the negative controls used in small-molecule screens (Birmingham et al. 2009). Another factor that might interfere with the biological process in an HT screening assay is the bioactive contaminants that may be released from the consumables used in the screening campaign, such as plastic tips and microplates (McDonald et al. 2008; Watson et al. 2009). Unreliable and misleading screening results may be obtained from altered assay conditions caused by leached materi‐ als, and boosted false discovery rates may be unavoidable. Hence, the effects of laboratory consumables on the assay readout should be carefully examined during assay development. The false discovery rates are also highly dependent on the analysis methods used for "hit" selection, and they can be statistically controlled. False discovery rate is defined as the ratio of false discoveries to the total number of discoveries. A *t*-test and the associated *p* value are often used for hypothesis testing in a single experiment and can be interpreted as the false positive discovery rate (Chen et al. 2010). However, the challenge arises when multiple hy‐ pothesis testing is needed or when the comparison of results across multiple experiments is required. For HT applications, a Bayesian approach was developed to enable plate-wise and experiment-wise comparison of results in a single process, while the false discovery rates can still be controlled (Zhang et al. 2008b). Another method utilizing the strictly standar‐ dized mean difference (SSMD) parameter was proven to control the false discovery and non-discovery rates in RNAi screens (Zhang 2007a; Zhang 2010 b; Zhang et al. 2010). By tak‐ ing the data variability into account, SSMD method is capable of determining "hits" with

Data Analysis Approaches in High Throughput Screening

http://dx.doi.org/10.5772/52508

205

higher assurance compared to the Z-score and *t*-test methods.

**3. Normalization and systematic error corrections**

Despite meticulous assay optimization efforts considering all the factors mentioned previ‐ ously, it is expected to observe variances in the raw data across plates even within the same experiment. Here, we consider these variances as "random" assay variability, which is sepa‐ rate from the systematic errors that can be linked to a known reason, such as failure of an instrument. Uneven assay performances may unpredictably occur at any given time during screening. Hence, normalization of data within each plate is necessary to enable comparable results across plates or experiments allowing a single cut-off for the selection of "hits".

When normalizing the HT screening data, two main approaches can be followed: controlsbased and non-controls-based. In controls-based approaches, the assay-specific in-plate pos‐ itive and negative controls are used as the upper (100%) and lower (0%) bounds of the assay activity, and the activities of the test samples are calculated with respect to these values. Al‐

**3.1. Normalization for assay variability**

There are two approaches for statistical analysis of data sets with outliers: classical and ro‐ bust. One can choose to replace or remove outliers based on the truncated mean or similar approaches, and continue the analysis process with classical methods. However, robust stat‐ istical approaches have gained popularity in HT screening data analysis in recent decades. In robust statistics, median and median absolute deviation (MAD) are utilized as statistical parameters as opposed to mean and standard deviation (std), respectively, to diminish the effect of outliers on the final analysis results. Although there are numerous approaches to detect and abolish/replace outliers with statistical methods (Hund et al. 2002; Iglewicz and Hoaglin 1993; Singh 1996), robust statistics is preferred for its insensitivity to outliers (Huber 1981). In statistics, while the robustness of an analysis technique can be determined by two main approaches, i.e. influence functions (Hampel et al. 1986) and breakdown point (Ham‐ pel 1971), the latter is a more intuitive technique in the concept of HT screening, where the breakdown point of a sample series is defined as the amount of outlier data points that can be tolerated by the statistical parameters before the parameters take on drastically different values that are not representing anymore distribution of the original dataset. In a demon‐ strated example on a five sample data set, robust parameters were shown to perform superi‐ or to the classical parameters after the data set was contaminated with outliers (Rousseeuw 1991). It was also emphasized that median and MAD have a breakdown point of 50%, while mean and std have 0%, indicating that sample sets with 50% outlier density can still be suc‐ cessfully handled with robust statistics.

#### **2.3. False discovery rates**

As mentioned previously, depending on the specificity and sensitivity of an HT assay, erro‐ neous assessment of "hits" and "non-hits" is likely. Especially in genome-wide siRNA screens, false positive and negative results may mislead the scientists in the confirmatory studies. While the cause of false discovery results may be due to indirect biological regula‐ tions of the gene of interest through other pathways that are not in the scope of the experi‐ ment, it may also be due to random errors experienced in the screening process. Although the latter can be easily resolved in the follow-up screens, the former may require a better assay design (Stone et al. 2007). Lower false discovery rates can also be achieved by careful selection of assay reagents to avoid inconsistent measurements (outliers) during screening. The biological interference effects of the reagents in RNAi screens can be considered in two categories: sequence-dependent and sequence-independent (Echeverri et al. 2006; Mohr and Perrimon 2012). Therefore, off-target effects and low transfection efficiencies are the main challenges to be overcome in these screens. Moreover, selection of the appropriate controls for either small molecule or RNAi screens is very crucial for screen quality assessment as well as for "hit" selection, so that the false discovery rates can be inherently reduced. Posi‐ tive controls are often chosen from small-molecule compounds or gene silencing agents that are known to have the desired effect on the target of interest; however, this may be a diffi‐ cult task if very little is known about the biological process (Zhang et al. 2008a). On the other hand, selection of negative controls from non-targeting reagents is more challenging due to higher potential of biological off-target effects in RNAi screens compared to the negative controls used in small-molecule screens (Birmingham et al. 2009). Another factor that might interfere with the biological process in an HT screening assay is the bioactive contaminants that may be released from the consumables used in the screening campaign, such as plastic tips and microplates (McDonald et al. 2008; Watson et al. 2009). Unreliable and misleading screening results may be obtained from altered assay conditions caused by leached materi‐ als, and boosted false discovery rates may be unavoidable. Hence, the effects of laboratory consumables on the assay readout should be carefully examined during assay development.

The false discovery rates are also highly dependent on the analysis methods used for "hit" selection, and they can be statistically controlled. False discovery rate is defined as the ratio of false discoveries to the total number of discoveries. A *t*-test and the associated *p* value are often used for hypothesis testing in a single experiment and can be interpreted as the false positive discovery rate (Chen et al. 2010). However, the challenge arises when multiple hy‐ pothesis testing is needed or when the comparison of results across multiple experiments is required. For HT applications, a Bayesian approach was developed to enable plate-wise and experiment-wise comparison of results in a single process, while the false discovery rates can still be controlled (Zhang et al. 2008b). Another method utilizing the strictly standar‐ dized mean difference (SSMD) parameter was proven to control the false discovery and non-discovery rates in RNAi screens (Zhang 2007a; Zhang 2010 b; Zhang et al. 2010). By tak‐ ing the data variability into account, SSMD method is capable of determining "hits" with higher assurance compared to the Z-score and *t*-test methods.
