**5.1. "Hit" selection in primary screen**

Although RNAi and small molecule assays differ in many ways, a common aim is to classify the test samples with relatively higher or lower activities than the reference wells as "hits". Hence, it is required to select an activity cut-off, where test samples with values above or below the cutoff are identified as "hits". It is very crucial to select a sensible cut-off value with enough differ‐ ence from the noise level in order to reduce false positive rates. Depending on the specific goals of the projects, the cut-off might need to be a reasonable value that leads to a manageable quan‐ tity of "hits" for follow-up studies. To guide scientists in the process, numerous "hit" selection methods have been developed for HT screens as presented below.


**•** Quartile-based method: Similar to the previous approaches, the quartile-based "hit" selec‐ tion method is based on the idea of treating the true "hits" as outliers and identifying them by setting upper and lower cut-off boundaries based on the quartiles and interquar‐ tiles of the data. The major advantage of the quartile-based method over median +/- k MAD is its more effective cut-off calculation formulation for non-symmetrical data, where upper and lower cut-offs can be determined independently. In the comparison of the three "hit" selection criteria presented so far, the quartile-based method outperformed the other two methods to detect true "hits" with moderate effects (Zhang et al. 2006).

A plot of signal as a function of concentration results in a rectangular hyperbola when the hill coefficient is 1 (Fig. 2A). Because the concentration range covers several orders of magni‐ tude, the x-axis is normally displayed in the logarithm scale, resulting in a sigmoidal curve

1+( EC50

The most accepted benchmark for drug potency is the EC50 value, which corresponds to the concentration of compound (x) that generates a signal midway between the top (T) and bot‐ tom (B) asymptotes (Fig. 2B). The steepness is indicated by the Hill slope (h), also known as

It is preferable to apply the Hill equation to concentrations on a logarithmic scale, because the error associated with the EC50 (log form) follows a Gaussian distribution (Motulsky and

> 1+( <sup>10</sup>Log EC50 <sup>10</sup><sup>x</sup> )

In biochemical experiments, a Hill coefficient of 1 is indicative of a 1:1 stoichiometry of en‐ zyme-inhibitor or protein-ligand complexes. Under such condition, an increase from 10% to 90% response requires 81-fold change in compound concentration. Hill coefficient values that deviate from unity could reflect mechanistic implications (such as cooperativity or mul‐ tiple binding sites) or non-ideal behavior of the compound (acting as protein denaturant or

For symmetrical curves, the inflection point corresponds to the relative EC50 value, which lies halfway between the asymptotes. This relative EC50 may be different from the actual EC50 if the top and bottom plateaus do not accurately represent 0% and 100% re‐ sponse. For instance, in Fig. 2D, the midpoint in the black curve dictates a value of 60% based on the positive and negative controls. When using the relative EC50, careful analy‐ sis of data fitting is necessary to avoid deceptive results, as exemplified by the green curve in Fig. 2D. Curve fitting would provide a relative EC50 value of 1 for both the green and black curves, but based on controls, the compound associated with the green curve would inhibit the assay only by 20%. Therefore, it is argued that the best ap‐ proach is to use a two-parameter curve fit, where only two parameters are allowed to float (EC50 and Hill coefficient values), while fixing the top and bottom boundaries as

Although EC50 is normally the main criterion to categorize compounds for downstream analysis, the value is highly dependent on assay conditions, such as cell number and en‐ zyme/substrate amount (Copeland 2003). For enzymatic assays, a more attractive approach is to consider relative affinities. Cheng and Prusoff formulated a way to convert EC50 values

<sup>x</sup> )<sup>h</sup> (20)

Data Analysis Approaches in High Throughput Screening

http://dx.doi.org/10.5772/52508

215

<sup>h</sup> (21)

signal=B+ T-B

Neubig 2010), as indicated in Eq. 21. The x values represent log[compound].

signal=B+ T-B

(Fig. 2B), which is generally fitted with the Hill equation:

the Hill coefficient or the slope factor (Fig. 2C).

causing micelle formation) (Copeland 2005).

presented in Fig. 2E. (Copeland 2005).


#### **5.2. "Hit" selection in confirmatory screen**

Different strategies are pursued for the confirmation of "hits" from RNAi and small mole‐ cule primary screens. While dose response screens are very common to test the compound activities in a dose-dependent manner in small molecule screens, this is not applicable to RNAi screens. Here, we will present the "hit" selection methods for screens with replicates in two categories: dose-response analysis and others.

#### *5.2.1. Dose-response analysis*

After running a primary screen, in which a single concentration of compound is used, a sub‐ set of compounds is selected for a more quantitative assessment. These molecules are tested at various concentrations and plotted against the corresponding assay response. These types of curves are commonly referred to as "dose-response" or "concentration-response" curves, and they are generally defined by four parameters: top asymptote (maximal response), bot‐ tom asymptote (baseline response), slope (Hill slope or Hill coefficient), and the EC50 value.

A plot of signal as a function of concentration results in a rectangular hyperbola when the hill coefficient is 1 (Fig. 2A). Because the concentration range covers several orders of magni‐ tude, the x-axis is normally displayed in the logarithm scale, resulting in a sigmoidal curve (Fig. 2B), which is generally fitted with the Hill equation:

**•** Quartile-based method: Similar to the previous approaches, the quartile-based "hit" selec‐ tion method is based on the idea of treating the true "hits" as outliers and identifying them by setting upper and lower cut-off boundaries based on the quartiles and interquar‐ tiles of the data. The major advantage of the quartile-based method over median +/- k MAD is its more effective cut-off calculation formulation for non-symmetrical data, where upper and lower cut-offs can be determined independently. In the comparison of the three "hit" selection criteria presented so far, the quartile-based method outperformed the

other two methods to detect true "hits" with moderate effects (Zhang et al. 2006).

relationship with the Z-score method.

214 Drug Discovery

ple data is used to generate the prior distribution.

in two categories: dose-response analysis and others.

**5.2. "Hit" selection in confirmatory screen**

*5.2.1. Dose-response analysis*

**•** SSMD and Robust SSMD: This parameter has become a widely-used method for RNAi screening data analysis mainly due to its ability to quantify RNAi effects with a statistical basis, and its better control on false negative and false positive rates (Zhang 2007a; Zhang 2007b; Zhang 2009; Zhang 2010a; Zhang 2010 b; Zhang 2011b; Zhang et al. 2010). SSMD is a robust parameter to capture the magnitude of the RNAi effects with various sample sizes. This scoring method also provides comparison of values across screens. Mean and std in the standard SSMD formula is substituted with median and MAD in the robust ver‐ sion. The SSMD parameter used for the primary screens without replicates holds a linear

**•** Bayesian method: This method is used to combine both plate-wise and experiment-wise information within single "hit" selection calculation based on Bayesian hypothesis testing (Zhang et al. 2008b). Bayesian statistics incorporates a prior data distribution and a likeli‐ hood function to generate a posterior distribution function. In HT screening data analysis using this method, the experiment- and plate-wise information is incorporated into the prior and likelihood functions, respectively. With the availability of various prior distri‐ bution models, the Bayesian method can be applied either with positive and negative con‐ trols or with test sample wells. As this method enables the control of false discovery rates, it is a more powerful "hit" selection measure than the median +/- k MAD when the sam‐

Different strategies are pursued for the confirmation of "hits" from RNAi and small mole‐ cule primary screens. While dose response screens are very common to test the compound activities in a dose-dependent manner in small molecule screens, this is not applicable to RNAi screens. Here, we will present the "hit" selection methods for screens with replicates

After running a primary screen, in which a single concentration of compound is used, a sub‐ set of compounds is selected for a more quantitative assessment. These molecules are tested at various concentrations and plotted against the corresponding assay response. These types of curves are commonly referred to as "dose-response" or "concentration-response" curves, and they are generally defined by four parameters: top asymptote (maximal response), bot‐ tom asymptote (baseline response), slope (Hill slope or Hill coefficient), and the EC50 value.

$$\text{signal} = \text{B} + \frac{\text{T-B}}{1 + \left(\frac{\text{EC}\_{30}}{\text{x}}\right)^{\text{h}}} \tag{20}$$

The most accepted benchmark for drug potency is the EC50 value, which corresponds to the concentration of compound (x) that generates a signal midway between the top (T) and bot‐ tom (B) asymptotes (Fig. 2B). The steepness is indicated by the Hill slope (h), also known as the Hill coefficient or the slope factor (Fig. 2C).

It is preferable to apply the Hill equation to concentrations on a logarithmic scale, because the error associated with the EC50 (log form) follows a Gaussian distribution (Motulsky and Neubig 2010), as indicated in Eq. 21. The x values represent log[compound].

$$\mathbf{S} \cdot \mathbf{signal} = \mathbf{B} + \frac{\mathbf{T} \cdot \mathbf{B}}{\mathbf{1} + \left(\frac{\mathbf{1}^{\mathrm{Log}} \cdot \mathbf{nc}\_{\mathrm{z\_{00}}}}{\mathbf{1}^{\mathrm{Log}}}\right)^{\mathrm{h}}} \tag{21}$$

In biochemical experiments, a Hill coefficient of 1 is indicative of a 1:1 stoichiometry of en‐ zyme-inhibitor or protein-ligand complexes. Under such condition, an increase from 10% to 90% response requires 81-fold change in compound concentration. Hill coefficient values that deviate from unity could reflect mechanistic implications (such as cooperativity or mul‐ tiple binding sites) or non-ideal behavior of the compound (acting as protein denaturant or causing micelle formation) (Copeland 2005).

For symmetrical curves, the inflection point corresponds to the relative EC50 value, which lies halfway between the asymptotes. This relative EC50 may be different from the actual EC50 if the top and bottom plateaus do not accurately represent 0% and 100% re‐ sponse. For instance, in Fig. 2D, the midpoint in the black curve dictates a value of 60% based on the positive and negative controls. When using the relative EC50, careful analy‐ sis of data fitting is necessary to avoid deceptive results, as exemplified by the green curve in Fig. 2D. Curve fitting would provide a relative EC50 value of 1 for both the green and black curves, but based on controls, the compound associated with the green curve would inhibit the assay only by 20%. Therefore, it is argued that the best ap‐ proach is to use a two-parameter curve fit, where only two parameters are allowed to float (EC50 and Hill coefficient values), while fixing the top and bottom boundaries as presented in Fig. 2E. (Copeland 2005).

Although EC50 is normally the main criterion to categorize compounds for downstream analysis, the value is highly dependent on assay conditions, such as cell number and en‐ zyme/substrate amount (Copeland 2003). For enzymatic assays, a more attractive approach is to consider relative affinities. Cheng and Prusoff formulated a way to convert EC50 values to dissociation constants, thus reducing the overload of performing multiple titrations asso‐ ciated with standard enzyme kinetics (Cheng and Prusoff 1973). Nevertheless, the caveat of using this convenient alternative is to recognize the inhibitory modality of the compounds (Copeland 2005): competitive (Eq. 22), non-competitive (Eq. 23) and uncompetitive (Eq. 24).

EC50=Ki

EC50=α×Ki

The dissociation constant of a reversible compound (Ki

that EC50 and Ki

blue curve).

below.

*5.2.2. Other methods*

(Eq. 22) or when α=1 (Eq. 23).

EC50= S+KM KM Ki <sup>+</sup> <sup>S</sup> α×Ki

(1+ <sup>S</sup> KM

> (1+ KM

substrate concentration (S) and the Michaelis constant (KM). The constant α delineates the ef‐ fect of inhibitor binding on the affinity of the substrate for the enzyme. It becomes evident

Dose-response curves can follow various patterns, depending on the biological system to be investigated. For assays with certain basal level, increasing concentrations of a full agonist triggers a maximal response for the system (Fig. 2F, red curve). A partial agonist displays a reduced response (efficacy) relative to a full agonist (Fig. 2F, black curve), even though they both exhibit the same potency (i.e. same EC50 values). An an‐ tagonist might have certain affinity or potency, but it would not show any change in basal activity as the efficacy has a value of zero (Fig. 2F, green curve). However, an an‐ tagonist reverses the actions of an agonist. In pharmacological terms, the effects of a competitive antagonist can be overcome by augmenting the amount of agonist, but such agonist increment has no effect on the effects of non-competitive antagonists. In‐ verse agonists reduce the basal response of systems with constitutive activity (Fig. 2F,

In "hit" selection for confirmatory screens with single concentration of compound or siRNA, hypothesis testing is a commonly-used method to incorporate sample variabili‐ ty of each sample from its replicates. Therefore, confirmatory screens (or some primary screens) are chosen to be performed in replicates to statistically calculate the signifi‐ cance of the sample activity in relation to a negative reference group. Since previously listed Z- and robust Z-score methods assume that the variability of the test samples and the negative controls or references is equal, it is not a reliable measure for confir‐ matory screens with replicates, where the sample variability can be individually calcu‐ lated. The most common methods to analyze screening data with replicates are listed

**•** *t*-test: For "hit" selection in confirmatory screens, *t* statistics and the associated *p* val‐ ue is used to calculate if a sample compound or siRNA is behaving significantly dif‐ ferent than the majority of the test samples or controls. A *t*-test determines whether

are roughly the same at much lower substrate concentration relative to KM

) (22)

Data Analysis Approaches in High Throughput Screening

http://dx.doi.org/10.5772/52508

<sup>S</sup> ) (24)

) can be calculated based on a single

(23)

217

**Figure 2.** Dose-response curves. A) Response vs. compound concentration resulting in a rectangular hyperbola curve. B) Response vs. logarithm of compound concentration resulting in a sigmoidal curve. The dashed lines indicate the concentration corresponding to half-maximal signal. C) Curves at different Hill slopes: 0.5 (black, closed circles), 1 (red, open circles), 2 (blue, closed squares), 3 (green, open squares) and -1 (pink, closed triangles). D) Relative (blue dash lines) and actual (red dash lines) EC50 values for a curve with different top boundary from that of the control (black curve). The green and black curves have the same relative EC50. E) The red curve fits the data points (black circles) al‐ lowing 2 parameters (EC50, hill coefficient) to float, while the blue curve fits the data refining all 4 parameters (EC50, hill coefficient, top and bottom asymptotes). F) Curves corresponding to a full agonist (red), partial agonist (black), antag‐ onist (green) and inverse agonist (blue).

Data Analysis Approaches in High Throughput Screening http://dx.doi.org/10.5772/52508 217

$$\mathbf{EC\_{50} = K\_i \left(1 + \frac{S}{K\_M}\right)}\tag{22}$$

$$\text{ECC}\_{50} = \frac{\text{S} \star \text{K}\_{\text{M}}}{\text{S}\_{1} + \frac{\text{S}}{\text{a} \star \text{K}\_{1}}} \tag{23}$$

$$\mathbf{EC\_{50}} = \alpha \times \mathbf{K\_i} \left( 1 + \frac{\mathbf{K\_{id}}}{\mathbf{S}} \right) \tag{24}$$

The dissociation constant of a reversible compound (Ki ) can be calculated based on a single substrate concentration (S) and the Michaelis constant (KM). The constant α delineates the ef‐ fect of inhibitor binding on the affinity of the substrate for the enzyme. It becomes evident that EC50 and Ki are roughly the same at much lower substrate concentration relative to KM (Eq. 22) or when α=1 (Eq. 23).

Dose-response curves can follow various patterns, depending on the biological system to be investigated. For assays with certain basal level, increasing concentrations of a full agonist triggers a maximal response for the system (Fig. 2F, red curve). A partial agonist displays a reduced response (efficacy) relative to a full agonist (Fig. 2F, black curve), even though they both exhibit the same potency (i.e. same EC50 values). An an‐ tagonist might have certain affinity or potency, but it would not show any change in basal activity as the efficacy has a value of zero (Fig. 2F, green curve). However, an an‐ tagonist reverses the actions of an agonist. In pharmacological terms, the effects of a competitive antagonist can be overcome by augmenting the amount of agonist, but such agonist increment has no effect on the effects of non-competitive antagonists. In‐ verse agonists reduce the basal response of systems with constitutive activity (Fig. 2F, blue curve).

#### *5.2.2. Other methods*

to dissociation constants, thus reducing the overload of performing multiple titrations asso‐ ciated with standard enzyme kinetics (Cheng and Prusoff 1973). Nevertheless, the caveat of using this convenient alternative is to recognize the inhibitory modality of the compounds (Copeland 2005): competitive (Eq. 22), non-competitive (Eq. 23) and uncompetitive (Eq. 24).

**Figure 2.** Dose-response curves. A) Response vs. compound concentration resulting in a rectangular hyperbola curve. B) Response vs. logarithm of compound concentration resulting in a sigmoidal curve. The dashed lines indicate the concentration corresponding to half-maximal signal. C) Curves at different Hill slopes: 0.5 (black, closed circles), 1 (red, open circles), 2 (blue, closed squares), 3 (green, open squares) and -1 (pink, closed triangles). D) Relative (blue dash lines) and actual (red dash lines) EC50 values for a curve with different top boundary from that of the control (black curve). The green and black curves have the same relative EC50. E) The red curve fits the data points (black circles) al‐ lowing 2 parameters (EC50, hill coefficient) to float, while the blue curve fits the data refining all 4 parameters (EC50, hill coefficient, top and bottom asymptotes). F) Curves corresponding to a full agonist (red), partial agonist (black), antag‐

onist (green) and inverse agonist (blue).

216 Drug Discovery

In "hit" selection for confirmatory screens with single concentration of compound or siRNA, hypothesis testing is a commonly-used method to incorporate sample variabili‐ ty of each sample from its replicates. Therefore, confirmatory screens (or some primary screens) are chosen to be performed in replicates to statistically calculate the signifi‐ cance of the sample activity in relation to a negative reference group. Since previously listed Z- and robust Z-score methods assume that the variability of the test samples and the negative controls or references is equal, it is not a reliable measure for confir‐ matory screens with replicates, where the sample variability can be individually calcu‐ lated. The most common methods to analyze screening data with replicates are listed below.

**•** *t*-test: For "hit" selection in confirmatory screens, *t* statistics and the associated *p* val‐ ue is used to calculate if a sample compound or siRNA is behaving significantly dif‐ ferent than the majority of the test samples or controls. A *t*-test determines whether the null hypothesis, which is the mean of a test sample being equal to the mean of the negative reference group, is accepted or not. Paired *t*-test (first pairing of the test sample and reference value within each plate, then calculating *t* statistic on the paired values) is often preferred to avoid the distortion of results due to inter-plate variability, whereas unpaired *t*-test is used for global comparison of the sample repli‐ cates with all reference values in the experiment (Zhang 2011a). The *p* value calculat‐ ed from *t* statistic is then used to determine the significance of the sample activity compared to the reference. An alternative to standard *t*-test, namely randomized var‐ iance model (RVM) *t*-test (Wright and Simon 2003), was found to be more advanta‐ geous for screens with few replicates to detect relatively less strong "hits" (Malo et al. 2010).

**Features Programming Language**

Java

Data Analysis Approaches in High Throughput Screening

http://dx.doi.org/10.5772/52508

219

Perl

C#

PHP, Oracle/MySQL

R, PHP, MySQL

R/Bioconductor project

R/Bioconductor project

R/Bioconductor project

Web-based laboratory information management system for management of library and screen

Web-based compound library and siRNA plate management, QC and dose-response fitting tools

Analysis, visualization, management and mining of HT screening data including dose-response curve

Statistical analysis, visualization and correction of

Web-based analysis toolbox for normalization, QC, "hit" selection and annotation for RNAi screens

Automated pipeline for normalization, QC, "hit" selection and pathway generation for RNAi screens

Gene set enrichment, network and gene set comparison analysis for post-processing of RNAi

**Table 3.** Examples of open-access software packages for library management and statistical analysis of HT screening

This work was supported by the American Lebanese Syrian Associated Charities (ALSAC), St. Jude Children's Research Hospital, and National Cancer Institute grant P30CA027165.

systematic errors for all HT screens

(Boutros et al. 2006; Pelz et al. 2010)

(Makarenkov et al. 2006)

(Rieber et al. 2009)

screening data (Wang et al. 2011)

information (Tolopko et al. 2010)

(Jacob et al. 2012)

fitting

(Tai et al. 2011)

**NEXT-RNAi** Library design and evaluation tools for RNAi screens (Horn et al. 2010)

**Screensaver**

**MScreen**

K-Screen

**HTS-Corrector**

**web cellHTS2**

**RNAither**

**HTSanalyzeR**

**Acknowledgements**

data.

