**2.4 Discovery strategies**

Discovery strategies allow many analytes to be measured simultaneously (*i.e.,* multiplexed analysis). The objective is to identify qualitative and/or quantitative differences across distinct clinical phenotypes that are reproducible and can then be adopted in a clinical setting. As discussed previously, however, what is observed in a discovery setting could be an artifact of statistical chance or experimental bias, and any findings must be rigorously validated. Discovery is typically costly, slow (low-throughput) and labor-intensive. Further, because the methods are not optimized for any single analyte, their performance characteristics are compromised (*i.e.,* limited sensitivity, selectivity and precision). The methods are therefore only suitable to survey – they are not suited to efficient, precise and accurate quantification. When proteins are the targets of the discovery process then there are two orthogonal strategies that are adopted: peptide-centric and protein-centric.

*Peptide-centric (bottom-up or shotgun) strategies*: This approach begins with proteolytic digestion of proteins to peptides and the 'digest' is then subjected to fractionation (HPLC) and tandem mass spectrometry (Duncan et al., 2010; Aebersold & Mann, 2003; Chait, 2006). The tandem mass spectra are then converted to peptide sequences and their precursor proteins are "assumed" by computational approaches. Refinements of this approach sometimes incorporate fractionation prior to digestion and/or multiple stages of fractionation post digestion (Washburn et al., 2001; Wolters et al. 2001). The principal assumption of a bottom-up strategy is that the identity of intact proteins can be ascertained from their constituent peptide fragments. As discussed elsewhere, this assumption is frequently invalid (Duncan et al., 2010).

*Protein-centric (top-down) strategies*: With protein-centric approaches intact proteins are first separated, typically by 2D gel electrophoresis, then the proteins are isolated and identified by mass spectrometry. Typically identification involves enzymatic cleavage of each individual protein to peptides and then either: (a) the masses of the peptide products of each pure protein are determined (*via* single stage mass spectrometry); or (b) the tandem mass spectrum (fragmentation pattern) of one (or more) of the peptides is determined (*via* tandem mass spectrometry). One or both these data sets is/are then used to interrogate a database and identify the protein. Relative protein amounts can be determined from the gel by staining. Because a top-down approach retains the protein integrity, modifications and sequence variations can be investigated.

As we will illustrate, the discovery findings should be considered a set of leads that require meticulous validation, especially with respect to the utility of the biomarker(s) in a routine clinical setting.

#### **2.5 Biostatistical considerations**

Because proteomic studies of clinical samples can generate cumbersome data sets, bioinformaticians are frequently involved in study design (*e.g.,* patient selection and study

Studies indicate that plasma is likely a better substrate for proteome analysis than serum due to the obfuscation of results associated with the high proportion (>40%) of clot-related proteins and peptides in serum (Haab et al., 2005; Rai et al., 2005; Tammen, 2005). Less invasive samples also amenable to protein biomarker discovery include urine, saliva and tear fluid. Although putative biomarkers can come from discovery work, candidates can also come from literature searches and genomic or transcriptomic mining. All candidates,

Discovery strategies allow many analytes to be measured simultaneously (*i.e.,* multiplexed analysis). The objective is to identify qualitative and/or quantitative differences across distinct clinical phenotypes that are reproducible and can then be adopted in a clinical setting. As discussed previously, however, what is observed in a discovery setting could be an artifact of statistical chance or experimental bias, and any findings must be rigorously validated. Discovery is typically costly, slow (low-throughput) and labor-intensive. Further, because the methods are not optimized for any single analyte, their performance characteristics are compromised (*i.e.,* limited sensitivity, selectivity and precision). The methods are therefore only suitable to survey – they are not suited to efficient, precise and accurate quantification. When proteins are the targets of the discovery process then there are

*Peptide-centric (bottom-up or shotgun) strategies*: This approach begins with proteolytic digestion of proteins to peptides and the 'digest' is then subjected to fractionation (HPLC) and tandem mass spectrometry (Duncan et al., 2010; Aebersold & Mann, 2003; Chait, 2006). The tandem mass spectra are then converted to peptide sequences and their precursor proteins are "assumed" by computational approaches. Refinements of this approach sometimes incorporate fractionation prior to digestion and/or multiple stages of fractionation post digestion (Washburn et al., 2001; Wolters et al. 2001). The principal assumption of a bottom-up strategy is that the identity of intact proteins can be ascertained from their constituent peptide fragments. As discussed elsewhere, this assumption is

*Protein-centric (top-down) strategies*: With protein-centric approaches intact proteins are first separated, typically by 2D gel electrophoresis, then the proteins are isolated and identified by mass spectrometry. Typically identification involves enzymatic cleavage of each individual protein to peptides and then either: (a) the masses of the peptide products of each pure protein are determined (*via* single stage mass spectrometry); or (b) the tandem mass spectrum (fragmentation pattern) of one (or more) of the peptides is determined (*via* tandem mass spectrometry). One or both these data sets is/are then used to interrogate a database and identify the protein. Relative protein amounts can be determined from the gel by staining. Because a top-down approach retains the protein integrity, modifications and

As we will illustrate, the discovery findings should be considered a set of leads that require meticulous validation, especially with respect to the utility of the biomarker(s) in a routine

Because proteomic studies of clinical samples can generate cumbersome data sets, bioinformaticians are frequently involved in study design (*e.g.,* patient selection and study

however, must undergo subsequent verification and validation (Pepe et al., 2008).

two orthogonal strategies that are adopted: peptide-centric and protein-centric.

**2.4 Discovery strategies** 

frequently invalid (Duncan et al., 2010).

sequence variations can be investigated.

**2.5 Biostatistical considerations** 

clinical setting.

size calculation) and the hunt for significant and reproducible patterns in the data. Their objective is to find reproducible differences which correlate with a defined clinical outcome and that are independent of the influence of experimental bias, over-fitting and statistical chance.

The incorporation of a randomization strategy in sample analysis reduces bias by accounting for the day-to-day variations in the analytical technique. Similarly, it is prudent to calibrate and record the performance characteristics of the instruments used in the analyses. Calibration in proteomic analyses entails, for example in mass spectrometry, initialising the mass accuracy to a standard mixture of purified proteins or peptides of known mass. Routine calibration of sensitive instruments subject to 'drift' in measurement over a period of time should become part of good laboratory practice. Further, in the discovery phase, the objective is to have sufficient sample numbers to provide confidence that the list of protein candidates is worthy of follow-up during the validation phase. Typically in this phase of biomarker development, the sample size is small due to the cost and time of analysis, and sometimes because of the difficulty associated with obtaining samples. However, the number of proteins (independent variables) measured in each sample is typically very large. This ratio of samples to variable size is contrary to the traditional application of multivariate statistics and leads to some unique considerations that have been discussed by others (Dowsey et al., 2009; Karp & Lilley, 2007). Conversely in the validation phase this relationship is inverted so that patient cohorts are much larger (typically 100's -1000's) and the number of biomarker candidates carried over from discovery are reduced depending on the strength of their relationship to the clinical outcome or measure being assessed. The costs incurred by the validation phase therefore sit in a multi-million dollar range far exceeding the costs of discovery. The financial implications alone may account for the relative dearth of publications on this phase as the main players, large pharmaceutical or diagnostic corporations, having invested large amounts of time and money likely strive to protect the resultant intellectual property prior to further clinical testing and pre-market approval.

Bias, or any discrimination occurring due to a non-biological signal, can potentially confound discovery. For example, spurious results may arise because of differences in how patient samples are collected, *e.g.,* type of blood collection tube, time taken to freeze sample, or the order in which the samples are analyzed. Over-fitting can occur when regression analysis tools are used to 'fit' (too) many variables to a limited set of outcomes. The discriminating 'pattern' or 'signature' then becomes an artifact of the patient cohorts. To resolve issues of bias, statistical analyses must consider the biology of the system being analyzed and take into account the assumptions and limitations of the methods (Ransohoff, 2009). Statistical tests capable of gauging the level of false positives across multiple comparisons include the student's unpaired t-test (for two group comparisons), ANOVA (for three or more group comparisons) and linear regression (for quantitative or correlative studies) (Dowsey et al., 2009). Alternatively, if the data are not normally distributed, nonparametric Mann Whitney and Kruskal-Wallis tests should be substituted (Karp & Lilley, 2007). These methods can be used to analyze the features one-at-a-time and then to compile a ranked list of them based on a combination of p values and effect size. As noted earlier, a longitudinal design can minimize the potential for bias relative to a typical case-control study. Nevertheless, false biomarker leads are common and therefore rigorous validation essential.

Validation of Protein Biomarkers to Advance the Management of Autoimmune Disorders 145

proteomic data with respect to clinical data. This process known as 'feature selection' and leads ultimately to the creation of a 'classifier' or biomarker driven algorithm specific to the disease and outcome being measured (Liu et al., 2009; Zhu et al., 2009). Since most protein expression profiles will likely not be correlated to a specific outcome, supervised methods screen out uninformative proteins and select protein combinations to develop a 'classifier'. A recent application of the SVM principle has been used to guide feature selection of

Depending on whether proteome studies are focused on biomarkers for (i) diagnosis (class discovery), (ii) prognosis (outcome-related) or (iii) prediction (supervised prediction), various rationales should be employed to generate and assess the reliability a classifier. Class discovery methods are best suited for grouping proteins into subsets that elucidate pathways with similar expression profiles across patient subgroups. In outcome-related studies, the goal is to identify which proteins have expression levels that correlate with outcomes grouped into discrete classes: for example, in arthritis patients with a good versus a bad prognosis. When prediction of patient outcome is the aim, supervised prediction methods that use a selected proteome profile are used to generate an algorithm based on individual profiles. In supervised class prediction studies, a totally independent cohort should be used for cross-validation purposes when rigorous testing of a predictive model is

The statistical significance of a selected proteome 'classifier' gives an incomplete estimate of its predictive ability and potential clinical utility. The number of true and false positives or negatives should be presented allowing the calculation of sensitivity and specificity. This reveals clinically-relevant information on how the classifier performs in each outcome category. List of statistical tools and recommendations for their application have been reported (Dupuy & Simon, 2007; Karp & Lilley, 2007; Marengo et al. 2006). Depending on the clinical question there may however be multiple outcome measures that are not amenable to a simple binary classification system. Statistical evidence of prevalence and analytical limits of detection of a specific group of isoforms can then direct the study towards validation of candidate biomarkers in a much larger group of multi-center patient

Three distinct phases can be delineated within a typical development pipeline: discovery, verification and validation. These can be further subdivided so there is a reduced number of candidates at each stage, each with an increased probability of utility 51. In the subsequent sections we aim to clearly segregate these phases in the biomarker 'pipeline' and further expand on the vastly different requirements of each (Figure 1). This process is prefaced by a brief overview of pre-analytical factors which can introduce unwanted bias or variation.

In the discovery phase, proteomic platforms are unsupervised and are used to highlight qualitative and/or quantitative differences in multiple proteins across distinct clinical phenotypes. The process of discovery is focused on assessing many candidates, while

Discovery by definition requires an analytical approach which does not preempt the identity of the biomarker candidates. Generally speaking as most discovery methods prioritise the

exhaled peptides as potential biomarkers of asthma (Bloemen et al., 2011).

desired (Dupuy & Simon, 2007).

**3. The biomarker development process** 

minimizing the probability of false positives and negatives.

populations.

**3.1 The discovery phase** 

The false discovery rate (FDR) can also be calculated (Benjamini et al., 2003; Storey, 2003; Strimmer, 2008). By setting the FDR level, it is possible to diminish the risk of a false positive identification for a differentially-expressed protein, i.e., at P ≤ 0.05, we expect only 5% false positives. However, by doing so, the process of discovery may be compromised by overly stringent criteria. Although proteins displaying the most dramatic changes may appear to be useful biomarkers, it is important to attempt to rationalize their changes to the pathology. For example, acute phase proteins are frequently identified in plasma or serum-based studies as 'specific' biomarkers of a wide range of chronic disorders, including arthritis and cancer but clearly they are not specific to any one disease (Addona et al., 2009).

Cross-validation procedures can be used to reduce false positives. In this instance one data set is used to build the model (called training) and a second data set generated from an independent patient cohort is used to assess the predictive accuracy of the model (called testing). Another commonly-used validation strategy is known as K-fold cross-validation where the analysis is repeated over many random splits of the data. For each analysis, a subset of the data is used to build K number of predictive models, with the remaining subset available for a test of predictive accuracy. Although useful initially after discovery, validation based on splitting a single data set is of limited use because confounding factors can introduce systematic biases into both training and test splits.

Given the issues noted above, it advisable to validate intial 'discoveries' on independent sample sets, perhaps incorporating analysis by orthogonal methods which are more amenable to the requirements of clinical throughput and precision (Dupuy & Simon, 2007). Re-analysis or meta-analysis using raw data coming from other research groups is another possibility, although data standards, such as the 'minimum information about a proteomics experiment' (MIAPE) (Taylor et al., 2007), often do not extend into the initial design of clinical studies. Consequently, detailed clinical data may not be captured and reported consistently for clinical proteomics experiments, limiting the ability of investigators to independently verify, combine or correlate data from multiple experiments.

For thorough validation, the number of patient samples required should be determined through the use of statistical tools that take into account the imprecision of the analytical method, inter-patient variability and the acceptable threshold of difference that is deemed significant for a given biomarker application (Ye et al., 2009). Patient numbers (biological replicates) and other statistical considerations of power have also been discussed in detail elsewhere (Cairns et al., 2009).

#### **2.6 Feature selection and classifier assessment**

Several multivariate analysis tools are available for the analysis of large multidimensional data sets and some of these have been arranged into commercial software packages. Visual tools, including principle component analysis, hierarchical cluster analysis and heat maps which display variance, relatedness and patterns in data (respectively), are also available and are useful preliminary aids in data analysis. These analyses stive to represent variance in a graphical fashion and give for example an overall view protein expression prevalence within outcome groups in the case of heat maps or 'relatedness' of expression levels between different proteins with hierarchical trees (Marengo et al. 2006; Marengo et al. 2008). Emphasis however, should be placed on using supervised or semi-supervised methods such as distribution free learning (kernel- based or Bayesian analysis) or support vector machine (SVM) which allow for advanced categorization and classification of multidimensional proteomic data with respect to clinical data. This process known as 'feature selection' and leads ultimately to the creation of a 'classifier' or biomarker driven algorithm specific to the disease and outcome being measured (Liu et al., 2009; Zhu et al., 2009). Since most protein expression profiles will likely not be correlated to a specific outcome, supervised methods screen out uninformative proteins and select protein combinations to develop a 'classifier'. A recent application of the SVM principle has been used to guide feature selection of exhaled peptides as potential biomarkers of asthma (Bloemen et al., 2011).

Depending on whether proteome studies are focused on biomarkers for (i) diagnosis (class discovery), (ii) prognosis (outcome-related) or (iii) prediction (supervised prediction), various rationales should be employed to generate and assess the reliability a classifier. Class discovery methods are best suited for grouping proteins into subsets that elucidate pathways with similar expression profiles across patient subgroups. In outcome-related studies, the goal is to identify which proteins have expression levels that correlate with outcomes grouped into discrete classes: for example, in arthritis patients with a good versus a bad prognosis. When prediction of patient outcome is the aim, supervised prediction methods that use a selected proteome profile are used to generate an algorithm based on individual profiles. In supervised class prediction studies, a totally independent cohort should be used for cross-validation purposes when rigorous testing of a predictive model is desired (Dupuy & Simon, 2007).

The statistical significance of a selected proteome 'classifier' gives an incomplete estimate of its predictive ability and potential clinical utility. The number of true and false positives or negatives should be presented allowing the calculation of sensitivity and specificity. This reveals clinically-relevant information on how the classifier performs in each outcome category. List of statistical tools and recommendations for their application have been reported (Dupuy & Simon, 2007; Karp & Lilley, 2007; Marengo et al. 2006). Depending on the clinical question there may however be multiple outcome measures that are not amenable to a simple binary classification system. Statistical evidence of prevalence and analytical limits of detection of a specific group of isoforms can then direct the study towards validation of candidate biomarkers in a much larger group of multi-center patient populations.
