**1.5 Biomarker-based classifiers**

An example if a state-specific marker is shown in Figure 2. Each "+" represents, for example, the blood concentration of a particular biochemical. The individuals in the left column are in a specific disease state, while those in the right column are not and are therefore considered to be in a healthy state, at least with respect to this disease. Individuals in each state have different blood concentrations of this biochemical due to genetic and environmental

A Comparison of Biomarker and Fingerprint-Based Classifiers of Disease 185

Since the values of many features are known for each individual, it is possible to construct classifiers using two or more features. An algorithm would search through sets of two or more features to find a set that optimally classified a given set of individuals, which is known as a training set. The goal is to maximize the number of correctly classified individuals, so if two features are used and one is that shown in Figure 2, the second feature would try to correctly resolve those in the undetermined region without upsetting the correct classification of the other individuals. Therefore, the action of this second feature in the classifier is to specifically act on those individuals in the undetermined region. The first feature (Figure 2) therefore is a state-specific marker since its intensity is largely controlled by the state of the individual, while the second feature is individual-specific since it only acts on those individuals who have an

intensity of the first feature in the undetermined region.

**(b) (c)** 

**(a)** 

and (c) Values for these markers assuming two disease states and one healthy state.

Fig. 3. (a) Scatter plot for diseased (red) and healthy (green) individuals using the values of a pair of correlated markers; (b) Values for these markers assuming one state for each category;

differences between individuals and any experimental uncertainty in the measurement. What is clear is that the range of concentrations for individuals in the disease state is significantly higher than for those not in this state. Such a marker can be used to classify the individuals into three groups; they are in the disease state if the blood concentration is above an upper threshold, they are not in this disease state if the concentration is below a lower threshold, and they are undetermined if the blood concentration is between these thresholds.

Fig. 2. Values of a state-specific marker for individuals in a disease state (left) and in a healthy state (right).

While blood concentrations of biochemicals are one possible means of examination, it is not the only one. Concentrations of biochemicals in the urine are another, but this can be extended to tears, mucous, or virtually any biofluid. Instead of directly measuring the concentration of specific compounds, mass spectra (with or without pre-fractionation) and 2D NMR of these biofluids can also be used to measure abundance. The difference with these spectral methods is that the abundance of a compound can be examined from one individual to the next without knowing the identity of this compound. Therefore, examining the intensity or area of spectral peaks is called an undirected search since a list of compounds to examine was not created before hand, while direct measurements of concentrations or intensity measurements from microarray experiments are directed searches since the search is over a set of pre-defined compounds.

In general, the set of biochemicals whose concentration is directly measured or examined by microarray analysis, as well as the set of peaks present in various spectra, are known as features. For each individual, each of these features has a corresponding value. This value can be the concentration, the logarithm of the relative fluorescence intensity, or the intensity or area of the spectral peak. The search for a putative biomarker is over the set of *N* available features, and each individual is represented by an array of *N* numbers representing the values of these features.

differences between individuals and any experimental uncertainty in the measurement. What is clear is that the range of concentrations for individuals in the disease state is significantly higher than for those not in this state. Such a marker can be used to classify the individuals into three groups; they are in the disease state if the blood concentration is above an upper threshold, they are not in this disease state if the concentration is below a lower threshold, and they are undetermined if the blood concentration is between these

Fig. 2. Values of a state-specific marker for individuals in a disease state (left) and in a

searches since the search is over a set of pre-defined compounds.

While blood concentrations of biochemicals are one possible means of examination, it is not the only one. Concentrations of biochemicals in the urine are another, but this can be extended to tears, mucous, or virtually any biofluid. Instead of directly measuring the concentration of specific compounds, mass spectra (with or without pre-fractionation) and 2D NMR of these biofluids can also be used to measure abundance. The difference with these spectral methods is that the abundance of a compound can be examined from one individual to the next without knowing the identity of this compound. Therefore, examining the intensity or area of spectral peaks is called an undirected search since a list of compounds to examine was not created before hand, while direct measurements of concentrations or intensity measurements from microarray experiments are directed

In general, the set of biochemicals whose concentration is directly measured or examined by microarray analysis, as well as the set of peaks present in various spectra, are known as features. For each individual, each of these features has a corresponding value. This value can be the concentration, the logarithm of the relative fluorescence intensity, or the intensity or area of the spectral peak. The search for a putative biomarker is over the set of *N* available features, and each individual is represented by an array of *N* numbers representing the

thresholds.

healthy state (right).

values of these features.

Since the values of many features are known for each individual, it is possible to construct classifiers using two or more features. An algorithm would search through sets of two or more features to find a set that optimally classified a given set of individuals, which is known as a training set. The goal is to maximize the number of correctly classified individuals, so if two features are used and one is that shown in Figure 2, the second feature would try to correctly resolve those in the undetermined region without upsetting the correct classification of the other individuals. Therefore, the action of this second feature in the classifier is to specifically act on those individuals in the undetermined region. The first feature (Figure 2) therefore is a state-specific marker since its intensity is largely controlled by the state of the individual, while the second feature is individual-specific since it only acts on those individuals who have an intensity of the first feature in the undetermined region.

Fig. 3. (a) Scatter plot for diseased (red) and healthy (green) individuals using the values of a pair of correlated markers; (b) Values for these markers assuming one state for each category; and (c) Values for these markers assuming two disease states and one healthy state.

A Comparison of Biomarker and Fingerprint-Based Classifiers of Disease 187

from which these individuals were taken. In other words, any classifier that accurately classifies a sufficient sample from a population should be generalizable to the entire population. We assert that this assumption may be true only if the classifier is strictly composed of state-specific markers. Any classifier that is a chance fit to the available data

The simplest example of fingerprinting is a straightforward decision tree, like the one shown in Figure 4. Assuming that the entire dataset is composed of 60 diseased and 60 healthy individuals, the intensity of Feature 1 splits the dataset into two groups; 40 diseased and 20 healthy individuals if the intensity of this feature is below Cut-1 and 20 diseased and 40 healthy individuals if its intensity is above Cut-1. The left branch is further divided using Feature 2 into a diseased node (D1) that contains 38 diseased and 3 healthy individuals and a healthy node (H1) that contains 2 diseased and 17 healthy individuals. The right branch is

Fig. 4. (a) Hypothetical decision tree using all available data and (b) the corresponding tree

Overall this decision tree would yield a sensitivity and a specificity of 90%, but the general procedure is to divide the data into a training set and a testing set and construct the classifier using only the training data. If one-third of the data was removed to form the testing data, the situation in Figure 4b could be produced. In this example, 16 of the 20 healthy samples happened to come from H1 and 16 of 20 diseased samples from D1. This training distribution would make the use of Feature 2 unnecessary and may result in different features being used at each node. If only Features 1 and 3 were used, the training set would have a sensitivity of 90% and a specificity of 82.5%, while the testing data would have a sensitivity of 100% but a specificity of only 20%. The basic reason for this large change in sensitivity is that the fingerprint needed to describe the healthy subjects in Group

when one-third of the samples are removed as testing data.

H1 is no longer present in the training data.

divided using Feature 3 into a healthy (H2) and a diseased (D2) terminal node.

will not be generalizable to the entire population.

**1.7 Coverage, uniqueness, and significance** 

This argument suggests that statistical methods which find features that are significantly different in magnitude depending upon an individual's state is all that is needed to find any state-specific markers. These independent markers can be found if both the healthy and diseased categories are represented by a single state. Figure 3 displays a situation where the diseased category (shown in red) is actually composed of two states (D1 and D2). This can only be seen through the action of a concerted pair of markers, Marker 1 and Marker 2. State D1 has a high intensity in Marker 1 while State D2 has a high intensity in Marker 2, while the healthy individuals have a low intensity in both features. Figure 3 shows intensity plots for these two markers under the assumption that there is a single diseased state and a single healthy state. It is questionable whether a given statistical method would find the difference in the intensities of these features significant. Only by correctly distinguishing the state if each individual can one see that Marker 1 is a good classifier for State D1 and Marker 2 for State D2 (Figure 3).

#### **1.6 Bias, chance and generalizability**

Ransohoff [2005a, 2005b] has presented three factors that must be explored in any classification study; bias, chance and generalizability. Until now, any marker that clearly distinguishes individuals in different states, either alone or in a concerted action with another feature, is denoted as a putative biomarker. Before it can become a true biomarker one has to ensure that the marker is not due to an underlying bias. For example, if all individuals in the disease state are being given a particular drug, there is no way to determine if the change in the feature value is due to the disease or the drug. There is no way to remove this bias, and such situations should be excluded in the initial study design. As a second example, the individuals in the disease state may be significantly older than those in the healthy state. Many diseases are more prevalent in older individuals and it may be very difficult to find age-matched patients who are disease free or are not on a regular drug treatment. If a random collection of age-matched individuals without signs of the particular disease state are taken to be the healthy category, it is likely that this category will be composed of a number of states due to other diseases or drug responses. Markers separating each of these "healthy" states from the disease state would have to be found. Finding all required biomarkers would be very difficult within a single set of features. In addition, if the number of individuals in a particular healthy state was small, the significance of any biomarker may be suspect (see below). For this case, the affect of age can be examined. If there is no correlation between the feature value and the age of the individual in either the disease or healthy state, one can conclude that age is not the source of the difference in feature values.

If the available individuals in the disease and healthy states are divided into a training set and a testing set, it is theoretically possible to construct one or more classifiers using the training set that can accurately classify the individuals in the testing set without using a state-based marker. Such a classifier is a chance fit to the available data, and we have shown that accurate results can be obtained for certain classifiers without any state-specific marker being present in the set of available features. Therefore, simply constructing a good classifier is not sufficient to demonstrate the presence of a state-specific marker.

The basic assumption is that if a classifier is able to accurately classify both a training set and a testing set of data, then this classifier will be useful for all individuals in the population from which these individuals were taken. In other words, any classifier that accurately classifies a sufficient sample from a population should be generalizable to the entire population. We assert that this assumption may be true only if the classifier is strictly composed of state-specific markers. Any classifier that is a chance fit to the available data will not be generalizable to the entire population.

#### **1.7 Coverage, uniqueness, and significance**

186 Biomarker

This argument suggests that statistical methods which find features that are significantly different in magnitude depending upon an individual's state is all that is needed to find any state-specific markers. These independent markers can be found if both the healthy and diseased categories are represented by a single state. Figure 3 displays a situation where the diseased category (shown in red) is actually composed of two states (D1 and D2). This can only be seen through the action of a concerted pair of markers, Marker 1 and Marker 2. State D1 has a high intensity in Marker 1 while State D2 has a high intensity in Marker 2, while the healthy individuals have a low intensity in both features. Figure 3 shows intensity plots for these two markers under the assumption that there is a single diseased state and a single healthy state. It is questionable whether a given statistical method would find the difference in the intensities of these features significant. Only by correctly distinguishing the state if each individual can one see that Marker 1 is a good classifier for State D1 and Marker 2 for

Ransohoff [2005a, 2005b] has presented three factors that must be explored in any classification study; bias, chance and generalizability. Until now, any marker that clearly distinguishes individuals in different states, either alone or in a concerted action with another feature, is denoted as a putative biomarker. Before it can become a true biomarker one has to ensure that the marker is not due to an underlying bias. For example, if all individuals in the disease state are being given a particular drug, there is no way to determine if the change in the feature value is due to the disease or the drug. There is no way to remove this bias, and such situations should be excluded in the initial study design. As a second example, the individuals in the disease state may be significantly older than those in the healthy state. Many diseases are more prevalent in older individuals and it may be very difficult to find age-matched patients who are disease free or are not on a regular drug treatment. If a random collection of age-matched individuals without signs of the particular disease state are taken to be the healthy category, it is likely that this category will be composed of a number of states due to other diseases or drug responses. Markers separating each of these "healthy" states from the disease state would have to be found. Finding all required biomarkers would be very difficult within a single set of features. In addition, if the number of individuals in a particular healthy state was small, the significance of any biomarker may be suspect (see below). For this case, the affect of age can be examined. If there is no correlation between the feature value and the age of the individual in either the disease or healthy state, one can conclude that age is not the source

If the available individuals in the disease and healthy states are divided into a training set and a testing set, it is theoretically possible to construct one or more classifiers using the training set that can accurately classify the individuals in the testing set without using a state-based marker. Such a classifier is a chance fit to the available data, and we have shown that accurate results can be obtained for certain classifiers without any state-specific marker being present in the set of available features. Therefore, simply constructing a good classifier

The basic assumption is that if a classifier is able to accurately classify both a training set and a testing set of data, then this classifier will be useful for all individuals in the population

is not sufficient to demonstrate the presence of a state-specific marker.

State D2 (Figure 3).

**1.6 Bias, chance and generalizability** 

of the difference in feature values.

The simplest example of fingerprinting is a straightforward decision tree, like the one shown in Figure 4. Assuming that the entire dataset is composed of 60 diseased and 60 healthy individuals, the intensity of Feature 1 splits the dataset into two groups; 40 diseased and 20 healthy individuals if the intensity of this feature is below Cut-1 and 20 diseased and 40 healthy individuals if its intensity is above Cut-1. The left branch is further divided using Feature 2 into a diseased node (D1) that contains 38 diseased and 3 healthy individuals and a healthy node (H1) that contains 2 diseased and 17 healthy individuals. The right branch is divided using Feature 3 into a healthy (H2) and a diseased (D2) terminal node.

Fig. 4. (a) Hypothetical decision tree using all available data and (b) the corresponding tree when one-third of the samples are removed as testing data.

Overall this decision tree would yield a sensitivity and a specificity of 90%, but the general procedure is to divide the data into a training set and a testing set and construct the classifier using only the training data. If one-third of the data was removed to form the testing data, the situation in Figure 4b could be produced. In this example, 16 of the 20 healthy samples happened to come from H1 and 16 of 20 diseased samples from D1. This training distribution would make the use of Feature 2 unnecessary and may result in different features being used at each node. If only Features 1 and 3 were used, the training set would have a sensitivity of 90% and a specificity of 82.5%, while the testing data would have a sensitivity of 100% but a specificity of only 20%. The basic reason for this large change in sensitivity is that the fingerprint needed to describe the healthy subjects in Group H1 is no longer present in the training data.

A Comparison of Biomarker and Fingerprint-Based Classifiers of Disease 189

Collins, 2008], but these will be included here to compare to SVM, LDA, and biomarkerbased classifiers. Exhaustive searches using DT, MCA, SVM and LDA are not computationally feasible, so the results presented here represent a lower bound to the accuracy that can be obtained from these methods with data that contains no information. It should also be stressed that 300 features is a very small number by current methods of analysis of biological samples, and the accuracy of all methods will not decrease as the

For the symmetric, 7-node decision tree shown in Figure 1, a modified Evolutionary Programming (mEP) procedure is used. Each putative decision tree classifier is represented by two 7-element arrays; the first contains the feature used at each node and the second contains the cut values. Both arrays assumed the node ordering listed in Figure 1. The only caveats are that all seven features must be different and that this ordered septet of features cannot be the same as any other putative solution in either the parent or offspring populations. When a new putative decision tree is formed, a local search is used to find

The mEP procedure starts by randomly generating 2000 unique decision trees. Each decision tree has one or two of the features removed and unique features are selected, again requiring that the final septet is unique. The local search first tries to find optimum cut points for the new features that were added and then the search is over all seven cut points. The best set of cut points is combined with the septet of features to represent an offspring classifier. The score is the sum of the sensitivity and specificity for the training individuals over the eight terminal nodes. When the entire set of initial, or parent, decision trees have generated unique offspring, all 4000 scores are compared and the 2000 decision trees with the best score become parents for the next generation. This process is repeated for a total of

While the algorithm described by Petricoin and Liotta [Browers et al., 2005; Conrads et al., 2004; Ornstein et al., 2004; Petricoin et al., 2004; Srinivasan et al., 2006; Stone et al., 2005] used a genetic algorithm driver to search for an optimum set of features, allowing for different putative solutions to use different numbers of features (5-20 features), our algorithm uses a mEP feature selection algorithm and all putative solutions have the same number of features *n*. For a given value of *n*, *n* features were selected and the intensities of these features were rescaled for each individual using the following formula [Browers et al., 2005; Conrads et al., 2004; Ornstein et al., 2004; Petricoin et al., 2004; Srinivasan et al., 2006;

In this equation, I is a feature's original intensity, I' is it's scaled intensity, and Imin and Imax are the minimum and maximum intensities found for the individual among the *n*

I' = (I – Imin) / (Imax – Imin) (1)

4000 generations and the best classifiers in the final population are examined.

number of features increases.

optimum cut points for this septet of features.

**2.2 Mediod classification algorithm** 

Stone et al., 2005]:

**2. Methods** 

**2.1 Decision tree** 

Therefore, the first requirement of a fingerprinting method is that there must be a complete *coverage* of all required fingerprints in the training data. If a required fingerprint or proteomic pattern is missing from the training data (Figure 4), the quality of predictions for the testing data will either be greatly reduced or there will be a significant number of testing individuals that will receive an "undetermined" classification.

If a fingerprinting classifier is found that performs extremely well on classifying the training data, but classifies the testing data poorly, one can either state that the classifier is insufficient and therefore not biologically relevant, or that there was an incorrect separation of training and validation data so that effective coverage of all important fingerprints was not present in the training data. Since the discriminating fingerprints are not known, proper coverage cannot be known, and therefore proper selection of the training data cannot be known. In addition, since the quality of classifying the testing set is the metric used to determine biological relevance, the testing set is used in the process of constructing the classifier and is therefore part of the training process.

With these points in mind, an effective way to construct classifiers based on fingerprints is to include all data in the search for fingerprinting classifiers and then to selectively remove samples for the testing set in a way that preserves the coverage of the fingerprint in the training data. This statement does not suggest, in any way, that this procedure is used by other research groups who present fingerprinting classifiers, it simply states that this method is an effective way to ensure complete coverage in the training data and to effectively test for uniqueness. If Figure 4 was used as the basic classifier, all other possible three-node decision trees would have to be constructed and compared to a sensitivity and specificity of 90%. If no other three-node decision tree is found to have this overall accuracy, then the uniqueness of this classifier is established. Otherwise, each decision tree would have to be presented as a possible solution; since the important fingerprints are not known, the selection of the training set cannot be determined, and two different decision trees that imply different separations of training and validation data are therefore equally valid.

Finally, the *significance* of a fingerprinting classifier needs to be established. Permutation testing is often used to test significance, but can be used in three different ways. In the Random Forest algorithm [Breiman, 2001] the intensities of a given feature are scrambled among all data in each testing set (i.e. the out-of-bag samples) to determine the importance of that feature. The phenotypes of the samples can also be scrambled a large number of times to determine the probability that the accuracy of a given classifier occurred by chance. In this application, the phenotypes will be scrambled amongst all data to determine if a new classifier of the same form (e.g. a three-node decision tree) can be constructed with comparable accuracy. The probability that random phenotypes can be classified to a given accuracy determines the significance of a given model.

#### **1.8 Proposed study**

To test the classification ability of different algorithms, this study will attempt to build classifiers from sets of 300 possible features. In each case, the intensities of the features will be determined using a random number generator. In other words, each classifier will attempt to distinguish healthy samples from diseased ones using data that contains no information. Results using DT and MCA classifiers have been previously presented [Luke & Collins, 2008], but these will be included here to compare to SVM, LDA, and biomarkerbased classifiers. Exhaustive searches using DT, MCA, SVM and LDA are not computationally feasible, so the results presented here represent a lower bound to the accuracy that can be obtained from these methods with data that contains no information. It should also be stressed that 300 features is a very small number by current methods of analysis of biological samples, and the accuracy of all methods will not decrease as the number of features increases.
