**2.1 Decision tree**

For the symmetric, 7-node decision tree shown in Figure 1, a modified Evolutionary Programming (mEP) procedure is used. Each putative decision tree classifier is represented by two 7-element arrays; the first contains the feature used at each node and the second contains the cut values. Both arrays assumed the node ordering listed in Figure 1. The only caveats are that all seven features must be different and that this ordered septet of features cannot be the same as any other putative solution in either the parent or offspring populations. When a new putative decision tree is formed, a local search is used to find optimum cut points for this septet of features.

The mEP procedure starts by randomly generating 2000 unique decision trees. Each decision tree has one or two of the features removed and unique features are selected, again requiring that the final septet is unique. The local search first tries to find optimum cut points for the new features that were added and then the search is over all seven cut points. The best set of cut points is combined with the septet of features to represent an offspring classifier. The score is the sum of the sensitivity and specificity for the training individuals over the eight terminal nodes. When the entire set of initial, or parent, decision trees have generated unique offspring, all 4000 scores are compared and the 2000 decision trees with the best score become parents for the next generation. This process is repeated for a total of 4000 generations and the best classifiers in the final population are examined.

#### **2.2 Mediod classification algorithm**

While the algorithm described by Petricoin and Liotta [Browers et al., 2005; Conrads et al., 2004; Ornstein et al., 2004; Petricoin et al., 2004; Srinivasan et al., 2006; Stone et al., 2005] used a genetic algorithm driver to search for an optimum set of features, allowing for different putative solutions to use different numbers of features (5-20 features), our algorithm uses a mEP feature selection algorithm and all putative solutions have the same number of features *n*. For a given value of *n*, *n* features were selected and the intensities of these features were rescaled for each individual using the following formula [Browers et al., 2005; Conrads et al., 2004; Ornstein et al., 2004; Petricoin et al., 2004; Srinivasan et al., 2006; Stone et al., 2005]:

$$\mathbf{I}' = \begin{pmatrix} \mathbf{I} - \text{Imin} \end{pmatrix} \text{ (Imax-Imin)}\tag{1}$$

In this equation, I is a feature's original intensity, I' is it's scaled intensity, and Imin and Imax are the minimum and maximum intensities found for the individual among the *n*

method.

A Comparison of Biomarker and Fingerprint-Based Classifiers of Disease 191

Many features in a genomic or proteomic data are irrelevant or redundant that may likely hinder the performance of a classifier. It is essential to select informative features to build a classifier. A new selection criterion is presented with a performance found to be better than or comparable to the other criteria and is applied to LDA and linear SVM as a classification

Support Vector Machines (SVM) [Boser, 1992; Vapnik, 1998], are becoming increasingly popular in biological problems [Noble, 2004]. SVM finds the optimal hyperplane that maximizes the margin of separation between the hyperplane and the closest data points on both sides of the hyperplane. Instead of error-risk minimization, the parameters of SVMs are determined on the basis of structural risk minimization. Thus, they have the tendency to overcome the overfitting problem. SVMs have been successful with a recursive procedure in selecting important features for cancer prediction [Guyon et al., 2002; Tang et al., 2007].

The decision functions (used to determine the class of a sample) of SVM and LDA can be expressed as a linear combination of features. They differ with regard to how the weights are determined. The weights (coefficients in the decision function) of features, which reflect the significance of the features for classification, can be served as a feature ranking criterion. This criterion corresponds to removing a feature whose elimination changes the objective function least [LeCun et al., 1990]. The criterion has been used with a recursive feature

Instead of judging a feature by its contribution to the classification on the full dataset, this study uses a leave-one-out cross-validation to evaluate a feature's contribution to the ensemble of classifiers. In other words, a classifier is re-trained on a new dataset formed by removing a sample from original dataset to obtain a weight for every feature. If a feature is important in differentiating samples, it should remain so when any sample is removed from a dataset. This can be indicated by the coefficient of variation of the weight value for each feature. The coefficient of variation is defined as the ratio of the standard deviation to the mean. A small coefficient of variation indicates smaller variation and a more consistent contribution of a feature to the sample classification. There are two ways to incorporate this criterion into the recursive selection process. One is to pre-select the number of iterations and the number of features at each iteration. This can be implemented by determining the coefficient of variation for each feature in current feature set and selecting k features with smallest coefficient of variation, where k is the predefined number of features for this iteration. In the second implementation, the number of iterations and the number of features at each iteration are determined during the selection process. It starts with all the features

Step 1. Compute the coefficient of variation for each feature in current feature set. In every

Step 2. Let *cmin* denote the minimum coefficient of variation and *cmax* denote the maximum

Step 3. Select *k* coefficient of variation, *c1, c2,…, ck*, such that *cmin < c1 < c2 < …< ck = cmax* and

selection cycle, the procedure initially eliminates at least certain number of features. In this study, 10% of the current features, or 1 whichever is larger, with largest

*c1, c2,…,ck* divided the interval [*cmin, cmax*] into *k* subintervals of equal lengths except

elimination scheme [Guyon et al., 2002], as described before.

and can be described as follow:

coefficient of variation are eliminated.

coefficient of variation among the remaining features.

possibly the interval [*ck-1,* cmax]. In this study, we choose *k=8*.

selected features, respectively. If Imin and Imax were from the same features in all samples, a baseline intensity would be subtracted and the remaining values scaled so that the largest intensity was 1.0. Each individual would then be represented as a point in an (*n*-2) dimensional unit cube. As designed, and as found in practice, Imin and Imax do not represent the same features from one individual to the next, so this interpretation does not hold. Therefore, each individual represents a point in an *n*-dimensional unit cube.

As stated in the Background, the first training sample becomes the medoid of the first cell, with this cell being classified as the category of this sample. Each cell has a constant trust radius *r*, which is set to 0.1 (n)1/2, or ten percent of the maximum theoretical separation in this unit hypercube. If the second sample is within *r* of the first, it is placed in the first cell; otherwise it becomes the medoid of the second cell and that cell is characterized by the second sample's category. This iteration continues until all training samples are processed. Each cell is then examined and the categories of all samples in the cell are compared to the cell's classification. This calculation allows a sensitivity and specificity to be determined for the training data, and their sum represents the score for this set of *n* features.

The mEP algorithm initially selects 2000 sets of *n* randomly selected features. The only caveat is that each set of *n* features must be different from all previously selected sets. The medoid classification algorithm then determines the score for each set of features. Again, each parent set of features generates an offspring set of features by randomly removing one or two of the features and replacing them with randomly selected features, requiring that this set be different from all feature sets in the parent population and in all offspring generated so far. The score of this feature set is determined and the score and feature set is stored in the offspring population. After all 2000 offspring have been generated the parent and offspring populations are combined. The 2000 feature sets with the best score are retained and become the parents for the next generation.

It should be noted that for a set of *n* features, the number of unique cells that can be generated is on the order of 10n. Since no training set is ever this large (*n* is 5 or more), only a small fraction of the possible cells will be populated and classified. As will be shown in the next section, this limitation causes a significant number of the testing samples to be placed in an unclassified cell, though none of the publications that used this method [Browers et al., 2005; Conrads et al., 2004; Ornstein et al., 2004; Petricoin et al., 2004; Srinivasan et al., 2006; Stone et al., 2005] reported an undetermined classification for any of the testing samples. Instead of searching through a large number of solutions that classified the training samples to a significant extent and find those that minimized the number of unclassified testing samples, we decided to use all samples and limit the number of cells. All samples were placed in the training set and the algorithm was run with the added requirement that any set of *n* features that produced more than a selected number of cells was given a score of zero. If the number of healthy and disease medoids are sufficiently small, all other samples could then be divided to place the required number in the testing set and the remainder would be part of the training.

#### **2.3 Support Vector Machine**

A support Vector Machine (SVM) [Boser et al., 1992; Vapnik, 1998] is a kernel-based learning system. SVM searches for the optimal hyperplane that maximizes the margin of separation between the hyperplane and the closest data points on both sides of the hyperplane.

selected features, respectively. If Imin and Imax were from the same features in all samples, a baseline intensity would be subtracted and the remaining values scaled so that the largest intensity was 1.0. Each individual would then be represented as a point in an (*n*-2) dimensional unit cube. As designed, and as found in practice, Imin and Imax do not represent the same features from one individual to the next, so this interpretation does not

As stated in the Background, the first training sample becomes the medoid of the first cell, with this cell being classified as the category of this sample. Each cell has a constant trust radius *r*, which is set to 0.1 (n)1/2, or ten percent of the maximum theoretical separation in this unit hypercube. If the second sample is within *r* of the first, it is placed in the first cell; otherwise it becomes the medoid of the second cell and that cell is characterized by the second sample's category. This iteration continues until all training samples are processed. Each cell is then examined and the categories of all samples in the cell are compared to the cell's classification. This calculation allows a sensitivity and specificity to be determined for

The mEP algorithm initially selects 2000 sets of *n* randomly selected features. The only caveat is that each set of *n* features must be different from all previously selected sets. The medoid classification algorithm then determines the score for each set of features. Again, each parent set of features generates an offspring set of features by randomly removing one or two of the features and replacing them with randomly selected features, requiring that this set be different from all feature sets in the parent population and in all offspring generated so far. The score of this feature set is determined and the score and feature set is stored in the offspring population. After all 2000 offspring have been generated the parent and offspring populations are combined. The 2000 feature sets with the best score are

It should be noted that for a set of *n* features, the number of unique cells that can be generated is on the order of 10n. Since no training set is ever this large (*n* is 5 or more), only a small fraction of the possible cells will be populated and classified. As will be shown in the next section, this limitation causes a significant number of the testing samples to be placed in an unclassified cell, though none of the publications that used this method [Browers et al., 2005; Conrads et al., 2004; Ornstein et al., 2004; Petricoin et al., 2004; Srinivasan et al., 2006; Stone et al., 2005] reported an undetermined classification for any of the testing samples. Instead of searching through a large number of solutions that classified the training samples to a significant extent and find those that minimized the number of unclassified testing samples, we decided to use all samples and limit the number of cells. All samples were placed in the training set and the algorithm was run with the added requirement that any set of *n* features that produced more than a selected number of cells was given a score of zero. If the number of healthy and disease medoids are sufficiently small, all other samples could then be divided to place the required number in the testing set and the remainder

A support Vector Machine (SVM) [Boser et al., 1992; Vapnik, 1998] is a kernel-based learning system. SVM searches for the optimal hyperplane that maximizes the margin of separation

between the hyperplane and the closest data points on both sides of the hyperplane.

hold. Therefore, each individual represents a point in an *n*-dimensional unit cube.

the training data, and their sum represents the score for this set of *n* features.

retained and become the parents for the next generation.

would be part of the training.

**2.3 Support Vector Machine** 

Many features in a genomic or proteomic data are irrelevant or redundant that may likely hinder the performance of a classifier. It is essential to select informative features to build a classifier. A new selection criterion is presented with a performance found to be better than or comparable to the other criteria and is applied to LDA and linear SVM as a classification method.

Support Vector Machines (SVM) [Boser, 1992; Vapnik, 1998], are becoming increasingly popular in biological problems [Noble, 2004]. SVM finds the optimal hyperplane that maximizes the margin of separation between the hyperplane and the closest data points on both sides of the hyperplane. Instead of error-risk minimization, the parameters of SVMs are determined on the basis of structural risk minimization. Thus, they have the tendency to overcome the overfitting problem. SVMs have been successful with a recursive procedure in selecting important features for cancer prediction [Guyon et al., 2002; Tang et al., 2007].

The decision functions (used to determine the class of a sample) of SVM and LDA can be expressed as a linear combination of features. They differ with regard to how the weights are determined. The weights (coefficients in the decision function) of features, which reflect the significance of the features for classification, can be served as a feature ranking criterion. This criterion corresponds to removing a feature whose elimination changes the objective function least [LeCun et al., 1990]. The criterion has been used with a recursive feature elimination scheme [Guyon et al., 2002], as described before.

Instead of judging a feature by its contribution to the classification on the full dataset, this study uses a leave-one-out cross-validation to evaluate a feature's contribution to the ensemble of classifiers. In other words, a classifier is re-trained on a new dataset formed by removing a sample from original dataset to obtain a weight for every feature. If a feature is important in differentiating samples, it should remain so when any sample is removed from a dataset. This can be indicated by the coefficient of variation of the weight value for each feature. The coefficient of variation is defined as the ratio of the standard deviation to the mean. A small coefficient of variation indicates smaller variation and a more consistent contribution of a feature to the sample classification. There are two ways to incorporate this criterion into the recursive selection process. One is to pre-select the number of iterations and the number of features at each iteration. This can be implemented by determining the coefficient of variation for each feature in current feature set and selecting k features with smallest coefficient of variation, where k is the predefined number of features for this iteration. In the second implementation, the number of iterations and the number of features at each iteration are determined during the selection process. It starts with all the features and can be described as follow:


A Comparison of Biomarker and Fingerprint-Based Classifiers of Disease 193

The final classifier is based on a distance-dependent K-nearest neighbor (DD-KNN) algorithm. In this classifier the un-normalized probability that an unknown sample belongs to the same group as a neighbor is given by the inverse of their distance. To account for the situation where an unknown sample has no nearest neighbors, the classifier also contains a probability that its group is unknown. This probability linearly increases from 0.0 to 0.01 as the probability of being in a neighbors group decreases from 1.0 to 0.8. For smaller probabilities of belonging to the neighbor's group, the probability of being unknown stays constant at 0.1. These probabilities are summed over all neighbors and scaled to a total probability of 1.0. Therefore, each unknown sample is described by a probability of belonging to Group-1 (e.g. healthy), Group-2 (e.g. diseased), and Undetermined. The final classification is given by a probability of membership of at least 0.5, or Undetermined if the

All of the putative biomarkers are individually used to find the best 1-feature DD-KNN algorithm, and this is followed by an exhaustive search over all sets of two and three putative biomarkers. In practice, six nearest neighbors are generally used but this number can be increased if there are a large number of samples; the number of neighbors should not decrease below six. The quality of the classifier is determined using a leave-one-out procedure since this method preserves the coverage (range of intensities) for the samples to the greatest extent. Each time an optimum classifier is found, the distribution of samples in feature-space is plotted to determine the number of disease state present for each category.

Since the datasets examined in this investigation are produced using a random number generator, the goal is to simply determine a lower-bound to the accuracy that can be obtained for 300 features for different numbers of Cases and Controls. Since these labels really have no meaning, the accuracy of a classifier will be given by the sum of the sensitivity and specificity. These are lower bounds since only five different datasets are examined for each Case/Control combination, and for all methods but BMDK only a small

For the DT and MCA classifiers, it is assumed that the dataset is divided such that two-thirds of the samples in each group are used in the training set and one-third is used in the testing set. The accuracy of the classifier is the sum of the sensitivity and specificity of the testing set. For the DT algorithm, all samples are used in the construction of the decision tree. After the best decision tree is constructed from the evolutionary programming search over ordered sets of seven features, one-third of the samples are removed to build the testing set. This is done in a way that does not change the description of each terminal node (i.e. it stays as either a healthy or diseased node) and the sensitivity and specificity of the training and testing sets are approximately equal. This may appear to be cheating, but the goal of this investigation is to determine the minimum accuracy that could be obtained from data that

The best quality from the five datasets for each number of Cases and Controls is given in Table 1. For 30 Cases and 30 Controls, a 7-node decision tree was able to correctly

probabilities of belonging to either group is less than 0.5.

fraction of all possible feature-combinations are explored.

**3. Results** 

**3.1 DT and MCA classifiers** 

contains no information.


The selection process is repeated until only one feature remains.

#### **2.4 Linear Discriminate Analysis**

Linear Discriminant Analysis (LDA) [Fukunaga, 1990] is a supervised learning algorithm that finds the linear combination of features that maximize the between-class scatter and simultaneously minimize the within-class scatter to achieve maximum discrimination in a dataset. The within-class scatter matrix may become singular if the sample size is smaller than the dimensionality of the search space (number of features). To overcome the singularity problem, the pseudo-inverse [Golub & Van Loan, 1983] of the within-class scatter matrix is computed in this study.

The computation of a pseudo-inverse in LDA may be demanding if the dimension of withinclass scatter matrix is too large. In this study diagonal LDA is used, which is the same as LDA except that the covariance matrices are assumed to be diagonal. The diagonal LDA has been reported to be performed remarkably well compared to more sophisticated methods [Dudoit et al., 2002]. A leave-one-out procedure is again used to determine the coefficient of variation for all remaining features, and procedure outlined above is used to reduce the feature set.

#### **2.5 Biomarker Discovery Kit**

The BioMarker Discovery Kit (BMDK) represents a suite of programs with the eventual goal of constructing one or more biomarker-based classifiers. Each biomarker represents a particular feature that is associated with a particular disease state represented by a subset of the available individuals. BMDK uses 10 different methods of analysis to identify putative biomarkers. These methods determine how well each feature distinguishes some or all of the individuals in a given histology. Descriptions of each filtering method are given elsewhere [http://isp.ncifcrf.gov/abcc/abcc-groups/simulation-and-modeling/biomarkerdiscovery-kit/].The union of all features that have one of the top five scores for each of the 10 methods produces the set of putative biomarkers.

A single biocompound may produce more than one putative biomarker if the features are obtained from a mass spectroscopic investigation. For example, separate peaks for the +1 and +2 ion or the biocompound alone and complexed with another compound are possible. Therefore, the Pearson's correlation coefficient between all pairs of putative biomarkers across all samples is used to combine the putative biomarkers into groups. All other features in the dataset are then compared to the putative biomarkers within each group and are selected for examination if the correlation coefficient is 0.70 or higher. Each group is then represented by the single feature with the largest maximum value; all other features are discarded.

The final classifier is based on a distance-dependent K-nearest neighbor (DD-KNN) algorithm. In this classifier the un-normalized probability that an unknown sample belongs to the same group as a neighbor is given by the inverse of their distance. To account for the situation where an unknown sample has no nearest neighbors, the classifier also contains a probability that its group is unknown. This probability linearly increases from 0.0 to 0.01 as the probability of being in a neighbors group decreases from 1.0 to 0.8. For smaller probabilities of belonging to the neighbor's group, the probability of being unknown stays constant at 0.1. These probabilities are summed over all neighbors and scaled to a total probability of 1.0. Therefore, each unknown sample is described by a probability of belonging to Group-1 (e.g. healthy), Group-2 (e.g. diseased), and Undetermined. The final classification is given by a probability of membership of at least 0.5, or Undetermined if the probabilities of belonging to either group is less than 0.5.

All of the putative biomarkers are individually used to find the best 1-feature DD-KNN algorithm, and this is followed by an exhaustive search over all sets of two and three putative biomarkers. In practice, six nearest neighbors are generally used but this number can be increased if there are a large number of samples; the number of neighbors should not decrease below six. The quality of the classifier is determined using a leave-one-out procedure since this method preserves the coverage (range of intensities) for the samples to the greatest extent. Each time an optimum classifier is found, the distribution of samples in feature-space is plotted to determine the number of disease state present for each category.
