**3.3 BMDK Classifier**

194 Biomarker

classify all 60 samples. In other words, all eight terminal nodes only contained samples from a single group. For 60 Cases and 60 Controls, a decision tree correctly classifies over 89% of the testing samples, and for 90 Cases and Controls, over 83% of the testing samples were correctly classified. It is only when the number of samples is as large or larger than the number of features that the DT classifier drops below 80% accuracy for

**DT MCA**

200.0 200.0 200.0 200.0 190.5 197.6 197.6 197.6 178.3 193.3 193.3 195.0 166.7 187.8 188.9 191.1 155.3 183.3 185.3 187.3 138.3 170.3 179.0 180.3

Table 1. Highest quality obtained using absolute differences in un-scaled peak intensities, from a decision tree (EPDT) and the medoid classifier algorithm (MCA). *Note:* Taken from [Luke & Collins, 2008] where the quality is the sum of the sensitivity and specificity.

The classifier constructed by the MCA is order dependent in that the first training sample automatically becomes the medoid of the first region. Therefore, this analysis uses all samples to construct a classifier with the requirement that the samples used as medoids cannot exceed two-thirds of either the Cases or Controls. One-third of the samples from each group are then selected as testing samples and are chosen such that the accuracy of the

The results in Table 1 show that an MCA classifier performs excellently using datasets that contain no information. If only seven of the 300 features are used, one can find a classifier with an average sensitivity and specificity of over 90% even when there are 300 Cases and 300 Controls (200 of each in the training set and 100 of each in the testing set). If the number of features is reduced to five or six, the accuracy stays above 90% for all but the largest

The results for the SVM and LDA classifiers are shown in Table 2. As described above, all samples are used to determine which features will be used in the classifier, but the accuracy of the final classification uses the sum of the sensitivity and specificity of the testing set for 10-fold cross-validation, averaged over 100 runs where the order of the samples are scrambled before each run. SVM has an average classification accuracy over 97% when there are 90 or fewer Cases and Controls in the dataset. For LDA, the average accuracy stays

For the larger datasets, the accuracy decreases significantly. When there are 150 Cases and 150 Controls, the SVM and LDA classifiers have an average accuracy of about 87 and 84%,

5-Feat 6-Feat 7-Feat

the testing data.

datasets.

above 99.7%.

**3.2 SVM and LDA classifiers** 

**Cases and Controls** 

training set is at least as high as the testing set.

The accuracy of the BMDK classifier is shown in Table 3. This accuracy is determined from a leave-one-out cross-validation, a procedure that is known to exaggerate the accuracy of a classifier. After each sample is classified, the overall accuracy is the sum of the sensitivity and specificity minus the percentage of samples that were classified as unknown. For the smallest dataset (30 Cases and 30 Controls) a 3-feature DD-KNN classifier correctly classified 78.9% of the samples. In general, the accuracy decreased as the number of samples increased.


Table 3. Highest quality obtained from the BioMarker Development Kit (BMDK) using absolute differences in un-scaled peak intensities. *Note:* The quality is the sum of the sensitivity and specificity minus the percent of samples undetermined using a leave-one-out cross-validation of the entire set of data.

A Comparison of Biomarker and Fingerprint-Based Classifiers of Disease 197

increases, the separation between the samples becomes larger. This causes more samples to be classified as Undetermined. For this reason, no 4-feature classifier did better than the

The results presented in the tables above show that very good results can be obtained from DT, MCA, SVM and LDA classifiers for datasets that contain no information. It can be argued that the procedures used here are selected to obtain the maximum possible accuracy, and that is exactly the point. If a 7-node decision tree used 40 Cases and 40 Controls in the training set and 20 Cases and 20 Controls in the testing set and obtained an accuracy of 87.5% for the testing samples, one could propose that the set of seven features denotes a fingerprint that accurately classifies the samples. The results in Table 1 show that this accuracy can be obtained from a dataset with only 300 randomly generated feature values for each sample. A 7-feature MCA classifier is able to achieve an average accuracy of over 90% when the dataset contains 300 Cases, 300 Controls, and only 300 non-informative features. This should draw into question the results of any study that uses this classification

SVM and LDA classifiers have testing set accuracies above 97.4 and 99.7%, respectively, for all but the largest datasets. It is only when the number of samples is at least as large as the number of features that these methods break down. Current methods for obtaining information from biological samples generate many more features that the 300 used here.

The BMDK classifier did not achieve an average accuracy above 80% for even the smallest dataset. This result is not unexpected. Since the datasets do not contain any information, there are no biomarkers and a biomarker-based classifier should not perform well. Fortuitous results can be obtained and a closer examination of the putative biomarkers

For the DT and MCA methods there is some selection of which samples should be placed in the training and testing sets, but this is basically what is required because of the coverage problem. If a given terminal node in a DT classifier contains 7 Cases and 4 Controls, and 4 of the Cases were moved to the testing set, this terminal node would change from a Case-node to a Control-node and the classification accuracy of the testing data would be decreased. The MCA classifier is based on the premise of a fingerprint that associates a sample in the testing set with a sample in the training set. If that sample were removed from the training set, the association could not be made and the accuracy of the

The results presented here show that very good classification results can be obtained from DT, MCA, SVM, and LDA classifiers, even if the dataset contains no information. Studies using any of these methods should carefully examine whether the results are due to some underlying biology or are just fortuitous. Performing comparable examinations on randomly generated feature values, or performing analysis of the same data after the

3-feature classifier in any of the 30 datasets.

**4. Discussion** 

method.

should be performed (Figure 5).

classifier would be decreased.

**5. Conclusions** 

The only exception was for one dataset with 60 Cases and 60 Controls. This increased accuracy was due to an unusual pattern in one of the randomly generated features. The intensities for this feature are shown in Figure 5, where the "+" marks in the left column are the intensities of the 60 samples in Group-1 and the marks in the right column are for the 60 samples in Group-2. While there is no overall difference between these columns, a closer examination shown a clumping of intensities in one group at values that have gaps in the other group.

Fig. 5. Intensities for the 60 cases (left column) and 60 controls (right column) for the peak that yielded a quality score of 151.7 (sensitivity=78.3%, specificity=73.3%, undetermined=0.0%) in the dataset of random peak intensities.

In many cases the accuracy of a 3-feature classifier is not significantly better than a 2-feature classifier. This is due to the fact that as the dimensionality of the classification space increases, the separation between the samples becomes larger. This causes more samples to be classified as Undetermined. For this reason, no 4-feature classifier did better than the 3-feature classifier in any of the 30 datasets.
