**3. Results**

192 Biomarker

Step 4. Estimate, for each *ci, i=1,2,…,k*, the performance of a classifier, which uses all the

Linear Discriminant Analysis (LDA) [Fukunaga, 1990] is a supervised learning algorithm that finds the linear combination of features that maximize the between-class scatter and simultaneously minimize the within-class scatter to achieve maximum discrimination in a dataset. The within-class scatter matrix may become singular if the sample size is smaller than the dimensionality of the search space (number of features). To overcome the singularity problem, the pseudo-inverse [Golub & Van Loan, 1983] of the within-class

The computation of a pseudo-inverse in LDA may be demanding if the dimension of withinclass scatter matrix is too large. In this study diagonal LDA is used, which is the same as LDA except that the covariance matrices are assumed to be diagonal. The diagonal LDA has been reported to be performed remarkably well compared to more sophisticated methods [Dudoit et al., 2002]. A leave-one-out procedure is again used to determine the coefficient of variation for all remaining features, and procedure outlined above is used to reduce the

The BioMarker Discovery Kit (BMDK) represents a suite of programs with the eventual goal of constructing one or more biomarker-based classifiers. Each biomarker represents a particular feature that is associated with a particular disease state represented by a subset of the available individuals. BMDK uses 10 different methods of analysis to identify putative biomarkers. These methods determine how well each feature distinguishes some or all of the individuals in a given histology. Descriptions of each filtering method are given elsewhere [http://isp.ncifcrf.gov/abcc/abcc-groups/simulation-and-modeling/biomarkerdiscovery-kit/].The union of all features that have one of the top five scores for each of the

A single biocompound may produce more than one putative biomarker if the features are obtained from a mass spectroscopic investigation. For example, separate peaks for the +1 and +2 ion or the biocompound alone and complexed with another compound are possible. Therefore, the Pearson's correlation coefficient between all pairs of putative biomarkers across all samples is used to combine the putative biomarkers into groups. All other features in the dataset are then compared to the putative biomarkers within each group and are selected for examination if the correlation coefficient is 0.70 or higher. Each group is then represented by the single feature with the largest maximum value; all other features are

Step 5. Find the smallest *ci* from *c1, c2,…,ck* which gives the best performance in Step 3. Step 6. Choose all the features from current feature set whose coefficient of variation are

less than or equal to *ci* as the feature subset for this selection cycle.

validation, such as, leave-one-out cross-validation.

The selection process is repeated until only one feature remains.

**2.4 Linear Discriminate Analysis** 

scatter matrix is computed in this study.

**2.5 Biomarker Discovery Kit** 

10 methods produces the set of putative biomarkers.

feature set.

discarded.

features whose coefficient of variation is less than or equal to *ci*, from a cross-

Since the datasets examined in this investigation are produced using a random number generator, the goal is to simply determine a lower-bound to the accuracy that can be obtained for 300 features for different numbers of Cases and Controls. Since these labels really have no meaning, the accuracy of a classifier will be given by the sum of the sensitivity and specificity. These are lower bounds since only five different datasets are examined for each Case/Control combination, and for all methods but BMDK only a small fraction of all possible feature-combinations are explored.
