**1. Introduction**

Binary classification when compared to multiple classification has wide range of real-world applications in many areas of human endeavors, such as criminal justice, education, medicine, email analysis, human resources management, pattern

recognition, energy and environmental management, financial data analysis and economics, production systems management and technical diagnosis, marketing among others. Where the classification criterion comprises one or several predictor variables along with a categorical criterion, such a prediction will require the use of a predictive discriminant analysis (PDA). PDA is still the optimal method when the cost of misclassifying groups is clearly different and when there is greater interest in the accuracy of classifying separate groups. In most cases, evaluating the proportion of correct classification of a predictive discriminant function (PDF) in all subpopulations is equivalent to the estimation of the actual hit rate, *P*ð Þ *<sup>a</sup>* [1, 2]. That is, *P*ð Þ *<sup>a</sup>* is the expected proportion of correct classification when a PDF that is built from a given training sample is validated on training samples from the sample population. In PDA, to improve or optimize classification accuracy or actual hit rate, researchers often rely on feature selection methods. The aim of feature selection methods in PDA is to choose the best subset of important predictor variables that will effectively reduce the intricacy of the PDF, thus facilitate interpretation, enhance or optimize the classification accuracy, and reduce the training time. Nevertheless, the promise of optimizing classification accuracy using variable selection methods is almost always unfulfilled, because the derived PDF is often obtained from a training sample that does not meet near optimal condition [1, 3–6]. The actual hit rate of a PDF may be considered statistically optimal only if the assumptions of normality and/or homogeneity of variances are taken into account [5, 7]. This means that having a better subset is not a guarantee for achieving a statistically optimal classification accuracy.

In general, the task of enhancing or improving classification accuracy was examined in two ways. Several researchers use feature or variable selection techniques to select the best subset of predictors to construct a classification model. In addition to conventional feature selection techniques, including the stepwise and all possible subset methods [4, 8, 9]. Some widely known and used methods include the principal component analysis (PCA) used to obtain a set of low-dimensional features from a large set of features [10, 11]. The branch and bound technique which uses a greedy procedure to obtain the best subset [12], the genetic search algorithm [13, 14], the shrinkage methods [10, 15], the particle swarm optimization (PSO) approach which is a meta-heuristic technique used to enhance classification accuracy [16], representative methods based on dictionary learning (DL) for classification [17–19], support vector machines (SVMS) [20], and the hyper parameter tuning approach [21, 22]. We have the sequential analysis approach as well [23]. The heteroscedastic discriminant analysis merged with feature selection [24] and the modified leave-one-out (LOOCV) cross-validation method used as an alternative to the all-possible subset method [25]. A PDF's classification accuracy is only statistically optimal if each group sample is normally distributed with different group means, and each predictor variance is similar between the groups [7]. None of these basic assumptions regarding the validity/reliability of the PDF are considered by any of the above methods. To address these gaps in feature selection techniques, other investigators are seeking alternatives to robust PDA by replacing conventional estimators with robust estimators. Some variants of these alternative methods include the dimensionality reduction/feature extraction for outlier detection (DROUT) [26], the minimum covariance determinant (MCD) [27], *S*-estimators [28], one-step *M*-estimator (MOM), and winsorized onestep *M*-estimator (WMOM) [29]. These other methods concentrate on building a robust PDF for deviations due to the presence of outliers in the training sample. Besides the presence of outliers in most training samples, there are also hidden
