*4.2.1 Algorithm for the modified winsorization with graphical diagnostic (MW-GD)*

Let *D<sup>t</sup> <sup>N</sup>* <sup>¼</sup> ½ � *<sup>x</sup>*1, *<sup>x</sup>*2, <sup>⋯</sup>, *xN* <sup>∈</sup> <sup>ℜ</sup>*<sup>P</sup>*�*<sup>N</sup>* be a training sample matrix that comprises *<sup>N</sup>* observations set *xi*, *yi <sup>n</sup> i*¼1 , obtained from a real dataset using any of the conventional feature selection techniques, where *xi* ∈f g 1, 2, ⋯, *P* denote the corresponding predictor variable label, *yi* ∈f g 1, 2, ⋯,*K* denote the corresponding group label, *P* is the number of predictor variables, and *K* is the number of groups.

*Step 1: Identification of the predictor variables with legitimate contaminants.*

For the training sample, *D<sup>t</sup> <sup>N</sup>* the scores or observations of the predictor variables, *X*1, ⋯, *XN* are first arranged in ascending order so that the extreme scores at both ends can be identified. For each predictor variable in the unordered training sample, *D<sup>t</sup> <sup>N</sup>*, a pair of score (i.e., the most extreme and the least extreme scores initially identified at both ends of each distribution) is deleted and replaced with the median value, before the mean of the remaining scores is calculated. The median value was adopted in order to satisfy the assumption of independence of all cases. The median is a position average independent from all other cases, whereas the mean depends on all other cases. Consequently, substituting the identified influential observations with their median value takes into account the assumption that all observations must be independent. This process is repeated for other lower pairs of extreme values and stops when five modified winsorized means are obtained for each predictor variable. When the modified winsorized means values are plotted, the predictor variable with bar

*On the Use of Modified Winsorization with Graphical Diagnostic for Obtaining… DOI: http://dx.doi.org/10.5772/intechopen.104539*

shapes that are not similar between groups in the 2-D area plot becomes the predictor variable with legitimate contaminants.

*Step 2: Removing legitimate contaminants from the identified predictor variables.*

To determine what percentage of winsorization is required to eliminate the legitimate contaminants, the modified winsorization process is repeated only for the predictive variable(s) identified with the legitimate contaminants until the highest hit rate is attained, thus obtaining a near optimal training sample given as:

$$D\_{N(optimal)}^t = [\mathfrak{x}\_1, \mathfrak{x}\_2, \dots, \mathfrak{x}\_P] \tag{1}$$

*Step 3: Obtaining a statistically optimal hit rate.*

The optimal training sample of Eq. (1), was then used to build the optimized PDF, *Z*ð Þ *opt* given as:

$$\begin{split} Z\_{(optimal)} &= u\_1 X\_1 + u\_2 X\_2 + \cdots + u\_P X\_P \\ &= \eta \left( D\_{N(optimal)}^t \right) \end{split} \tag{2}$$

where *Z*ð Þ *optimal* is the optimized PDF, *ui* are the discriminant weights, *Xi* are the predictor variables and *η D<sup>t</sup> N optimal* ð Þ � � shows that the PDF is constructed with an optimal training sample.

To get a statistically optimal hit rate, let:

$$d\_j = \begin{cases} \mathbf{1} & \text{if } \hat{\mathbf{Z}}\_j = \mathbf{Z}\_j \\ \mathbf{0} & \text{otherwise} \end{cases} \tag{3}$$

where *Z*^ *<sup>j</sup>* is the predicted response for the *jth* observation in the optimized training sample, *Z <sup>j</sup>* is the value for the *jth* observation in the optimized training sample. Therefore, a statistically optimal hit rate for the optimized PDF in (2) is given as:

$$P^{(a)} = \frac{1}{N^t} \left(\sum\_{j=1}^{N^t} d\_j\right) \times 100\tag{4}$$

where *N<sup>t</sup>* is the total number of cases over all groups in the optimized training sample. If we redefine *d <sup>j</sup>* as:

$$d\_{\hat{j}} = \begin{cases} \mathbf{1} & \text{if } \hat{\mathbf{Z}}^{-j} = \mathbf{Z}\_{\hat{j}} \\ \mathbf{0} & \text{otherwise} \end{cases} \tag{5}$$

where *<sup>Z</sup>*^�*<sup>j</sup>* is the predicted response for the *jth* observation computed with the *jth* observation removed from the training sample. The leave-one-out cross-validation (LOOCV) estimate of the optimized hit rate (4) is given by:

$$\hat{P}\_{LOAD}^{(a)} = \frac{1}{N} \left(\sum\_{j=1}^{N\_v} d\_j\right) \times 100\tag{6}$$

where *nv* is the number of validation samples.
