**5.1 Using the modified winsorization with graphical diagnostic (MW-GD) method**

To evaluate the effectiveness of the modified winsorization with graphical diagnostic (MW-GD) method, two real samples (see [6] for the two data sets) were considered with the second used as a validation sample. The first training sample is from a renowned financial journal, among Japanese business leaders, which can be compared to the Economist, Financial Times, and Business Week in Europe and the United States of America. This dataset contains 50 observations from each of the two groups of Japanese financial institutions, each bank being evaluated using the following seven performance indexes: (1) return on total assets (= total profits/average total assets), (2) labor profitability (= total profits/total employees), (3) equity to total assets (= total equity/average total assets), (4) total net working capital, (5) return on equity (= earnings available for common/average equity), (6) cost-profit ratio (= total operating expenditures/total profits), and (7) bad loan ratio (= total bad loans/total loans). However, taking into account the beneficial effect of feature selection and outlier detection as a preprocessing step, a best subset and critical value of Mahalanobis distance were first obtained using the SPSS stepwise method, and its compute command for critical value of Mahalanobis distance. The stepwise approach produced a best subset which comprises return on total assets (*X*1), labor profitability (*X*2), equity to total assets (*X*3), and bad loan ratio (*X*7). The SPSS output of the constructed PDF based on the training sample of four variables given by:

$$Z = 0.005X\_1 + 0.006X\_2 + 0.004X\_3 + 0.005X\_7 \tag{10}$$

Thus, the classification accuracy of the PDF, *Z* (10) and its LOOCV estimate are given by Eqs. (11) and (12).

$$P^{(a)} = \frac{1}{N^t} \left(\sum\_{j=1}^{N^t} d\_j\right) \times 100 = \textbf{86.0\%}\tag{11}$$

$$\hat{P}\_{LOOCV}^{(a)} = \frac{1}{N} \left(\sum\_{j=1}^{n\_v} d\_j\right) \times 100 = 85\% \tag{12}$$

While the two critical values from the SPSS outputs for 0.95 and 0.99 used as the probabilities in the IDF.CHISQ function with four predictor variables were given as:

$$\text{COMPUTE} \text{Eritical} = \text{IDF.CHISQ} \ (0.95, 4) = 9.49$$

$$\text{COMPUTE} \text{Eritical} = \text{IDF.CHISQ} \ (0.99, 4) = 13.28$$

The Mahalanobis distance values for all cases, as reported in the case wise statistics table, were all lower than the two critical values. This means that there are neither outliers nor hidden influential observations in the dataset or training sample. Next, the MW-GD algorithm was applied to the training sample, *D<sup>t</sup> <sup>N</sup>* consisting of four predictor variables.

At *step 1* of the MW-GD method, a total of five modified winsorized means for the four predictor variables were obtained. The summary of the winsorized means values is presented in **Table 1**.

To provide a meaningful interpretation of **Table 1**, the modified winsorized means or averages for both groups, as shown in **Table 1**, were plotted with a 2D area plot in Excel Package. The process involves entering the first variable *X*<sup>1</sup> modified winsorized averages for both groups into Excel spreadsheet in pairs (with *X*<sup>1</sup> values for group 1 and 2 occupying the first 2 columns from row 1 to row 6), followed by variable *X*<sup>2</sup> (with *X*<sup>2</sup> values for group 1 and 2 occupying column 3 to column 4 from row 8 to 13), followed by variables *X*<sup>3</sup> and *X*<sup>7</sup> proceeding downward in steps. The graphical representation is presented in **Figure 1**.


#### **Table 1.**

*Modified winsorized means for up to five pairs of winsorized values.*

*On the Use of Modified Winsorization with Graphical Diagnostic for Obtaining… DOI: http://dx.doi.org/10.5772/intechopen.104539*

A cursory look at **Figure 1** shows that the winsorized average values for the four predictor variables in both groups represented by the 2-D area plot have similar shape (or have similar variances within the groups) except for predictor variable *X*2. Also, in **Figure 1**, the bar shape of the predictor variable *X*<sup>2</sup> in group 1 looks like a rectangle whereas in group 2 it looks like a trapezium. The observed difference in the shape of the variable *X*<sup>2</sup> bar indicates that variable *X*<sup>2</sup> does not have similar variances in the groups, and therefore becomes the only variable with legitimate contaminants. It therefore means that the training sample does not satisfy the assumption of homogeneity of variances. This finding was corroborated by the result of the Box M test for equality of variance-covariance matrices for this training sample, which was significant. Apart from manually replacing the extreme values on both ends with the median value, for each percentage of winsorization using R aggregate () function, the average calculation time needed to generate each row result for groups 1 and 2 in **Table 1** was 2 seconds.

At *step 2* of the MW-GD method, the modified winsorization process was performed for only variable *X*2. For each percent of winsorization, a PDF is constructed using all four predictor variables. The summary of hit rate results for each percent of winsorization is presented in **Table 2**. **Table 2** shows that the highest hit rate of 97.00 was achieved when 5 data points at both ends of the data were replaced by the median value. This means that all the legitimate contaminants in variable *X*<sup>2</sup> was completely replaced at 20% winsorization. In other words, at 20% winsorization, the lack of homogeneity of variances observed in the variable *X*<sup>2</sup> was taken into account, thus obtaining a near optimal training sample, *DN opt* ð Þ:

The optimized training sample, *DN opt* ð Þ was then used to construct a PDF. The SPSS output for the obtained PDF is given as:

$$Z\_{opt} = \; 0.003X\_1 + 0.018X\_2 + 0.001X\_3 + 0.004X\_7 \tag{13}$$

Thus, the PDF, *Zopt* (13) hit rate and its LOOCV estimate are given by Eqs. (14) and (15).

$$P^{(a)} = \frac{1}{N^t} \left(\sum\_{j=1}^{N^t} d\_j\right) \times 100 = 97.0\% \tag{14}$$

$$\hat{P}\_{LOOCV}^{(a)} = \frac{1}{N} \left(\sum\_{j=1}^{n\_v} d\_j\right) \times 100 = 91\% \tag{15}$$


#### **Table 2.**

*Summary of hit rate results for each percent of modified winsorization for predictor variable,* X*2.*

In addition to the dataset from Japanese banks, a second real dataset was used to validate the first sets of results (11), (12), (14), and (15). This validation sample was obtained from the academic records of junior secondary school (JSS) 2, University Demonstration Secondary School (UDSS), University of Benin, Nigeria. The dataset contains 30 observations for both classes: Science and Art. The dataset consists of average scores for the three consecutive terms obtained for eleven (11) subjects, including English Language (*X*1), Mathematics (*X*2), Integrated Science (*X*3), Social Studies (*X*4), Introductory Technology (*X*5), Business Studies (*X*6), Home Economics (*X*7), Agricultural Science (*X*8), Fine Art (*X*9), Physical and Health Education (*X*10), and Computer Studies (*X*11). Using the SPSS stepwise method, a subset of three variables comprising introductory technology (*X*5), physical and health education (*X*10), and computer science (*X*11) was obtained. The SPSS output for the PDF is given as:

$$Z = 0.135X\_{\\$} - 0.102X\_{10} + 0.058X\_{11} \tag{16}$$

Thus, the PDF, *Z* (16) hit rate and its LOOCV estimate are given by Eqs. (17) and (18).

$$P^{(d)} = \frac{1}{N^t} \left(\sum\_{j=1}^{N^t} d\_j\right) \times 100 = \textbf{85.096} \tag{17}$$

$$\hat{P}\_{LOAD}^{(a)} = \frac{1}{N} \left(\sum\_{j=1}^{n\_v} d\_j\right) \times 100 = \textbf{83\%}\tag{18}$$

Also, the two critical values from the SPSS outputs for 0.95 and 0.99 used as the probabilities in the IDF.CHISQ function with three predictor variables were given as:

$$\text{COMPUTE} \text{Eritical} = \text{IDF.CHISQ} \ (0.95, 4) = 7.81$$

$$\text{COMPUTE} \text{Eritical} = \text{IDF.CHISQ} \ (0.99, 4) = \text{11.34}$$

The Mahalanobis distance values for all cases, as reported in the case wise statistics table, were all lower than the two critical values. This means that there are neither outliers nor hidden influential observations in the dataset or training sample. Once again, the proposed algorithm is applied to this second training sample, *D<sup>t</sup> <sup>N</sup>* consisting of three predictor variables.

At *step 1* of the MW-GD method, a total of five modified winsorized means for the three predictor variables were obtained. The summary of the modified winsorized means values is presented in **Table 3**.

Again, to interpret **Table 3**, the modified winsorized means or averages for both groups, as shown in column three of **Table 3**, were plotted with a 2D area plot in Excel Package. The graphical representation is presented in **Figure 2**.

A cursory look at **Figure 2** shows that the winsorized average values for the four predictor variables in both groups represented by the 2-D area plot have similar shape (or have similar variances within the groups). The similar shape shown by the three variables in each group indicates that there are no legitimate contaminants in the training sample, *Dt <sup>N</sup>*. This implies that the fit between the training sample, *Dt <sup>N</sup>* and the basic assumptions of PDA is sufficient (in particular is the assumption of homoscedasticity) to construct a PDF whose hit rate can be said to be statistically optimal. Therefore, for this dataset, the MW-GD algorithm ends at *step 1*. The initial training


*On the Use of Modified Winsorization with Graphical Diagnostic for Obtaining… DOI: http://dx.doi.org/10.5772/intechopen.104539*

#### **Table 3.**

*Winsorized means for up to five pairs of winsorized values.*

**Figure 2.** *Graphical representation of winsorized mean values in Table 3.*

sample of three variables obtained from the second real dataset using SPSS stepwise method is therefore an optimal training sample.
