**4.3 The proposed alternative statistical interpretation for the informative graphical diagnostic**

As described in Step 1 of Section 4.2.1, the graphical representation of the modified winsorized means makes it possible to easily identify the predictor variable(s) whose variance is not significantly similar between the groups. If the variance of a predictor variable is not similar between groups, the bars representing the modified winsorized mean values for the variable in the 2-D area plot will not have similar shape. Such a variable is interpreted to be the identified variable that contains legitimate contaminants. This means that the more the shape of the bars differ between the groups, the easier it is to interpret the 2 D area plot. It is therefore necessary to provide an alternative interpretation when it is difficult to differentiate between a variable shape in the groups of the 2-D area plot. Therefore, a simple statistical or numerical interpretation is proposed for the informative graphical diagnosis when it is difficult to differentiate between a variable shape in the groups of the 2-D area plot. This alternative numerical interpretation consists of the following two simple steps:

*Step 1: Fitting the modified winsorized means values to a linear regression model.*

For each group, the modified winsorized means and their corresponding winsorized percentage values for each predictor variable are fitted to a linear regression model given as:

$$\begin{aligned} Y\_{11} &= a + b\_{11}X \\ &\vdots \\ Y\_{1P} &= a + b\_{1P}X \end{aligned} \tag{7}$$

and

$$\begin{aligned} Y\_{21} &= a + b\_{21}X \\ &\vdots \\ Y\_{2P} &= a + b\_{2P}X \end{aligned} \tag{8}$$

where, *Y*11, ⋯, *Y*1*<sup>P</sup>* and *Y*21, ⋯, *Y*2*<sup>P</sup>* are the modified winsorized mean values for the *P* predictor variables in groups 1 and 2, respectively, and *X* is the corresponding winsorized percent values.

*Step 2: Obtaining the absolute difference between the corresponding regression coefficients for the groups.*

The absolute difference between the obtained regression coefficients (i.e., the slope) in group 1 and 2 is computed as:

$$\delta\_{\rm abs} = b\_{1i} - b\_{2i}, i = 1, \cdots, p \tag{9}$$

where the subscripts 1 and 2 represents groups 1 and 2. The predictor variable that has an absolute difference of 0.75 or greater will be the variable identified with legitimate contaminants. In PDA, if two samples are equal in size, there is always a 50/50 chance. Most researchers would accept a classification accuracy of 25% greater than that caused by chance. Hence, the choice of 0.75 as the decision boundary.
