**2. Outliers and legitimate contaminants in PDA**

In PDA, an outlier is an observation which is not a member of a group, and is often indicative of an incorrect measurement or an incorrect allocation of the unit or observation. Such an outlying observation can cause severe problems that even the robustness of PDA may not overcome. Over the last two decades, many articles have been published about detecting outliers in discriminant analysis (DA) [31–38]. In PDA, a popular means of treating outliers is to construct multiple PDFs with assumed outliers added and with assumed outliers removed [1]. The primary issue with this method is whether potential outliers should be remove one at a time, two at a time, or all at a time. With the SPSS DISCRIMINANT procedure, the chi-squared distribution is used to establish the typicality probability. These typicality probabilities are used to identify potential outliers in the context of PDA. However, Huberty and Olejnik [1] pointed out that when the group covariance matrices are not equal, the unit typicality probabilities are difficult to interpret because different distance metrics are used in the calculation. A common distance index used for detecting outliers or influential observations in the context of PDA is the Mahalanobis distance [39] which is also calculated as a byproduct in SPSS DISCRIMINANT procedure.

However, there are also hidden influential observations (or legitimate contaminants) resulting either from an incorrect distributional assumption (i.e., when the data turns out with a different structure than originally assumed) or an inherent

variability of the dataset, see Osborne [30], and Iglewicz and Hoaglin [40] for more details. While hidden influential observations may actually belong to a training sample, but if not distributed randomly may reduce normality which often leads to violation of sphericity and multivariate normality assumptions in PDA. Hidden influential observation can also adversely affect the quality of the PDA solution and its generality. But how to identify and remove hidden influential observations before building a classification model (particularly in the PDA) has not receive any significant attention in the literature by statisticians or by methodologist and therefore not by any substantive researchers. Besides, the SPSS typicality or Mahalanobis index may not be able to identify hidden influential observations because their unit often belongs to a different group compared to outliers.

Therefore, much emphasis should be placed on cleaning the training sample to ensure that it meets its near-optimum condition by removing all legitimate contaminants from the training sample. This method is similar to optimizing decision trees (in particular classification trees) which consists in reducing the amount of impurity—see Myatt [41] for details. In the context of PDA, it will improve the similarity of each predictor variable variance between groups, thus improving the approximation of the true shape. This will in no doubt guarantee the statistical optimality of the PDF solution or classification accuracy.
