**3.2 Case 2**

There are other examples that show the importance of PCA for data visualization. In forensic investigations, there is often the need to compare very similar samples. These comparisons invariably require the use of specific devices. One example was a specific request to compare three paint samples (A, B & C ) in order to discover whether sample C was more similar to A or B. At first glance, all three samples appeared to be very alike. The analytical method that might have allowed to answer this question is pyrolysis, followed by GC, but the laboratory in question wasn't equipped with the necessary devices. Therefore, FT-IR (Fourier Transform Infrared Spectroscopy) analyses were carried out in transmission on 10 portions for each sample (these are both quick and relatively cheap) in order to characterize each sample variability: in other words, each sample was treated as if it were a class with 10 samples. PCA was applied to a data set relative to 30 samples and variables obtained from a data spacing of 64 cm-1 (with a smooth of 11, corresponding to 21.213 cm-1) of FT-IR transmittances, in order to obtain a first data visualization. From the score plot of PC1 vs PC2 vs PC3, shown in figure 11, a trend can be seen in the separation between samples of classes A and B, while C samples are more frequently close to A samples than to B samples. Therefore, the similarity between C and A classes is assumed to be bigger than the one between C and B classes.

As the analytical problem required the classification of C sample to one of two classes, A or B, a discriminant analysis tool was then applied, with discriminant functions calculated only for A and B classes, while C samples were considered as unknown. The aim of this analysis was to verify in which of the two classes (A or B) samples C were more frequently classified. Discriminant analysis always classifies an unknown sample in one class (even if it is an outlier or it belongs to a different class from those implemented), because it calculates only a delimiter between the known classes. For the purpose of this case study, this tool was therefore preferable to a class modeling tool, which builds, on the other hand, a defined mathematical model for each class. Discriminant analysis was performed calculating canonical discriminant functions and using the leave-one-out method; this method is an extension of Linear Discriminant Analysis (LDA), which finds a number of variables that reflect as much as possible the difference between the groups.

The results of discriminant analysis, apart from indicating a classification ability of 100% for both classes A and B and a prediction ability of 70% and 80% respectively, show that seven C samples were classified in class A against three samples classified in class B. To conclude, the results obtained perfectly reflected those achieved in a laboratory equipped with pyrolysis devices.

60 Principal Component Analysis

SIMCA confirms, therefore, the results obtained with PCA in so far as the unknown sample was significantly different from the A and E samples. Regarding classes B, C and D, SIMCA allows to conclude that the unknown sample, outlier for all classes, is nevertheless closer to class B (figure 9) than to the others. Finally, it can be concluded that the sample under investigation does not belong to any of the classes studied (for example, it comes from another refinery, not included in the data matrix); otherwise the sample could belong to one of the classes studied (the most probable class is number 2, followed by class 4) but the variability within each class might not have been sufficiently represented in the data

There are other examples that show the importance of PCA for data visualization. In forensic investigations, there is often the need to compare very similar samples. These comparisons invariably require the use of specific devices. One example was a specific request to compare three paint samples (A, B & C ) in order to discover whether sample C was more similar to A or B. At first glance, all three samples appeared to be very alike. The analytical method that might have allowed to answer this question is pyrolysis, followed by GC, but the laboratory in question wasn't equipped with the necessary devices. Therefore, FT-IR (Fourier Transform Infrared Spectroscopy) analyses were carried out in transmission on 10 portions for each sample (these are both quick and relatively cheap) in order to characterize each sample variability: in other words, each sample was treated as if it were a class with 10 samples. PCA was applied to a data set relative to 30 samples and variables obtained from a data spacing of 64 cm-1 (with a smooth of 11, corresponding to 21.213 cm-1) of FT-IR transmittances, in order to obtain a first data visualization. From the score plot of PC1 vs PC2 vs PC3, shown in figure 11, a trend can be seen in the separation between samples of classes A and B, while C samples are more frequently close to A samples than to B samples. Therefore, the similarity between C and A classes is assumed to be bigger than

As the analytical problem required the classification of C sample to one of two classes, A or B, a discriminant analysis tool was then applied, with discriminant functions calculated only for A and B classes, while C samples were considered as unknown. The aim of this analysis was to verify in which of the two classes (A or B) samples C were more frequently classified. Discriminant analysis always classifies an unknown sample in one class (even if it is an outlier or it belongs to a different class from those implemented), because it calculates only a delimiter between the known classes. For the purpose of this case study, this tool was therefore preferable to a class modeling tool, which builds, on the other hand, a defined mathematical model for each class. Discriminant analysis was performed calculating canonical discriminant functions and using the leave-one-out method; this method is an extension of Linear Discriminant Analysis (LDA), which finds a number of variables that

The results of discriminant analysis, apart from indicating a classification ability of 100% for both classes A and B and a prediction ability of 70% and 80% respectively, show that seven C samples were classified in class A against three samples classified in class B. To conclude, the results obtained perfectly reflected those achieved in a laboratory equipped with

matrix used.

**3.2 Case 2** 

the one between C and B classes.

pyrolysis devices.

reflect as much as possible the difference between the groups.

Fig. 11. Score plot of PC3 versus PC2 versus PC1 relative to FT-IR data for 30 paint samples.

A further observation about how discriminant analysis was applied in this case needs to be made. Indeed, this chemometric tool was applied to 5 principal components which account for 97,48% of the total variance, instead of the original variables. Such a procedure was adopted because the original variables were more than the samples used for building the classification rule between A and B classes (20). PCA is, therefore, imperative in classification problems where the number of variables is greater than that of the samples. In these cases, the application of discriminant analysis to the original variables would cause some overfitting problems; in other words, a sound and specific model would be obtained only for the training set used for its construction. The application of this model in real cases (like this one) would not prove very reliable. In reality, with overfitting, classification ability tends to increase while prediction ability tends to decrease. The best approach to take in these cases is to apply discriminant analysis to the PCs, by using a number of PCs (obviously less than the number of original variables) that explain a fair quantity of the variance contained in the original data. DA provides reliable results if the ratio between the number of samples and variables is more than 3.
