**3.1 Case 1**

The gasoline data matrix has been used in real cases of arson to link a sample of unevaporated gasoline, found at a fire scene in an unburned can, to its brand or refinery. This helped to answer, for example, questions posed by a military body about the origin of an unevaporated gasoline sample taken from a suspected arsonist. The gasoline sample

Principal Component Analysis: A Powerful

Fig. 6. Score plot of PC3 versus PC1 for 51 gasoline samples.

Fig. 7. Score plot of PC3 versus PC2 versus PC1 for 51 gasoline samples.

Interpretative Tool at the Service of Analytical Methodology 57

under investigation was analyzed with the same procedure adopted by Monfreda and Gregory (2011) and using the same devices. Analyses were carried out almost in the same period in which the 50 samples of the previous work had been analyzed. Three independent portions of the sample were analyzed and from the Total Ion Chromatogram (TIC) of each analysis, a semi-quantitative report of peak areas of the same target compounds (TCs) used by Monfreda and Gregori was obtained. Areas were normalized to the area of the base peak (benzene, 1,2,3-trimethyl), set to 10000, as in the previous study. The average areas (of the three portions analyzed) corresponding to the aromatic compounds were appended to the data matrix of 50 gasoline samples analyzed by Monfreda and Gregori. A PCA was then applied to a data set of 51 samples and 16 variables. Results are shown in the scatter plots of figures 5, 6 and 7. From these scatter plots it can be seen that the sample under investigation is significantly different from those of the A and E brands. As a consequence, these two refineries could be excluded from further investigations by the relevant authorities because the membership of the unknown sample to A or E brands was less likely than it belonging to other classes. The score plot of PC2 versus PC1 (figure 5) shows the unknown sample among the classes B, C and D. From the score plot of PC3 versus PC1 (figure 6), it can be seen that the unknown sample is very close to those of class B, and quite distant from class C. The unknown sample, however, falls into an area where some samples of D brand are also present. Finally, from the scatter plot of PC3 versus PC2 and PC1 (figure 7), the sample under investigation would appear to fall between the B and D classes.

Fig. 5. Score plot of PC2 versus PC1 for 51 gasoline samples.

56 Principal Component Analysis

under investigation was analyzed with the same procedure adopted by Monfreda and Gregory (2011) and using the same devices. Analyses were carried out almost in the same period in which the 50 samples of the previous work had been analyzed. Three independent portions of the sample were analyzed and from the Total Ion Chromatogram (TIC) of each analysis, a semi-quantitative report of peak areas of the same target compounds (TCs) used by Monfreda and Gregori was obtained. Areas were normalized to the area of the base peak (benzene, 1,2,3-trimethyl), set to 10000, as in the previous study. The average areas (of the three portions analyzed) corresponding to the aromatic compounds were appended to the data matrix of 50 gasoline samples analyzed by Monfreda and Gregori. A PCA was then applied to a data set of 51 samples and 16 variables. Results are shown in the scatter plots of figures 5, 6 and 7. From these scatter plots it can be seen that the sample under investigation is significantly different from those of the A and E brands. As a consequence, these two refineries could be excluded from further investigations by the relevant authorities because the membership of the unknown sample to A or E brands was less likely than it belonging to other classes. The score plot of PC2 versus PC1 (figure 5) shows the unknown sample among the classes B, C and D. From the score plot of PC3 versus PC1 (figure 6), it can be seen that the unknown sample is very close to those of class B, and quite distant from class C. The unknown sample, however, falls into an area where some samples of D brand are also present. Finally, from the scatter plot of PC3 versus PC2 and PC1 (figure 7), the sample

under investigation would appear to fall between the B and D classes.

Fig. 5. Score plot of PC2 versus PC1 for 51 gasoline samples.

Fig. 6. Score plot of PC3 versus PC1 for 51 gasoline samples.

Fig. 7. Score plot of PC3 versus PC2 versus PC1 for 51 gasoline samples.

Principal Component Analysis: A Powerful

as it is too far from both classes.

Fig. 9. Cooman's plot for the classes 2 (B) and 4 (D).

Fig. 10. Cooman's plot for the classes 1 (A) and 5 (E).

Interpretative Tool at the Service of Analytical Methodology 59

From the Cooman's plot of classes 3 (C) and 4 (D) (figure 8), the unknown sample (red square) results in an outlier but is closer to class 4 than to class 3. In figure 9, the distances from classes 2 (B) and 4 (D) are displayed and the sample under investigation remains an outlier, but its distance from class 2 is shorter than the equivalent from class 4. In figure 10, where the distances from classes 1 (A) and 5 (E) are plotted, the unknown sample is missing

The application of PCA was especially useful for an initial visualization of data, however the question posed by the military body also needed to be handled with some supervised methods; in other words, discriminant analysis or class modeling tools. In such a way, the system is forced to create a boundary between classes and eventually the unknown sample is processed. For this kind of problem, class modeling tools are clearly preferable to discriminant analysis, in that they first create a model for each category as opposed to creating a simple delimiter between classes. The modeling rule discriminates between the studied category and the rest of the universe. As a consequence, each sample can be assigned to a single category, or to more than one category (if more than one class is modeled) or, alternatively, considered as an outlier if it falls outside the model. Discriminant analysis tends, however, to classify in any case the unknown sample in one of the studied categories even though it may not actually belong to any of them. In this case, the class modeling technique known as SIMCA (Soft Independent Models of Class Analogy) was applied to the data set under investigation. SIMCA builds a mathematical model of the category with its principal components and a sample is accepted by the specific category if its distance to the model is not significantly different from the class residual standard deviation. This chemometric tool was applied considering a 95% confidence level to define the class space and the unweighted augmented distance (Wold & Sjostrom, 1977). A cross validation with 10 cancellation groups was then carried out and 8 components were used to build the mathematical model of each class. The boundaries were forced to include all the objects of the training set in each class, which provided a sensitivity (the percentage of objects belonging to the category which are correctly identified by the mathematical model) of 100%. Results are shown in the Cooman's plots (figures 8, 9 and 10), where classes are labeled with the numbers 1 to 5 instead of the letters A to E respectively. The specificity (the percentage of objects from other categories which are classified as foreign) was also 100%.

Fig. 8. Cooman's plot for the classes 3 (C) and 4 (D).

58 Principal Component Analysis

The application of PCA was especially useful for an initial visualization of data, however the question posed by the military body also needed to be handled with some supervised methods; in other words, discriminant analysis or class modeling tools. In such a way, the system is forced to create a boundary between classes and eventually the unknown sample is processed. For this kind of problem, class modeling tools are clearly preferable to discriminant analysis, in that they first create a model for each category as opposed to creating a simple delimiter between classes. The modeling rule discriminates between the studied category and the rest of the universe. As a consequence, each sample can be assigned to a single category, or to more than one category (if more than one class is modeled) or, alternatively, considered as an outlier if it falls outside the model. Discriminant analysis tends, however, to classify in any case the unknown sample in one of the studied categories even though it may not actually belong to any of them. In this case, the class modeling technique known as SIMCA (Soft Independent Models of Class Analogy) was applied to the data set under investigation. SIMCA builds a mathematical model of the category with its principal components and a sample is accepted by the specific category if its distance to the model is not significantly different from the class residual standard deviation. This chemometric tool was applied considering a 95% confidence level to define the class space and the unweighted augmented distance (Wold & Sjostrom, 1977). A cross validation with 10 cancellation groups was then carried out and 8 components were used to build the mathematical model of each class. The boundaries were forced to include all the objects of the training set in each class, which provided a sensitivity (the percentage of objects belonging to the category which are correctly identified by the mathematical model) of 100%. Results are shown in the Cooman's plots (figures 8, 9 and 10), where classes are labeled with the numbers 1 to 5 instead of the letters A to E respectively. The specificity (the percentage of objects from other categories which are classified as foreign) was also 100%.

Fig. 8. Cooman's plot for the classes 3 (C) and 4 (D).

From the Cooman's plot of classes 3 (C) and 4 (D) (figure 8), the unknown sample (red square) results in an outlier but is closer to class 4 than to class 3. In figure 9, the distances from classes 2 (B) and 4 (D) are displayed and the sample under investigation remains an outlier, but its distance from class 2 is shorter than the equivalent from class 4. In figure 10, where the distances from classes 1 (A) and 5 (E) are plotted, the unknown sample is missing as it is too far from both classes.

Fig. 9. Cooman's plot for the classes 2 (B) and 4 (D).

Fig. 10. Cooman's plot for the classes 1 (A) and 5 (E).

Principal Component Analysis: A Powerful

number of samples and variables is more than 3.

**3.3 Case 3** 

Interpretative Tool at the Service of Analytical Methodology 61

Fig. 11. Score plot of PC3 versus PC2 versus PC1 relative to FT-IR data for 30 paint samples.

A further observation about how discriminant analysis was applied in this case needs to be made. Indeed, this chemometric tool was applied to 5 principal components which account for 97,48% of the total variance, instead of the original variables. Such a procedure was adopted because the original variables were more than the samples used for building the classification rule between A and B classes (20). PCA is, therefore, imperative in classification problems where the number of variables is greater than that of the samples. In these cases, the application of discriminant analysis to the original variables would cause some overfitting problems; in other words, a sound and specific model would be obtained only for the training set used for its construction. The application of this model in real cases (like this one) would not prove very reliable. In reality, with overfitting, classification ability tends to increase while prediction ability tends to decrease. The best approach to take in these cases is to apply discriminant analysis to the PCs, by using a number of PCs (obviously less than the number of original variables) that explain a fair quantity of the variance contained in the original data. DA provides reliable results if the ratio between the

Another (forensic) case which involved a comparison between very similar samples was the comparison between a piece of a packing tape used on a case containing drugs with a roll of packing tape found during a house search, in order to establish whether the packing tape could have been ripped from the roll. Finding such evidence would have been of utmost importance in building a strong case against the suspect. Both exhibits, analyzed by FTIR in transmission, revealed an adhesive part of polybutylacrylate and a support of

SIMCA confirms, therefore, the results obtained with PCA in so far as the unknown sample was significantly different from the A and E samples. Regarding classes B, C and D, SIMCA allows to conclude that the unknown sample, outlier for all classes, is nevertheless closer to class B (figure 9) than to the others. Finally, it can be concluded that the sample under investigation does not belong to any of the classes studied (for example, it comes from another refinery, not included in the data matrix); otherwise the sample could belong to one of the classes studied (the most probable class is number 2, followed by class 4) but the variability within each class might not have been sufficiently represented in the data matrix used.
