**3. Chemometrics to vibrational spectral analysis**

The automation and computerization of laboratories have been carried out with various important consequences. One of them is the rapid acquisition of large amounts of data. However, it is well know that acquiring such large amount of data is far from to providing appropriate answers quicker. Obtaining vibrational spectroscopy multivariate data is not synonymous with possessing vibrational information. The later must be interpreted and placed in context to convert it into useful information for the user. Chemometrics is the field of chemistry that provides the user with the required tools to enable that capability.

A great deal of chemometrics tools have been developed and tested. The most used of these tools to identify, quantify and classify data sets are those that make use of principal

components analysis (PCA), partial least squares (PLS), discriminant analysis (DA), and their combined usage: PLS-DA and hierarchical cluster analysis (HCA). PCA transforms a set of variables into fewer variables (called factors, rank, dimensions or principal components) which contain most of the information (variance) of the initial data set [28-31].

Multivariate Analysis in Vibrational

Spectroscopy of Highly Energetic Materials and Chemical Warfare Agents Simulants 169

(1) Ǥ ൌ

(2) Ǥ ൌ

**+** *<sup>E</sup>*

**+** *<sup>f</sup>*

into the first few loading vectors (number of component, factors, ranks or principal components). PLS regression consists of two fundamental steps. First, to transform the *X* predictive matrix (spectra) of order n × p (n = number of samples and p = number of variables: cm-1 or nm), in an matrix of components or latent variables uncorrelated, *T* = (T1, ..., Tp) of order n × p, called PLS components, using the *Y* response vector (concentrations) of order n × 1; this contrasts with the principal component analysis in which the components are obtained using only the *X* predictive matrix . Second, to calculate the estimated regression model using the *Y* response original vector as predictive, PLS components. The dimensionality reduction can be applied directly on the components as they are orthogonal. The number of PC required for the regression analysis must be much smaller than the number of predictors. There is a number of ways of expressing these, a convenient one being

Figure 7 illustrates a simplified scheme for PLS: *X* represents the experimental measurements (e.g. spectra) and *c* (or *Y*) the concentrations. The first equation above appears similar to that of PCA, but the scores matrix also models the concentrations, and the vector *q* has some analogy to a loadings vector. The matrix *T* is common to both equations. *E* is an error matrix for the *X* block and *f* an error vector for the *c* block. The scores are orthogonal, but the loadings (*P*) are not orthogonal, unlike in PCA, and usually they are not

*P*

**.**

**.***q*

Chemometrics techniques have improved the last years in order to save time and computational resources in different models to be used without compromising the quality of results. In 2000 Norgaard and co-workers [32,33], developed different algorithms useful in Chemometrics field called interval partial least squares (iPLS) and this tool was presented for use on NIR spectral data. Recently, this new graphically oriented local modeling procedure has been implemented in many areas of research such as petrochemicals,

(Eq.1 and 2) [29]:

normalized.

**Figure 7.** Simplified scheme for a PLS transformation.

*<sup>c</sup>* **=** *<sup>T</sup>*

*<sup>X</sup>* **=** *<sup>T</sup>*

pharmaceutical and beverage industry [34-36].

**Figure 6.** Simplified scheme for a PCA analysis.

The PCA algorithm seeks to save the information from a large number of variables in a small number of uncorrelated components, with minimal loss of information. The main reasons for performing a PCA are: reduction of the number of variables to fewer dimensions that contain as much information as possible and have uncorrelated dimensions (used to avoid multi-collinearity in multiple regressions, among other things). An important method for qualitative analysis of spectral data is principal component analysis. PCA is a method for the investigation of the variation within a multivariable data set. The first step in PCA is to subtract the average value or spectrum from the entire data set, this is called mean centering. The largest source of variation in the data set is called principal component PC1. The 2nd largest source of variation in the data, which is independent of PC1, is called PC2. Principal components form a set of orthogonal vectors. For each one of the data points, the projection of the data point onto the P1 or P2 vector is called a score value. Plots of sample score values for different principal components, typically P1 versus P2 are called score plots. Score plots provide important information about how different samples are related. Principal component plots, also called loading plots, provide information about how different variables are related to each other. In practical cases, PCA uses a single X matrix which is represented by the infrared spectra. PCA is a purely qualitative analysis (does not give a quantitative value that establishes how different are a spectral dataset) to visualize if there is variability between a set of IR spectra. PCA can thus also be used to detect the presence of outliers. Figure 6 shows a simplified PCA scheme [28-29].

Partial least squares (PLS) regression is a quantitative spectral decomposition technique that is closely related to PCA regression. The importance of PLS is that it is used to design and build robust calibration models for multivariate quantitative analysis. PLS actually uses the concentration information during the decomposition process. This causes spectra containing higher constituent concentrations to be weighted more heavily than those with lower concentrations. The main idea of PLS is to get as much concentration information as possible into the first few loading vectors (number of component, factors, ranks or principal components). PLS regression consists of two fundamental steps. First, to transform the *X* predictive matrix (spectra) of order n × p (n = number of samples and p = number of variables: cm-1 or nm), in an matrix of components or latent variables uncorrelated, *T* = (T1, ..., Tp) of order n × p, called PLS components, using the *Y* response vector (concentrations) of order n × 1; this contrasts with the principal component analysis in which the components are obtained using only the *X* predictive matrix . Second, to calculate the estimated regression model using the *Y* response original vector as predictive, PLS components. The dimensionality reduction can be applied directly on the components as they are orthogonal. The number of PC required for the regression analysis must be much smaller than the number of predictors. There is a number of ways of expressing these, a convenient one being (Eq.1 and 2) [29]:

$$X = T.P + E \tag{1}$$

$$\mathbf{c} = \mathbf{T}. \mathbf{q} + \mathbf{f} \tag{2}$$

Figure 7 illustrates a simplified scheme for PLS: *X* represents the experimental measurements (e.g. spectra) and *c* (or *Y*) the concentrations. The first equation above appears similar to that of PCA, but the scores matrix also models the concentrations, and the vector *q* has some analogy to a loadings vector. The matrix *T* is common to both equations. *E* is an error matrix for the *X* block and *f* an error vector for the *c* block. The scores are orthogonal, but the loadings (*P*) are not orthogonal, unlike in PCA, and usually they are not normalized.

**Figure 7.** Simplified scheme for a PLS transformation.

168 Multivariate Analysis in Management, Engineering and the Sciences

**Samples**

**Figure 6.** Simplified scheme for a PCA analysis.

components analysis (PCA), partial least squares (PLS), discriminant analysis (DA), and their combined usage: PLS-DA and hierarchical cluster analysis (HCA). PCA transforms a set of variables into fewer variables (called factors, rank, dimensions or principal components) which contain most of the information (variance) of the initial data set [28-31].

**Variables Scores**

**Samples**

The PCA algorithm seeks to save the information from a large number of variables in a small number of uncorrelated components, with minimal loss of information. The main reasons for performing a PCA are: reduction of the number of variables to fewer dimensions that contain as much information as possible and have uncorrelated dimensions (used to avoid multi-collinearity in multiple regressions, among other things). An important method for qualitative analysis of spectral data is principal component analysis. PCA is a method for the investigation of the variation within a multivariable data set. The first step in PCA is to subtract the average value or spectrum from the entire data set, this is called mean centering. The largest source of variation in the data set is called principal component PC1. The 2nd largest source of variation in the data, which is independent of PC1, is called PC2. Principal components form a set of orthogonal vectors. For each one of the data points, the projection of the data point onto the P1 or P2 vector is called a score value. Plots of sample score values for different principal components, typically P1 versus P2 are called score plots. Score plots provide important information about how different samples are related. Principal component plots, also called loading plots, provide information about how different variables are related to each other. In practical cases, PCA uses a single X matrix which is represented by the infrared spectra. PCA is a purely qualitative analysis (does not give a quantitative value that establishes how different are a spectral dataset) to visualize if there is variability between a set of IR spectra. PCA can thus also be used to detect the

**PCA**

presence of outliers. Figure 6 shows a simplified PCA scheme [28-29].

Partial least squares (PLS) regression is a quantitative spectral decomposition technique that is closely related to PCA regression. The importance of PLS is that it is used to design and build robust calibration models for multivariate quantitative analysis. PLS actually uses the concentration information during the decomposition process. This causes spectra containing higher constituent concentrations to be weighted more heavily than those with lower concentrations. The main idea of PLS is to get as much concentration information as possible Chemometrics techniques have improved the last years in order to save time and computational resources in different models to be used without compromising the quality of results. In 2000 Norgaard and co-workers [32,33], developed different algorithms useful in Chemometrics field called interval partial least squares (iPLS) and this tool was presented for use on NIR spectral data. Recently, this new graphically oriented local modeling procedure has been implemented in many areas of research such as petrochemicals, pharmaceutical and beverage industry [34-36].

The principle of the iPLS is to optimize the predictive capability of PLS regression models and to support in interpretation. This algorithm which develops local PLS models on equidistant subintervals of the full-spectrum region. Its major objective is to provide an overall perspective of the significant information in different spectral subdivisions, thereby focusing on important spectral regions and removing interferences from other regions. The sensitivity of the PLS algorithm to noisy variables is highlighted by the informative *i*PLS plots [32]. For synergy interval PLS (siPLS), the basic principle of this algorithm is the same as iPLS first, it is to split the data set into a number of intervals (variable-wise), next, to develop PLS regression models for all possible combinations of two, three or four intervals. Thereafter, RMSECV is calculated for every combination of intervals. The combination of intervals with the lowest root mean square error of cross-validation (RMSECV) is selected.

Multivariate Analysis in Vibrational

(3)

(4)

(5)

� is the

Spectroscopy of Highly Energetic Materials and Chemical Warfare Agents Simulants 171

by using chemometrics tools, such as PLS, iPLS and siPLS. The PLS program used was from PLS-ToolBox™ (Eigenvector Research Inc.) for use with MatLab™. The iPLS and siPLS algorithms used in this work were carried out by employing iToolbox™, (downloaded from http://www. models.kvl.dk). The performance of the final PLS, iPLS and siPLS models were evaluated according to the root mean square error of cross-validation (RMSECV), a leaveone-sample-out cross-validation method and the predictive ability of the models were assessed by the root mean square error of prediction (RMSEP) and the correlation coefficient

> ������ � �<sup>∑</sup> ������)� ����� ��� ����

Where *c*i and *c*p are the experimental and predicted concentration, respectively, of the ith calibration sample when situated in a left out segment, *n*cal is the number of calibration samples in the training set. The number of PLS components included in the model is selected according to the lowest RMSECV. This procedure is repeated for each of the preprocessed spectra. For the test set, the root mean square error of prediction (RMSEP) is

> ����� � �<sup>∑</sup> ������)� ����� ��� �����

The best model with the overall lowest RMSECV will be selected as final model. Correlation coefficients between the predicted and the true concentration are calculated for both the

� � �1 − <sup>∑</sup> ������) � � ���

The implementation of new methodologies for enhanced detection of hazardous compounds such as explosives is always attractive for many countries principally for defense and security applications. Terrorist employ different ways to pose threats and make illegal acts against military and civilian people. According to this situation our study is focused on detection of explosives present in mixture prepared intentionally with a pharmaceutical product by employing remote Raman detection and chemometrics tools. Remote Raman spectra of PETN, APAP in mixtures of them are illustrated in Figure 8. The results show that mean centering (MC) pre-processing method was the most successful method for correcting background and was selected for construction of further models

The full spectrum was split in 20 independent intervals and the RMSECV values for PLS models constructed with different intervals is shown in Figure 9. Models with no intervals were better than PLS models with all variables (dotted in line) and the intervals 6 (1185.2- 1328.9 cm-1), 9 (1619.8 -1755.4 cm-1), and 19 (2878 -2988.4 cm-1), presened the lowest RMSECV values where more variability exists. These values are shown in Table 1. The number of

<sup>∑</sup> �������) � � ���

calibration and the test set, which are calculated as follows from Equation 5, where ��

mean of the experimental measurement results for all samples in the train and test sets.

because they presented small improvement in RMSEC.

(R). In general for all the PLS models RMSECV were calculated as follows:

calculated as follows:

Finally, cluster analysis is the name given to a set of techniques that seeks to determine the structural characteristics of multivariate data sets by dividing the data into groups, clusters, or hierarchies. For cluster analysis, each sample is treated as a point in an n-dimensional measurement space. The coordinate axes of this space are defined by the measurements used to characterize the samples. Cluster analysis assesses the similarity between samples by measuring the distances between the points in the measurement space. Samples that are similar will lie close to one another, whereas dissimilar samples are distant from each other [28].

In this chapter, remote Raman detection experiments were performed to quantify HEM such as PETN present in mixtures with non-HEM. The remote measurements were carried out at 10 m employing a frequency-doubled 532 nm Nd:YAG pulsed laser as excitation source. The quantification study was performed by using PLS, iPLS and siPLS as chemometrics tools to achieve the best correlation between the remote Raman signal and the concentration (%) of PETN explosive in a mixture with pharmaceutical compound. Discrimination of chemical warfare agent simulant (CWAS) TEP concealed within commercial beverage bottles using Optical Fiber Coupled Raman Spectroscopy with the use of different chemometrics techniques such as PLS, PLS-DA. Finally infrared spectroscopic information analysis using Chemometrics was designed and implemented in the detection of HEM: 2,4-DNT, TATP, PETN and RDX, present at trace level on surfaces and in air were analyzed by Chemometrics Enhanced Vibrational Spectroscopy.
