**4.1. Pedagogical example: how to make PCA step-by-step (See Matlab code in appendix)**

The data in Table 2 consists of the fluorescence intensities at four different wavelengths for 10 hypothetical samples 1-10.

The data processing presented here was performed with Matlab v2007b. Like all data processing software, Matlab has a number of statistical tools to perform PCA in one mouse click or a one-step command-process, but we choose to give details of the calculation and bring up the steps leading to the representation of the factorial coordinates (scores) and factor contributions (loadings). Then, we interpret the results.

 Demonstrations can be found elsewhere showing that eigenvalues of *M*T*M* and *MM*T are the same. The reason is that *M*T*M* and *MM*T respond to the same characteristic equation.

<sup>\*\*</sup> We know that *M = USVT*. Therefore, to find *U* knowing *S*, just post-multiply by *S-1* to obtain *AVS-1 = USS-1*. Hence *U = AVS-1*, because *SS-1 = I = 1*.


**Table 2.** Fluorescence intensities for four wavelengths measured on 12 samples.

#### **Step 1.** Centring the data matrix by the average

12 Analytical Chemistry

Compute *M*T, *M*T*M* (or *MM*T)

diagonal and compute *S-1*,

*3.3.3. NIPALS versus SVD* 

**4. Practical examples** 

**(See Matlab code in appendix)** 

10 hypothetical samples 1-10.

*AVS-1*, because *SS-1 = I = 1*.

  resolving:

eigenvalues and eigenvectors of *MMT* and *MTM*. The eigenvectors of *MTM* will produce the

6. Compute eigenvalues of *M*T*M* and sort them in descending order along its diagonal by

0 *M MT* 

2. Build a diagonal matrix *S* by placing singular values in descending order along its

3. Re-use eignenvalues from step 2 in descending order and compute the eigenvectors of *MTM*. Place these eigenvectors along the columns of *V* and compute its transpose *VT*.

The relationship between the matrices obtained by NIPALS and SVD is given by the following:

*TP USV*

With the orthonormal *US* product corresponding to the scores matrix *T* , and *V* to the loadings matrix *P*. Note that with *S* being a diagonal matrix, its dimensions are the same as those of *M*.

The data in Table 2 consists of the fluorescence intensities at four different wavelengths for

The data processing presented here was performed with Matlab v2007b. Like all data processing software, Matlab has a number of statistical tools to perform PCA in one mouse click or a one-step command-process, but we choose to give details of the calculation and bring up the steps leading to the representation of the factorial coordinates (scores) and

Demonstrations can be found elsewhere showing that eigenvalues of *M*T*M* and *MM*T are the same. The reason is that

\*\* We know that *M = USVT*. Therefore, to find *U* knowing *S*, just post-multiply by *S-1* to obtain *AVS-1 = USS-1*. Hence *U =* 

columns of *V* and the eigenvectors of *MMT* will give the columns of *U*.

 Characteristic equation: its resolution gives eigenvalues of *MTM*. 1. Square root the eigenvalues of *M*T*M* to obtain the singular values of M,

Compute *U* as *U* = *MVS-1* (\*\*) and compute the true scores *T* as *T = US*.

**4.1. Pedagogical example: how to make PCA step-by-step** 

factor contributions (loadings). Then, we interpret the results.

*M*T*M* and *MM*T respond to the same characteristic equation.

Presented below is the pseudo-code of the SVD algorithm:

This step is to subtract the intensity values of each column, the average of the said column. In other words, for each wavelength is the mean of all samples for this wavelength and subtract this value from the fluorescence intensity for each sample for the same wavelength.

**Step 2.** Calculation of the variance-covariance matrix

The variance-covariance matrix (Table 3) is calculated according to the *X* T*X* product, namely the Gram matrix. The diagonal of this matrix consists of the variances; the trace of the matrix (sum of diagonal elements) corresponds to the total variance (3294.5) of the original matrix of the data. This matrix shows, for example, that the covariance for the fluorescence intensities at 420 and 520 nm is equal to 665.9.


**Table 3.** Variance-covariance matrix

As mentioned above, the matrix also contains the variances of the fluorescence intensities at each wavelength on the diagonal. For example, for the fluorescence intensities at 474 nm, the variance is 1194.2.

The technique for calculating the eigenvectors of the variance-covariance matrix, and so the principal components, is called eigenvalue analysis (*eigenanalysis*).

#### **Step 3.** Calculate the eigenvalues and eigenvectors

Table 4 shows the eigenvalues and eigenvectors obtained by the diagonalization of the variance-covariance matrix (a process that will not be presented here, but some details on this mathematical process will be found elsewhere [*23*]).

PCA: The Basic Building Block of Chemometrics 15

0.566\*15.4 + 0.614\*11.4 + 0.376\*6.1 + 0.401\*3.6 = 19.453. Table 6 presents the factorial

Therefore, we can now represent, for example, all samples in the space of two first

**Figure 4. On the left hand side**: scores-plot from samples in two first principal components space;

It was found that the samples are divided into two distinct groups on the initial data. Another way to visualize this division is to represent only the scores on PC1 (on PC2 or PCi) for each sample (see Figure 4). In this case, it is also obvious that the samples 1 to 5 have similar values of PC1 and, thus, form a first group (group 1), while samples 6-10 are the

19,455 6,128 0,377 0,678 20,121 6,671 0,072 0,983 20,420 4,528 -1,256 0,156 19,357 7,534 1,494 0,575 20,928 7,635 0,259 0,051 119,413 20,924 0,667 0,179 140,394 23,343 -0,522 0,843 123,676 -0,601 0,537 0,228 116,052 6,511 0,590 0,397 130,754 -18,524 0,056 0,632 **Table 6.** Factorial coordinates = *scores* of the initial sample matrix in the principal components space.

Centred data **PC1 PC2 PC3 PC4** 

coordinates of all the samples of the initial matrix.

components, as shown in Figure 4 below.

**on the right hand side**: samples-scores on PC1.

second group (group 2).

Note that the sum of the eigenvalues is equal to the sum of the variances in the variancecovariance matrix, which is not surprising since the eigenvalues are calculated from the variance-covariance matrix.


**Table 4.** Eigenvalues calculated using diagonalization processing of the variance-covariance matrix.

The next step of the calculation is to determine what percentage of the variance is explained by each major component. This is done by using the fact that the sum of the eigenvalues corresponding to 100% of the explained variance, as follows:

$$\%Var\_{j^{\text{th}}\text{ }PC} = \frac{\mathcal{A}\_j}{\sum\_{i=1}^{l} \mathcal{A}\_i} \times 100\%$$

Where j is the jth eigenvalue. We thus obtain for the first component PC1 = 3163.4 \* 1 / 3294.5 \* 100 = 96.01%. PC2 is associated with 3.96% and so on, as shown in Table 5 below.


**Table 5.** Percentage of the explained variance for each principal component.

**Step 4.** Calculating factorial coordinates - Scores

The eigenvectors calculated above are the principal components and the values given in Table 6 are the coefficients of each principal component. Thus, component #1 is written: PC1 = 0.566\*X1 + 0.614\*X2 + 0.376\*X3 + 0.401\*X4 where X1, X2, X3 and X4 are the fluorescence intensities, 420, 474, 520 and 570 nm respectively. The factorial coordinates of each sample in the new space formed by the principal components can now be calculated directly from the equations of the PCs. For example, sample No. 1 has a coordinate on PC1 equal to: 0.566\*15.4 + 0.614\*11.4 + 0.376\*6.1 + 0.401\*3.6 = 19.453. Table 6 presents the factorial coordinates of all the samples of the initial matrix.

14 Analytical Chemistry

variance-covariance matrix.

**Step 3.** Calculate the eigenvalues and eigenvectors

this mathematical process will be found elsewhere [*23*]).

corresponding to 100% of the explained variance, as follows:

Table 4 shows the eigenvalues and eigenvectors obtained by the diagonalization of the variance-covariance matrix (a process that will not be presented here, but some details on

Note that the sum of the eigenvalues is equal to the sum of the variances in the variancecovariance matrix, which is not surprising since the eigenvalues are calculated from the

Eigenvalues 3163,3645 0,0000 0,0000 0,0000 0,0000 130,5223 0,0000 0,0000 0,0000 0,0000 0,5404 0,0000 0,0000 0,0000 0,0000 0,1007 Eigenvectors 0,5662 0,6661 0,4719 0,1143 0,6141 -0,0795 -0,7072 0,3411 0,3760 -0,0875 -0,1056 -0,9164 0,4011 -0,7365 0,5157 0,1755 **Table 4.** Eigenvalues calculated using diagonalization processing of the variance-covariance matrix.

The next step of the calculation is to determine what percentage of the variance is explained by each major component. This is done by using the fact that the sum of the eigenvalues

> % *th* 100% *<sup>j</sup> j PC l*

*Var*

**Table 5.** Percentage of the explained variance for each principal component.

**Step 4.** Calculating factorial coordinates - Scores

1

Where j is the jth eigenvalue. We thus obtain for the first component PC1 = 3163.4 \* 1 / 3294.5 \* 100 = 96.01%. PC2 is associated with 3.96% and so on, as shown in Table 5 below.

**PC1 PC2 PC3 PC4**  96.02 3.96 0.02 0.00

The eigenvectors calculated above are the principal components and the values given in Table 6 are the coefficients of each principal component. Thus, component #1 is written: PC1 = 0.566\*X1 + 0.614\*X2 + 0.376\*X3 + 0.401\*X4 where X1, X2, X3 and X4 are the fluorescence intensities, 420, 474, 520 and 570 nm respectively. The factorial coordinates of each sample in the new space formed by the principal components can now be calculated directly from the equations of the PCs. For example, sample No. 1 has a coordinate on PC1 equal to:

 

*i i*


**Table 6.** Factorial coordinates = *scores* of the initial sample matrix in the principal components space.

Therefore, we can now represent, for example, all samples in the space of two first components, as shown in Figure 4 below.

**Figure 4. On the left hand side**: scores-plot from samples in two first principal components space; **on the right hand side**: samples-scores on PC1.

It was found that the samples are divided into two distinct groups on the initial data. Another way to visualize this division is to represent only the scores on PC1 (on PC2 or PCi) for each sample (see Figure 4). In this case, it is also obvious that the samples 1 to 5 have similar values of PC1 and, thus, form a first group (group 1), while samples 6-10 are the second group (group 2).

#### **Step 5.** Calculating factorial contributions - Loadings

A complete interpretation of the results of PCA involves the graph of the loadings, i.e. the projection of the variables in the sample space. But how does one get this? Consider what has been calculated so far: the eigenvalues and eigenvectors of the matrix samples and their factorial coordinates in the space of principal components. We need to calculate the factorial coordinates of variables in the sample space. The results are called "*loadings*" or "*factorial contributions*". You just have to transpose the initial data matrix and repeat the entire calculation. Figure 5 shows the results representing the position of the variables of the problem in the plane formed by the first two PCs**.** 

PCA: The Basic Building Block of Chemometrics 17

nutritional value of oils but which may also generate toxic compounds injurious to health [*24, 25*]. The most important cause of the deterioration of oils is oxidation. Among the oxidation products, our interest has focused especially on secondary oxidation products, such as aldehydes, because they are rarely present in natural unheated oil, as described by Choe et al. [*26*] from the study of secondary oxidation products proposed by Frankel [*27*]. In this work, an analytical approach was first adopted to calculate a new semi-quantitative criterion of the thermal stability of oils. This new test is based on the assumption that by focusing on a selected portion of the 1H-NMR spectra and for a relatively short time, we can model the appearance of aldehydes by a kinetic law of order 1, knowing that the mechanisms actually at work are more complex and associated with radical reactions. In the following pages is presented only that part of this work related to the application of PCA to 1H-NMR data to characterize and follow the effect of temperature and time of heating on the chemical quality of edible oils. For further

Three types of edible oils were analysed. Rapeseed, sunflower and virgin olive oils were purchased at a local supermarket and used in a thermal oxidation study. Approximately 12 mL of oil was placed in 10 cm diameter glass dishes and subjected to heating in a laboratory oven with temperature control. Each of the three types of oil was heated at 170 °C, 190 °C and 210 °C, each of which is close to home-cooking temperatures. Three samples of 1 g were collected every 30 min until the end of the heating process, fixed at 180 min, resulting in a total of 189 samples to be analysed. Samples were cooled in an ice-water bath for 4 min, in

Between 0.3 and 0.5 g of oil was introduced into an NMR tube (I.D. 5 mm) with 700 µL of deuterated chloroform for the sample to reach a filling height of approximately 5 cm. The proton NMR spectrum was acquired at 300.13 MHz on a Bruker 300 Advance Ultrashield spectrometer with a 7.05 T magnetic field. A basic spin echo sequence was applied. The acquisition parameters were: spectral width 6172.8 Hz; pulse angle 90–180°; pulse delay 4.4 µs; relaxation delay 3 s; number of scans 64; plus 2 dummy scans, acquisition time 5.308 s, with a total acquisition time of about 9 min. The experiment was carried out at 25 °C. Spectra were acquired periodically throughout the thermal oxidation process. All plots of 1H NMR spectra or spectral regions were plotted with a fixed value of absolute intensity for

The initial matrix (189 samples × 1001 variables) contains 1H-NMR spectra of three oils at three heating temperatures and seven heating times. The computations were performed

using the MATLAB environment, version R2007b (Mathworks, Natick, MA, USA).

information about the kinetic study and multiway treatments see [*28*].

order to stop thermal-oxidative reactions, and then directly analysed.

*4.2.1. Samples* 

comparison.

*4.2.2. 1H-NMR spectroscopy* 

*4.2.3. Data & data processing* 

The graph of the loadings allows us to understand what the characteristic variables of each group of samples on the graph of the scores are. We see, in particular, that the samples of Group 1 are distinguished from the others by variables 420 and 474 while the Group 2 samples are distinguished from the others by variables 520 and 570. In other words, for these two sets of variables, the groups of samples have opposite values in quantitative terms: when a group has high values for the two pairs of variables, then the other group of samples has low values for the same pair of variables, and vice versa.

When the measured variables are of structural (mass spectrometry data) or spectral types (infrared bands or chemical shifts in NMR), the joint interpretation of scores and loadings can be extremely interesting because one can be in a position to know what distinguishes the groups of samples from a molecular point of view.

**Figure 5.** Scores & loadings on the PC1xPC2 plane.
