**2. General considerations and historical introduction**

The origin of PCA is confounded with that of linear regression. In 1870, Sir Lord Francis Galton worked on the measurement of the physical features of human populations. He assumed that many physical traits are transmitted by heredity. From theoretical assumptions, he supposed that the height of children with exceptionally tall parents will, eventually, tend to have a height close to the mean of the entire population. This assumption greatly disturbed Lord Galton, who interpreted it as a "move to mediocrity," in other words as a kind of *regression* of the human race. This led him in 1889 to formulate his law of *universal regression* which gave birth to the statistical tool of *linear regression*. Nowadays, the word "regression" is still in use in statistical science (but obviously without any connotation of the regression of the human race). Around 30 years later, Karl Pearson,1 who was one of Galton's disciples, exploited the statistical work of his mentor and built up the mathematical framework of *linear regression*. In doing so, he laid down the basis of the correlation calculation, which plays an important role in PCA. Correlation and linear regression were exploited later on in the fields of psychometrics, biometrics and, much later on, in chemometrics.

PCA: The Basic Building Block of Chemometrics 3

a descriptive method, it is not based on a probabilistic model of data but simply aims to

Among the multivariate analysis techniques, PCA is the most frequently used because it is a starting point in the process of data mining [*1, 2*]. It aims at minimizing the dimensionality of the data. Indeed, it is common to deal with a lot of data in which a set of *n* objects is described by a number *p* of variables. The data is gathered in a matrix *X*, with *n* rows and *p* columns, with an element *xij* referring to an element of *X* at the ith line and the jth column. Usually, a line of *X* corresponds with an "observation", which can be a set of physicochemical measurements or a spectrum or, more generally, an analytical curve obtained from an analysis of a real sample performed with an instrument producing analytical curves as output data. A column of *X* is usually called a "variable". With regard to the type of analysis that concerns us, we are typically faced with multidimensional data *n* x *p*, where *n* and *p* are of the order of several hundreds or even thousands. In such situations, it is difficult to identify in this set any relevant information without the help of a mathematical technique such as PCA. This technique is commonly used in all areas where data analysis is necessary; particularly in the food research laboratories and industries, where it is often used in conjunction with other multivariate techniques such as discriminant analysis (Table 1 indicates a few published works in the area of food from

The key idea of PCA is to represent the original data matrix *X* by a product of two matrices

*nn n <sup>p</sup>*

1

This orthogonality is a means to ensure the non-redundancy (at least at a minimum) of

tools, such as correspondence analysis that can handle large amounts of data to visualize and prioritize the

*K ij ik kj ij k x tp e* 

*XTP E* (1)

(smaller) *T* and *P* (respectively the scores matrix and the loadings matrix), such that:

**3. PCA: Description, use and interpretation of the outputs** 

among a huge number of publications involving PCA).

. *<sup>q</sup> <sup>p</sup> q p <sup>t</sup>*

*i j i j p p and t t for i j*

information "carried" by each estimated principal component.

Equation 1 can be expressed in a graphical form, as follows:

**3.1. Some theoretical aspects** 

Or the non-matrix version:

information.

With the condition 0 0 *t t*

provide geometric representation.

Thirty-two years later, Harold Hotelling2 made use of correlation in the same spirit as Pearson and imagined a graphical presentation of the results that was easier to interpret than tables of numbers. At this time, Hotelling was concerned with the *economic games* of companies. He worked on the concept of economic competition and introduced a notion of spatial competition in *duopoly*. This refers to an economic situation where many "players" offer similar products or services in possibly overlapping trading areas. Their potential customers are thus in a situation of choice between different available products. This can lead to illegal agreements between companies. Harold Hotelling was already looking at solving this problem. In 1933, he wrote in the *Journal of Educational Psychology* a fundamental article entitled "Analysis of a Complex of Statistical Variables with Principal Components," which finally introduced the use of special variables called *principal components*. These new variables allow for the easier viewing of the intrinsic characteristics of observations.

PCA (also known as Karhunen-Love or Hotelling transform) is a member of those descriptive methods called *multidimensional factorial methods*. It has seen many improvements, including those developed in France by the group headed by Jean-Paul Benzécri3 in the 1960s. This group exploited in particular geometry and graphs. Since PCA is

<sup>1</sup> Famous for his contributions to the foundation of modern statistics (2 test, linear regression, principal components analysis), Karl Pearson (1857-1936), English mathematician, taught mathematics at London University from 1884 to 1933 and was the editor of *The Annals of Eugenics* (1925 to 1936). As a good disciple of Francis Galton (1822-1911), Pearson continues the statistical work of his mentor which leads him to lay the foundation for calculating correlations (on the basis of principal component analysis) and will mark the beginning of a new science of biometrics.

<sup>2</sup> Harold Hotelling: American economist and statistician (1895-1973). Professor of Economics at Columbia University, he is responsible for significant contributions in statistics in the first half of the 20th century, such as the calculation of variation, the production functions based on profit maximization, the using the t distribution for the validation of assumptions that lead to the calculation of the confidence interval.

<sup>3</sup> French statistician. Alumnus of the Ecole Normale Superieure, professor at the Institute of Statistics, University of Paris. Founder of the French school of data analysis in the years 1960-1990, Jean-Paul Benzécri developed statistical

a descriptive method, it is not based on a probabilistic model of data but simply aims to provide geometric representation.

#### **3. PCA: Description, use and interpretation of the outputs**

Among the multivariate analysis techniques, PCA is the most frequently used because it is a starting point in the process of data mining [*1, 2*]. It aims at minimizing the dimensionality of the data. Indeed, it is common to deal with a lot of data in which a set of *n* objects is described by a number *p* of variables. The data is gathered in a matrix *X*, with *n* rows and *p* columns, with an element *xij* referring to an element of *X* at the ith line and the jth column. Usually, a line of *X* corresponds with an "observation", which can be a set of physicochemical measurements or a spectrum or, more generally, an analytical curve obtained from an analysis of a real sample performed with an instrument producing analytical curves as output data. A column of *X* is usually called a "variable". With regard to the type of analysis that concerns us, we are typically faced with multidimensional data *n* x *p*, where *n* and *p* are of the order of several hundreds or even thousands. In such situations, it is difficult to identify in this set any relevant information without the help of a mathematical technique such as PCA. This technique is commonly used in all areas where data analysis is necessary; particularly in the food research laboratories and industries, where it is often used in conjunction with other multivariate techniques such as discriminant analysis (Table 1 indicates a few published works in the area of food from among a huge number of publications involving PCA).

#### **3.1. Some theoretical aspects**

2 Analytical Chemistry

**2. General considerations and historical introduction** 

psychometrics, biometrics and, much later on, in chemometrics.

The origin of PCA is confounded with that of linear regression. In 1870, Sir Lord Francis Galton worked on the measurement of the physical features of human populations. He assumed that many physical traits are transmitted by heredity. From theoretical assumptions, he supposed that the height of children with exceptionally tall parents will, eventually, tend to have a height close to the mean of the entire population. This assumption greatly disturbed Lord Galton, who interpreted it as a "move to mediocrity," in other words as a kind of *regression* of the human race. This led him in 1889 to formulate his law of *universal regression* which gave birth to the statistical tool of *linear regression*. Nowadays, the word "regression" is still in use in statistical science (but obviously without any connotation of the regression of the human race). Around 30 years later, Karl Pearson,1 who was one of Galton's disciples, exploited the statistical work of his mentor and built up the mathematical framework of *linear regression*. In doing so, he laid down the basis of the correlation calculation, which plays an important role in PCA. Correlation and linear regression were exploited later on in the fields of

Thirty-two years later, Harold Hotelling2 made use of correlation in the same spirit as Pearson and imagined a graphical presentation of the results that was easier to interpret than tables of numbers. At this time, Hotelling was concerned with the *economic games* of companies. He worked on the concept of economic competition and introduced a notion of spatial competition in *duopoly*. This refers to an economic situation where many "players" offer similar products or services in possibly overlapping trading areas. Their potential customers are thus in a situation of choice between different available products. This can lead to illegal agreements between companies. Harold Hotelling was already looking at solving this problem. In 1933, he wrote in the *Journal of Educational Psychology* a fundamental article entitled "Analysis of a Complex of Statistical Variables with Principal Components," which finally introduced the use of special variables called *principal components*. These new

variables allow for the easier viewing of the intrinsic characteristics of observations.

(on the basis of principal component analysis) and will mark the beginning of a new science of biometrics.

assumptions that lead to the calculation of the confidence interval.

PCA (also known as Karhunen-Love or Hotelling transform) is a member of those descriptive methods called *multidimensional factorial methods*. It has seen many improvements, including those developed in France by the group headed by Jean-Paul Benzécri3 in the 1960s. This group exploited in particular geometry and graphs. Since PCA is

1 Famous for his contributions to the foundation of modern statistics (2 test, linear regression, principal components analysis), Karl Pearson (1857-1936), English mathematician, taught mathematics at London University from 1884 to 1933 and was the editor of *The Annals of Eugenics* (1925 to 1936). As a good disciple of Francis Galton (1822-1911), Pearson continues the statistical work of his mentor which leads him to lay the foundation for calculating correlations

2 Harold Hotelling: American economist and statistician (1895-1973). Professor of Economics at Columbia University, he is responsible for significant contributions in statistics in the first half of the 20th century, such as the calculation of variation, the production functions based on profit maximization, the using the t distribution for the validation of

3 French statistician. Alumnus of the Ecole Normale Superieure, professor at the Institute of Statistics, University of Paris. Founder of the French school of data analysis in the years 1960-1990, Jean-Paul Benzécri developed statistical The key idea of PCA is to represent the original data matrix *X* by a product of two matrices (smaller) *T* and *P* (respectively the scores matrix and the loadings matrix), such that:

$$\prescript{}{n}{X}^p = \prescript{}{n}{T}^q \cdot \prescript{}{p}{\left[\prescript{}{P}{}\right]^q} + \prescript{}{n}{E}^p \tag{1}$$

Or the non-matrix version:

$$\mathcal{X}\_{ij} = \sum\_{k=1}^{K} t\_{ik} p\_{kj} + e\_{ij}$$

With the condition 0 0 *t t i j i j p p and t t for i j*

This orthogonality is a means to ensure the non-redundancy (at least at a minimum) of information "carried" by each estimated principal component.

Equation 1 can be expressed in a graphical form, as follows:

tools, such as correspondence analysis that can handle large amounts of data to visualize and prioritize the information.

**Scheme 1.** A matricized representation of the PCA decomposition principle.

Schema 2 translates this representation in a vectorized version which shows how the X matrix is decomposed in a sum of column-vectors (components) and line-vectors (eigenvectors). In a case of spectroscopic data or chromatographic data, these components and eigenvectors take a chemical signification which is, respectively, the proportion of the constituent *i* for the *i* th component and the "pure spectrum" or "pure chromatogram" for the *i*th eigenvector.

**Scheme 2.** Schematic representation of the PCA decomposition as a sum of "components" and "eigenvectors" with their chemical significance.

The mathematical question behind this re-expression of *X* is: is there another basis which is a linear combination of the original basis that re-expresses the original data? The term "basis" means here a mathematical basis of unit vectors that support all other vectors of data. Regarding the linearity - which is one of the basic assumptions of the PCA - the general response to this question can be written in the case where *X* is perfectly re-expressed by the matrix product *T*.*P*T, as follows:

$$PX = T \tag{2}$$

PCA: The Basic Building Block of Chemometrics 5

1. by calculating the eigenvectors of the square, symmetric covariance matrix *X*T*X*

2. by calculating the eigenvectors of *X* by the direct decomposition of *X* using an iterative

3. by the singular value decomposition of *X* - this method is a more general algebraic

The dual nature of expressions *T* = *PX* and *X* = *TP*T leads to a comparable result when PCA is applied on *X* or on its transposed *X*T. The score vectors of one are the eigenvectors of the other. This property is very important and is utilized when we compute the principal components of a matrix using the covariance matrix method (see the pedagogical example

Consider a *p*-dimensional space where each dimension is associated with a variable. In this space, each observation is characterized by its coordinates corresponding to the value of variables that describe it. Since the raw data is generally too complex to lead to an interpretable representation in the space of the initial variables, it is necessary to "compress" or "reduce" the *p*-dimensional space into a space smaller than *p*, while maintaining the maximum information. The amount of information is statistically represented by the variances. PCA builds new variables by the linear combination of original variables. Geometrically, this change of variables results in a change of axes, called *principal components,* chosen to be orthogonal4. Each newly created axis defines a direction that

The first component (i.e. the first axis) is calculated in order to represent the main pieces of information, and then comes the second component which represents a smaller amount of information, and so on. In other words, the *p* original variables are replaced by a set of new variables, the *components*, which are linear combinations of these original variables. The variances of components are sorted in decreasing order. By construction of PCA, the whole set of components keeps all of the original variance. The dimensions of the space are not then reduced but the change of axis allows a better representation of the data. Moreover, by retaining the *q* first principal components (with *q<p*), one is assured to retain the maximum of the variance contained in the original data for a *q*-dimensional space. This reduction from *p* to *q* dimensions is the result of the projection of points in a *p*dimensional space into a subspace of dimension *q*. A highlight of the technique is the ability to represent simultaneously or separately the samples and variables in the space of

4 The orthogonality ensures the non-correlation of these axes and, therefore, the information carried by an axis is not

(eigenvector analysis implying a diagonalization of the *X*T*X*);

*3.2.1. Change of basis vectors and reduction of dimensionality* 

procedure (NIPALS);

**3.2. Geometrical point of view** 

describes a part of the global information.

even partially redundant with that carried by another.

initial components.

solution of PCA.

below).

Equation 2 represents a *change of basis* and can be interpreted in several ways, such as the transformation of *X* into *T* by the application of *P* or by geometrically saying that *P* - which is the rotation and a stretch - transforms *X* into *T* or the rows of *P*, where {p1,…,pm} are a set of new basis vectors for expressing the columns of *X*.

Therefore, the solution offered by PCA consists of finding the matrix *P*, of which at least three ways are possible:


The dual nature of expressions *T* = *PX* and *X* = *TP*T leads to a comparable result when PCA is applied on *X* or on its transposed *X*T. The score vectors of one are the eigenvectors of the other. This property is very important and is utilized when we compute the principal components of a matrix using the covariance matrix method (see the pedagogical example below).

#### **3.2. Geometrical point of view**

4 Analytical Chemistry

constituent *i* for the *i*

vectors" with their chemical significance.

by the matrix product *T*.*P*T, as follows:

three ways are possible:

of new basis vectors for expressing the columns of *X*.

*i*th eigenvector.

. . . .. . .... . . . .. . .... .... . . . .. . .... .... . . . .. . ....

Schema 2 translates this representation in a vectorized version which shows how the X matrix is decomposed in a sum of column-vectors (components) and line-vectors (eigenvectors). In a case of spectroscopic data or chromatographic data, these components and eigenvectors take a chemical signification which is, respectively, the proportion of the

**Scheme 2.** Schematic representation of the PCA decomposition as a sum of "components" and "eigen-

The mathematical question behind this re-expression of *X* is: is there another basis which is a linear combination of the original basis that re-expresses the original data? The term "basis" means here a mathematical basis of unit vectors that support all other vectors of data. Regarding the linearity - which is one of the basic assumptions of the PCA - the general response to this question can be written in the case where *X* is perfectly re-expressed

Equation 2 represents a *change of basis* and can be interpreted in several ways, such as the transformation of *X* into *T* by the application of *P* or by geometrically saying that *P* - which is the rotation and a stretch - transforms *X* into *T* or the rows of *P*, where {p1,…,pm} are a set

Therefore, the solution offered by PCA consists of finding the matrix *P*, of which at least

*PX T* (2)

th component and the "pure spectrum" or "pure chromatogram" for the

*X T <sup>t</sup> P E* 

**Scheme 1.** A matricized representation of the PCA decomposition principle.

#### *3.2.1. Change of basis vectors and reduction of dimensionality*

Consider a *p*-dimensional space where each dimension is associated with a variable. In this space, each observation is characterized by its coordinates corresponding to the value of variables that describe it. Since the raw data is generally too complex to lead to an interpretable representation in the space of the initial variables, it is necessary to "compress" or "reduce" the *p*-dimensional space into a space smaller than *p*, while maintaining the maximum information. The amount of information is statistically represented by the variances. PCA builds new variables by the linear combination of original variables. Geometrically, this change of variables results in a change of axes, called *principal components,* chosen to be orthogonal4. Each newly created axis defines a direction that describes a part of the global information.

The first component (i.e. the first axis) is calculated in order to represent the main pieces of information, and then comes the second component which represents a smaller amount of information, and so on. In other words, the *p* original variables are replaced by a set of new variables, the *components*, which are linear combinations of these original variables. The variances of components are sorted in decreasing order. By construction of PCA, the whole set of components keeps all of the original variance. The dimensions of the space are not then reduced but the change of axis allows a better representation of the data. Moreover, by retaining the *q* first principal components (with *q<p*), one is assured to retain the maximum of the variance contained in the original data for a *q*-dimensional space. This reduction from *p* to *q* dimensions is the result of the projection of points in a *p*dimensional space into a subspace of dimension *q*. A highlight of the technique is the ability to represent simultaneously or separately the samples and variables in the space of initial components.

<sup>4</sup> The orthogonality ensures the non-correlation of these axes and, therefore, the information carried by an axis is not even partially redundant with that carried by another.


PCA: The Basic Building Block of Chemometrics 7

*3.2.2. Correlation circle for discontinuous variables* 

correlate well with the two factors constituting the plan.

coefficient between two variables: cos (angle) = r (V1, V2)

are very highly positively correlated

strongly negatively correlated.

discuss below in this chapter.

*3.2.3. Scores and loadings* 

principal components**.** 

In the case of physicochemical or sensorial data, more generally in cases where variables are not continuous like in spectroscopy or chromatography, a powerful tool is useful for interpreting the meaning of the axes: the correlation circle. On this graph, each variable is associated with a point whose coordinate on an axis factor is a measure of the correlation between the variable and the factor. In the space of dimension *p*, the maximum distance of the variables at the origin is equal to 1. So, by projection on a factorial plan, the variables are part of a circle of radius 1 (the correlation circle) and the closer they are near the edge of the circle, the more they are represented by the plane of factors. Therefore, the variables

The angle between two variables, measured by its cosine, is equal to the linear correlation

if the points are very close (angle close to 0): cos (angle) = r (V1, V2) = 1 then V1 and V2

 if a is equal to 90 °, cos (angle) = r (V1, V2) = 0 then no linear correlation between X1 and X2 if the points are opposite, a is 180 °, cos (angle) = r (V1, V2) = -1: V1 and V2 are very

An illustration of this is given by figure 1, which presents a correlation circle obtained from a principal component analysis on physicochemical data measured on palm oil samples. In this example, different chemical parameters (such as *Lauric acid* content, concentration of *saponifiable* compounds, iodine index, *oleic acid* content, etc.) have been measured. One can note for example on the PC1 axis, "*iodine index*" and "*Palmitic*" are close together and so have a high correlation; in the same way, "*iodine index*" and "*Oleic*" are positively correlated because they are close together and close to the circle. On the other hand, a similar interpretation can be made with "*Lauric*" and "*Miristic*" variables, indicating a high correlation between these two variables, which are together negatively correlated on PC1. On PC2, "*Capric*" and "*Caprilic*" variables are highly correlated. Obviously, the correlation circle should be interpreted jointly with another graph (named the score-plot) resulting from the calculation of samples coordinates in the new principal components space that we

We speak of **scores** to denote the **coordinates** of the observations on the PC components and the corresponding graphs (objects projected in successive planes defined by two principal components) are called score-plots. **Loadings** denotes the **contributions** of original variables to the various components, and corresponding graphs called **loadings-plot** can be seen as the projection of unit vectors representing the variables in the successive planes of the main components. As scores are a representation of observations in the space formed by the new axes (principal components), symmetrically, loadings represent the variables in the space of

**Table 1.** Examples of analytical work on various food products, involving the PCA (and/or LDA) as tools for data processing (from 1990 to 2002). Not exhaustive.

5 Comprehensive overview on the use of chemometric tools for processing data derived from infrared spectroscopy applied to food analysis.

#### *3.2.2. Correlation circle for discontinuous variables*

6 Analytical Chemistry

Virgin olive oil

Garlic products

Coffees

Cider apples fruits

Cheeses Water-soluble

Edible oils All chemicals between

Fruit puree All chemicals between

Orange juice All chemicals between

Green coffee All chemicals between

**Food(s) Analysed compounds Analytical**

All chemicals between 3250 et 100 cm-1

Almonds Fatty acids GC

Volatile sulphur

Physicochemical parameters

Physicochemical parameters + chlorogenic acid

Meet All chemicals between


Honeys All chemicals between

applied to food analysis.

Apple juice Aromas Capillary GC-MS

General5 General FTIR (Mid-IR) PCA, LDA

**technique(s) Chemometrics Aim of study Year [***Ref.***]** 

PCR, LDA, HCA Authentication 1996 [*8*]

Classification &

PCA Origin 1996 [*10*] PCA, LDA 1998 [*11*]

LDA Classification 1998 [*12, 13*]

Classification according to botanical origin and other criteria

> Comparison of classification methods

Classification according to floral origin

Relationship between phenolic compounds and antioxidant power

2001 [*16*]

2001 [*17*]

2001 [*18*]

2001 [*19*]

Authentication 1998 [*9*]

compounds HPLC PCA, LDA Classification 1990 [*3*]

4800 et 800 cm-1 FTIR (Mid-IR) PCA, LDA Authentication 1994 [*4*]

4000 et 800 cm-1 FTIR (Mid-IR) PCA, LDA Authentication 1995 [*5*]

9000 et 4000 cm-1 FTIR (NIR) PCA, LDA Authentication 1995 [*6*]

4800 et 350 cm-1 FTIR (Mid-IR) PCA, LDA Origin 1996 [*7*]

+ others

compounds GC-MS PCA Classification 1998 [*11*]

25000 et 4000 cm-1 IR (NIR+Visible) PCA + others Authentication 2000 [*15*]

9000-4000 cm-1 IR (NIR) PCA, LDA Classification 2002 [*19*]

**Table 1.** Examples of analytical work on various food products, involving the PCA (and/or LDA) as

5 Comprehensive overview on the use of chemometric tools for processing data derived from infrared spectroscopy

PCA

+ chiral GC-MS PCA Authentication 1999 [*14*]

FT-Raman (Far- et Mid-Raman)

Physicochemical techniques + HPLC for sugars

Physicochemical techniques + HPLC for chlorogenic acid determination

synthetic substances MS + IR PCA, HCA

Honeys Sugars GC-MS PCA, LDA

Red wine Phenolic compounds HPLC PCA (+PLS)

tools for data processing (from 1990 to 2002). Not exhaustive.

In the case of physicochemical or sensorial data, more generally in cases where variables are not continuous like in spectroscopy or chromatography, a powerful tool is useful for interpreting the meaning of the axes: the correlation circle. On this graph, each variable is associated with a point whose coordinate on an axis factor is a measure of the correlation between the variable and the factor. In the space of dimension *p*, the maximum distance of the variables at the origin is equal to 1. So, by projection on a factorial plan, the variables are part of a circle of radius 1 (the correlation circle) and the closer they are near the edge of the circle, the more they are represented by the plane of factors. Therefore, the variables correlate well with the two factors constituting the plan.

The angle between two variables, measured by its cosine, is equal to the linear correlation coefficient between two variables: cos (angle) = r (V1, V2)


An illustration of this is given by figure 1, which presents a correlation circle obtained from a principal component analysis on physicochemical data measured on palm oil samples. In this example, different chemical parameters (such as *Lauric acid* content, concentration of *saponifiable* compounds, iodine index, *oleic acid* content, etc.) have been measured. One can note for example on the PC1 axis, "*iodine index*" and "*Palmitic*" are close together and so have a high correlation; in the same way, "*iodine index*" and "*Oleic*" are positively correlated because they are close together and close to the circle. On the other hand, a similar interpretation can be made with "*Lauric*" and "*Miristic*" variables, indicating a high correlation between these two variables, which are together negatively correlated on PC1. On PC2, "*Capric*" and "*Caprilic*" variables are highly correlated. Obviously, the correlation circle should be interpreted jointly with another graph (named the score-plot) resulting from the calculation of samples coordinates in the new principal components space that we discuss below in this chapter.

#### *3.2.3. Scores and loadings*

We speak of **scores** to denote the **coordinates** of the observations on the PC components and the corresponding graphs (objects projected in successive planes defined by two principal components) are called score-plots. **Loadings** denotes the **contributions** of original variables to the various components, and corresponding graphs called **loadings-plot** can be seen as the projection of unit vectors representing the variables in the successive planes of the main components. As scores are a representation of observations in the space formed by the new axes (principal components), symmetrically, loadings represent the variables in the space of principal components**.** 

PCA: The Basic Building Block of Chemometrics 9

weight (or importance) of certain variables in the positioning of objects on the plane formed by two main components. Indeed, the objects placed, for example, right on the scores' plot will have important values for the variables placed right on the loadings plot, while the variables near the origin of the axes that will make a small contribution to the discrimination of objects. As presented by the authors of this work [*20*], one can see on the loadings plot that PC1 is characterized mainly by eating quality, one chemical parameter and two physical parameters (ease of sinking, Te; overall acceptability, Oa; initial juiciness, Ji; sustained juiciness, Js; friability, Tf; residue, Tr). These variables are located far from the origin of the first PC, to the right in the loadings plot, and close together, which means, therefore, that they are positively correlated. On the other hand, PC2 is mainly characterized by two chemical (hydroproline, Hy; and ether extract, E) and two physical (hue, H; and lightness, L) parameters. These variables located on the left of the loadings plot are positively correlated. The interpretation of the scores' plot indicates an arrangement of the samples into two groups: the first one includes meats of hypertrophied animals (HP) while the second one includes the meats of the normal Piemontese (NP) and the Friesian (F). Without repeating all of the interpretation reported by the authors, the combined reading of scores and loadings shows, for example, that the meat samples HP and BBW have, in general, a higher protein content as well as good eating qualities and lightness. At the opposite end, the meat samples F and NP are characterized more by their hydroxyproline content, their ether extract or else their Warner-Bratzler shear value. This interpretation may be made with the rest of the parameters studied and contributes to a better understanding of the influence of these parameters on the features of the product studied. This is a qualitative approach to

comparing samples on the basis of a set of experimental measurements.

**Figure 2.** Score-plot obtained by PCA applied on meat samples. Extracted from [*20*].

**Figure 1.** Example of score-plot and correlation circle obtained with PCA.

Observations close to each other in the space of principal components necessarily have similar characteristics. This proximity in the initial space leads to a close neighbouring in the score-plots. Similarly, the variables whose unit vectors are close to each other are said to be positively correlated, meaning that their influence on the positioning of objects is similar (again, these proximities are reflected in the projections of variables on loadings-plot). However, variables far away from each other will be defined as being negatively correlated.

When we speak of *loadings* it is necessary to distinguish two different cases depending on the nature of the data. When the data contains discontinuous variables, as in the case of physicochemical data, the loadings are represented as a factorial plan, i.e. PC1 vs. PC2, showing each variable in the PCs space. However, when the data is continuous (in case of spectroscopic or chromatographic data) loadings are not represented in the same way. In this case, uses usually represent the values of the loadings of each principal component in a graph with the values of the loadings of component PCi on the Y-axis and the scale corresponding to the experimental unit on the X-axis Thus, the loadings are like a spectrum or a chromatogram (See § B. Research example: Application of PCA on 1H-NMR spectra to study the thermal stability of edible oils). Figure 2 and Figure 3 provide an example of scores and loadings plots extracted from a physicochemical and sensorial characterization study of Italian beef [*20*] through the application of principal component analysis. The goal of this work was to discriminate between the ethnic groups of animals (hypertrophied Piemontese, HP; normal Piemontese, NP; Friesian, F; crossbred hypertrophied PiemontesexFriesian, HPxF; Belgian Blue and White, BBW). These graphs are useful for determining the likely reasons of groups' formation of objects that are visualized, i.e. the weight (or importance) of certain variables in the positioning of objects on the plane formed by two main components. Indeed, the objects placed, for example, right on the scores' plot will have important values for the variables placed right on the loadings plot, while the variables near the origin of the axes that will make a small contribution to the discrimination of objects. As presented by the authors of this work [*20*], one can see on the loadings plot that PC1 is characterized mainly by eating quality, one chemical parameter and two physical parameters (ease of sinking, Te; overall acceptability, Oa; initial juiciness, Ji; sustained juiciness, Js; friability, Tf; residue, Tr). These variables are located far from the origin of the first PC, to the right in the loadings plot, and close together, which means, therefore, that they are positively correlated. On the other hand, PC2 is mainly characterized by two chemical (hydroproline, Hy; and ether extract, E) and two physical (hue, H; and lightness, L) parameters. These variables located on the left of the loadings plot are positively correlated. The interpretation of the scores' plot indicates an arrangement of the samples into two groups: the first one includes meats of hypertrophied animals (HP) while the second one includes the meats of the normal Piemontese (NP) and the Friesian (F). Without repeating all of the interpretation reported by the authors, the combined reading of scores and loadings shows, for example, that the meat samples HP and BBW have, in general, a higher protein content as well as good eating qualities and lightness. At the opposite end, the meat samples F and NP are characterized more by their hydroxyproline content, their ether extract or else their Warner-Bratzler shear value. This interpretation may be made with the rest of the parameters studied and contributes to a better understanding of the influence of these parameters on the features of the product studied. This is a qualitative approach to comparing samples on the basis of a set of experimental measurements.

8 Analytical Chemistry

**Figure 1.** Example of score-plot and correlation circle obtained with PCA.

Observations close to each other in the space of principal components necessarily have similar characteristics. This proximity in the initial space leads to a close neighbouring in the score-plots. Similarly, the variables whose unit vectors are close to each other are said to be positively correlated, meaning that their influence on the positioning of objects is similar (again, these proximities are reflected in the projections of variables on loadings-plot). However, variables far away from each other will be defined as being negatively correlated. When we speak of *loadings* it is necessary to distinguish two different cases depending on the nature of the data. When the data contains discontinuous variables, as in the case of physicochemical data, the loadings are represented as a factorial plan, i.e. PC1 vs. PC2, showing each variable in the PCs space. However, when the data is continuous (in case of spectroscopic or chromatographic data) loadings are not represented in the same way. In this case, uses usually represent the values of the loadings of each principal component in a graph with the values of the loadings of component PCi on the Y-axis and the scale corresponding to the experimental unit on the X-axis Thus, the loadings are like a spectrum or a chromatogram (See § B. Research example: Application of PCA on 1H-NMR spectra to study the thermal stability of edible oils). Figure 2 and Figure 3 provide an example of scores and loadings plots extracted from a physicochemical and sensorial characterization study of Italian beef [*20*] through the application of principal component analysis. The goal of this work was to discriminate between the ethnic groups of animals (hypertrophied Piemontese, HP; normal Piemontese, NP; Friesian, F; crossbred hypertrophied PiemontesexFriesian, HPxF; Belgian Blue and White, BBW). These graphs are useful for determining the likely reasons of groups' formation of objects that are visualized, i.e. the

**Figure 2.** Score-plot obtained by PCA applied on meat samples. Extracted from [*20*].

PCA: The Basic Building Block of Chemometrics 11

**Beginning of the main loop** (i=1 to Nb of PCs):

2. Normalise loading vector p to length 1

4. Check for convergence.

new = ( ) *<sup>T</sup>*

**End of the main loop** 


*3.3.2. SVD* 

old > *threshold* \*

*t t* and

*t t*

be split down into the product of three new matrices:


Usually the theorem is written as follows:

IF new - 

With 

1. Project *M* onto t to calculate the corresponding loading p

3. Project *M* onto p to calculate the corresponding score vector t

new THEN return to step 1

old = th ( ) from (i -1) iteration. *<sup>T</sup>*

( ) ( 1)

*i i E E tp*

SVD is based on a theorem from linear algebra which says that a rectangular matrix *M* can

*<sup>T</sup> M U SV mn mm mn nn*

where *U*T*U* = *I* ; *V*T*V* = *I*; the columns of *U* are orthonormal eigenvectors of *MM*T, the columns of *V* are orthonormal eigenvectors of *M*T*M*, and *S* is a diagonal matrix containing the square roots of eigenvalues from *U* or *V* in decreasing order, called singular values. The singular values are always real numbers. If the matrix *M* is a real matrix, then *U* and *V* are also real. The SVD represents an expression of the original data in a coordinate system where the covariance matrix is diagonal. Calculating the SVD consists of finding the

*T*

5. Deflating process: remove the estimated PC component from E(i-1):

( 1) *T i T E t <sup>p</sup> t t* 

\* \* *<sup>T</sup> pp p p*

( 1) *T i T E p <sup>t</sup> p p* 

Plot of the first two PC loading vectors. Water (W); protein (P); ether extract (E); hydroxyproline (Hy); collagen solubility (Cs); lightness (L); hue (H); drip losses (Dl); cooking losses (Cl); Warner±Bratzler shear (WB); appearance (A); ease of sinking (Te); friability (Tf); residue (Tr); initial juiciness (Ji); sustained juiciness (Js); overall acceptability (Oa).

**Figure 3.** Loadings-plot obtained by PCA applied on meat samples. Extracted from [*20*].

#### **3.3. Algorithms**

There are different ways to achieve PCA, depending on whether one uses an iterative algorithm such as the NIPALS algorithm (Non-linear Iterative Partial Least Squares) or else a matrix factorization algorithm like SVD (Singular Value Decomposition). There are many variants of the SVD algorithm; the most well-known is probably the Golub-Reinsch algorithm [GR-SVD] [*21, 22*]. Both types of algorithms are well-suited for computing the eigenvalues and eigenvectors of a matrix *M* and determining, ultimately, the new axes of representation.

#### *3.3.1. NIPALS*

Let us start with the description of NIPALS. The most common version is given below.


#### **Beginning of the main loop** (i=1 to Nb of PCs):

1. Project *M* onto t to calculate the corresponding loading p

$$p = \frac{E\_{(i-1)}^T t}{t^T \* t}$$

2. Normalise loading vector p to length 1

$$p = p^\* \sqrt{p^T \ast\_p p}$$

3. Project *M* onto p to calculate the corresponding score vector t

$$t = \frac{E\_{(i-1)}^T p}{p^T \* p}$$

4. Check for convergence.

IF new - old > *threshold* \* new THEN return to step 1

With new = ( ) *<sup>T</sup> t t* and old = th ( ) from (i -1) iteration. *<sup>T</sup> t t*

5. Deflating process: remove the estimated PC component from E(i-1):

$$E\_{(i)} = E\_{(i-1)} - tp^T$$

#### **End of the main loop**

#### *3.3.2. SVD*

10 Analytical Chemistry

**3.3. Algorithms** 

representation.

*3.3.1. NIPALS* 

p Loadings for PCi

Plot of the first two PC loading vectors. Water (W); protein (P); ether extract (E); hydroxyproline (Hy); collagen solubility (Cs); lightness (L); hue (H); drip losses (Dl); cooking losses (Cl); Warner±Bratzler shear (WB); appearance (A); ease of sinking (Te); friability (Tf); residue (Tr); initial juiciness (Ji); sustained juiciness (Js); overall acceptability (Oa).

There are different ways to achieve PCA, depending on whether one uses an iterative algorithm such as the NIPALS algorithm (Non-linear Iterative Partial Least Squares) or else a matrix factorization algorithm like SVD (Singular Value Decomposition). There are many variants of the SVD algorithm; the most well-known is probably the Golub-Reinsch algorithm [GR-SVD] [*21, 22*]. Both types of algorithms are well-suited for computing the eigenvalues and eigenvectors of a matrix *M* and determining, ultimately, the new axes of

Let us start with the description of NIPALS. The most common version is given below.

t Initialization step: this vector is set to a column in *M*, Scores for PCi

*M* Preferably the mean centred data matrix

*E*(0) = *M E*-matrix equals to mean centred *M* at the beginning

*threshold* = 10-4 Convergence criterion, a constant in the procedure

**Figure 3.** Loadings-plot obtained by PCA applied on meat samples. Extracted from [*20*].

SVD is based on a theorem from linear algebra which says that a rectangular matrix *M* can be split down into the product of three new matrices:


Usually the theorem is written as follows:

$$\mathcal{M}\_{mn} = \mathcal{U}I\_{mm}\mathcal{S}\_{mn}V^T\_{mn}$$

where *U*T*U* = *I* ; *V*T*V* = *I*; the columns of *U* are orthonormal eigenvectors of *MM*T, the columns of *V* are orthonormal eigenvectors of *M*T*M*, and *S* is a diagonal matrix containing the square roots of eigenvalues from *U* or *V* in decreasing order, called singular values. The singular values are always real numbers. If the matrix *M* is a real matrix, then *U* and *V* are also real. The SVD represents an expression of the original data in a coordinate system where the covariance matrix is diagonal. Calculating the SVD consists of finding the

eigenvalues and eigenvectors of *MMT* and *MTM*. The eigenvectors of *MTM* will produce the columns of *V* and the eigenvectors of *MMT* will give the columns of *U*.

PCA: The Basic Building Block of Chemometrics 13

**300 350 400 450** 

Sample # Wavelengths (nm)

**Table 2.** Fluorescence intensities for four wavelengths measured on 12 samples.

This step is to subtract the intensity values of each column, the average of the said column. In other words, for each wavelength is the mean of all samples for this wavelength and subtract this value from the fluorescence intensity for each sample for the

The variance-covariance matrix (Table 3) is calculated according to the *X* T*X* product, namely the Gram matrix. The diagonal of this matrix consists of the variances; the trace of the matrix (sum of diagonal elements) corresponds to the total variance (3294.5) of the original matrix of the data. This matrix shows, for example, that the covariance for the

 **nm 420 474 520 570 420 1072,2** 1092,9 665,9 654,5 **474** 1092,9 **1194,2** 731,5 786,6 **520** 665,9 731,5 **448,4** 485,5 **570** 654,5 786,6 485,5 **579,8** 

As mentioned above, the matrix also contains the variances of the fluorescence intensities at each wavelength on the diagonal. For example, for the fluorescence intensities at 474 nm, the

The technique for calculating the eigenvectors of the variance-covariance matrix, and so the

**Step 1.** Centring the data matrix by the average

**Step 2.** Calculation of the variance-covariance matrix

fluorescence intensities at 420 and 520 nm is equal to 665.9.

principal components, is called eigenvalue analysis (*eigenanalysis*).

same wavelength.

**Table 3.** Variance-covariance matrix

variance is 1194.2.

15,4 11,4 6,1 3,6 16,0 12,1 6,1 3,4 14,0 13,1 7,3 4,2 16,7 10,4 5,9 3,1 17,1 12,1 7,1 2,9 81,9 71,3 42,8 32,9 94,9 85,0 50,0 39,0 69,9 75,7 46,3 50,4 70,4 70,5 42,6 42,1 61,8 81,9 50,2 66,2 Mean 15.75 61.25 68.92 29.25

Presented below is the pseudo-code of the SVD algorithm:

Compute *M*T, *M*T*M* (or *MM*T) 

6. Compute eigenvalues of *M*T*M* and sort them in descending order along its diagonal by resolving:

$$\left| \left| \boldsymbol{M}^T \boldsymbol{M} - \boldsymbol{\mathcal{A}} \boldsymbol{\mathcal{I}} \right| = 0$$


Compute *U* as *U* = *MVS-1* (\*\*) and compute the true scores *T* as *T = US*.

#### *3.3.3. NIPALS versus SVD*

The relationship between the matrices obtained by NIPALS and SVD is given by the following:

$$TP = LISV$$

With the orthonormal *US* product corresponding to the scores matrix *T* , and *V* to the loadings matrix *P*. Note that with *S* being a diagonal matrix, its dimensions are the same as those of *M*.

#### **4. Practical examples**
