**4. The mathematics of PCA: An eigenvalue problem**

Now we have understood the intuitions of PCA, we present the mathematics behind the method by considering a general case. More details on technical aspects can be found in Cooley & Lohnes (1971), Stevens (1986), Lebart, Morineau & Piron (1995), Cadima & Jolliffe (1995), Hyvarinen, Karhunen & Oja (2001), and Jolliffe (2002).

Consider a dataset consisting of *p* variables observed on *n* subjects. Variables are denoted by 1 2 ( , ,..., ) *<sup>p</sup> xx x* . In general, data are in a table with the rows representing the subjects (individuals) and the columns the variables. The dataset can also be viewed as a *n p* rectangular matrix X. Note that variables are such that their means make sense. The variables are also standardized.

We can represent these data in two graphs: on the one hand, in a subject graph where we try to find similarities or differences between subjects, on the other, in a variable graph where we try to find correlations between variables. Subjects graph belongs to an *p*-dimensional space, i.e. to Rp, while variables graph belongs to an *n*-dimensional space, i.e. to Rn. We have two clouds of points in high-dimensional spaces, too large for us to plot and see something in them. We cannot see beyond a three-dimensional space! The PCA will give us a subspace of reasonable dimension so that the projection onto this subspace retains "as much as possible" of the information present in the dataset, i.e., so that the projected clouds of points be as "dispersed" as possible. In other words, the goal of PCA is to compute another basis that best re-express the dataset. The hope is that this new basis will filter out the noise and reveal hidden structure.

$$\mathbf{x}\_{i} = \begin{pmatrix} \mathbf{x}\_{1i} \\ \mathbf{x}\_{2i} \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \mathbf{x}\_{p} \end{pmatrix} \rightarrow \text{reduce dimensionality} \rightarrow \quad \mathbf{z}\_{i} = \begin{pmatrix} \mathbf{z}\_{1i} \\ \mathbf{z}\_{2i} \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \mathbf{z}\_{q} \end{pmatrix} \text{ with } q < p \tag{13}$$

Dimensionality reduction implies information loss. How to represent the data in a lowerdimensional form without losing too much information? Preserve as much information as possible is the objective of the mathematics behind the PCA procedure.

We first of all assume that we want to project the data points on a 1-dimensional space. The principal component corresponding to this axis is a linear combination of the original variables and can be expressed as follows:

186 Principal Component Analysis

Throughout, we assume that the data have been centered and standardized. Graphically, this implies that the centroid or center of gravity of the whole dataset is at the origin. In this case, the PCA is called normalized principal component analysis, and will be based on the correlation matrix (and not on variance-covariance matrix). The variables will lie on the unit sphere; their projection on the subspace spanned by the principal components is the "correlation circle". Standardization allows the use of variables which are not measured in the same units (e.g. temperature, weight, distance, size, etc.). Also, as we will see later,

Now we have understood the intuitions of PCA, we present the mathematics behind the method by considering a general case. More details on technical aspects can be found in Cooley & Lohnes (1971), Stevens (1986), Lebart, Morineau & Piron (1995), Cadima & Jolliffe

Consider a dataset consisting of *p* variables observed on *n* subjects. Variables are denoted by 1 2 ( , ,..., ) *<sup>p</sup> xx x* . In general, data are in a table with the rows representing the subjects (individuals) and the columns the variables. The dataset can also be viewed as a *n p* rectangular matrix X. Note that variables are such that their means make sense. The

We can represent these data in two graphs: on the one hand, in a subject graph where we try to find similarities or differences between subjects, on the other, in a variable graph where we try to find correlations between variables. Subjects graph belongs to an *p*-dimensional space, i.e. to Rp, while variables graph belongs to an *n*-dimensional space, i.e. to Rn. We have two clouds of points in high-dimensional spaces, too large for us to plot and see something in them. We cannot see beyond a three-dimensional space! The PCA will give us a subspace of reasonable dimension so that the projection onto this subspace retains "as much as possible" of the information present in the dataset, i.e., so that the projected clouds of points be as "dispersed" as possible. In other words, the goal of PCA is to compute another basis that best re-express the dataset. The hope is that this new basis will filter out the noise and reveal hidden structure.

> 1 2 . . .

 

*z z*

*qi*

*z*

*i*

*z*

*i i*

with *q p* (13)

reduce dimensionality

possible is the objective of the mathematics behind the PCA procedure.

Dimensionality reduction implies information loss. How to represent the data in a lowerdimensional form without losing too much information? Preserve as much information as

We first of all assume that we want to project the data points on a 1-dimensional space. The principal component corresponding to this axis is a linear combination of the original

working with standardized data makes interpretation easier.

**4. The mathematics of PCA: An eigenvalue problem** 

(1995), Hyvarinen, Karhunen & Oja (2001), and Jolliffe (2002).

variables are also standardized.

1 2 . . .

 

*x x*

*pi*

*x*

variables and can be expressed as follows:

*i*

*x*

*i i*

$$\mathbf{x}\_{1} = \mathbf{a}\_{11}\mathbf{x}\_{1} + \mathbf{a}\_{12}\mathbf{x}\_{2} + ... + \mathbf{a}\_{1p}\mathbf{x}\_{p} = \mathbf{X}\boldsymbol{\mu}\_{1} \tag{14}$$

where 1 11 12 1 ( , ,... )' *u <sup>p</sup>* is a column vector of weights. The principal component *z1* is determined such that the overall variance of the resulting points is as large as possible. Of course, one could make the variance of *z1* as large as possible by choosing large values for the weights 11 12 1 , ,..., *<sup>p</sup>* . To prevent this, weights are calculated with the constraint that their sum of squares is one, that is *u*1 is a unit vector subject to the constraint:

$$\left\|\alpha\_{11}^{2} + \alpha\_{12}^{2} + \ldots + \alpha\_{1p}^{2} = \left\|\mu\_{1}\right\|^{2} = 1\tag{15}$$

Eq.(14) is also the projections of the *n* subjects on the first component. PCA finds *u*<sup>1</sup> so that

$$Var(z\_1) = \frac{1}{n} \sum\_{i=1}^{n} z\_{1i}^2 = \frac{1}{n} \left\| z\_1 \right\|^2 = \frac{1}{n} \mu^\prime\_1 X^\prime X \mu\_1 \text{ is maximal} \tag{16}$$

The matrix <sup>1</sup> *C XX*' *<sup>n</sup>* is the correlation matrix of the variables. The optimization problem is:

$$\max\_{\begin{subarray}{c}u\_1\\\|u\_1\|^2\end{subarray}} u\_1^\prime \operatorname{Cu}\_1 \tag{17}$$

This program means that we search for a unit vector *u*<sup>1</sup> so as to maximize the variance of the projection on the first component. The technique for solving such optimization problems (linearly constrained) involves a construction of a Lagrangian function.

$$\mathfrak{S}\_1 = \mu\_1' \, \mathbb{C}\mu\_1 - \lambda\_1 (\mu\_1' \, \mu\_1 - 1) \tag{18}$$

Taking the partial derivative 1 1 1 11 / *u Cu u* and solving the equation 1 1 / 0 *u* yields:

$$\mathbb{C}u\_1 = \mathbb{A}\_1 u\_1 \tag{19}$$

By premultiplying each side of this condition by 1 *u*' and using the condition 1 1 *u u*' 1 we get:

$$
\mu\prime\prime\_1\text{Cu}\_1 = \mathbb{A}\_1\mu\prime\_1\mu\_1 = \mathbb{A}\_1\tag{20}
$$

It is known from matrix algebra that the parameters *u*1 and <sup>1</sup> that satisfy conditions (19) and (20) are the maximum eigenvalue and the corresponding eigenvector of the correlation matrix C. Thus the optimum coefficients of the original variables generating the first principal component z1 are the elements of the eigenvector corresponding to the largest eigenvalue of the correlation matrix. These elements are also known as loadings.

The second principal component is calculated in the same way, with the condition that it is uncorrelated (orthogonal) with the first principal component and that it accounts for the largest part of the remaining variance.

The Basics of Linear Principal Components Analysis 189

the first component with the most variance. It is clear that all components explain together 100% of the variability in the data. This is why we say that PCA works like a change of basis. Analyzing the original data in the canonical space yields the same results than examining it in the components space. However, PCA allows us to obtain a linear projection of our data, originally in Rp, onto Rq, where *q < p*. The variance of the projections on to the first *q* principal components is the sum of the eigenvalues corresponding to these components. If the data fall near a *q*-dimensional subspace, then *p-q* of the eigenvalues will

Suppose 1 2 , ,..., *<sup>p</sup> xx x* are *p* 1 vectors collected from *n* subjects. The computational steps that

1 1 '

 

**Step 6.** Proceed to the linear tranformation *Rp* ->*Rq* that performs the dimensionality

Notice that, in this analysis, we gave the same weight to each subject. We could have give

In principal component analysis the number of components extracted is equal to the number of variables being analyzed (under the general condition *n p* ). This means that an analysis of our 5 variables would actually result in 5 components, not two. However, since PCA aims at reducing dimensionality, only the first few components will be important enough to be retained for interpretation and used to present the data. It is therefore reasonable to wonder

Eigenvalues are thought of as quantitative assessment of how much a component represents the data. The higher the eigenvalues of a component, the more representative it is of the data. Eigenvalues are therefore used to determine the meaningfulness of components. Table 3 provides the eigenvalues from the PCA applied to our dataset. In the column headed "Eignenvalue", the eigenvalue for each component is presented. Each raw in the table presents information about one of the 5 components: the raw "1" provides information about the first component (PCA1) extracted, the raw "2" provides information about the second component

(PCA2) extracted, and so forth. Eigenvalues are ranked from the highest to the lowest.

**5. Criteria for determining the number of meaningful components to retain** 

*i*

*n* 

*i i*

 *p*

*n*

need to be accomplished in order to obtain the results of PCA are the following:

*x x x* 

*C*

more weight to some subjects, to reflect their representativity in the population.

how many independent components are necessary to best describe the data.

**Step 3.** Form the matrix 1 2 , ,..., *<sup>A</sup> <sup>p</sup>* ( *<sup>p</sup> <sup>n</sup>* matrix), then compute:

1 1 *<sup>n</sup> i i x x n* 

*i*

be nearly zero.

**Step 1.** Compute mean:

reduction.

**Step 2.** Standardize the data: *<sup>i</sup>*

**Summarizing the computational steps of PCA** 

**Step 4.** Compute the eigenvalues of *C:* 1 2 ...

**Step 5.** Compute the eigenvectors of *C*: 1 2 , ,..., *u u u p*

$$z\_2 = a\_{21}\mathbf{x}\_1 + a\_{22}\mathbf{x}\_2 + \dots + a\_{2p}\mathbf{x}\_p = \mathbf{X}\boldsymbol{\mu}\_2\tag{21}$$

where 2 21 22 2 ( , ,... )' *u <sup>p</sup>* is the direction of the component. This axis is constrained to be orthogonal to the first one. Thus, the second component is subject to the constraints:

$$\left\|\alpha\_{21}^{2} + \alpha\_{22}^{2} + \dots + \alpha\_{2p}^{2} = \left\|u\_{2}\right\|^{2} = 1, \quad u\_{1}^{\prime}u\_{2} = 0\tag{22}$$

The optimization problem is therefore:

$$\max\_{\mu\_2 \atop \mu\_1 \mu\_2 \to 0} u\_2^{\prime} \mathbf{C} u\_2 \tag{23}$$

Using the technique of Lagrangian function the following conditions:

$$\mathbb{C}\mathfrak{u}\_2 = \mathbb{A}\_2\mathfrak{u}\_2\tag{24}$$

$$
\mu\_2' \cdot \mathbf{C} \mu\_2 = \lambda\_2 \tag{25}
$$

are obtained again. So once more the second vector comes to be the eignevector corresponding to the second highest eigenvalue of the correlation matrix.

Using induction, it can be proven that PCA is a procedure of eigenvalue decomposition of the correlation matrix. The coefficients generating the linear combinations that transform the original variables into uncorrelated variables are the eigenvectors of the correlation matrix. This is a good new, because finding eigenvectors is something which can be done rapidly using many statistical packages (SAS, Stata, R, SPSS, SPAD…), and because eigenvectors have many nice mathematical properties. Note that rather than maximizing variance, it might sound more plausible to look for the projection with the smallest average (mean-squared) distance between the original points and their projections on the principal components. This turns out to be equivalent to maximizing the variance (Pythagorean Theorem).

An interesting property of the principal components is that they are all uncorrelated (orthogonal) to one another. This is because matrix C is a real symmetric matrix and then linear algebra tells us that it is diagonalizable and the eigenvectors are orthogonal to one another. Again because C is a covariance matrix, it is a positive matrix in the sense that *u Cu* ' 0 for any vector *u* . This tells us that the eigenvalues of C are all non-negative.

$$\operatorname{var}(z) = \begin{bmatrix} \mathbb{A}\_1 & 0 & \dots & 0 \\ 0 & \mathbb{A}\_2 & & \\ \cdot & & \cdot & \\ 0 & & & \mathbb{A}\_p \end{bmatrix} \tag{26}$$

The eigenvectors are the "preferential directions" of the data set. The principal components are derived in decreasing order of importance; and have a variance equal to their corresponding eigenvalue. The first principal component is the direction along which the data have the most variance. The second principal component is the direction orthogonal to 188 Principal Component Analysis

2 21 1 22 2 <sup>2</sup> <sup>2</sup> ... *p p z x x x Xu*

orthogonal to the first one. Thus, the second component is subject to the constraints:

2 22 2

 

*u u u u*

1 ' 0

 

21 22 2 2 1 2 ... 1, ' 0

'

*Max u Cu* 

*Cu u* 2 22 

22 2 *u Cu* ' 

are obtained again. So once more the second vector comes to be the eignevector

Using induction, it can be proven that PCA is a procedure of eigenvalue decomposition of the correlation matrix. The coefficients generating the linear combinations that transform the original variables into uncorrelated variables are the eigenvectors of the correlation matrix. This is a good new, because finding eigenvectors is something which can be done rapidly using many statistical packages (SAS, Stata, R, SPSS, SPAD…), and because eigenvectors have many nice mathematical properties. Note that rather than maximizing variance, it might sound more plausible to look for the projection with the smallest average (mean-squared) distance between the original points and their projections on the principal components. This turns out to be equivalent to maximizing the variance (Pythagorean

An interesting property of the principal components is that they are all uncorrelated (orthogonal) to one another. This is because matrix C is a real symmetric matrix and then linear algebra tells us that it is diagonalizable and the eigenvectors are orthogonal to one another. Again because C is a covariance matrix, it is a positive matrix in the sense that *u Cu* ' 0 for any vector *u* . This tells us that the eigenvalues of C are all non-negative.

1

0 var( ) . .

The eigenvectors are the "preferential directions" of the data set. The principal components are derived in decreasing order of importance; and have a variance equal to their corresponding eigenvalue. The first principal component is the direction along which the data have the most variance. The second principal component is the direction orthogonal to

*z*

2

0 *<sup>p</sup>*

0.0

2 2

 

*<sup>p</sup>* is the direction of the component. This axis is constrained to be

*<sup>p</sup> u uu* (22)

(21)

(23)

(24)

(25)

(26)

Using the technique of Lagrangian function the following conditions:

corresponding to the second highest eigenvalue of the correlation matrix.

where 2 21 22 2 ( , ,... )' *u* 

Theorem).

The optimization problem is therefore:

the first component with the most variance. It is clear that all components explain together 100% of the variability in the data. This is why we say that PCA works like a change of basis. Analyzing the original data in the canonical space yields the same results than examining it in the components space. However, PCA allows us to obtain a linear projection of our data, originally in Rp, onto Rq, where *q < p*. The variance of the projections on to the first *q* principal components is the sum of the eigenvalues corresponding to these components. If the data fall near a *q*-dimensional subspace, then *p-q* of the eigenvalues will be nearly zero.

#### **Summarizing the computational steps of PCA**

Suppose 1 2 , ,..., *<sup>p</sup> xx x* are *p* 1 vectors collected from *n* subjects. The computational steps that need to be accomplished in order to obtain the results of PCA are the following:


$$C = \frac{1}{n} \sum\_{i=1}^{n} \Phi^{\circ}\_{i} \Phi\_{i}$$


Notice that, in this analysis, we gave the same weight to each subject. We could have give more weight to some subjects, to reflect their representativity in the population.
