**2. The basic prerequisite – Variance and correlation**

PCA is useful when you have data on a large number of quantitative variables and wish to collapse them into a smaller number of artificial variables that will account for most of the variance in the data. The method is mainly concerned with identifying variances and correlations in the data. Let us focus our attention to the meaning of these concepts. Consider the dataset given in Table 1. This dataset will serve to illustrate how PCA works in practice.


Table 1. **Example dataset, 5 variables obtained for 20 observations.**

The variance of a given variable *x* is defined as the average of the squared differences from the mean:

$$
\sigma\_{\boldsymbol{\alpha}}^{2} = \frac{1}{n} \sum\_{l=1}^{n} \left( \boldsymbol{\chi}\_{l} - \overline{\boldsymbol{\pi}} \right)^{2} \tag{1}
$$

The square root of the variance is the standard deviation and is symbolized by the small Greek sigma *<sup>x</sup>* . It is a measure of how spread out numbers are.

182 Principal Component Analysis

We think that the well understanding of this Chapter will facilitate that of the following chapters and novel extensions of PCA proposed in this book (sparse PCA, Kernel PCA,

PCA is useful when you have data on a large number of quantitative variables and wish to collapse them into a smaller number of artificial variables that will account for most of the variance in the data. The method is mainly concerned with identifying variances and correlations in the data. Let us focus our attention to the meaning of these concepts. Consider the dataset given in Table 1. This dataset will serve to illustrate how PCA works in practice.

ID X1 X2 X3 X4 X5 1 24 21.5 5 2 14 2 16.7 21.4 6 2.5 17 3 16.78 23 7 2.2 15 4 17.6 22 8.7 3 20 5 22 25.7 6.4 2 14.2 6 15.3 16 8.7 2.21 15.3 7 10.2 19 4.3 2.2 15.3 8 11.9 17.1 4.5 2 14 9 14.3 19.1 6 2.2 15 10 8.7 14.3 4.1 2.24 15.5 11 6.7 10 3.8 2.23 16 12 7.1 13 2.8 2.01 12 13 10.3 16 4 2 14.5 14 7.1 13 3.9 2.4 16.4 15 7.9 13.6 4 3.1 20.2 16 3 8 3.4 2.1 14.7 17 3 9 3.3 3 20.2 18 1 7.5 3 2 14 19 0.8 7 2.8 2 15.8 20 1 4 3.1 2.2 15.3

**2. The basic prerequisite – Variance and correlation** 

Table 1. **Example dataset, 5 variables obtained for 20 observations.**

2

*<sup>x</sup>* . It is a measure of how spread out numbers are.

The variance of a given variable *x* is defined as the average of the squared differences from

1 1 *<sup>n</sup> x i i*

The square root of the variance is the standard deviation and is symbolized by the small

*n*

<sup>2</sup>

(1)

*x x*

Multilinear PCA, …).

the mean:

Greek sigma

The variance and the standard deviation are important in data analysis because of their relationships to correlation and the normal curve. Correlation between a pair of variables measures to what extent their values co-vary. The term covariance is undoubtedly associatively prompted immediately. There are numerous models for describing the behavioral nature of a simultaneous change in values, such as linear, exponential and more. The linear correlation is used in PCA. The linear correlation coefficient for two variables *x* and *y* is given by:

$$\rho(\mathbf{x}, y) = \frac{\frac{1}{m} \sum\_{i=1}^{n} (\mathbf{x}\_i - \overline{\mathbf{x}})(y\_i - \overline{y})}{\sigma\_x \sigma\_y} \tag{2}$$

where *<sup>x</sup>* and *<sup>y</sup>* denote the standard deviation of *x* and *y*, respectively. This definition is the most widely-used type of correlation coefficient in statistics and is also called Pearson correlation or product-moment correlation. Correlation coefficients lie between -1.00 and +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. Correlation coefficients are used to assess the degree of collinearity or redundancy among variables. Notice that the value of correlation coefficient does not depend on the specific measurement units used.

When correlations among several variables are computed, they are typically summarized in the form of a correlation matrix. For the five variables in Table 1, we obtain the results reported in Table 2.


#### Table 2. **Correlations among variables**

In this Table a given row and column intersect shows the correlation between the two corresponding variables. For example, the correlation between variables X1 and X2 is 0.94.

As can be seen from the correlations, the five variables seem to hang together in two distinct groups. First, notice that variables X1, X2 and X3 show relatively strong correlations with one another. This could be because they are measuring the same "thing". In the same way, variables X4 and X5 correlate strongly with each another, a possible indication that they measure the same "thing" as well. Notice that those two variables show very weak correlations with the rest of the variables.

Given that the 5 variables contain some "redundant" information, it is likely that they are not really measuring five different independent constructs, but two constructs or underlying factors. What are these factors? To what extent does each variable measure each of these factors? The purpose of PCA is to provide answers to these questions. Before presenting the mathematics of the method, let's see how PCA works with the data in Table 1.

The Basics of Linear Principal Components Analysis 185

This quantity measures how spread out the points are around the centroid. We will need

<sup>5</sup> <sup>2</sup> 2 2 '' ' 1

(7)

(, ) ( ) *p ii i i ij i j j*

Two subjects are close one to another when they take similar values for all variables. We can use this distance to measure the overall dispersion of the data around the centroid or to

There are different problems when variables are measured in different units. The first problem is the meaning of the variance: how to sum quantities with different measurement units? The second problem is that the distance between points can be greatly influenced. To illustrate this point, let us consider the distances between subjects 7, 8 and 9. Applying

2 22 2

2 22 2

2 22 2

2 22 2

subjects. Indeed, we could by this way render a tall man as shorter as we want!

Subject 7 is closer to subject 8 than to subject 9. Multiplying the values of variable X5 by 10

Now we observe that subject 7 is closer to subject 9 than to subject 8. It is hard to accept how the measurement units of the variables can change greatly the comparison results among

As seen, PCA is sensitive to scale. If you multiply one variable by a scalar you get different results. In particular, the principal components are dependent on the units used to measure the original variables as well as on the range of values they assume (variance). This makes comparison very difficult. It is for these reasons we should *often* standardize the variables prior to using PCA. A common standardization method is to subtract the mean and divide

> \* *i i*

*<sup>x</sup>* are the mean and standard deviation of X, respectively.

variance of the data set is the number of observed variables being analyzed.

Thus, the new variables all have zero mean and unit standard deviation. Therefore the total

*x X X <sup>X</sup>* 

(12)

7 8 *dss* ( , ) (10.2 11.9) (19 17.1) .... (15.3 14) 8.27 (8)

7 9 *dss* ( , ) (10.2 14.3) (19 19.1) .... (15.3 15) 19.8 (9)

7 8 *dss* ( , ) (10.2 11.9) (19 17.1) .... (153 140) 175.58 (10)

7 9 *dss* ( , ) (10.2 14.3) (19 19.1) .... (153 150) 28.71 (11)

We define the distance between subjects si and si' using the Euclidian distance as follows:

*d ss s s X X*

this quantity when determining principal components.

cluster the points as in classification methods.

Eq.(7), we obtain the following results:

yields:

where *X* and

**3.2 How work when data are in different units?** 

by the standard deviation. This yields the following:

In linear PCA each of the two artificial variables is computed as the linear combination of the original variables.

$$Z = \alpha\_1 X\_1 + \alpha\_2 X\_2 + \dots + \alpha\_5 X\_5 \tag{3}$$

where *<sup>j</sup>* is the weight for variable j in creating the component Z. The value of Z for a subject represents the subject's score on the principal component.

Using our dataset, we have:

$$Z\_1 = 0.579X\_1 + 0.577X\_2 + 0.554X\_3 + 0.126X\_4 + 0.098X\_5 \tag{4}$$

$$Z\_2 = -0.172X\_1 - 0.14X\_2 + 0.046X\_3 + 0.685X\_4 + 0.693X\_5 \tag{5}$$

Notice that different coefficients were assigned to the original variables in computing subject scores on the two components. X1, X2 and X3 are assigned relatively large weights that range from 0.554 to 0.579, while variables X4 and X5 are assigned very small weights ranging from 0.098 to 0.126. As a result, component Z1 should account for much of the variability in the first three variables. In creating subject scores on the second component, much weight is given to X4 and X5, while little weight is given to X1, X2 and X3. Subject scores on each component are computed by adding together weighted scores on the observed variables. For example, the value of a subject along the first component Z1 is 0.579 times the standardized value of X1 plus 0.577 times the standardized value of X2 plus 0.554 times the standardized value of X3 plus 0.126 times the standardized value of X4 plus 0.098 times the standardized value of X5.

At this stage of our analysis, it is reasonable to wonder how the weights from the preceding equations are determined. Are they optimal in the sense that no other set of weights could produce components that best account for variance in the dataset? How principal components are computed?

### **3. Heterogeneity and standardization of data**

#### **3.1 Graphs and distances among points**

Our dataset in **Table 1** can be represented into two graphs: one representing the subjects, and the other the variables. In the first, we consider each subject (individual) as a vector with coordinates given by the 5 observations of the variables. Clearly, the cloud of points belongs to a R5 space. In the second one each variable is regarded as a vector belonging to a R20 space.

We can calculate the centroide of the cloud of points which coordinates are the 5 means of the variables, that is 1 5 *g* ( ,...., ) *X X* . Again, we can compute the overall variance of the points by summing the variance of each variable:

$$I = \frac{1}{n} \sum\_{i=1}^{n} \sum\_{j=1}^{p} \left( X\_{ij} - \overline{X}\_{j} \right)^{2} = \frac{1}{n} \sum\_{i=1}^{n} d^{2} \left( s\_{i \prime} g \right) = \sum\_{j=1}^{p} \sigma\_{j}^{2} \tag{6}$$

184 Principal Component Analysis

In linear PCA each of the two artificial variables is computed as the linear combination of

11 22 55 *ZX X X*

Notice that different coefficients were assigned to the original variables in computing subject scores on the two components. X1, X2 and X3 are assigned relatively large weights that range from 0.554 to 0.579, while variables X4 and X5 are assigned very small weights ranging from 0.098 to 0.126. As a result, component Z1 should account for much of the variability in the first three variables. In creating subject scores on the second component, much weight is given to X4 and X5, while little weight is given to X1, X2 and X3. Subject scores on each component are computed by adding together weighted scores on the observed variables. For example, the value of a subject along the first component Z1 is 0.579 times the standardized value of X1 plus 0.577 times the standardized value of X2 plus 0.554 times the standardized value of X3 plus 0.126 times the standardized value of X4 plus 0.098

At this stage of our analysis, it is reasonable to wonder how the weights from the preceding equations are determined. Are they optimal in the sense that no other set of weights could produce components that best account for variance in the dataset? How principal

Our dataset in **Table 1** can be represented into two graphs: one representing the subjects, and the other the variables. In the first, we consider each subject (individual) as a vector with coordinates given by the 5 observations of the variables. Clearly, the cloud of points belongs to a R5 space. In the second one each variable is regarded as a vector belonging to a

We can calculate the centroide of the cloud of points which coordinates are the 5 means of the variables, that is 1 5 *g* ( ,...., ) *X X* . Again, we can compute the overall variance of the

> 1 1 (,) *n n p p*

*I X X dsg n n*

*i j i j*

 <sup>2</sup> 2 2 1 1 1 1

*ij j i j*

(6)

*<sup>j</sup>* is the weight for variable j in creating the component Z. The value of Z for a subject

 

<sup>112345</sup> *ZXXXXX* 0.579 0.577 0.554 0.126 0.098 (4)

2 12 3 4 5 *Z XX X X X* 0.172 0.14 0.046 0.685 0.693 (5)

... (3)

represents the subject's score on the principal component.

the original variables.

Using our dataset, we have:

times the standardized value of X5.

**3. Heterogeneity and standardization of data** 

points by summing the variance of each variable:

**3.1 Graphs and distances among points** 

components are computed?

R20 space.

where This quantity measures how spread out the points are around the centroid. We will need this quantity when determining principal components.

We define the distance between subjects si and si' using the Euclidian distance as follows:

$$d^2(\mathbf{s}\_{i'}, \mathbf{s}\_{i'}) = \left\| \mathbf{s}\_i - \mathbf{s}\_{i'} \right\|^2 = \sum\_{j=1}^{p-5} (\mathbf{X}\_{ij} - \mathbf{X}\_{i'j})^2 \tag{7}$$

Two subjects are close one to another when they take similar values for all variables. We can use this distance to measure the overall dispersion of the data around the centroid or to cluster the points as in classification methods.

#### **3.2 How work when data are in different units?**

There are different problems when variables are measured in different units. The first problem is the meaning of the variance: how to sum quantities with different measurement units? The second problem is that the distance between points can be greatly influenced. To illustrate this point, let us consider the distances between subjects 7, 8 and 9. Applying Eq.(7), we obtain the following results:

$$d^2(\mathbf{s}\_7, \mathbf{s}\_8) = (10.2 - 11.9)^2 + (19 - 17.1)^2 + \dots + (15.3 - 14)^2 = 8.27\tag{8}$$

$$d^2(\mathbf{s}\_7, \mathbf{s}\_9) = (10.2 - 14.3)^2 + (19 - 19.1)^2 + \dots + (15.3 - 15)^2 = 19.8\tag{9}$$

Subject 7 is closer to subject 8 than to subject 9. Multiplying the values of variable X5 by 10 yields:

$$d^2(\mathbf{s}\_7, \mathbf{s}\_8) = (10.2 - 11.9)^2 + (19 - 17.1)^2 + \dots + (153 - 140)^2 = 175.58 \tag{10}$$

$$d^2(\mathbf{s}\_7, \mathbf{s}\_9) = (10.2 - 14.3)^2 + (19 - 19.1)^2 + \dots + (153 - 150)^2 = 28.71\tag{11}$$

Now we observe that subject 7 is closer to subject 9 than to subject 8. It is hard to accept how the measurement units of the variables can change greatly the comparison results among subjects. Indeed, we could by this way render a tall man as shorter as we want!

As seen, PCA is sensitive to scale. If you multiply one variable by a scalar you get different results. In particular, the principal components are dependent on the units used to measure the original variables as well as on the range of values they assume (variance). This makes comparison very difficult. It is for these reasons we should *often* standardize the variables prior to using PCA. A common standardization method is to subtract the mean and divide by the standard deviation. This yields the following:

$$X\_i^\* = \frac{X\_i - X}{\sigma\_x} \tag{12}$$

where *X* and *<sup>x</sup>* are the mean and standard deviation of X, respectively.

Thus, the new variables all have zero mean and unit standard deviation. Therefore the total variance of the data set is the number of observed variables being analyzed.

The Basics of Linear Principal Components Analysis 187

1 11 1 12 2 <sup>1</sup> <sup>1</sup> ... *p p z x x x Xu*

determined such that the overall variance of the resulting points is as large as possible. Of course, one could make the variance of *z1* as large as possible by choosing large values for

> 2 22 2 11 12 1 1 ... 1

 

Eq.(14) is also the projections of the *n* subjects on the first component. PCA finds *u*<sup>1</sup> so that

The matrix <sup>1</sup> *C XX*' *<sup>n</sup>* is the correlation matrix of the variables. The optimization problem

1 1

2 2 1 1 1 11

> <sup>1</sup> <sup>2</sup> 1

*u u* 1 '

*Max u Cu* 

This program means that we search for a unit vector *u*<sup>1</sup> so as to maximize the variance of the projection on the first component. The technique for solving such optimization problems

> 1 1 1 1 11 *u Cu u u* ' ( ' 1)

> > *Cu u* 1 11

By premultiplying each side of this condition by 1 *u*' and using the condition 1 1 *u u*' 1 we

1 1 1 11 1 *u Cu u u* ' ' 

and (20) are the maximum eigenvalue and the corresponding eigenvector of the correlation matrix C. Thus the optimum coefficients of the original variables generating the first principal component z1 are the elements of the eigenvector corresponding to the largest

The second principal component is calculated in the same way, with the condition that it is uncorrelated (orthogonal) with the first principal component and that it accounts for the

eigenvalue of the correlation matrix. These elements are also known as loadings.

1 11 ( ) ' '

*Var z z z u X Xu n nn*

 

*<sup>p</sup>* is a column vector of weights. The principal component *z1* is

*<sup>p</sup>* . To prevent this, weights are calculated with the constraint that

*<sup>p</sup> u* (15)

(17)

(18)

and solving the equation 1 1 / 0 *u*

(19)

(20)

<sup>1</sup> that satisfy conditions (19)

is maximal (16)

(14)

their sum of squares is one, that is *u*1 is a unit vector subject to the constraint:

1

(linearly constrained) involves a construction of a Lagrangian function.

Taking the partial derivative 1 1 1 11 / *u Cu u*

largest part of the remaining variance.

It is known from matrix algebra that the parameters *u*1 and

*n i i*

where 1 11 12 1 ( , ,... )' *u* 

the weights 11 12 1 , ,..., 

is:

yields:

get:

 

Throughout, we assume that the data have been centered and standardized. Graphically, this implies that the centroid or center of gravity of the whole dataset is at the origin. In this case, the PCA is called normalized principal component analysis, and will be based on the correlation matrix (and not on variance-covariance matrix). The variables will lie on the unit sphere; their projection on the subspace spanned by the principal components is the "correlation circle". Standardization allows the use of variables which are not measured in the same units (e.g. temperature, weight, distance, size, etc.). Also, as we will see later, working with standardized data makes interpretation easier.
