**6.2 Factor scores and their use in multivariate models**

A useful by product of PCA is factor scores. Factor scores are coordinates of subjects (individuals) on each component. They indicate where a subject stands on the retained component. Factor scores are computed as weighted values on the observed variables. Results for our dataset are reported in Table 6.

Factor scores can be used to plot a reduced representation of subjects. This is displayed by Figure 3.

How do we interpret the position of points on this diagram? Recall that this graph is a projection. As such some distances could be spurious. To distinguish wrong projections from real ones and better interpret the plot, we need to use that is called "the quality of representation" of subjects. This is computed as the squared of the cosine of the angle between a subject *<sup>i</sup> s* and a component *z* , following the formula:

$$\cos^2(\mathbf{s}\_{i\prime}\mathbf{z}\_{j\prime}) = \frac{\mathbf{z}\_i^2}{\left\|\mathbf{s}\_i\right\|^2} = \frac{\mathbf{z}\_i^2}{\sum\_{k=1}^p \mathbf{x}\_{kl}^2} \tag{30}$$

The Basics of Linear Principal Components Analysis 197

Cos2 is interpreted as a measure of goodness-of-fit of the projection of a subject on a given

measures how far the subject is from the center. So if cos2=1 the component extracted is reproducing a great amount of the original behavior of the subject. Since the components are orthogonal, the quality of representation of a subject in a given subspace of components is the sum of the associated cos2. This notion is similar to the concept of communality

In Table 6 we also reported these statistics. As can be seen, the two components retained explain more than 80% of the behavior of subjects, except for subjects 6 and 7. Now we are confident that almost all the subjects are well-represented, we can interpret the graph. Thus, we can tell that subjects located in the right side and having larger coordinates on the first component, i.e.1, 9, 6, 3 and 5, have values of X1, X2 and X3 greater than the average. Those located in the left side and having smaller coordinates on the first axis, i.e. 20, 19, 18, 16, 12, 11 and 10, record lesser values for these variables. On the other hand, subjects 15 and 17 are characterized by highest values for variables X4 and X5, while subjects 8 and 13 record

Very often a small number of subjects can determine the direction of principal components. This is because PCA uses the notions of mean, variance and correlation; and it is well known that these statistics are influenced by outliers or atypical observations in the data. To detect what are these atypical subjects we define the notion of "contribution" that measures how much a subject contributes to the variance of a component. Contributions (CTR) are

*i j*

*<sup>z</sup> CTR s z*

2 ( , ) 100 *<sup>i</sup>*

*i*

*n*

Contributions are reported in the last two columns of Table 6. Subject 4 contributes greatly to the first component with a contribution of 16.97%. This indicates that subject 4 explains alone 16.97% of the variance of the first component. Therefore, this subject takes higher values for X1, X2 and X3. This can be easily verified from the original Table 1. Regarding the second component, over 25% of the variance of the data accounted for by this component is explained by subjects 15 and 17. These subjects exhibit high values for variables X4 and X5. The principal components obtained from PCA could be used in subsequent analyses (regressions, poverty analysis, classification…). For example, in linear regression models, the presence of correlated variables poses the econometric well-known problem of multicolinearity that makes instable regression coefficients. This problem is avoided when using the principal components that are orthogonal with one another. At the end of the analysis you can re-express the model with the original variables using the equations defining principal components. If there are variables that are not correlated with the other variables, you can delete them prior to the PCA, and reintroduce them in your model once

We collected data on 10 socio-demographic variables for a sample of 132 countries. We use these data to illustrate how performing PCA using the SPSS software package. By following

the indications provided here, user can try to reproduce himself the results obtained.

*<sup>i</sup> s* is the distance of subject *<sup>i</sup> s* from the origin. It

(31)

component. Notice that in Eq. (30), <sup>2</sup>

previously defined for variables.

lowest values for these variables.

computed following:

the model is estimated.

**7. A Case study with illustration using SPSS** 


Notes: Columns PCA1 and PCA2 display the factor scores on the first and second components, respectively. Cos21 and Cos22 indicate the quality of representation of subjects on the first and second components, respectively. QL12= Cos21+ Cos22 measures the quality of representation of subjects on the plane formed by the first two components. CTR1and CTR2 are the contribution of subjects on component 1 and component 2, respectively.

Table 6. **Factor Scores of Subjects, Contributions and Quality of Representation** 

Fig. 3. **Scatterplot of subjects in the first two factors**

196 Principal Component Analysis

ID PCA1 PCA2 Cos21 Cos22 QL12= Cos21+ Cos22 CTR1 CTR2 1 1.701 -1.610 0.436 0.390 0.826 5.458 6.547 2 1.701 0.575 0.869 0.099 0.969 5.455 0.837 3 1.972 -0.686 0.862 0.104 0.966 7.333 1.191 4 3.000 2.581 0.563 0.417 0.981 16.974 16.832 5 2.382 -1.556 0.687 0.293 0.980 10.700 6.116 6 1.717 -0.323 0.522 0.018 0.541 5.558 0.264 7 0.193 -0.397 0.062 0.263 0.325 0.070 0.400 8 0.084 -1.213 0.004 0.972 0.977 0.013 3.718 9 1.071 -0.558 0.765 0.208 0.974 2.162 0.787 10 -0.427 -0.110 0.822 0.054 0.877 0.344 0.030 11 -1.088 0.176 0.093 0.024 0.933 2.232 0.078 12 -1.341 -1.673 0.344 0.536 0.881 3.393 7.075 13 -0.291 -0.996 0.071 0.835 0.906 0.160 2.507 14 -0.652 0.567 0.543 0.411 0.955 0.801 0.812 15 -0.062 3.166 0.000 0.957 0.957 0.007 25.325 16 -1.830 -0.375 0.929 0.039 0.968 6.318 0.356 17 -1.181 3.182 0.119 0.868 0.988 2.630 25.572 18 -2.244 -0.751 0.877 0.098 0.976 9.493 1.424 19 -2.288 -0.150 0.933 0.004 0.937 9.874 0.057 20 -2.417 0.155 0.938 0.003 0.942 11.019 0.060

Notes: Columns PCA1 and PCA2 display the factor scores on the first and second components, respectively. Cos21 and Cos22 indicate the quality of representation of subjects on the first and second components, respectively. QL12= Cos21+ Cos22 measures the quality of representation of subjects on the

plane formed by the first two components. CTR1and CTR2 are the contribution of subjects on

Table 6. **Factor Scores of Subjects, Contributions and Quality of Representation** 

component 1 and component 2, respectively.

Fig. 3. **Scatterplot of subjects in the first two factors**

Cos2 is interpreted as a measure of goodness-of-fit of the projection of a subject on a given component. Notice that in Eq. (30), <sup>2</sup> *<sup>i</sup> s* is the distance of subject *<sup>i</sup> s* from the origin. It measures how far the subject is from the center. So if cos2=1 the component extracted is reproducing a great amount of the original behavior of the subject. Since the components are orthogonal, the quality of representation of a subject in a given subspace of components is the sum of the associated cos2. This notion is similar to the concept of communality previously defined for variables.

In Table 6 we also reported these statistics. As can be seen, the two components retained explain more than 80% of the behavior of subjects, except for subjects 6 and 7. Now we are confident that almost all the subjects are well-represented, we can interpret the graph. Thus, we can tell that subjects located in the right side and having larger coordinates on the first component, i.e.1, 9, 6, 3 and 5, have values of X1, X2 and X3 greater than the average. Those located in the left side and having smaller coordinates on the first axis, i.e. 20, 19, 18, 16, 12, 11 and 10, record lesser values for these variables. On the other hand, subjects 15 and 17 are characterized by highest values for variables X4 and X5, while subjects 8 and 13 record lowest values for these variables.

Very often a small number of subjects can determine the direction of principal components. This is because PCA uses the notions of mean, variance and correlation; and it is well known that these statistics are influenced by outliers or atypical observations in the data. To detect what are these atypical subjects we define the notion of "contribution" that measures how much a subject contributes to the variance of a component. Contributions (CTR) are computed following:

$$\text{CTR}(s\_i, z\_j) = \frac{z\_i^2}{n\lambda\_i} \times 100\tag{31}$$

Contributions are reported in the last two columns of Table 6. Subject 4 contributes greatly to the first component with a contribution of 16.97%. This indicates that subject 4 explains alone 16.97% of the variance of the first component. Therefore, this subject takes higher values for X1, X2 and X3. This can be easily verified from the original Table 1. Regarding the second component, over 25% of the variance of the data accounted for by this component is explained by subjects 15 and 17. These subjects exhibit high values for variables X4 and X5.

The principal components obtained from PCA could be used in subsequent analyses (regressions, poverty analysis, classification…). For example, in linear regression models, the presence of correlated variables poses the econometric well-known problem of multicolinearity that makes instable regression coefficients. This problem is avoided when using the principal components that are orthogonal with one another. At the end of the analysis you can re-express the model with the original variables using the equations defining principal components. If there are variables that are not correlated with the other variables, you can delete them prior to the PCA, and reintroduce them in your model once the model is estimated.

### **7. A Case study with illustration using SPSS**

We collected data on 10 socio-demographic variables for a sample of 132 countries. We use these data to illustrate how performing PCA using the SPSS software package. By following the indications provided here, user can try to reproduce himself the results obtained.

The Basics of Linear Principal Components Analysis 199

**Kaiser-Meyer-Olkin Measure of Sampling Adequacy (**Kaiser, 1974): This measure varies between 0 and 1, and values closer to 1 are better. A value of 0.6 is a suggested minimum

**Bartlett's Test of Sphericity** (Bartlett, 1950): This tests the null hypothesis that the correlation matrix is an identity matrix in which all of the diagonal elements are 1 and all off diagonal elements are 0. We reject the null hypothesis when the level of significance

The results reported in Table 8 suggest that the data may be grouped into smaller set of

Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .913 Bartlett's Test of Sphericity Approx. Chi-Square 1407.151

Table 9 displays the eigenvalues, percent of variance and cumulative percent of variance from the observed data. Earlier it was stated that the number of components computed is equal to the number of variables being analyzed, necessitating that we decide how many

Here only component 1 demonstrates an eigenvalue greater than 1.00. So the Kaiser eigenvalue-one criterion would lead us to retain and interpret only this component. The first component provides a reasonable summary of the data, accounting for about 72% of the total variance of the 10 variables. Subsequent components each contribute less than 8%.

Cumulative

1 7.194 71.940 71.940 7.194 71.940 71.940 2 .780 7.801 79.741 .780 7.801 79.741

Initial Eigenvalues Extraction Sums of Squared Loadings

% Total % of Variance

Cumulative %

components are truly meaningful and worthy of being retained for interpretation.

df 45 Sig. .000

for good PCA.

exceeds 0.05.

Component

Table 9. **Eigenvalues**

underlying factors.

Table 8. **Results of KMO and Bartlett's Test** 

**Eigenvalues and number of meaningful components** 

Total % of Variance

3 .667 6.675 86.416 4 .365 3.654 90.070 5 .302 3.022 93.092 6 .236 2.361 95.453 7 .216 2.162 97.615 8 .106 1.065 98.680 9 .095 .946 99.626 10 .037 .374 100.000

To perform a principal components analysis with SPSS, follow these steps:


In what follows, we review and comment on the main outputs.
