**4.3 Principal component analysis in the example**

Note that an eigenvector can be multiplied by 1, changing the signs of all its elements. In the following, this is done with PC1 so that SYS and DIAS have positive loadings. Our interpretations, related to the scientific/medical context of the study, are BPtotal, SIZE, AGE, OVERWT, and BPdiff and are written below the eigenvectors. The interpretations are based on which loadings are large and which are small, that is, on the relative sizes of the loadings. Taking 0.6 as a cutoff point, in PC1, SYS and DIAS have loadings above this, while the other variables have loadings less than this (in fact, less than 0.4), so PC1 can be interpreted as an index of total BP. In PC2, the variables WT and HT have large loadings with the same sign, so PC2 can be interpreted as SIZE (**Tables 2** and **3**).


*Cell Contents: Pearson correlation.*

#### **Table 1.**

*Correlation matrix of five variables—LA heart data.*



**Table 3.**

*PC1 is multiplied by* �*1.*

As above, denote the eigensystem in terms of the eigenpairs

$$(\lambda\_v, \ a\_v), \quad v = 1, 2, \dots, p. \tag{41}$$

Then, the eigensystem equations are

$$\mathbf{S} \cdot \mathfrak{a}\_v = \lambda\_v \quad \mathfrak{a}\_v, \quad v = 1, 2, \dots, p. \tag{42}$$

Here, **S** is taken to be the correlation matrix. Let **1**<sup>0</sup> *<sup>v</sup>* ¼ ð Þ 0 0⋯ 1⋯ 0⋯ , the vector with 1 in the *v*th position and zeroes elsewhere. The covariance between a variable *Xv* and a PC *Cu* is C½ �¼ *Xv*,*Cu* C **1**<sup>0</sup> *<sup>v</sup>X*, *a*<sup>0</sup> *<sup>u</sup> <sup>X</sup>* � � <sup>¼</sup> **<sup>1</sup>**<sup>0</sup> Σ *au* ¼ **1**<sup>0</sup> *<sup>v</sup> λ<sup>u</sup> a<sup>u</sup>* ¼ *λuauv*, where *auv* is the *v*th element of the vector *au:* The coefficient of correlation is Corr½ �¼ *Xv*,*Cu* C½ � *Xv*,*Cu =*SD½ � *Xv* SD ½ �¼ *Cu λ<sup>u</sup> auv=σ<sup>v</sup>* ffiffiffiffi *λu* <sup>p</sup> <sup>¼</sup> ffiffiffiffi *λu* <sup>p</sup> *auv=σv:* When the covariance matrix used is the correlation matrix, each standard deviation *σ<sup>v</sup>* ¼ 1, and therefore, this correlation is ffiffiffiffi *λu* <sup>p</sup> *auv:* A correlation of size greater than 0.6 corresponds to more than 0*:*6<sup>2</sup> � 100% <sup>¼</sup> 36% of variance explained. The variable *Xv* has a correlation higher than 0.6 with the component *Cu* if its loading in *Cu*, the value *auv*, is greater than 0.6 / ffiffiffiffi *λu* <sup>p</sup> *:* These values are appended to **Table 4**. Loadings larger than


#### **Table 4.**

*Loadings corresponding to correlations* >0*:*6 *are boldface.*


#### **Table 5.**

*Estimating the number of PCs by various methods.*

this cutoff value are in boldface. (The cutoff point of 0.6 is somewhat arbitrary; one might use, for example, a cutoff of 0.5.)

One can also focus on the pattern of loadings within the different PCs for the interpretation of the PCs. To reiterate this process and the interpretations, we have the following:

PC1: SYS and DIAS have large loadings with the same sign; we interpret PC1 as BPindex, or BPtotal.

PC2: WT and HT have large loadings with the same sign; we interpret PC2 as the man's SIZE.

PC3: Only AGE has a large loading, so we interpret PC3 simply as AGE.

PC4: WT and HT have large loadings with opposite signs; we interpret PC4 as OVERWEIGHT.

PC5: SYS and DIAS have large loadings with opposite signs; we interpret PC5 as BPdrop.

We continue to marvel at how readily interpretable the PCs are. This simplicity is attained even without using a factor analysis model and using rotation to simplify the pattern of the loadings.

#### **4.4 Employing the criteria in the example**

To compare and contrast the methods, **Table 5** shows the eigenvalues and the results according to the various criteria for deciding on the adequate number of PCs. According to the rule based on the average eigenvalue, the dimension is retained if its eigenvalue is greater than 1 (when working in terms of the correlation matrix). For BIC, the *k*th PC is retained if

$$n \text{ } \ln \lambda\_k > -\mathfrak{a}\_n,\tag{43}$$

where *an* ¼ ln *n:* Here, *n* ¼ 100 and ln *n* ¼ ln 100, approximately 4.61. For AIC, the *k*th PC is retained if *n* ln *λ<sup>k</sup>* > � 2*:* In this example, the methods agree on retaining *k* ¼ 2 PCs.

We feel that we should remark that, though it is the case that two PCs are suggested, the fourth and fifth PCs do have simple and interesting interpretations. It is just that they do not improve the fit very much. The third PC is essentially a single variable, age.
