**3. Model selection criteria AIC and BIC for the number of PCs**

Let us now delve a bit further into mathematical statistics and consider some more objective, numerical criteria, in particular, the information criteria AIC and BIC. Let us see what a Gaussian model would imply about AIC and BIC. The maximum log likelihood for the model (\*) approximating the *p* variables in terms of *k* PCs is

ð Þ <sup>2</sup>*<sup>π</sup>* �*np=*<sup>2</sup>^∣Σ*<sup>k</sup>* � � � �*n=*2 *C n*ð Þ , *p*, *k* , where *C n*ð Þ , *p*, *k* is a constant depending upon the sample size, *n*, the number of variables, *p*, and *k*, the Model *k* being considered, *k* ¼ 1, 2, … ,*K*, and ∣Σ*k*∣ denotes the determinant of the residual covarance matrix Σ*k:*

The determinant of the covariance matrix is the product of the eigenvalues,

$$|\Sigma| = \prod\_{j=1}^{p} \lambda\_j. \tag{29}$$

For a model based on the first *k* PCs, the determinant of the residual covariance matrix is the product of the remaining, smaller eigenvalues, Π*<sup>p</sup> <sup>j</sup>*¼*k*þ<sup>1</sup>*<sup>λ</sup> <sup>j</sup>:*

The model selection criterion AIC—Akaike's information criterion [5–7]—is based on an estimate of the logarithm of the cross-entropy of the *K* proposed models with a null model. That is, for alternative models indexed by *k* ¼ 1, 2, … , *K*, AIC*<sup>k</sup>* is an estimate of the log cross-entropy of the proposed Model *k* with the null model. The cross-entropy of the distribution with the probability density function *q*ð Þ **x** relative to a distribution with the probability density function *p*ð Þ **x** is defined as *H p*ð Þ¼ , *q* �E*p*½ �¼� ln *<sup>q</sup>*ð Þ **<sup>X</sup>** <sup>Ð</sup> ln *<sup>q</sup>*ð Þ **<sup>x</sup>** *<sup>p</sup>*ð Þ **<sup>x</sup>** *<sup>d</sup>***x***:*

The Bayesian information criterion (BIC) [8] is based on a large-sample estimate of the posterior probability *ppk* of Model *k*, *k* ¼ 1, 2, … , *K:* More precisely, BIC*<sup>k</sup>* is an approximation to �2 ln *ppk:*

Formulated in this way, these model selection criteria (MSCs) are, thus, smaller-isbetter criteria and take the form

$$\text{MSC}\_{k} = -2 \text{ ln } \max L\_{k} + a\_{n} m\_{k}, \quad k = 1, 2, \dots, K,\tag{30}$$

where *Lk* is the likelihood for Model *k*, *an* ¼ ln *n* for BIC*k*, *an* ¼ 2 (not depending upon *n*) for AIC*k*, and *mk* is the number of independent parameters in Model *k:* The first term is a lack-of-fit (LOF) term, and the second term is a penalty term based on the number of parameters used. With AIC, the penalty is two units per parameter; with BIC, the penalty is ln *n* units per parameter. For *n* ≥8, In *n* exceeds 2: for sample sizes greater than 7, the penalty per parameter with BIC exceeds that for AIC. Therefore, relative to AIC, BIC tends to favor more parsimonious models—models with a smaller number of parameters.

Note that

$$ppp\_k \approx \mathcal{C} \exp\left(-\text{BIC}\_k/2\right),\tag{31}$$

where *C* is a constant. Thus, BIC values can be converted to values on a scale of 0–1. This is done by exponentiating –BIC*k=*2, summing the values, and dividing by the sum. That is,

$$pp\_k \approx \exp\left(-\text{BIC}\_k/2\right) / \sum\_{j=1}^{K} \exp\left(-\text{BIC}\_j/2\right). \tag{32}$$

To relate the maximum likehood to the eigenvalues, note that for the PC model,

$$-2\ln\max \,L\_k = n\ln\,\Pi\_{j=k+1}^p\\\lambda\_k = n\sum\_{j=k+1}^p \ln\lambda\_k.\tag{33}$$

The model selection criteria can be written as

$$\text{MSC}\_{k} = \text{Decvenue}\_{k} + \text{Penalty}\_{k},\tag{34}$$

where Deviance*<sup>k</sup>* ¼ *n* In max *Lk* is a measure of lack of fit and Penalty*<sup>k</sup>* ¼ *anmk*. Inclusion of an additional PC is justified if the criterion value decreases, that is, if MSC*<sup>k</sup>*þ<sup>1</sup> < MSC*k:* For PCs, this is

$$n\sum\_{j=k+2}^{p} \ln \lambda\_j + (k+1)a\_n < n\sum\_{j=k+1}^{p} \ln \lambda\_j + k\, a\_n. \tag{35}$$

This is

$$a\_n < n \ln \lambda\_{k+1} = \ln \left(\lambda\_{k+1}^n\right),\tag{36}$$

or

$$\exp\left[a\_n\right] < \lambda\_{k+1}^n,\tag{37}$$

or

**145**

$$
\lambda\_{k+1} > \exp\left[a\_n/n\right] \tag{38}
$$

or

$$
\lambda\_{k+1} > \exp\left[-a\_n/n\right].\tag{39}
$$

Thus, for AIC, the inclusion of the additional PC*k*þ<sup>1</sup> is justified if *λk*þ<sup>1</sup> is greater than exp ð Þ �2*=n :*

For BIC, the inclusion of an additional PC*k*þ<sup>1</sup> is justified if

$$
\lambda\_{k+1} > \exp\left(\ln n/n\right) = \left[\exp\left(\ln n\right)\right]^{1/n} = n^{1/n}.\tag{40}
$$

The quantity *n*<sup>1</sup>*=<sup>n</sup>* tends to 1 for large *n:* Therefore, this procedure is in approximate agreement with the average eigenvalue rule for correlation matrices, stating that one should retain dimensions with eigenvalues larger than 1.
