*4.3.2. PCA-LDA*

202 Multivariate Analysis in Management, Engineering and the Sciences

accuracy and the prediction accuracy [43, 54].

second partition is the validation set [54].

average result is obtained.

11 1

*acr =*

*C +C*

*n +n*

*22 2*

The confusion matrix becomes then:

the number of observations correctly classified and the number of observations misclassified. Then, we can estimate the classification rate as the number of correctly classified observations over the total number of observations. In general, in evaluating the accuracy of a model, we have then to distinguish between two types of accuracy: the fitting

The fitting accuracy is the ability to reproduce the data, namely how the model is able to reproduce the data that were used to build the model (the training set). This corresponds to

The prediction accuracy is the ability to predict the value or the class of an observation, that was not included in the construction of the model. This kind of accuracy is often referred to as the ability of the model to generalize. The data used to measure this accuracy are called "test set". The prediction accuracy can be called "actual classification rate". This is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. To have an estimation of the actual classification

In the hold-out, the dataset is divided into two partitions: one partition is used to develop the model (e.g. the discriminant functions) and the second partition is given as input to the model. The first partition is usually called "training set" or "calibration set", while the

When the number of observations is small, the cross-validation is usually preferred over the hold-out. The basic idea of the cross-validation procedure is to divide the entire dataset into L disjoint sets. L-1 sets are used to develop the model (i.e. the calibration set on which the discriminant functions are computed) and the omitted portion is used to test the model (i.e. the validation set given as input to the model). This is repeated for all the L sets and an

Apparent or actual classification accuracies can be summarized in a confusion matrix. As an example, total N observations, 1 *n* , belong to the group 1 and 2 *n* belong to the group 2. *C*<sup>11</sup> is the total number of observations correctly classified in group 1 and *C*12 is the total number of data misclassified in group 2. Similarly, *C*22 is the total number of observations

Actual group Predicted group

and the accuracy (the actual or apparent classification rate (acr)) is computed as:

correctly classified in group 2 and *C*21 is the number of misclassified in group 1.

 1 2 1 *C*<sup>11</sup> *C*<sup>12</sup> 2 *C*<sup>21</sup> *C*<sup>22</sup>

the apparent classification rate and it is obtained using the re-substitution procedure.

rate, two main procedures can be applied: the hold-out and cross-validation [43].

A powerful analysis tool is the combination of the principal component analysis with the linear discriminant analysis [52]. This is particularly helpful when the number of variables is large. In particular, if the number of observations (*N*) is less than the number of variables (*m*) - specifically *N-1<m* - the covariance matrix is singular and cannot be inverted (see section 4.2.3.). We then need to find a way to reduce the number of variables, for example using the PCA [49, 55]. This procedure has been widely used for several problems in different fields [35, 52, 56-60]. The condition *N-1<m* almost always appears in spectroscopy, where the number of observations (*N*) is usually 102 and the number of variables (*m*) is typically within 102 to 103.

Let's take into account the same situation described for the many group linear discriminant analysis. The original dataset is an ensemble of multivariate observations which is partitioned into *k* distinct groups. Again, we want to find the discriminant functions which optimally separate our multivariate observation into the *k* groups. Then, the discriminant functions can be used to identify the most important variables in terms of ability of distinguishing among the groups. Thus, first the original dataset is submitted to the PCA to reduce the number of variables; subsequently, the reduced dataset is analyzed using the LDA.

Another way that can be used instead of PCA is to perform the PLS.

## *4.3.3. PLS-LDA*

In a way analogous to the PCA-LDA procedure, here we first apply the PLS algorithm to the original data and then the LDA on the selected principal components [61].

Given that the PLS searches for a set of components that performs a simultaneous decomposition of the dependent and independent datasets, the main difference with PCA-LDA is that the principal components resulting as output of PLS better describe the relationship between independent and dependent variables. This does not necessarily mean that this method is better in general. Indeed, applying PCA or PLS on the same dataset often leads to similar results [62, 63] and the classification accuracy or the descriptive ability is mostly determined by the underlying structure of the data which can make one of the two methods more suitable than the other.

#### *4.3.4. Cluster Analysis (CA)*

The goal of cluster analysis is to find the best grouping of the multivariate observations such that the clusters are dissimilar to each other but the observations within a cluster are similar [44].

CA is an unsupervised technique, that is, the group membership of the observations (and often the number of groups) is not known in advance.

At first we have to define a measure of similarity or dissimilarity also called distance functions. The most common distance functions are: i) the Euclidean distance; ii) the Manatthan distance; iii) the Mahalanobis distance; iv) the maximum norm.

Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes 205

the aim to highlight the importance of linking the two approaches to extract the most

In some cases, PCA alone represents a powerful method for the analysis of multidimensional FTIR spectra. Indeed, several interesting works are reported in the literature, in which this approach is employed to support the spectroscopic investigation of complex biological systems and processes. For instance, synchrotron based FTIR microspectroscopy coupled with PCA has been applied to the characterization of human corneal stem cells [27, 66], in cancer research for the screening of cervical cancer [14], as well as to disclose the effects induced by a surface glycoprotein in colon carcinoma cells [67].

For instance, Matthew German and colleagues [68] coupled high-resolution synchrotron radiation-based FTIR (SR-FTIR) microspectroscopy with PCA to investigate the characteristics of putative adult stem cell (SC), transiently amplified (TA) cell, and terminally differentiated (TD) cell populations of the corneal epithelium. Using PCA, each spectrum, composed by many variables (the wavenumbers), is reduced to a point in a low dimensional space. Then, each observation can be visualized in a two or three dimensional score plot. Choosing the appropriate principal components, the authors were able to clearly distinguish the three cell populations confirming the ability of SR-FTIR microspectroscopy

PCA alone is extremely powerful to reduce the number of variables; however, it is not a

For example, Tanthanuch and colleagues applied FTIR microspectroscopy-supported by PCA and unsupervised hierarchical cluster analysis (UHCA) to identify specific spectral markers of the differentiation of murine embryonic stem cell (mESCs) and to distinguish them into different neural cell types [25]. In particular, focal plane array (FPA) - FTIR and SR-FTIR microspectroscopy measurements - performed on cell clumps and single cells respectively - allowed to obtain a biochemical fingerprint of different mESC developmental stages, namely embryoid bodies (EBs), neural progenitor cells (NPCs) and embryonic stemderived neural cells (ESNCs). Interestingly, it should be noted that the results obtained on cell clumps and on single cells were found to be comparable, corroborating the FPA-FTIR results on cell clumps. The analysis of second derivative spectra enabled to highlight important spectral changes occurring during ES cell differentiation, mainly in the lipid CH2 and CH3 stretching region and in the protein amide I band. Noteworthy, these results overall indicated that during neural differentiation the cell lipid content increased significantly, likely reflecting modifications in cell membranes, whose lipid content is known to have a key role in neural cell differentiation and signal transduction. Moreover, changes in the profile of amide I band, mainly involving the alpha-helix component around 1650-1652 cm-1, indicated an increased expression of alpha-helix reach protein in ESNCs compared with their progenitor cells, a result that could reflect the expression of cytoskeleton protein, crucial for the establishment of neural structure and function. These results were then strongly supported by PCA, that made it possible to disclose regions of the IR spectrum which most contributed to the spectral variance, namely amide I band and C-H

clustering algorithm and the group into clusters must be done with other techniques.

significant spectral information from highly informative systems.

to identify SC, TA cell, and TD cell populations.

Based on the procedure they use, clustering algorithms can be divided into three main groups: hierarchical, partitional and density-based clustering. None of the following algorithms is better than the other. The choice of the clustering method strongly depends on the structure of the data and on which kind of results one would expect.

Hierarchical clustering algorithms can be again subdivided into agglomerative or divisive. The agglomerative clustering starts with all observations placed in different clusters and in each step an observation or a cluster of observations is merged into another cluster. The most commonly employed agglomerative clustering strategies are complete-linkage, averagelinkage, single-linkage, centroid-linkage. The drawback of the agglomerative clustering algorithms is that observations cannot be moved among the clusters once a cluster is made.

The divisive method starts with one single cluster containing all observations and then it divides the cluster into two sub-clusters at each step. Divisive methods have the same drawback of the agglomerative clustering, that is, once a cluster is made, an observation cannot be moved to another cluster. Divisive methods are suited when large clusters are searched for.

The partitional algorithm assigns the observations to a set of clusters without using hierarchical approaches. One of the most used non-hierarchical approach is the k-means clustering.

The density-based clustering seeks to search for regions of high density without any assumption about the shape of the cluster.
