4.3. Classification and clustering models

variables (Raman shift or wavenumber) and the rows represent observations (Raman spectra). To analyze data with more than one variable, multivariate data analysis is used. There are many multivariate data analysis techniques available and their correct use depends on the objective of the analysis. The objective can be data description or exploratory analysis, discrimination, classification, clustering, regression, and prediction. Also, the data analysis methods can be divided into unsupervised and supervised methods. The supervised methods are used when there is no a priori knowledge available and are very useful to find hidden structures in the unlabeled data and sometimes are used as a first step to supervised methods. Hierarchical cluster analysis (HCA) and principal component analysis (PCA) are examples of unsupervised methods. On the other hand, supervised methods need a priori information such as class labels and the analysis involves the use of a training data set to find the patterns in the data and later validate the model using a test set. One example of the supervised method is partial least

Principal component analysis (PCA) is an unsupervised method often used to reduce the number of variables [24] and exploratory analysis of data. PCA is based on the eigenvector decomposition of the covariance matrix of the spectra matrix into eigenvectors and eigenvalues. The eigenvectors (or principal components) are orthogonal along n-dimensional axes and are ordered by decreasing value of each associated eigenvalue. This means the principal components are independent of each other and uncorrelated, as opposed to the original ones, which may be correlated. Also, their decreasing order means that the first principal component explains the maximum amount of variance of the original data, and the second one explains more variance than the third, and so on. The original data can be considered as an MN matrix of M spectra sampled at N wavenumbers. Applying the PCA to this matrix, PCA yields three results: N principal components, an NN matrix containing the coefficients for the transformation between the original data and the principal components, and N eigenvalues describing the importance of the corresponding principal components. The original N experimental spectra are transformed into a new set of N 'synthetic' spectra called principal components. In summary, one advantage of PCA is that by evaluating the relative importance of the consecutive principal components, it is possible to reduce the dimension of the original dataset by finding a smaller collection of variables that explain the highest amount of variance. Additionally, because changes in Raman signal are uncorrelated with the noise in the spectra, the random noise and the significant spectral changes will be separated into different principal components. Therefore, many principal components can be discarded, removing noise with-

PLS is one of the most widely used multivariate data analysis techniques along with vibrational spectroscopy to estimate and quantify components in a sample [25]. As a supervised method, the concentrations of all constituents in the calibration samples are known. As with PCA, the noise observed in the spectra is isolated into separate latent variables (LVs), which are left out of the calibration, improving prediction precision, and nonlinear relationships

squares (PLS).

298 Raman Spectroscopy

4.1. Principal component analysis (PCA)

out losing useful information from Raman signal.

4.2. Partial least squares (PLS)

Several data analysis methods are focused on looking for differences between the spectra so that groups of spectra can be identified and classified. The most common methods used in biomedical Raman spectroscopy are k-nearest neighbors (KNN), hierarchical cluster analysis (HCA), artificial neural networks (ANN), discriminant analysis (DA), and support vector machines (SVM). The KNN method compares all spectra in the dataset through the use of the metrics of similarity between spectra like the Euclidean distance. This method has been used in combination with PCA and Raman spectroscopy for the diagnosis of colon cancer [26]. HCA uses a variety of multivariate distance calculations such as Euclidean and Mahalanobis metrics to identify similar spectra and is one of the used methods in Raman and IR imaging [27]. Similarly, artificial neural networks can be used to identify clusters or to find patterns in complex data. ANNs are computational models inspired by the functionality and structure of the central nervous system and the networks consist of interconnected group of nodes or neurons, which have different functions such data input, output, storage, or forwarding. The layout of ANN is composed of a number of layers and a number of neurons per layer. The use of ANN in the data analysis of blood serum Raman spectra allows for the differentiation between patients with Alzheimer's disease, other types of dementia, and healthy individuals [28]. DA is a supervised data analysis technique, which requires a priori knowledge of each sample group membership. DA computes a set of discriminant functions based on linear combinations of variables that maximize the variance between groups and minimize the variance within groups according to Fisher's criterion. Sometimes it is very useful to combine both PCA and LDA approaches (called PC-LDA model), which improves the efficiency of classification as it automatically finds the most diagnostically significant features [29–31]. SVMs are kernel-based algorithms that transform data into a high-dimensional space and construct a hyperplane that maximizes the distance to the nearest data point of any of the input classes. Raman spectroscopy and SVM have been used as methods for cancer screening [32].
