4. Chemometrics

3.2. Background removal

296 Raman Spectroscopy

effects and Raman signal contribution.

3.3. Normalization

As mentioned is the last section, one noise source in biological Raman spectra is the fluorescence background. This intrinsic fluorescence emission is several orders of magnitude greater than the Raman scattering intensity of biological tissues; therefore, fluorescence appears as a strong band that obscures Raman signals and must be removed in order to perform the analysis on the Raman spectra. Background elimination has been performed using two approaches: experimental and computational. The experimental methods are related to changes in the instrumentation and those include shifted excitation [7], photo bleaching [8], and time gating [9]. One drawback of these methods is the relatively complex instrumentation, the long acquisition times, and alterations in the sample that could make the analysis of biological samples difficult. On the other hand, background removing by using computational approaches has the advantages such as easy to implement, inexpensive, and fast. Such methods include polynomial fitting [10–12], Fourier transform [13], wavelet transform [13], first- and second-order differentiation [14], multiplicative signal correction [15], linear programming [16], geometric approach [17], asymmetric least squares [18], methods based on iterative reweighted quantile regression [19], iterative exponential smoothing [20], and morphology operators [21, 22]. However, the most used method is polynomial fitting due to simplicity. In this method, a polynomial is fitted and subsequently subtracted from the Raman spectrum to eliminate background effects. The selection of polynomial order is extremely important, because a higher order polynomial fitting may consider Raman bands as background and may be affected by high frequency noise. To solve this issue, some modified polynomial fitting methods were proposed. Figure 2 shows the Raman spectra of in vivo mouse skin tissue with and without fluorescence removal using the polynomial fitting method. For example, the algorithm proposed by Zhao et al. [11] also known as the Vancouver Raman algorithm (VRA) is widely used for baseline correction in biomedical applications due to effectiveness and simplicity. The main advantage of this method is that it accounts for noise

Raman spectra from the same sample could have different intensity levels if they were acquired at different times or under different experimental parameters such as changes in laser power levels. Normalization process deals with these differences in intensity levels by making that the intensity of a specific Raman band of the same material is the same or similar possible in all the spectra recorded under the same experimental parameters. One approach is the normalization to area. In this method, the intensity at each frequency in the spectrum is divided by the square root of the sum of the squares of all intensities. This normalization is useful when the spectra do not share a common band and it is better to normalize the spectra so that the total area under the spectrum is 1.0. This method has the advantage that is not dependent on any single band but one disadvantage is that the background can contribute to the normalization [1]. Another approach is the peak normalization, which uses intensity corresponding to the central frequency of a particular Raman band as reference (internal or external). The 1660 cm<sup>1</sup> (amide I) and the 1450 cm<sup>1</sup> band (CdH vibrations) are commonly

Chemometrics uses mathematical and statistical methods to provide chemical/physical information from chemical data or for the subject under consideration, spectroscopic data. In order to identify components in a sample, one possibility is to use individual bands, but this approach is not the best option because one band is not specific for a molecule, as many molecules have the a band in the same localization. A more precise identification is to use multiple bands or the complete spectrum. Such approach considers each point in a spectrum as a variable and spectroscopic data can be displayed as a matrix where columns represent the variables (Raman shift or wavenumber) and the rows represent observations (Raman spectra). To analyze data with more than one variable, multivariate data analysis is used. There are many multivariate data analysis techniques available and their correct use depends on the objective of the analysis. The objective can be data description or exploratory analysis, discrimination, classification, clustering, regression, and prediction. Also, the data analysis methods can be divided into unsupervised and supervised methods. The supervised methods are used when there is no a priori knowledge available and are very useful to find hidden structures in the unlabeled data and sometimes are used as a first step to supervised methods. Hierarchical cluster analysis (HCA) and principal component analysis (PCA) are examples of unsupervised methods. On the other hand, supervised methods need a priori information such as class labels and the analysis involves the use of a training data set to find the patterns in the data and later validate the model using a test set. One example of the supervised method is partial least squares (PLS).

between the properties of interest and intensity can be accommodated in a PLS model by

Raman Spectroscopy for In Vivo Medical Diagnosis http://dx.doi.org/10.5772/intechopen.72933 299

Several data analysis methods are focused on looking for differences between the spectra so that groups of spectra can be identified and classified. The most common methods used in biomedical Raman spectroscopy are k-nearest neighbors (KNN), hierarchical cluster analysis (HCA), artificial neural networks (ANN), discriminant analysis (DA), and support vector machines (SVM). The KNN method compares all spectra in the dataset through the use of the metrics of similarity between spectra like the Euclidean distance. This method has been used in combination with PCA and Raman spectroscopy for the diagnosis of colon cancer [26]. HCA uses a variety of multivariate distance calculations such as Euclidean and Mahalanobis metrics to identify similar spectra and is one of the used methods in Raman and IR imaging [27]. Similarly, artificial neural networks can be used to identify clusters or to find patterns in complex data. ANNs are computational models inspired by the functionality and structure of the central nervous system and the networks consist of interconnected group of nodes or neurons, which have different functions such data input, output, storage, or forwarding. The layout of ANN is composed of a number of layers and a number of neurons per layer. The use of ANN in the data analysis of blood serum Raman spectra allows for the differentiation between patients with Alzheimer's disease, other types of dementia, and healthy individuals [28]. DA is a supervised data analysis technique, which requires a priori knowledge of each sample group membership. DA computes a set of discriminant functions based on linear combinations of variables that maximize the variance between groups and minimize the variance within groups according to Fisher's criterion. Sometimes it is very useful to combine both PCA and LDA approaches (called PC-LDA model), which improves the efficiency of classification as it automatically finds the most diagnostically significant features [29–31]. SVMs are kernel-based algorithms that transform data into a high-dimensional space and construct a hyperplane that maximizes the distance to the nearest data point of any of the input classes. Raman spectroscopy and SVM have been used as

The importance of the in vivo Raman spectroscopy is the number of potential biomedical applications. One application is the in vivo noninvasive diagnosis, and most research papers focus on cancer and skin diagnosis. In this section, a wide overview over applications in cancer

One of the most common clinical targets under investigation with Raman spectroscopy is cancer due to the possibility to measure biological samples minimally invasive, in vivo, and without labeling. One important step that enables the introduction of in vivo measurements of cancer in

and skin diagnosis is given, with a focus on developments over the past 5 years.

including multiple LVs.

4.3. Classification and clustering models

methods for cancer screening [32].

5. Applications

5.1. Cancer diagnosis

#### 4.1. Principal component analysis (PCA)

Principal component analysis (PCA) is an unsupervised method often used to reduce the number of variables [24] and exploratory analysis of data. PCA is based on the eigenvector decomposition of the covariance matrix of the spectra matrix into eigenvectors and eigenvalues. The eigenvectors (or principal components) are orthogonal along n-dimensional axes and are ordered by decreasing value of each associated eigenvalue. This means the principal components are independent of each other and uncorrelated, as opposed to the original ones, which may be correlated. Also, their decreasing order means that the first principal component explains the maximum amount of variance of the original data, and the second one explains more variance than the third, and so on. The original data can be considered as an MN matrix of M spectra sampled at N wavenumbers. Applying the PCA to this matrix, PCA yields three results: N principal components, an NN matrix containing the coefficients for the transformation between the original data and the principal components, and N eigenvalues describing the importance of the corresponding principal components. The original N experimental spectra are transformed into a new set of N 'synthetic' spectra called principal components. In summary, one advantage of PCA is that by evaluating the relative importance of the consecutive principal components, it is possible to reduce the dimension of the original dataset by finding a smaller collection of variables that explain the highest amount of variance. Additionally, because changes in Raman signal are uncorrelated with the noise in the spectra, the random noise and the significant spectral changes will be separated into different principal components. Therefore, many principal components can be discarded, removing noise without losing useful information from Raman signal.

#### 4.2. Partial least squares (PLS)

PLS is one of the most widely used multivariate data analysis techniques along with vibrational spectroscopy to estimate and quantify components in a sample [25]. As a supervised method, the concentrations of all constituents in the calibration samples are known. As with PCA, the noise observed in the spectra is isolated into separate latent variables (LVs), which are left out of the calibration, improving prediction precision, and nonlinear relationships between the properties of interest and intensity can be accommodated in a PLS model by including multiple LVs.
