**4. Multivariate analyses**

#### **4.1. Introduction to multivariate analysis**

Several phenomena can only be described or explained by taking into account several variables at the same time. These cases represent the realm of the Multivariate statistical analysis (MVA).

We now define the structure of our data that will be kept throughout the text for all described techniques. For a given phenomenon we perform a certain measurement and store the value in a uni- or multivariate variable called <sup>1</sup> ... *<sup>T</sup> 2 m* **y** *y ,y , ,y* , where *m* is the number of independent variables. The same measurement can be repeated several times on the same sample or on different samples. We then define a group as a collection of two or more replica of the same experiment and we also define the term instance or observation to refer to a specific experiment within one group.

Each instance associated to the variable **y** is stored in a matrix **Y** composed of *n* rows (the observations) and *m* columns (the independent variables).

$$\mathbf{Y} = \begin{pmatrix} y\_{11} & y\_{12} & \dots & y\_{1m} \\ \vdots & \vdots & \vdots & \vdots \\ y\_{n1} & y\_{n1} & \dots & y\_{nm} \end{pmatrix} = \begin{pmatrix} \mathbf{y}\_{1}, \mathbf{y}\_{2}, \dots, \mathbf{y}\_{n} \end{pmatrix}^{T} \tag{1}$$

Each element of matrix **Y** can be indicated as *ij y* where *i* indicate the observation and *j* is an independent variable. In some cases we want to find or explain the relationship between the independent variables **Y** and another set of uni- or multivariate variables **Z**. Similarly to the **Y** matrix, the matrix **Z** has *n* rows, one for each observation and *m* columns, the dependent variables.

$$\mathbf{Z} = \begin{pmatrix} z\_{11} & z\_{12} & \dots & z\_{1m} \\ \vdots & \vdots & \vdots & \vdots \\ z\_{n1} & z\_{n1} & \dots & z\_{nm} \end{pmatrix} \tag{2}$$

The matrix **Y** (composed of the independent variables y) represents the only input for several multivariate techniques described here; in some other cases the matrices **Y** and **Z**  (composed of the dependent variables z) are both required.

In the following part, we will make a distinction between regression and classification techniques. However, it should be clear that the separation between these two domains is not always sharp and the same technique can be either used for regression or for classification purposes.

#### **4.2. Multivariate regression techniques**

194 Multivariate Analysis in Management, Engineering and the Sciences

**4. Multivariate analyses** 

analysis (MVA).

**4.1. Introduction to multivariate analysis**

refer to a specific experiment within one group.

observations) and *m* columns (the independent variables).

Moreover, in reflectance mode - where the sample is placed onto proper reflective slides the IR beam passes the sample, is reflected by the slide, and passes the sample again. In particular, the sample slides reflect mid-infrared radiation almost completely and usually are also transparent to visible light, allowing sample inspection by a conventional light

Finally, in the ATR approach, where the sample is placed into contact with a higher refractive index and an IR transparent element (mainly germanium and diamond), samples with higher thickness than in transmission can be processed. In particular, the IR beam reaches the interface between the ATR support and the sample at an angle larger than that corresponding to the total reflection. In this way the beam is totally reflected by the interface and penetrates into the sample as an evanescent wave, where it can be absorbed. The beam penetration depth is of the order of the IR wavelength (a few micrometers) and depends on the wavelength, the incident angle, as well as on the refractive indices of the sample and of the ATR element. Furthermore, it should be noted that this kind of approach makes it possible to measure also samples not necessarily deposited onto an IR transparent support, as in ATR measurements it is only required that the sample be in close contact with the ATR element.

Several phenomena can only be described or explained by taking into account several variables at the same time. These cases represent the realm of the Multivariate statistical

We now define the structure of our data that will be kept throughout the text for all described techniques. For a given phenomenon we perform a certain measurement and

number of independent variables. The same measurement can be repeated several times on the same sample or on different samples. We then define a group as a collection of two or more replica of the same experiment and we also define the term instance or observation to

Each instance associated to the variable **y** is stored in a matrix **Y** composed of *n* rows (the

Each element of matrix **Y** can be indicated as *ij y* where *i* indicate the observation and *j* is an independent variable. In some cases we want to find or explain the relationship between the

11 12 1 1 2

, ,..., *<sup>T</sup>*

*m n*

*2 m* **y** *y ,y , ,y* , where *m* is the

(1)

microscope. This approach is, for instance, useful for tissue characterizations.

For a review of the technical aspects of FTIR microspectroscopy, see [40-42].

store the value in a uni- or multivariate variable called <sup>1</sup> ... *<sup>T</sup>*

1 1

*n n nm*

*y y ... y*

 

*y y ... y =*

**Y yy y**

#### *4.2.1. Linear Multivariate Regression (LMVR)*

LMVR (or MLR) can be used to model linear relationships between one or more **z**  (dependent variable) and one or more **y** (independent variable). In the most general case, we have *n* independent multivariate variables **y** represented by the matrix **Y** and the corresponding response multivariate variable *z*, stored in the matrix **Z**.

The LMVR is based, as many other statistical techniques, on the generalized linear model: **Z** *= +* **βY ε** where **β** is a matrix containing the parameters to be estimated, and *ε* is a matrix which models the errors or noise. The coefficients **β** are usually estimated using the ordinary least square, which consists of minimizing the sum of the square differences of the *n* observed *y*'s from their modeled values. Mathematically, the optimal values of **β** are obtained by <sup>1</sup> *T T <sup>=</sup>* **β YY YZ** . To apply the least square method we must have *n - 1 > m* (e.g. the number of observation must be larger than the number of variables, which is often not the case), otherwise the matrix *<sup>T</sup>* **Y Y** is singular and cannot be inverted. Another common problem is the correlation between variables; more specifically, none of the independent variables must be a linear combination of any other. This phenomenon is called "multicollinearity" [43, 44] and it will be explained in more details in section 4.2.3.

#### *4.2.2. Non-Linear Multivariate Regression (NLMVR)*

In some cases linear models cannot be used and one could try to apply non-linear models.

Common models which frequently apply to natural phenomena are the exponentials (which, indeed, is a transformed linear model. A linear model can be applied upon on the logarithm of the data), logistic models or power law models.The regressed model has the general form of **Z** *= f* **β Y ε** , where *f***(Y)** can be any non-linear function.

Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes 197

Thus, we want to find the matrix **A** such that the covariance matrix of the transformed data, *<sup>T</sup>* **S** , is diagonal, which corresponds to find the eigenvectors of the covariance matrix and

The eigenvalues, which coincide with the matrix *<sup>T</sup>* **S** , are the sample variance of the principal components **T** and are ranked according to their magnitude. The first principal component is then the linear combination with maximal variance (the largest eigenvalue). The second principal component is the linear combination with the maximal variance along

The number of eigenvalues is equal to the number of original variables; however, since the eigenvalues are equal to the variance of the principal components and they are sorted in a decreasing order, the first *k* eigenvalues can account for a large portion of the variance of the

Hence, to describe our original dataset we can use only the first *k* uncorrelated principal components, instead of the complete set of redundant *m* variables. In matrix notation this can be written as *k k* **T AY** *=* , where **A***k* is the eigenvector matrix truncated to the k-th eigenvector, and **T***k* is the matrix of the first *k* principal components, also called score

Choosing which and the number of principal components that should be retained in order to summarize our data is a task that can be solved using several strategies [43, 49]. For example, one way commonly used is to retain the first k principal components that explain a given total percentage of the variance, e.g. 90% [43, 44]. Another rule is to plot the eigenvalues in decreasing order. Moving from left to right, the eigenvalues usually have an initial steep drop followed by a slow decrease. All the components after the elbow between the steep and the flat part of the curve should be discarded. This test is called screen plot.

Alternatively, one can select the principal components that can be associated to a physical meaning related to the studied system. For example, following the differentiations of a cell line growing in different experimental conditions, one principal component may represents the different conditions, while another PC may describe the maturation stage of the cells. None of the above methods are better than the other; usually more than one test should be

The principal component analysis allows to obtain uncorrelated variables and then to

Once a set of *k* principal components has been obtained using the PCA method, they can be used as input variables for a multivariate regression analysis instead of the original data. The regression equation **Z** *= +* **βY ε** , shown in section 4.2.1, can be written as *<sup>k</sup>* **Z** *= +* **βT ε** , where **T**<sup>k</sup> is the matrix of the principal components (scores matrix) and the regression coefficients **β** can be estimated by least squares. When the number of principal components

*4.2.3.2. Principal Component Regression (PCR): multivariate regression following PCA* 

the corresponding eigenvalues.

done and the results compared.

remove the multicollinearity problem.

data.

matrix [50].

a direction orthogonal to the first component, and so on [44].

The optimal values for the coefficients **β** can be obtained using deterministic optimization algorithm such as the conjugate gradients [45] or the Levenberg-Marquard method [46, 47], or stochastic algorithm such as genetic algorithms [48].
