*4.2.3. The multicollinearity problem*

When the number of observations is smaller than the number of variables (as it often happens for spectral data), the matrix *<sup>T</sup>* **Y Y** is singular and is not invertible. This rules out the possibility of using standard linear multivariate techniques (LMVR) based on the least square criterion, as the solution will not be unique.

Increasing the number of observations (above the number of variables) will not always solve the problem. This is due to the so-called near-multicollinearity which means that some variables can be written approximately as linear functions of other variables. This problem is often found among spectral measurements. Even if the solution will be mathematically unique, it may be unstable and lead to poor prediction performances.

Linearly correlated or quasi-linearly correlated variables have to be removed prior to apply a regression method. In the following sections, we will describe two methods that are frequently used to remove correlations among variables, namely principal component analysis (PCA) and partial least squares (PLS).

#### *4.2.3.1. Principal Component Analysis (PCA)*

We should first recall the structure of the data. Suppose that we have *n* observations, each one defined by a vector *<sup>i</sup>* **<sup>y</sup>** composed of *m* variables, where *i=1,2,...,n* stands for the i-th observation. The matrix of the original data **Y** is then composed by *n* rows (the observations) and *m* columns (the variables).

By using PCA, our intent is to develop a smaller number of uncorrelated artificial variables, called principal components (PC), that will account for most of the variance in the observed variables. The new uncorrelated variables are obtained as linear combination of the original data as **T AY** *=* . Correlation among variables can be measured using the covariance matrix.

Given the sample mean of the m-dimensional vector *<sup>i</sup>* **y** , 1 1 *<sup>n</sup> i i= = n* **y y** , an unbiased estimator of the sample covariance matrix is 1 1 1 *<sup>n</sup> <sup>T</sup> i i i= <sup>=</sup> <sup>n</sup>* **S y yy y** .

For uncorrelated variables, the off-diagonal values of the sample covariance matrix are zero, that is, **S** is diagonal. The covariance of linearly transformed variables **T AY** *=* is equal to *T <sup>T</sup>* **S ASA** *=* , where **S** is the sample covariance of the original data **Y** [49].

Thus, we want to find the matrix **A** such that the covariance matrix of the transformed data, *<sup>T</sup>* **S** , is diagonal, which corresponds to find the eigenvectors of the covariance matrix and the corresponding eigenvalues.

196 Multivariate Analysis in Management, Engineering and the Sciences

or stochastic algorithm such as genetic algorithms [48].

square criterion, as the solution will not be unique.

analysis (PCA) and partial least squares (PLS).

*4.2.3.1. Principal Component Analysis (PCA)* 

and *m* columns (the variables).

*T*

*4.2.3. The multicollinearity problem* 

logarithm of the data), logistic models or power law models.The regressed model has the

The optimal values for the coefficients **β** can be obtained using deterministic optimization algorithm such as the conjugate gradients [45] or the Levenberg-Marquard method [46, 47],

When the number of observations is smaller than the number of variables (as it often happens for spectral data), the matrix *<sup>T</sup>* **Y Y** is singular and is not invertible. This rules out the possibility of using standard linear multivariate techniques (LMVR) based on the least

Increasing the number of observations (above the number of variables) will not always solve the problem. This is due to the so-called near-multicollinearity which means that some variables can be written approximately as linear functions of other variables. This problem is often found among spectral measurements. Even if the solution will be mathematically

Linearly correlated or quasi-linearly correlated variables have to be removed prior to apply a regression method. In the following sections, we will describe two methods that are frequently used to remove correlations among variables, namely principal component

We should first recall the structure of the data. Suppose that we have *n* observations, each one defined by a vector *<sup>i</sup>* **<sup>y</sup>** composed of *m* variables, where *i=1,2,...,n* stands for the i-th observation. The matrix of the original data **Y** is then composed by *n* rows (the observations)

By using PCA, our intent is to develop a smaller number of uncorrelated artificial variables, called principal components (PC), that will account for most of the variance in the observed variables. The new uncorrelated variables are obtained as linear combination of the original data as **T AY** *=* . Correlation among variables can be measured using the covariance matrix.

For uncorrelated variables, the off-diagonal values of the sample covariance matrix are zero, that is, **S** is diagonal. The covariance of linearly transformed variables **T AY** *=* is equal to

1

*i=*

1 1

1 1 *<sup>n</sup>*

*<sup>n</sup> <sup>T</sup> i i*

*<sup>=</sup> <sup>n</sup>* **S y yy y** .

*i i= = n*

**y y** , an unbiased

general form of **Z** *= f* **β Y ε** , where *f***(Y)** can be any non-linear function.

unique, it may be unstable and lead to poor prediction performances.

Given the sample mean of the m-dimensional vector *<sup>i</sup>* **y** ,

estimator of the sample covariance matrix is

*<sup>T</sup>* **S ASA** *=* , where **S** is the sample covariance of the original data **Y** [49].

The eigenvalues, which coincide with the matrix *<sup>T</sup>* **S** , are the sample variance of the principal components **T** and are ranked according to their magnitude. The first principal component is then the linear combination with maximal variance (the largest eigenvalue). The second principal component is the linear combination with the maximal variance along a direction orthogonal to the first component, and so on [44].

The number of eigenvalues is equal to the number of original variables; however, since the eigenvalues are equal to the variance of the principal components and they are sorted in a decreasing order, the first *k* eigenvalues can account for a large portion of the variance of the data.

Hence, to describe our original dataset we can use only the first *k* uncorrelated principal components, instead of the complete set of redundant *m* variables. In matrix notation this can be written as *k k* **T AY** *=* , where **A***k* is the eigenvector matrix truncated to the k-th eigenvector, and **T***k* is the matrix of the first *k* principal components, also called score matrix [50].

Choosing which and the number of principal components that should be retained in order to summarize our data is a task that can be solved using several strategies [43, 49]. For example, one way commonly used is to retain the first k principal components that explain a given total percentage of the variance, e.g. 90% [43, 44]. Another rule is to plot the eigenvalues in decreasing order. Moving from left to right, the eigenvalues usually have an initial steep drop followed by a slow decrease. All the components after the elbow between the steep and the flat part of the curve should be discarded. This test is called screen plot.

Alternatively, one can select the principal components that can be associated to a physical meaning related to the studied system. For example, following the differentiations of a cell line growing in different experimental conditions, one principal component may represents the different conditions, while another PC may describe the maturation stage of the cells. None of the above methods are better than the other; usually more than one test should be done and the results compared.

The principal component analysis allows to obtain uncorrelated variables and then to remove the multicollinearity problem.

#### *4.2.3.2. Principal Component Regression (PCR): multivariate regression following PCA*

Once a set of *k* principal components has been obtained using the PCA method, they can be used as input variables for a multivariate regression analysis instead of the original data. The regression equation **Z** *= +* **βY ε** , shown in section 4.2.1, can be written as *<sup>k</sup>* **Z** *= +* **βT ε** , where **T**<sup>k</sup> is the matrix of the principal components (scores matrix) and the regression coefficients **β** can be estimated by least squares. When the number of principal components

is equal to the number of variables, this method becomes equivalent to the LMVR. By removing correlations in the original data, the PCR method allows to perform linear regression on a multicollinear dataset.

Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes 199

The use of a linear combination implies that each original variable is weighted by a coefficient which can be used to study the relative importance of the variable in the separation among the groups. A second possible role of DA is to classify observations into groups. An observation, which has to be assigned to a group, is evaluated by a discriminant function (already calibrated on another dataset) and it is assigned to one of the groups at which most likely it belongs [43, 44, 49]; in this view DA is used as an unsupervised method.

When only linear transformations are applied to the variables used as DA input, the

In some cases, LDA alone is not suitable and the original variables can be mapped to a new space via any non-linear function. Then, the LDA is applied in this non-linear space (which is equivalent to non-linear classification in the original space). This procedure can be seen under several names such as "non-linear DA" (NLDA) or "kernel Fisher discriminant

In the following sections we will focus on LDA, first describing the descriptive approach

The initial dataset is an ensemble of multivariate observations partitioned into *G* distinct groups (e.g. different experimental treatments, times or conditions). Each of the *G* groups

Our goal in LDA is to search for the linear combination that optimally separates our

*T*

Since *gj z* is a linear transformation of *gj* **y** , the mean of the group *g* of the transformed data

*T*

1

We now introduce the between groups sum of squares **B** in equation 5 (measure of the dispersion among the groups) and the within-group sum of squares **E** in equation 6

*g n g gj g j=* **y y** *= n*

/

j-th observation. The vector has size *m*, which corresponds to the number of variables.

*gj* **<sup>y</sup>** is written as

where *<sup>g</sup>* **y** is the mean, of the observations within a group, obtained as

*<sup>g</sup> <sup>n</sup>* observations, where *g* runs from *1* to *G* and refers to the g-th group. The

*gj* **<sup>y</sup>** where *g* is the g-th group and *j* is the

*gj gj z =* **w y** (3)

*<sup>g</sup> <sup>g</sup> z =* **w y** (4)

discriminant analysis is called linear discriminant analysis (LDA).

analysis" (KFD) or "generalized discriminant analysis".

and subsequently the classification approach.

*4.3.1.1. Linear DA (LDA) as a descriptive method* 

multivariate observation vectors can be written as

multivariate observation into *G* groups.

The linear transformation of

can be written as

contains

#### *4.2.3.3. Partial Least Squares (PLS)*

Another way to face the multicollinearity problem is to use PLS. The goal of PLS regression is to predict **Z** from **Y** and to describe their common structure [50].

In the PCR method described above, the principal components are selected based on their ability of explaining the variance of the **Y** matrix (the dependent variable matrix). By contrast, PLS regression finds components from **Y** that are also relevant for **Z**. Specifically, PLS regression searches for a set of components that performs a simultaneous decomposition of **Y** and **Z**, with the constraint that these components explain as much as possible the covariance between **Y** and **Z**. In this way, compared to the PCR, the principal components contain more information about the relationship between predictors and dependent variables [50]. For categorical dependent variables, the PLS method takes the name of partial least square discriminant analysis (PLS-DA) [43].

#### **4.3. Multivariate classification techniques**

Classification methods can be divided into two main categories, supervised and unsupervised. Supervised techniques require the knowledge of the group membership of the observations and can be used to understand the structure of the data, e.g. why certain observations belong to a given group. Moreover, once the classification model is calibrated on a "training" dataset, it can be used in a predictive way to group observations whose group membership is unknown.

On the other hand, unsupervised methods try to group the observations without any knowledge of the group membership.

In the following paragraph, we will describe the main multivariate classification approaches.

#### *4.3.1. Discriminant Analysis (DA)*

Discriminant analysis is mainly a supervised technique which was originally developed by Ronald Fisher as a way to subdivide a set of taxonomic observations into two groups based on some measured features [51]. Later, DA was extended to treat cases where there are more than two groups, the so-called "multiclass discriminant analysis" [49, 52, 53].

DA can have mainly two objectives. First, it can be used in a supervised way to describe and explain the differences among the groups. As we will see later, mathematically DA finds the optimal hyperplane that separates the groups among each other. Or, in other words, it finds the optimal linear combination of the original variables that maximizes the distance among the groups. The transformed observations are called discriminant functions.

The use of a linear combination implies that each original variable is weighted by a coefficient which can be used to study the relative importance of the variable in the separation among the groups. A second possible role of DA is to classify observations into groups. An observation, which has to be assigned to a group, is evaluated by a discriminant function (already calibrated on another dataset) and it is assigned to one of the groups at which most likely it belongs [43, 44, 49]; in this view DA is used as an unsupervised method.

When only linear transformations are applied to the variables used as DA input, the discriminant analysis is called linear discriminant analysis (LDA).

In some cases, LDA alone is not suitable and the original variables can be mapped to a new space via any non-linear function. Then, the LDA is applied in this non-linear space (which is equivalent to non-linear classification in the original space). This procedure can be seen under several names such as "non-linear DA" (NLDA) or "kernel Fisher discriminant analysis" (KFD) or "generalized discriminant analysis".

In the following sections we will focus on LDA, first describing the descriptive approach and subsequently the classification approach.

#### *4.3.1.1. Linear DA (LDA) as a descriptive method*

198 Multivariate Analysis in Management, Engineering and the Sciences

is to predict **Z** from **Y** and to describe their common structure [50].

name of partial least square discriminant analysis (PLS-DA) [43].

**4.3. Multivariate classification techniques** 

group membership is unknown.

approaches.

knowledge of the group membership.

*4.3.1. Discriminant Analysis (DA)* 

regression on a multicollinear dataset.

*4.2.3.3. Partial Least Squares (PLS)* 

is equal to the number of variables, this method becomes equivalent to the LMVR. By removing correlations in the original data, the PCR method allows to perform linear

Another way to face the multicollinearity problem is to use PLS. The goal of PLS regression

In the PCR method described above, the principal components are selected based on their ability of explaining the variance of the **Y** matrix (the dependent variable matrix). By contrast, PLS regression finds components from **Y** that are also relevant for **Z**. Specifically, PLS regression searches for a set of components that performs a simultaneous decomposition of **Y** and **Z**, with the constraint that these components explain as much as possible the covariance between **Y** and **Z**. In this way, compared to the PCR, the principal components contain more information about the relationship between predictors and dependent variables [50]. For categorical dependent variables, the PLS method takes the

Classification methods can be divided into two main categories, supervised and unsupervised. Supervised techniques require the knowledge of the group membership of the observations and can be used to understand the structure of the data, e.g. why certain observations belong to a given group. Moreover, once the classification model is calibrated on a "training" dataset, it can be used in a predictive way to group observations whose

On the other hand, unsupervised methods try to group the observations without any

In the following paragraph, we will describe the main multivariate classification

Discriminant analysis is mainly a supervised technique which was originally developed by Ronald Fisher as a way to subdivide a set of taxonomic observations into two groups based on some measured features [51]. Later, DA was extended to treat cases where there are more

DA can have mainly two objectives. First, it can be used in a supervised way to describe and explain the differences among the groups. As we will see later, mathematically DA finds the optimal hyperplane that separates the groups among each other. Or, in other words, it finds the optimal linear combination of the original variables that maximizes the distance among

than two groups, the so-called "multiclass discriminant analysis" [49, 52, 53].

the groups. The transformed observations are called discriminant functions.

The initial dataset is an ensemble of multivariate observations partitioned into *G* distinct groups (e.g. different experimental treatments, times or conditions). Each of the *G* groups contains *<sup>g</sup> <sup>n</sup>* observations, where *g* runs from *1* to *G* and refers to the g-th group. The multivariate observation vectors can be written as *gj* **<sup>y</sup>** where *g* is the g-th group and *j* is the j-th observation. The vector has size *m*, which corresponds to the number of variables.

Our goal in LDA is to search for the linear combination that optimally separates our multivariate observation into *G* groups.

The linear transformation of *gj* **<sup>y</sup>** is written as

$$\mathbf{z}\_{\mathbf{g}\mathbf{j}} = \mathbf{w}^T \mathbf{y}\_{\mathbf{g}\mathbf{j}} \tag{3}$$

Since *gj z* is a linear transformation of *gj* **y** , the mean of the group *g* of the transformed data can be written as

$$
\left\langle z\_{\mathcal{g}} \right\rangle = \mathbf{w}^T \left\langle \mathbf{y}\_{\mathcal{g}} \right\rangle \tag{4}
$$

where *<sup>g</sup>* **y** is the mean, of the observations within a group, obtained as

$$\left\langle \mathbf{y}\_{\mathcal{S}} \right\rangle - \sum\_{j=1}^{n\_{\mathcal{S}}} \mathbf{y}\_{\mathcal{S}^j} \nmid n\_{\mathcal{S}^j}$$

We now introduce the between groups sum of squares **B** in equation 5 (measure of the dispersion among the groups) and the within-group sum of squares **E** in equation 6

(measure of the dispersion within one group). First, we define them for the uni-dimensional case relatively to the untransformed data:

$$\mathbf{B}(y) = \sum\_{\mathcal{S}=1}^{G} \left( \left\langle y\_{\mathcal{S}} \right\rangle - \left\langle y \right\rangle \right)^{2} \tag{5}$$

Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes 201

The solutions of equation 10 are the eigenvalues 1, 2,... *<sup>m</sup> λ λ ,λ* associated to the eigenvectors ... *<sup>m</sup> ,* **ww w 1, 2,** . The solutions are ranked for the eigenvalues 1 ... *2 m λ > λ > > λ* . Hence, the first

The discriminant functions are then obtained considering only the first *s* positive eigenvalues and multiplying the original data by the eigenvectors

Discriminant functions are uncorrelated but not orthogonal since the matrix *<sup>1</sup> =* **A EB** is not

In many cases the first two or three discriminant functions account for most of <sup>1</sup> ... *2 s λ + λ + + λ* . This allows to represent the multivariate observations as 2 or 3 dimensional points which can be plotted on a scatter plot. These plots are particularly helpful to visualize the separation of our observations into the different groups. Moreover, we can deduce, looking at the scatter plot, the meaning of a given discriminant function, i.e. we can

The weighting vectors ... **ww w 1, 2,** *<sup>s</sup>* are called unstandardized discriminant function coefficients and give the weight associated to each variable on every discriminant function.

If the variables are on very different scales and with different variance, to assess the importance of each variable in the group separation the standardized discriminant functions can be used. The standardization is done by multiplying the unstandardized coefficients by

Another way to assess the variable importance is to look at the correlation between each variable and the discriminant function. These correlations are called structure or loading coefficients. However, it has been shown that these parameters are intrinsically univariate and they only show how a single variable contributes to the separation among groups,

After a set of discriminant functions are calibrated as described in the previous section, the discriminant analysis can be applied to classify new observations into the most probable groups. From this point of view, the linear discriminant analysis becomes a predictive tool, since it is able to classify observations whose group membership is unknown [43, 49]. The discrimination ability of our LDA model can be tested by a procedure called "resubstitution" [49]. This method consists of producing an LDA model using our dataset (i.e. finding the optimal w). Then, each observation vector is re-submitted to the classification

the submitted vector, we can count the number of observations correctly classified and the number of observations misclassified. To measure the classification accuracy we can count

*gj gj* **z wy** *=* ) and assigned to a group. Since we know the group membership of

associate the discriminant function to a given property of the analyzed system.

the square root of the diagonal element of the within-group covariance matrix.

without taking into account the presence of the other variables [49].

*4.3.1.2. Linear as a classification method* 

function ( *<sup>T</sup>*

eigenvalue 1 *λ* corresponds to the maximum value of equation 9.

<sup>1</sup> ... *TT T 12 2 s s* **z wYz wY z wY** *= , = ,, =* .

symmetric.

and

$$\mathbb{E}\left(y\right) - \sum\_{\mathcal{S}=1}^{G} \sum\_{j=1}^{n\_{\mathcal{S}}} \left(y\_{\mathcal{S}j} - \left\right)^2\tag{6}$$

where 1 1 1 1 *<sup>n</sup> <sup>G</sup> <sup>g</sup> gj g= j= g <sup>=</sup> G n* **y y** is the total average of the data.

Analogously, in the multivariate case (where each observation is constituted by *m* variables) we have the two matrices:

$$\mathbf{B}\left(\mathbf{y}\right) = \sum\_{\mathcal{S}=1}^{G} \boldsymbol{n}\_{\mathcal{S}} \left( \left< \mathbf{y}\_{\mathcal{S}} \right> - \left< \mathbf{y} \right> \right) \left( \left< \mathbf{y}\_{\mathcal{S}} \right> - \left< \mathbf{y} \right> \right)^{T} \tag{7}$$

and

$$\mathbf{E}(\mathbf{y}) = \sum\_{\mathcal{S}=1}^{G} n\_{\mathcal{S}} \sum\_{j=1}^{n\_{\mathcal{S}}} (\left\langle \mathbf{y}\_{\mathcal{S}j} \right\rangle - \left\langle \mathbf{y}\_{\mathcal{S}} \right\rangle) \left( \left\langle \mathbf{y}\_{\mathcal{S}j} \right\rangle - \left\langle \mathbf{y}\_{\mathcal{S}} \right\rangle \right)^{T} \tag{8}$$

Finding the optimal linear combination that separates our multivariate observations into *k* groups means to find the vector *w* which maximizes the rate between the between-groups sum of squares over the within-groups sum of squares. Using the equations for the transformed data (equations 3 and 4) into the equations 7 and 8, we can write:

$$\lambda = \frac{\mathbf{w}^T \mathbf{B}(\mathbf{y}) \mathbf{w}}{\mathbf{w}^T \mathbf{E}(\mathbf{y}) \mathbf{w}} - \frac{\mathbf{B}(\mathbf{z})}{\mathbf{E}(\mathbf{z})} \tag{9}$$

We want to find *w* such that λ is maximized.

Equation 9 can be rewritten in the form 0 *<sup>T</sup>* **w Bw Ew** *λ =* ; then we search for all the non trivial ( 0 *<sup>T</sup>* **w** *=* is excluded) solutions of this equation and we choose the one which gives the maximum value of λ. This means to solve the eigenvalue problem **Bw Ew** *λ =* 0 that can be written in the usual form:

$$(\mathbf{A} - \lambda \mathbf{I})\mathbf{w} = 0\tag{10}$$

where *<sup>1</sup> =* **A EB** .

The solutions of equation 10 are the eigenvalues 1, 2,... *<sup>m</sup> λ λ ,λ* associated to the eigenvectors ... *<sup>m</sup> ,* **ww w 1, 2,** . The solutions are ranked for the eigenvalues 1 ... *2 m λ > λ > > λ* . Hence, the first eigenvalue 1 *λ* corresponds to the maximum value of equation 9.

The discriminant functions are then obtained considering only the first *s* positive eigenvalues and multiplying the original data by the eigenvectors <sup>1</sup> ... *TT T 12 2 s s* **z wYz wY z wY** *= , = ,, =* .

Discriminant functions are uncorrelated but not orthogonal since the matrix *<sup>1</sup> =* **A EB** is not symmetric.

In many cases the first two or three discriminant functions account for most of <sup>1</sup> ... *2 s λ + λ + + λ* . This allows to represent the multivariate observations as 2 or 3 dimensional points which can be plotted on a scatter plot. These plots are particularly helpful to visualize the separation of our observations into the different groups. Moreover, we can deduce, looking at the scatter plot, the meaning of a given discriminant function, i.e. we can associate the discriminant function to a given property of the analyzed system.

The weighting vectors ... **ww w 1, 2,** *<sup>s</sup>* are called unstandardized discriminant function coefficients and give the weight associated to each variable on every discriminant function.

If the variables are on very different scales and with different variance, to assess the importance of each variable in the group separation the standardized discriminant functions can be used. The standardization is done by multiplying the unstandardized coefficients by the square root of the diagonal element of the within-group covariance matrix.

Another way to assess the variable importance is to look at the correlation between each variable and the discriminant function. These correlations are called structure or loading coefficients. However, it has been shown that these parameters are intrinsically univariate and they only show how a single variable contributes to the separation among groups, without taking into account the presence of the other variables [49].

#### *4.3.1.2. Linear as a classification method*

200 Multivariate Analysis in Management, Engineering and the Sciences

case relatively to the untransformed data:

1 1

*g= j= g*

we have the two matrices:

*gj*

We want to find *w* such that λ is maximized.

written in the usual form:

where *<sup>1</sup> =* **A EB** .

*<sup>=</sup> G n* **y y** is the total average of the data.

1

1 1

transformed data (equations 3 and 4) into the equations 7 and 8, we can write:

*T <sup>T</sup> <sup>λ</sup> = =* **wByw Bz**

*g= j=*

*g=*

1 1 *<sup>n</sup> <sup>G</sup> <sup>g</sup>*

and

where

and

(measure of the dispersion within one group). First, we define them for the uni-dimensional

 <sup>2</sup> 1

 <sup>2</sup> 1 1

Analogously, in the multivariate case (where each observation is constituted by *m* variables)

Finding the optimal linear combination that separates our multivariate observations into *k* groups means to find the vector *w* which maximizes the rate between the between-groups sum of squares over the within-groups sum of squares. Using the equations for the

> 

Equation 9 can be rewritten in the form 0 *<sup>T</sup>* **w Bw Ew** *λ =* ; then we search for all the non trivial ( 0 *<sup>T</sup>* **w** *=* is excluded) solutions of this equation and we choose the one which gives the maximum value of λ. This means to solve the eigenvalue problem **Bw Ew** *λ =* 0 that can be

*G T gg g*

*<sup>n</sup> <sup>G</sup> <sup>g</sup> <sup>T</sup> g gj g gj g*

*gj g*

*g*

**<sup>B</sup>** *y= y y* (5)

**<sup>E</sup>** *y= y y* (6)

**By y y y y** *= n* (7)

**Ey y y y y** *= n* (8)

 

**wEyw E z** (9)

**A Iw** *λ =* 0 (10)

*G*

*g=*

*<sup>n</sup> <sup>G</sup> <sup>g</sup>*

*g= j=*

After a set of discriminant functions are calibrated as described in the previous section, the discriminant analysis can be applied to classify new observations into the most probable groups. From this point of view, the linear discriminant analysis becomes a predictive tool, since it is able to classify observations whose group membership is unknown [43, 49]. The discrimination ability of our LDA model can be tested by a procedure called "resubstitution" [49]. This method consists of producing an LDA model using our dataset (i.e. finding the optimal w). Then, each observation vector is re-submitted to the classification function ( *<sup>T</sup> gj gj* **z wy** *=* ) and assigned to a group. Since we know the group membership of the submitted vector, we can count the number of observations correctly classified and the number of observations misclassified. To measure the classification accuracy we can count

the number of observations correctly classified and the number of observations misclassified. Then, we can estimate the classification rate as the number of correctly classified observations over the total number of observations. In general, in evaluating the accuracy of a model, we have then to distinguish between two types of accuracy: the fitting accuracy and the prediction accuracy [43, 54].

Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes 203

A powerful analysis tool is the combination of the principal component analysis with the linear discriminant analysis [52]. This is particularly helpful when the number of variables is large. In particular, if the number of observations (*N*) is less than the number of variables (*m*) - specifically *N-1<m* - the covariance matrix is singular and cannot be inverted (see section 4.2.3.). We then need to find a way to reduce the number of variables, for example using the PCA [49, 55]. This procedure has been widely used for several problems in different fields [35, 52, 56-60]. The condition *N-1<m* almost always appears in spectroscopy, where the number of observations (*N*) is usually 102 and the number of variables (*m*) is

Let's take into account the same situation described for the many group linear discriminant analysis. The original dataset is an ensemble of multivariate observations which is partitioned into *k* distinct groups. Again, we want to find the discriminant functions which optimally separate our multivariate observation into the *k* groups. Then, the discriminant functions can be used to identify the most important variables in terms of ability of distinguishing among the groups. Thus, first the original dataset is submitted to the PCA to reduce the number of variables; subsequently, the reduced dataset is analyzed using

In a way analogous to the PCA-LDA procedure, here we first apply the PLS algorithm to the

Given that the PLS searches for a set of components that performs a simultaneous decomposition of the dependent and independent datasets, the main difference with PCA-LDA is that the principal components resulting as output of PLS better describe the relationship between independent and dependent variables. This does not necessarily mean that this method is better in general. Indeed, applying PCA or PLS on the same dataset often leads to similar results [62, 63] and the classification accuracy or the descriptive ability is mostly determined by the underlying structure of the data which can make one of the two

The goal of cluster analysis is to find the best grouping of the multivariate observations such that the clusters are dissimilar to each other but the observations within a cluster are similar

CA is an unsupervised technique, that is, the group membership of the observations (and

Another way that can be used instead of PCA is to perform the PLS.

original data and then the LDA on the selected principal components [61].

*4.3.2. PCA-LDA* 

typically within 102 to 103.

the LDA.

*4.3.3. PLS-LDA* 

methods more suitable than the other.

often the number of groups) is not known in advance.

*4.3.4. Cluster Analysis (CA)* 

[44].

The fitting accuracy is the ability to reproduce the data, namely how the model is able to reproduce the data that were used to build the model (the training set). This corresponds to the apparent classification rate and it is obtained using the re-substitution procedure.

The prediction accuracy is the ability to predict the value or the class of an observation, that was not included in the construction of the model. This kind of accuracy is often referred to as the ability of the model to generalize. The data used to measure this accuracy are called "test set". The prediction accuracy can be called "actual classification rate". This is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. To have an estimation of the actual classification rate, two main procedures can be applied: the hold-out and cross-validation [43].

In the hold-out, the dataset is divided into two partitions: one partition is used to develop the model (e.g. the discriminant functions) and the second partition is given as input to the model. The first partition is usually called "training set" or "calibration set", while the second partition is the validation set [54].

When the number of observations is small, the cross-validation is usually preferred over the hold-out. The basic idea of the cross-validation procedure is to divide the entire dataset into L disjoint sets. L-1 sets are used to develop the model (i.e. the calibration set on which the discriminant functions are computed) and the omitted portion is used to test the model (i.e. the validation set given as input to the model). This is repeated for all the L sets and an average result is obtained.

Apparent or actual classification accuracies can be summarized in a confusion matrix. As an example, total N observations, 1 *n* , belong to the group 1 and 2 *n* belong to the group 2. *C*<sup>11</sup> is the total number of observations correctly classified in group 1 and *C*12 is the total number of data misclassified in group 2. Similarly, *C*22 is the total number of observations correctly classified in group 2 and *C*21 is the number of misclassified in group 1.

The confusion matrix becomes then:


and the accuracy (the actual or apparent classification rate (acr)) is computed as: 11 *22 C +C acr =*

$$n\_1 + n\_2$$

Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes 203
