**Methods**

**1** 

*P. R. China* 

**Model Population Analysis** 

**for Statistical Model Comparison** 

Hong-Dong Li1, Yi-Zeng Liang1 and Qing-Song Xu2

*2School of Mathematic Sciences, Central South University, Changsha,* 

*1College of Chemistry and Chemical Engineering, Central South University, Changsha,* 

Model comparison plays a central role in statistical learning and chemometrics. Performances of models need to be assessed using a given criterion based on which models can be compared. To our knowledge, there exist a variety of criteria that can be applied for model assessment, such as Akaike's information criterion (AIC) [1], Bayesian information criterion (BIC) [2], deviance information criterion (DIC),Mallow's Cp statistic, cross validation [3-6] and so on. There is a large body of literature that is devoted to these criteria. With the aid of a chosen criterion, different models can be compared. For example, a model with a smaller AIC or BIC is preferred if AIC or BIC are chosen for

In chemometrics, model comparison is usually conducted by validating different models on an independent test set or by using cross validation [4, 5, 7], resulting in a single value, *i.e.* root mean squared error of prediction (RMSEP) or root mean squared error of cross validation (RMSECV). This single metrics is heavily dependent on the selection of the independent test set (RMSEP) or the partition of the training data (RMSECV). Therefore, we have reasons to say that this kind of comparison is lack of statistical assessment and also at the risk of drawing wrong conclusions. We recently proposed model population analysis (MPA) as a general framework for designing chemometrics/bioinformatics methods [8]. MPA has been shown to be promising in outlier detection and variable selection. Here we hypothesize that reliably statistical model comparison could be achieved via the use of

Model population analysis has been recently proposed for developing chemometrics methods in our previous work [8]. As is shown in **Figure 1**, MPA works in three steps which are summarized as (1) randomly generating N sub-datasets using Monte Carlo sampling (2) building one sub-model on each sub-dataset and (3) statistically analyzing some interesting

**1. Introduction** 

model assessment.

model population analysis.

**2. Model population analysis** 

output of all the N sub-models.

**2.1 The framework of model population analysis** 

## **Model Population Analysis for Statistical Model Comparison**

Hong-Dong Li1, Yi-Zeng Liang1 and Qing-Song Xu2 *1College of Chemistry and Chemical Engineering, Central South University, Changsha, 2School of Mathematic Sciences, Central South University, Changsha, P. R. China* 

#### **1. Introduction**

Model comparison plays a central role in statistical learning and chemometrics. Performances of models need to be assessed using a given criterion based on which models can be compared. To our knowledge, there exist a variety of criteria that can be applied for model assessment, such as Akaike's information criterion (AIC) [1], Bayesian information criterion (BIC) [2], deviance information criterion (DIC),Mallow's Cp statistic, cross validation [3-6] and so on. There is a large body of literature that is devoted to these criteria. With the aid of a chosen criterion, different models can be compared. For example, a model with a smaller AIC or BIC is preferred if AIC or BIC are chosen for model assessment.

In chemometrics, model comparison is usually conducted by validating different models on an independent test set or by using cross validation [4, 5, 7], resulting in a single value, *i.e.* root mean squared error of prediction (RMSEP) or root mean squared error of cross validation (RMSECV). This single metrics is heavily dependent on the selection of the independent test set (RMSEP) or the partition of the training data (RMSECV). Therefore, we have reasons to say that this kind of comparison is lack of statistical assessment and also at the risk of drawing wrong conclusions. We recently proposed model population analysis (MPA) as a general framework for designing chemometrics/bioinformatics methods [8]. MPA has been shown to be promising in outlier detection and variable selection. Here we hypothesize that reliably statistical model comparison could be achieved via the use of model population analysis.

### **2. Model population analysis**

#### **2.1 The framework of model population analysis**

Model population analysis has been recently proposed for developing chemometrics methods in our previous work [8]. As is shown in **Figure 1**, MPA works in three steps which are summarized as (1) randomly generating N sub-datasets using Monte Carlo sampling (2) building one sub-model on each sub-dataset and (3) statistically analyzing some interesting output of all the N sub-models.

Model Population Analysis for Statistical Model Comparison 5

All these N sub-models are mutually different but have the same goal that is to predict the

The core of model population analysis is statistical analysis of an interesting output, *e.g.* prediction errors or regression coefficients, of all these sub-models. Indeed, it is difficult to give a clear answer on what output should be analyzed and how the analysis should be done. Different designs for the analysis will lead to different algorithms. As proof-ofprinciple, it was shown in our previous work that the analysis of the distribution of

As described above, Monte Carlo sampling serves as the basics of model population analysis that help generate distributions of interesting parameters one would like to analyze. Looking on the surface, it seems to be very natural and easy to generate distributions using Monte Carlo sampling. However, here we show by examples that the distribution provided by model population analysis can indeed provide very useful information that gives

Two datasets are fist simulated. The first contains only normal samples, whereas there are 3 outliers in the second dataset, which are shown in Plot A and B of **Figure 2**, respectively. For each dataset, a percentage (70%) of samples are randomly selected to build a linear regression model of which the slope and intercept is recorded. Repeating this procedure 1000 times, we obtain 1000 values for both the slope and intercept. For both datasets, the intercept is plotted against the slope as displayed in Plot C and D, respectively. It can be observed that the joint distribution of the intercept and slope for the normal dataset appears to be multivariate normally distributed. In contrast, this distribution for the dataset with outliers looks quite different, far from a normal distribution. Specifically, the distributions of slopes for both datasets are shown in Plot E and F. These results show that the existence of outliers can greatly influence a regression model, which is reflected by the odd distributions of both slopes and intercepts. In return, a distribution of a model parameter that is far from

In this study, we first simulate a design matrix **X** of size 50 × 10, the response variable **Y** is simulated by multiplying X with a 10-dimensional regression vector. Gussian noises with standard deviation equal to 1 are then added to **Y**. That is to say, all the variables in **X** are "true variables" that collectively predict **Y**. This dataset (**X**, **Y**) is denoted SIMUTRUE. Then another design matrix **F** is simulated of size 50 × 10. Denote the combination of **X** and **F** as **Z**=[**X F**]. This dataset (**Z**, **Y**) is called SIMUINTF, which contains variables that are not predictive of **Y**. For both datasets, we randomly choose 70% samples to first build a regression model which is then used to make predictions on the remaining 30% samples, resulting in a RMSEP value. Repeating this procedure 1000 times, we, for both datasets,

a normal one would, most likely, indicate some abnormality in the data.

**2.1.3 Statistically analyzing an interesting output of all the sub-models** 

prediction errors is effective in outlier detection [13].

**2.2 Insights provided by model population analysis** 

insights into the data under investigation.

**2.2.2 Are there any interfering variables?** 

**2.2.1 Are there any outliers?** 

response value **y**.

Fig. 1. The schematic of MPA. MCS is the abbreviation of Monte Carlo Sampling.

#### **2.1.1 Monte Carlo sampling for generating a sub-dataset**

Sampling plays a key role in statistics which allows us to generate replicate sub-datasets from which an interested unknown parameter could be estimated. For a given dataset (**X**, **y**), it is assumed that the design matrix **X** contains *m* samples in rows and *p* variables in columns, the response vector **y** is of size *m*×*1*. The number of Monte Carlo samplings is set to N. In this setting, N sub-datasets can be drawn from N Monte Carlo samplings with or without replacement [9, 10], which are denoted as (**X**sub, **y**sub)i, i = 1, 2, 3, …N.

#### **2.1.2 Establishing a sub-model using each sub-dataset**

For each sub-dataset (**X**sub, **y**sub)i, a sub-model can be constructed using a selected method, *e.g.* partial least squares (PLS) [11] or support vector machines (SVM) [12]. Denote the submodel established as *f*i (**X**). Then, all these sub-models can be put into a collection:

$$\mathbf{C} = (f\_1(\mathbf{X}), f\_2(\mathbf{X}), f\_3(\mathbf{X}), \dots, f\_N(\mathbf{X})) \tag{1}$$

Fig. 1. The schematic of MPA. MCS is the abbreviation of Monte Carlo Sampling.

without replacement [9, 10], which are denoted as (**X**sub, **y**sub)i, i = 1, 2, 3, …N.

model established as *f*i (**X**). Then, all these sub-models can be put into a collection:

Sampling plays a key role in statistics which allows us to generate replicate sub-datasets from which an interested unknown parameter could be estimated. For a given dataset (**X**, **y**), it is assumed that the design matrix **X** contains *m* samples in rows and *p* variables in columns, the response vector **y** is of size *m*×*1*. The number of Monte Carlo samplings is set to N. In this setting, N sub-datasets can be drawn from N Monte Carlo samplings with or

For each sub-dataset (**X**sub, **y**sub)i, a sub-model can be constructed using a selected method, *e.g.* partial least squares (PLS) [11] or support vector machines (SVM) [12]. Denote the sub-

C = (*f*1(X), *f*2(X), *f*3(X),…, *f*N(X)) (1)

**2.1.1 Monte Carlo sampling for generating a sub-dataset** 

**2.1.2 Establishing a sub-model using each sub-dataset** 

All these N sub-models are mutually different but have the same goal that is to predict the response value **y**.

#### **2.1.3 Statistically analyzing an interesting output of all the sub-models**

The core of model population analysis is statistical analysis of an interesting output, *e.g.* prediction errors or regression coefficients, of all these sub-models. Indeed, it is difficult to give a clear answer on what output should be analyzed and how the analysis should be done. Different designs for the analysis will lead to different algorithms. As proof-ofprinciple, it was shown in our previous work that the analysis of the distribution of prediction errors is effective in outlier detection [13].

#### **2.2 Insights provided by model population analysis**

As described above, Monte Carlo sampling serves as the basics of model population analysis that help generate distributions of interesting parameters one would like to analyze. Looking on the surface, it seems to be very natural and easy to generate distributions using Monte Carlo sampling. However, here we show by examples that the distribution provided by model population analysis can indeed provide very useful information that gives insights into the data under investigation.

#### **2.2.1 Are there any outliers?**

Two datasets are fist simulated. The first contains only normal samples, whereas there are 3 outliers in the second dataset, which are shown in Plot A and B of **Figure 2**, respectively. For each dataset, a percentage (70%) of samples are randomly selected to build a linear regression model of which the slope and intercept is recorded. Repeating this procedure 1000 times, we obtain 1000 values for both the slope and intercept. For both datasets, the intercept is plotted against the slope as displayed in Plot C and D, respectively. It can be observed that the joint distribution of the intercept and slope for the normal dataset appears to be multivariate normally distributed. In contrast, this distribution for the dataset with outliers looks quite different, far from a normal distribution. Specifically, the distributions of slopes for both datasets are shown in Plot E and F. These results show that the existence of outliers can greatly influence a regression model, which is reflected by the odd distributions of both slopes and intercepts. In return, a distribution of a model parameter that is far from a normal one would, most likely, indicate some abnormality in the data.

#### **2.2.2 Are there any interfering variables?**

In this study, we first simulate a design matrix **X** of size 50 × 10, the response variable **Y** is simulated by multiplying X with a 10-dimensional regression vector. Gussian noises with standard deviation equal to 1 are then added to **Y**. That is to say, all the variables in **X** are "true variables" that collectively predict **Y**. This dataset (**X**, **Y**) is denoted SIMUTRUE. Then another design matrix **F** is simulated of size 50 × 10. Denote the combination of **X** and **F** as **Z**=[**X F**]. This dataset (**Z**, **Y**) is called SIMUINTF, which contains variables that are not predictive of **Y**. For both datasets, we randomly choose 70% samples to first build a regression model which is then used to make predictions on the remaining 30% samples, resulting in a RMSEP value. Repeating this procedure 1000 times, we, for both datasets,

Model Population Analysis for Statistical Model Comparison 7

Fig. 3. The distribution of RMSEPs using the variable set that contains only "true variables"

obtain 1000 RMSEP values, of which the distributions are given in **Figure 3**. Clearly, the distribution of RMSEP of the SIMUINTF is right shifted, indicating the existence of variables that are not predictive of Y can degrade the performance of a regression model. We call this kind of variables "interfering variables". Can you tell whether a dataset contains interfering variables for a real world dataset? Curious readers may ask a question like this. Indeed, we can. We can do replicate experiments to estimate the experimental error that could serve as a reference by which it is possible to judge whether interfering variables exist. For example, if a model containing a large number of variables (with true variables included) shows a large prediction error compared to the experimental error, we may predict that interfering variables exist. In this situation, variable selection is encouraged and can greatly improve the performance of a model. Actually, when interfering variables exist, variable selection is a must. Other methods that use latent variables like PCR or PLS cannot work well because

Using the idea of model population analysis, we have developed algorithms that address the fundamental issues in chemical modeling: outlier detection and variable selection. For

(upper panel) and the variable set that includes not only "true variables" but also

latent variables have contributions coming from interfering variables.

**2.3 Applications of model population analysis** 

"interfering variables" (lower panel).

Fig. 2. A simulation study illustrating the use of model population analysis to detect whether a dataset contains outliers. Plot A and Plot B shows the data simulated without and with outliers, respectively. 1000 linear regression models computed using 1000 sub-datasets randomly selected and the slope and intercept are presented in Plot C and D. Specifically, the distribution of slope for these two simulated datasets are displayed in Plot E and Plot F.

Fig. 2. A simulation study illustrating the use of model population analysis to detect

whether a dataset contains outliers. Plot A and Plot B shows the data simulated without and with outliers, respectively. 1000 linear regression models computed using 1000 sub-datasets randomly selected and the slope and intercept are presented in Plot C and D. Specifically, the distribution of slope for these two simulated datasets are displayed in Plot E and Plot F.

Fig. 3. The distribution of RMSEPs using the variable set that contains only "true variables" (upper panel) and the variable set that includes not only "true variables" but also "interfering variables" (lower panel).

obtain 1000 RMSEP values, of which the distributions are given in **Figure 3**. Clearly, the distribution of RMSEP of the SIMUINTF is right shifted, indicating the existence of variables that are not predictive of Y can degrade the performance of a regression model. We call this kind of variables "interfering variables". Can you tell whether a dataset contains interfering variables for a real world dataset? Curious readers may ask a question like this. Indeed, we can. We can do replicate experiments to estimate the experimental error that could serve as a reference by which it is possible to judge whether interfering variables exist. For example, if a model containing a large number of variables (with true variables included) shows a large prediction error compared to the experimental error, we may predict that interfering variables exist. In this situation, variable selection is encouraged and can greatly improve the performance of a model. Actually, when interfering variables exist, variable selection is a must. Other methods that use latent variables like PCR or PLS cannot work well because latent variables have contributions coming from interfering variables.

#### **2.3 Applications of model population analysis**

Using the idea of model population analysis, we have developed algorithms that address the fundamental issues in chemical modeling: outlier detection and variable selection. For

Model Population Analysis for Statistical Model Comparison 9

Ensemble learning methods, such as bagging[21], boosting [22] and random forests [23], have emerged as very promising strategies for building a predictive model and these methods have found applications in a wide variety of fields. Recently, a new ensemble technique, called feature-subspace aggregating (Feating) [24], was proposed that was shown to have nice performances. The key point of these ensemble methods is aggregating a large number of models built using sub-datasets randomly generated using for example bootstrapping. Then ensemble models make predictions by doing a majority voting for classification or averaging for regression. In our opinion, the basic idea of ensemble learning methods is the same as that in model population analysis. In this sense, ensemble learning

Based on model population analysis, here we propose to perform model comparison by deriving an empirical distribution of the difference of RMSEP or RMSECV between two models (variable sets), followed by testing the null hypothesis that the difference of RMSEP or RMSECV between two models is zero. Without loss of generality, we describe the proposed method by taking the distribution of difference of RMSEP as an example. We assume that the data **X** consists of *m* samples in row and *p* variables in column and the target value **Y** is an m-dimensional column vector. Two variable sets, say V1 and V2, selected from the *p* variables, then can be compared using the MPA-based method described below. First, a percentage, say 80%, from the *m* samples with variables in V1 and V2 is randomly selected to build two regression models using a preselected modeling method such as PLS [11] or support vector machines (SVMs) [12], respectively. Then an RMSEP value can be computed for each model by using the remaining 20% samples as the test set. Denote the two RMSEP values as RMSEP1 and RMSEP2, of which the difference can be calculated as

By repeating this procedure N, say 1000, times, N D values are obtained and collected into a vector **D**. Now, the model comparison can be formulated into a hypothesis test problem as:

By employing a statistical test method, *e.g. t*-test or Mann-Whitney U test [25], a *P* value can be computed for strictly assessing whether the mean of **D** is significantly different from zero (P<0.05) or not (P>0.05). If P<0.05, the sign of the mean of **D** is then used to compare which model (variable set) is of better predictive performance. If P>0.05, we say two models have

The corn NIR data measured on *mp5* instrument is used to illustrate the use of the proposed method (http://software.eigenvector.com/Data/index.html). This data contain NIR spectra

D = RMSEP1-RMSEP2 (2)

methods can also be formulated into the framework of model population analysis.

**3. Model population analysis for statistical model comparison** 

**2.5 Model population analysis and ensemble learning** 

Null hypothesis: the mean of **D** is zero.

the same predictive ability.

**4. Results and discussions** 

Alternative hypothesis: the mean of **D** is not zero.

**4.1 Comparison of predictive performances of variables subsets** 

outlier detection, we developed the MC method [13]. For variable selection, we developed subwindow permutation analysis (SPA) [14], noise-incorporated subwindow permutation analysis (NISPA) [15] and margin influence analysis (MIA) [16]. Here, we first give a brief description of these algorithms, aiming at providing examples that could help interested readers to understand how to design an algorithm by borrowing the framework of model population analysis.

As can be seen from **Figure 1**, These MPA-based methods share the first two steps that are (1) generating N sub-datasets and (2) building N sub-models. The third step "statistical analysis of an interesting output of all these N sub-models" is the core of model population analysis that underlines different methods. The key points of these methods as well as another method Monte Carlo uninformative variable elimination (MCUVE) that also implements the idea of MPA are summarized in Table 1. In a word, the distribution from model population analysis contains abundant information that provides insight into the data analyzed and by making full use of these information, effective algorithms can be developed for solving a given problem.


\*: The MC method, SPA, NISPA, MIA and MCUVE are described in references [13], [14], [15] [16] and [27].

Table 1. Key points of MPA-based methods.

#### **2.4 Model population analysis and bayesian analysis**

There exist similarities as well as differences between model population analysis and Bayesian analysis. One important similarity is that both methods consider the parameter of interest not as a single number but a distribution. In model population analysis, we generate distributions by causing variations in samples and/or variables using Monte Carlo sampling [17]. In contrast, in Bayesian analysis the parameter to infer is first assumed to be from a prior distribution and then observed data are used to update this prior distribution to the posterior distribution from which parameter inference can be conducted and predictions can be made [18-20]. The output of Bayesian analysis is a posterior distribution of some interesting parameter. This posterior distribution provides a natural link between Bayesian analysis and model population analysis. Taking Bayesian linear regression (BLR) [20] as an example, the output can be a large number of regression coefficient vectors that are sampled from its posterior distribution. These regression coefficient vectors actually represent a population of sub-models that can be used directly for model population analysis. Our future work will be constructing useful algorithms by borrowing merits of both Bayesian analysis and model population analysis.

outlier detection, we developed the MC method [13]. For variable selection, we developed subwindow permutation analysis (SPA) [14], noise-incorporated subwindow permutation analysis (NISPA) [15] and margin influence analysis (MIA) [16]. Here, we first give a brief description of these algorithms, aiming at providing examples that could help interested readers to understand how to design an algorithm by borrowing the framework of model

As can be seen from **Figure 1**, These MPA-based methods share the first two steps that are (1) generating N sub-datasets and (2) building N sub-models. The third step "statistical analysis of an interesting output of all these N sub-models" is the core of model population analysis that underlines different methods. The key points of these methods as well as another method Monte Carlo uninformative variable elimination (MCUVE) that also implements the idea of MPA are summarized in Table 1. In a word, the distribution from model population analysis contains abundant information that provides insight into the data analyzed and by making full use of these information, effective algorithms can be

population analysis.

[27].

developed for solving a given problem.

Methods\* What to statistically analyze

permuted

Table 1. Key points of MPA-based methods.

**2.4 Model population analysis and bayesian analysis** 

both Bayesian analysis and model population analysis.

MC method Distribution of prediction errors of each sample

SPA Distribution of prediction errors before and after each variable is

NISPA Distribution of prediction errors before and after each variable is permuted with one noise variable as reference MIA Distribution of margins of support vector machines sub-models MCUVE Distribution of regression coefficients of PLS regression sub-models \*: The MC method, SPA, NISPA, MIA and MCUVE are described in references [13], [14], [15] [16] and

There exist similarities as well as differences between model population analysis and Bayesian analysis. One important similarity is that both methods consider the parameter of interest not as a single number but a distribution. In model population analysis, we generate distributions by causing variations in samples and/or variables using Monte Carlo sampling [17]. In contrast, in Bayesian analysis the parameter to infer is first assumed to be from a prior distribution and then observed data are used to update this prior distribution to the posterior distribution from which parameter inference can be conducted and predictions can be made [18-20]. The output of Bayesian analysis is a posterior distribution of some interesting parameter. This posterior distribution provides a natural link between Bayesian analysis and model population analysis. Taking Bayesian linear regression (BLR) [20] as an example, the output can be a large number of regression coefficient vectors that are sampled from its posterior distribution. These regression coefficient vectors actually represent a population of sub-models that can be used directly for model population analysis. Our future work will be constructing useful algorithms by borrowing merits of

#### **2.5 Model population analysis and ensemble learning**

Ensemble learning methods, such as bagging[21], boosting [22] and random forests [23], have emerged as very promising strategies for building a predictive model and these methods have found applications in a wide variety of fields. Recently, a new ensemble technique, called feature-subspace aggregating (Feating) [24], was proposed that was shown to have nice performances. The key point of these ensemble methods is aggregating a large number of models built using sub-datasets randomly generated using for example bootstrapping. Then ensemble models make predictions by doing a majority voting for classification or averaging for regression. In our opinion, the basic idea of ensemble learning methods is the same as that in model population analysis. In this sense, ensemble learning methods can also be formulated into the framework of model population analysis.

#### **3. Model population analysis for statistical model comparison**

Based on model population analysis, here we propose to perform model comparison by deriving an empirical distribution of the difference of RMSEP or RMSECV between two models (variable sets), followed by testing the null hypothesis that the difference of RMSEP or RMSECV between two models is zero. Without loss of generality, we describe the proposed method by taking the distribution of difference of RMSEP as an example. We assume that the data **X** consists of *m* samples in row and *p* variables in column and the target value **Y** is an m-dimensional column vector. Two variable sets, say V1 and V2, selected from the *p* variables, then can be compared using the MPA-based method described below.

First, a percentage, say 80%, from the *m* samples with variables in V1 and V2 is randomly selected to build two regression models using a preselected modeling method such as PLS [11] or support vector machines (SVMs) [12], respectively. Then an RMSEP value can be computed for each model by using the remaining 20% samples as the test set. Denote the two RMSEP values as RMSEP1 and RMSEP2, of which the difference can be calculated as

$$\mathbf{D} = \mathbf{R}\mathbf{M}\mathbf{S}\mathbf{E}\mathbf{P}\_1\mathbf{-R}\mathbf{M}\mathbf{S}\mathbf{E}\mathbf{P}\_2\tag{2}$$

By repeating this procedure N, say 1000, times, N D values are obtained and collected into a vector **D**. Now, the model comparison can be formulated into a hypothesis test problem as:

Null hypothesis: the mean of **D** is zero.

Alternative hypothesis: the mean of **D** is not zero.

By employing a statistical test method, *e.g. t*-test or Mann-Whitney U test [25], a *P* value can be computed for strictly assessing whether the mean of **D** is significantly different from zero (P<0.05) or not (P>0.05). If P<0.05, the sign of the mean of **D** is then used to compare which model (variable set) is of better predictive performance. If P>0.05, we say two models have the same predictive ability.

#### **4. Results and discussions**

#### **4.1 Comparison of predictive performances of variables subsets**

The corn NIR data measured on *mp5* instrument is used to illustrate the use of the proposed method (http://software.eigenvector.com/Data/index.html). This data contain NIR spectra

Model Population Analysis for Statistical Model Comparison 11

MC-UVE CARS

1200 1400 1600 1800 2000 2200 2400

wavelength

Fig. 5. Comparison of selected wavelengths using MC-UVE (red circles) and CARS (blue

spectra or standardized spectra indeed would lead to different results. But the difference is usually not big according to our experience. The reason why we choose standardization is to remove the influence of each wavelength's variance on PLS modeling because the decomposition of spectrum data X using PLS depends on the magnitude of covariance between wavelengths and the target variable Y. The number of PLS components are optimized using 5-fold cross validation. For MCUVE, the number of Monte Carlo simulations is set to 1000 and at each simulation 80% samples are selected randomly to build a calibration model. We use the reliability index (RI) to rank each wavelength and the number of wavelengths (with a maximum 200 wavelengths allowed) is identified using 5 fold cross validation. Using MCUVE, 115 wavelengths in 5 bands (1176-1196nm, 1508- 1528nm, 1686-1696nm, 1960-2062nm and 2158-2226nm) are finally selected and shown in **Figure 5** as red circles. For CARS, the number of iterations is set to 50. Using CARS, altogether 28 variables (1188, 1202, 1204, 1396, 1690, 1692, 1710, 1800, 1870, 2048, 2050, 2052, 2064, 2098, 2102, 2104, 2106, 2108, 2166, 2168, 2238, 2270, 2382, 2402, 2434, 2436, 2468 and 2472 nm) are singled out and these variables are also shown in **Figure 5** as blue dots. Intuitively, MCUVE selects 5 wavelength bands while the variables selected by CARS are more diverse and scattered at different regions. In addition, the Pearson correlations

dots). The green line denotes the mean of the 80 corn NIR spectra.

variables selected by both methods are shown in **Figure 6**.

0

0.1

0.2

0.3

0.4

intensity

0.5

0.6

0.7

0.8

measured at 700 wavelengths on 80 corn samples. The original NIR spectra are shown in **Figure 4**. The chemical property modeled here is the content of protein. As was demonstrated in a large body of literature [26-30], variable selection can improve the predictive performance of a model. Here we would like to investigate whether the gain in predictive accuracy using variable subsets identified by variable selection methods is significant.

Fig. 4. Original near infrared spectra of corn on the mp5 instrument.

Uninformative variable elimination (UVE) is a widely used method for variable selection in chemometrics [26]. Its extended version, Monte Carlo UVE (MCUVE), was recently proposed [27, 31]. Mimicking the principle of "survival of the fittest" in Darwin's evolution theory, we developed a variable selection method in our previous work, called competitive adaptive reweighted sampling (CARS) [8, 28, 32, 33], which was shown to have the potential to identify an optimal subset of variables that show high predictive performances. The source codes of CARS are freely available at [34, 35].

In this study, MCUVE and CARS is chosen to first identify two variable sets, named V1 and V2, respectively. The set of the original 700 variables are denoted as V0. Before data analysis, each wavelength of the original NIR spectra is standardized to have zero mean and unit variance. Regarding the pretreatment of spectral data, using original spectra, mean-centered

measured at 700 wavelengths on 80 corn samples. The original NIR spectra are shown in **Figure 4**. The chemical property modeled here is the content of protein. As was demonstrated in a large body of literature [26-30], variable selection can improve the predictive performance of a model. Here we would like to investigate whether the gain in predictive accuracy using variable subsets identified by variable selection methods is

1200 1400 1600 1800 2000 2200 2400

wavelength

Uninformative variable elimination (UVE) is a widely used method for variable selection in chemometrics [26]. Its extended version, Monte Carlo UVE (MCUVE), was recently proposed [27, 31]. Mimicking the principle of "survival of the fittest" in Darwin's evolution theory, we developed a variable selection method in our previous work, called competitive adaptive reweighted sampling (CARS) [8, 28, 32, 33], which was shown to have the potential to identify an optimal subset of variables that show high predictive performances. The

In this study, MCUVE and CARS is chosen to first identify two variable sets, named V1 and V2, respectively. The set of the original 700 variables are denoted as V0. Before data analysis, each wavelength of the original NIR spectra is standardized to have zero mean and unit variance. Regarding the pretreatment of spectral data, using original spectra, mean-centered

Fig. 4. Original near infrared spectra of corn on the mp5 instrument.

source codes of CARS are freely available at [34, 35].

significant.


0

0.1

0.2

0.3

0.4

intensity

0.5

0.6

0.7

0.8

0.9

Fig. 5. Comparison of selected wavelengths using MC-UVE (red circles) and CARS (blue dots). The green line denotes the mean of the 80 corn NIR spectra.

spectra or standardized spectra indeed would lead to different results. But the difference is usually not big according to our experience. The reason why we choose standardization is to remove the influence of each wavelength's variance on PLS modeling because the decomposition of spectrum data X using PLS depends on the magnitude of covariance between wavelengths and the target variable Y. The number of PLS components are optimized using 5-fold cross validation. For MCUVE, the number of Monte Carlo simulations is set to 1000 and at each simulation 80% samples are selected randomly to build a calibration model. We use the reliability index (RI) to rank each wavelength and the number of wavelengths (with a maximum 200 wavelengths allowed) is identified using 5 fold cross validation. Using MCUVE, 115 wavelengths in 5 bands (1176-1196nm, 1508- 1528nm, 1686-1696nm, 1960-2062nm and 2158-2226nm) are finally selected and shown in **Figure 5** as red circles. For CARS, the number of iterations is set to 50. Using CARS, altogether 28 variables (1188, 1202, 1204, 1396, 1690, 1692, 1710, 1800, 1870, 2048, 2050, 2052, 2064, 2098, 2102, 2104, 2106, 2108, 2166, 2168, 2238, 2270, 2382, 2402, 2434, 2436, 2468 and 2472 nm) are singled out and these variables are also shown in **Figure 5** as blue dots. Intuitively, MCUVE selects 5 wavelength bands while the variables selected by CARS are more diverse and scattered at different regions. In addition, the Pearson correlations variables selected by both methods are shown in **Figure 6**.

Model Population Analysis for Statistical Model Comparison 13

Fig. 7. Distributions of root mean squared errors of prediction (RMSEP) from 1000 test sets (32 samples) randomly selected from the 80 corn samples using full spectra and variables

statistical model comparison. With our method, we have evidence showing that the

Fig. 8. The distributions of D values. The P values of t test for these three distributions are

selected by MCUVE and CARS, respectively.

8.36×10-120, 0 and 0, respectively.

improvement resulting from MCUVE is significant.

Fig. 6. The Pearson pair-wise correlations of variables selected using MCUVE (115 variables, left) and CARS (28 variables, right).

We choose PLS for building regression models. For the MPA-based method for model comparison, the number of Monte Carlo simulations is set to 1000 and at each simulation 60% samples are randomly selected as training samples and the rest 40% work as test samples. The number of PLS components is chosen based on 5-fold cross validation. In this setting, we first calculated 1000 values of RMSEP0, RMSEP1 and RMSEP2 using V0, V1 and V2, respectively. The distributions of RMSEP0, RMSEP1 and RMSEP2 are shown in **Figure 7**. The mean and standard deviations of these three distributions are 0.169±0.025 (full spectra), 0.147±0.018 (MCUVE) and 0.108±0.015 (CARS). On the whole, both variable selection methods improve the predictive performance in terms of lower prediction errors and smaller standard deviations. Looking closely, the model selected by CARS has smaller standard deviation than that of MCUVE. The reason may be that CARS selected individual wavelengths and these wavelengths display lower correlations (see **Figure 6**) than those wavelength bands selected by MCUVE. The lower correlation results in better model stability which is reflected by smaller standard deviations of prediction errors. Therefore from the perspective of prediction ability, we recommend to adopt methods that select individual wavelengths rather than continuous wavelength bands.

Firstly, we compare the performance of the model selected by MCUVE to the full spectral model. The distribution of D values (MCUVE – Full spectra) is shown in Plot A of **Figure 8**. The mean of D is -0.023 and is shown to be not zero (*P* < 0.000001) using a two-side t test, indicating that MCUVE significantly improves the predictive performance. Of particular note, it can be observed that a percentage (83.1%) of D values are negative and the remaining (16.9%) is positive, which indicates model comparison based on a single split of the data into a training set and a corresponding test set may have the potential risk of drawing a wrong conclusion. In this case, the probability of saying that MCUVE does not improve predictive performances is about 0.169. However, this problem can be solved by the proposed MPA-based method because the model performance is tested on a large number of sub-datasets, rendering the current method potentially useful for reliably

Fig. 6. The Pearson pair-wise correlations of variables selected using MCUVE (115 variables,

We choose PLS for building regression models. For the MPA-based method for model comparison, the number of Monte Carlo simulations is set to 1000 and at each simulation 60% samples are randomly selected as training samples and the rest 40% work as test samples. The number of PLS components is chosen based on 5-fold cross validation. In this setting, we first calculated 1000 values of RMSEP0, RMSEP1 and RMSEP2 using V0, V1 and V2, respectively. The distributions of RMSEP0, RMSEP1 and RMSEP2 are shown in **Figure 7**. The mean and standard deviations of these three distributions are 0.169±0.025 (full spectra), 0.147±0.018 (MCUVE) and 0.108±0.015 (CARS). On the whole, both variable selection methods improve the predictive performance in terms of lower prediction errors and smaller standard deviations. Looking closely, the model selected by CARS has smaller standard deviation than that of MCUVE. The reason may be that CARS selected individual wavelengths and these wavelengths display lower correlations (see **Figure 6**) than those wavelength bands selected by MCUVE. The lower correlation results in better model stability which is reflected by smaller standard deviations of prediction errors. Therefore from the perspective of prediction ability, we recommend to adopt methods that select

Firstly, we compare the performance of the model selected by MCUVE to the full spectral model. The distribution of D values (MCUVE – Full spectra) is shown in Plot A of **Figure 8**. The mean of D is -0.023 and is shown to be not zero (*P* < 0.000001) using a two-side t test, indicating that MCUVE significantly improves the predictive performance. Of particular note, it can be observed that a percentage (83.1%) of D values are negative and the remaining (16.9%) is positive, which indicates model comparison based on a single split of the data into a training set and a corresponding test set may have the potential risk of drawing a wrong conclusion. In this case, the probability of saying that MCUVE does not improve predictive performances is about 0.169. However, this problem can be solved by the proposed MPA-based method because the model performance is tested on a large number of sub-datasets, rendering the current method potentially useful for reliably

individual wavelengths rather than continuous wavelength bands.

left) and CARS (28 variables, right).

Fig. 7. Distributions of root mean squared errors of prediction (RMSEP) from 1000 test sets (32 samples) randomly selected from the 80 corn samples using full spectra and variables selected by MCUVE and CARS, respectively.

statistical model comparison. With our method, we have evidence showing that the improvement resulting from MCUVE is significant.

Fig. 8. The distributions of D values. The P values of t test for these three distributions are 8.36×10-120, 0 and 0, respectively.

Model Population Analysis for Statistical Model Comparison 15

worst. As a transitional model that is between PCR and PLS, ECR with α = 0.5 achieves the

Fig. 9. The distributions of RMSEP from PCR, PLS and an ECR model with α =0.5

Fig. 10. The distributions of D values. The P values of t test for these three distributions are

medium level performance.

0, 0 and 0, respectively.

Further, the performance of the model selected by CARS is compared to the full spectral model. The distribution of D values (CARS – Full spectrum) is shown in Plot B of **Figure 8**. The mean of D is -0.061 which is much smaller that that from MCUVE (-0.023). Using a twoside t test, this mean is shown to be significantly different from zero (*P* = 0), indicating that the improvement over the full spectral model is significant. Interestingly, it is found that all the D values are negative, which implies the model selected by CARS is highly predictive and there is little evidence to recommend the use of a full spectral model, at least for this dataset.

Finally, we compare the models selected by MCUVE and CARS, respectively. The distribution of D values (CARS – MCUVE) is shown in Plot C of **Figure 8**. The mean of D values is -0.039. Using a two-side t test, this mean is shown to be significantly different from zero (*P* = 0), indicating that the improvement of CARS over MCUVE is significant. We find that 98.9% of D values are negative and only 1.1% are positive, which suggests that there is a small probability to draw a wrong conclusion that MCUVE performs better than CARS. However, with the help of MPA, this risky conclusion can be avoided, indeed.

Summing up, we have conducted statistical comparison of the full spectral model and the models selected by MCUVE and CARS based on the distribution of D values calculated using RMSEP. Our results show that model comparison based on a single split of the data into a training set and a corresponding test set may result in a wrong conclusion and the proposed MPA approach can avoid drawing such a wrong conclusion thus providing a solution to this problem.

#### **4.2 Comparison of PCR, PLS and an ECR model**

In chemometrics, PCR and PLS seem to be the most widely used method for building a calibration model. Recently, we developed a method, called elastic component regression (ECR), which utilizes a tuning parameter α∈[0,1] to supervise the decomposition of X-matrix [36], which falls into the category of continuum regression [37-40]. It is demonstrated theoretically that the elastic component resulting from ECR coincides with principal components of PCA when α = 0 and also coincides with PLS components when α = 1. In this context, PCR and PLS occupy the two ends of ECR and α∈(0,1) will lead to an infinite number of transitional models which collectively uncover the model path from PCR to PLS. The source codes implementing ECR in MATLAB are freely available at [41]. In this section, we would like to compare the predictive performance of PCR, PLS and an ECR model with α = 0.5.

We still use the corn protein data described in Section 4.1. Here we do not consider all the variables but only the 28 wavelengths selected by CARS. For the proposed method, the number of Monte Carlo simulations is set to 1000. At each simulation 60% samples selected randomly are used as training samples and the remaining serve as test samples. The number of latent variables (LVs) for PCR, PLS and ECR (α = 0.5) is chosen using 5-fold cross validation.

**Figure 9** shows the three distributions of RMSEP computed using PCR, PLS and ECR (α = 0.5). The mean and standard deviations of these distributions are 0.1069±0.0140, 0.1028±0.0111 and 0.0764±0.0108, respectively. Obviously, PLS achieves the lowest prediction errors as well as the smallest standard deviations. In contrast, PCR performs the

Further, the performance of the model selected by CARS is compared to the full spectral model. The distribution of D values (CARS – Full spectrum) is shown in Plot B of **Figure 8**. The mean of D is -0.061 which is much smaller that that from MCUVE (-0.023). Using a twoside t test, this mean is shown to be significantly different from zero (*P* = 0), indicating that the improvement over the full spectral model is significant. Interestingly, it is found that all the D values are negative, which implies the model selected by CARS is highly predictive and there is little evidence to recommend the use of a full spectral model, at least for this

Finally, we compare the models selected by MCUVE and CARS, respectively. The distribution of D values (CARS – MCUVE) is shown in Plot C of **Figure 8**. The mean of D values is -0.039. Using a two-side t test, this mean is shown to be significantly different from zero (*P* = 0), indicating that the improvement of CARS over MCUVE is significant. We find that 98.9% of D values are negative and only 1.1% are positive, which suggests that there is a small probability to draw a wrong conclusion that MCUVE performs better than CARS.

Summing up, we have conducted statistical comparison of the full spectral model and the models selected by MCUVE and CARS based on the distribution of D values calculated using RMSEP. Our results show that model comparison based on a single split of the data into a training set and a corresponding test set may result in a wrong conclusion and the proposed MPA approach can avoid drawing such a wrong conclusion thus providing a

In chemometrics, PCR and PLS seem to be the most widely used method for building a calibration model. Recently, we developed a method, called elastic component regression (ECR), which utilizes a tuning parameter α∈[0,1] to supervise the decomposition of X-matrix [36], which falls into the category of continuum regression [37-40]. It is demonstrated theoretically that the elastic component resulting from ECR coincides with principal components of PCA when α = 0 and also coincides with PLS components when α = 1. In this context, PCR and PLS occupy the two ends of ECR and α∈(0,1) will lead to an infinite number of transitional models which collectively uncover the model path from PCR to PLS. The source codes implementing ECR in MATLAB are freely available at [41]. In this section, we would like to compare the predictive performance of PCR, PLS and an ECR model with

We still use the corn protein data described in Section 4.1. Here we do not consider all the variables but only the 28 wavelengths selected by CARS. For the proposed method, the number of Monte Carlo simulations is set to 1000. At each simulation 60% samples selected randomly are used as training samples and the remaining serve as test samples. The number of latent variables (LVs) for PCR, PLS and ECR (α = 0.5) is chosen using 5-fold cross

**Figure 9** shows the three distributions of RMSEP computed using PCR, PLS and ECR (α = 0.5). The mean and standard deviations of these distributions are 0.1069±0.0140, 0.1028±0.0111 and 0.0764±0.0108, respectively. Obviously, PLS achieves the lowest prediction errors as well as the smallest standard deviations. In contrast, PCR performs the

However, with the help of MPA, this risky conclusion can be avoided, indeed.

dataset.

α = 0.5.

validation.

solution to this problem.

**4.2 Comparison of PCR, PLS and an ECR model** 

worst. As a transitional model that is between PCR and PLS, ECR with α = 0.5 achieves the medium level performance.

Fig. 9. The distributions of RMSEP from PCR, PLS and an ECR model with α =0.5

Fig. 10. The distributions of D values. The P values of t test for these three distributions are 0, 0 and 0, respectively.

Model Population Analysis for Statistical Model Comparison 17

Fig. 11. The distributions of misclassification error on 1000 test sets using all variables and

performance of a classification model. The reason why SPA performs better than t-test is that synergistic effects among multiple variables are implicitly taken into account in SPA

Fig. 12. The distributions of D values. The P values of t test for these three distributions are

variables selected by t test and SPA, respectively.

while t-test only considers univariate associations.

1.66×10-46, 1.02×10-57 and 1.27×10-8.

The distributions of D values are displayed in **Figure 10**. The means of these three distributions are -0.0041 (Plot A), -0.0305 (Plot B) and -0.0264 (Plot C), respectively. Using a two-side t test, it is shown that all these three distributions of D values have a mean value that is significant not zero with P values equal to 0 , 0 and 0 for Plot A, Plot B and Plot C. To conclude, this section provides illustrative examples for the comparison of different modeling methods. Our example demonstrates that PLS (an ECR model associated with α = 1) performs better than PCR (an ECR model associated with α = 0) and a specific transitional ECR model associated with α = 0.5 has the moderate performance.

#### **4.3 Comparison of PLS-LDA models before and after variable selection**

Partial least squares-linear discriminant analysis (PLS-LDA) is frequently used in chemometrics and metabolomics/metabonomics for building predictive classification models and/or biomarker discovery [32, 42-45]. With the development of modern highthroughput analytical instruments, the data generated often contains a large number of variables (wavelengths, m/z ratios etc). Most of these variables are not relevant to the problem under investigation. Moreover, a model constructed using this kind of data that contain irrelevant variables would not be likely to have good predictive performance. Variable selection provides a solution to this problem that can help select a small number of informative variables that could be more predictive than an all-variable model.

In the present work, two methods are chosen to conduct variable selection. The first is t-test, which is a simple univariate method that determines whether two samples from normal distributions could have the same mean when standard deviations are unknown but assumed to be equal. The second is subwindow permutation analysis (SPA) which was a model population analysis-based approach proposed in our previous work [14]. The main characteristic of SPA is that it can output a conditional P value by implicitly taking into account synergistic effects among multiple variables. With this conditional P value, important variables or conditionally important variables can be identified. The source codes in Matlab and R are freely available at [46].We apply these two methods on a type 2 diabetes mellitus dataset that contains 90 samples (45 healthy and 45 cases) each of which is characterized by 21 metabolites measured using a GC/MS instrument. Details of this dataset can be found in reference [32].

Using t-test, 13 out of the 21 variables are identified to be significant (P < 0.01). For SPA, we use the same setting as described in our previous work [14]. Three variables are selected with the aid of SPA. Let V0, V1 and V2 denote the sets containing all the 21 variables, the 13 variables selected by t-test and the 3 variables selected by SPA, respectively. To run the proposed method, we set the number of Monte Carlo simulations to 1000. At each simulation 70% samples are randomly selected to build a PLS-LDA model with the number of latent variables optimized by 10-fold cross validation. The remaining 30% samples working as test sets on which the misclassification error is computed.

**Figure 11** shows the distributions of misclassification errors computed using these three variable sets. The mean and standard deviations of these distributions are 0.065±0.048 (all variables), 0.042±0.037 (t-test) and 0.034±0.034 (SPA), respectively. It can be found that the models using selected variables have lower prediction errors as well as higher stability in terms of smaller standard deviations, indicating that variable selection can improve the

The distributions of D values are displayed in **Figure 10**. The means of these three distributions are -0.0041 (Plot A), -0.0305 (Plot B) and -0.0264 (Plot C), respectively. Using a two-side t test, it is shown that all these three distributions of D values have a mean value that is significant not zero with P values equal to 0 , 0 and 0 for Plot A, Plot B and Plot C. To conclude, this section provides illustrative examples for the comparison of different modeling methods. Our example demonstrates that PLS (an ECR model associated with α = 1) performs better than PCR (an ECR model associated with α = 0) and a specific transitional

Partial least squares-linear discriminant analysis (PLS-LDA) is frequently used in chemometrics and metabolomics/metabonomics for building predictive classification models and/or biomarker discovery [32, 42-45]. With the development of modern highthroughput analytical instruments, the data generated often contains a large number of variables (wavelengths, m/z ratios etc). Most of these variables are not relevant to the problem under investigation. Moreover, a model constructed using this kind of data that contain irrelevant variables would not be likely to have good predictive performance. Variable selection provides a solution to this problem that can help select a small number of

In the present work, two methods are chosen to conduct variable selection. The first is t-test, which is a simple univariate method that determines whether two samples from normal distributions could have the same mean when standard deviations are unknown but assumed to be equal. The second is subwindow permutation analysis (SPA) which was a model population analysis-based approach proposed in our previous work [14]. The main characteristic of SPA is that it can output a conditional P value by implicitly taking into account synergistic effects among multiple variables. With this conditional P value, important variables or conditionally important variables can be identified. The source codes in Matlab and R are freely available at [46].We apply these two methods on a type 2 diabetes mellitus dataset that contains 90 samples (45 healthy and 45 cases) each of which is characterized by 21 metabolites measured using a GC/MS instrument. Details of this dataset

Using t-test, 13 out of the 21 variables are identified to be significant (P < 0.01). For SPA, we use the same setting as described in our previous work [14]. Three variables are selected with the aid of SPA. Let V0, V1 and V2 denote the sets containing all the 21 variables, the 13 variables selected by t-test and the 3 variables selected by SPA, respectively. To run the proposed method, we set the number of Monte Carlo simulations to 1000. At each simulation 70% samples are randomly selected to build a PLS-LDA model with the number of latent variables optimized by 10-fold cross validation. The remaining 30% samples

**Figure 11** shows the distributions of misclassification errors computed using these three variable sets. The mean and standard deviations of these distributions are 0.065±0.048 (all variables), 0.042±0.037 (t-test) and 0.034±0.034 (SPA), respectively. It can be found that the models using selected variables have lower prediction errors as well as higher stability in terms of smaller standard deviations, indicating that variable selection can improve the

working as test sets on which the misclassification error is computed.

ECR model associated with α = 0.5 has the moderate performance.

**4.3 Comparison of PLS-LDA models before and after variable selection** 

informative variables that could be more predictive than an all-variable model.

can be found in reference [32].

Fig. 11. The distributions of misclassification error on 1000 test sets using all variables and variables selected by t test and SPA, respectively.

performance of a classification model. The reason why SPA performs better than t-test is that synergistic effects among multiple variables are implicitly taken into account in SPA while t-test only considers univariate associations.

Fig. 12. The distributions of D values. The P values of t test for these three distributions are 1.66×10-46, 1.02×10-57 and 1.27×10-8.

Model Population Analysis for Statistical Model Comparison 19

[8] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Model population analysis for variable

[9] B. Efron, G. Gong, A Leisurely Look at the Bootstrap, the Jackknife, and Cross-

[12] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, Support vector machines and its applications in

[13] D.S. Cao, Y.Z. Liang, Q.S. Xu, H.D. Li, X. Chen, A New Strategy of Outlier Detection for

[14] H.-D. Li, M.-M. Zeng, B.-B. Tan, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Recipe for revealing

[15] Q. Wang, H.-D. Li, Q.-S. Xu, Y.-Z. Liang, Noise incorporated subwindow permutation

[16] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, B.-B. Tan, B.-C. Deng, C.-C. Lin, Recipe for

[17] A.I. Bandos, H.E. Rockette, D. Gur, A permutation test sensitive to differences in areas

[18] Y. Ai-Jun, S. Xin-Yuan, Bayesian variable selection for disease classification using gene

[19] A. Vehtari, J. Lampinen, Bayesian model assessment and comparison using crossvalidation predictive densities, Neural Computation, 14 (2002) 2439. [20] T. Chen, E. Martin, Bayesian linear regression and variable selection for spectroscopic

[22] Y. Freund, R. Schapire, Experiments with a new boosting algorithm, Machine Learning: Proceedings of the Thirteenth International Conference, (1996) 148.

[24] K. Ting, J. Wells, S. Tan, S. Teng, G. Webb, Feature-subspace aggregating: ensembles for

[25] H.B. Mann, D.R. Whitney, On a test of whether one of two random variables is stochastically larger than the other, Ann. Math. Statist., 18 (1947) 50. [26] V. Centner, D.-L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna,

[27] W. Cai, Y. Li, X. Shao, A variable selection method based on uninformative variable

Elimination of Uninformative Variables for Multivariate Calibration, Anal. Chem.,

elimination for multivariate calibration of near-infrared spectra, Chemometr. Intell.

stable and unstable learners, Mach. Learn., 82 (2010) 375.

Population Analysis, IEEE/ACM T Comput Bi, 8 (2011) 1633.

informative metabolites based on model population analysis, Metabolomics, 6

analysis for informative gene selection using support vector machines, Analyst, 136

Uncovering Predictive Genes using Support Vector Machines based on Model

for comparing ROC curves from a paired design, Statistics in Medicine, 24 (2005)

[10] B. Efron, R. Tibshirani, An introduction to the bootstrap, Chapman&Hall, (1993). [11] S. Wold, M. Sjöström, L. Eriksson, PLS-regression: a basic tool of chemometrics,

[6] J. Shao, Linear Model Selection by Cross-Validation, J Am. Stat. Assoc., 88 (1993) 486. [7] M. Stone, Cross-validatory choice and assessment of statistical predictions, J. R. Stat. Soc.

B, 36 (1974) 111.

(2010) 353.

(2011) 1456.

2873.

68 (1996) 3851.

Lab., 90 (2008) 188.

selection, J. Chemometr., 24 (2009) 418.

Chemometr. Intell. Lab., 58 (2001) 109.

chemistry, Chemometr. Intell. Lab., 95 (2009) 188

QSAR/QSPR, J. Comput. Chem., 31 (2010) 592.

expression data, Bioinformatics, 26 (2009) 215.

calibration, Anal. Chim. Acta, 631 (2009) 13. [21] L. Breiman, Bagging Predictors, Mach. Learn., 24 (1996) 123.

[23] L. Breiman, Random Forests, Mach. Learn., 45 (2001) 5.

Validation, Am. Stat., 37 (1983) 36.

We conducted pair-wise comparison of performances of the three variable sets described above. The distribution of D values (t-test – all variables) is shown in Plot A of **Figure 12**. The mean of D is -0.023 and is demonstrated to be significantly not zero (*P*=0) using a twoside t test, suggesting the improvement of variable selection. In spite of this improvement, we should also notice that a percentage (17.3%) of D values is positive, which again imply that model comparison based on a single split of the data into a training set and a corresponding test set is risky. However, with the aid of this MPA-based approach, it is likely to reliably compare different models in a statistical manner.

The distribution of D values (SPA – all variables) is shown in Plot B of **Figure 12**. The mean of D is -0.031 and is shown to be not zero (*P* = 0). Also, 171 D values are positive, again indicating the necessity of the use of a population of models for model comparison. In analogy, Plot C in **Figure 12** displays the distributions of D values (SPA-t-test). After applying a two-side t-test, we found that the improvement of SPA over t-test is significant (P= 0). For this distribution, 22.7% D values is positive, indicating that based on a random splitting of the data t-test will have a 22.7% chance to perform better than SPA. However, based on a large scale comparison, the overall performance of SPA is statistically better than t-test.

To conclude, in this section we have compared the performances of the original variable set and variable sets selected using t-test and SPA. We found evidences to support the use of the proposed model population analysis approach for statistical model comparison of different classification models.

#### **5. Conclusions**

A model population analysis approach for statistical model comparison is developed in this work. From our case studies, we have found strong evidences that support the use of model population analysis for the comparison of different variable sets or different modeling methods in both regression and classification. P values resulting from the proposed method in combination with the sign of the mean of D values clearly shows whether two models have the same performance or which model is significantly better.

#### **6. Acknowledgements**

This work was financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 20875104 and No. 21075138) and the Graduate degree thesis Innovation Foundation of Central South University (CX2010B057).

#### **7. References**


We conducted pair-wise comparison of performances of the three variable sets described above. The distribution of D values (t-test – all variables) is shown in Plot A of **Figure 12**. The mean of D is -0.023 and is demonstrated to be significantly not zero (*P*=0) using a twoside t test, suggesting the improvement of variable selection. In spite of this improvement, we should also notice that a percentage (17.3%) of D values is positive, which again imply that model comparison based on a single split of the data into a training set and a corresponding test set is risky. However, with the aid of this MPA-based approach, it is

The distribution of D values (SPA – all variables) is shown in Plot B of **Figure 12**. The mean of D is -0.031 and is shown to be not zero (*P* = 0). Also, 171 D values are positive, again indicating the necessity of the use of a population of models for model comparison. In analogy, Plot C in **Figure 12** displays the distributions of D values (SPA-t-test). After applying a two-side t-test, we found that the improvement of SPA over t-test is significant (P= 0). For this distribution, 22.7% D values is positive, indicating that based on a random splitting of the data t-test will have a 22.7% chance to perform better than SPA. However, based on a large scale comparison,

To conclude, in this section we have compared the performances of the original variable set and variable sets selected using t-test and SPA. We found evidences to support the use of the proposed model population analysis approach for statistical model comparison of

A model population analysis approach for statistical model comparison is developed in this work. From our case studies, we have found strong evidences that support the use of model population analysis for the comparison of different variable sets or different modeling methods in both regression and classification. P values resulting from the proposed method in combination with the sign of the mean of D values clearly shows whether two models

This work was financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 20875104 and No. 21075138) and the Graduate degree thesis Innovation

[1] H. Akaike, A new look at the statistical model identification, IEEE Transactions on

[2] G.E. Schwarz, Estimating the dimension of a model, Annals of Statistics, 6 (1978) 461. [3] S. Wold, Cross-validatory estimation of the number of components in factor and

[4] Q.-S. Xu, Y.-Z. Liang, Monte Carlo cross validation, Chemometr. Intell. Lab., 56 (2001) 1. [5] P. Filzmoser, B. Liebmann, K. Varmuza, Repeated double cross validation, J Chemometr,

principal component analysis, Technometrics, 20 (1978) 397.

likely to reliably compare different models in a statistical manner.

the overall performance of SPA is statistically better than t-test.

have the same performance or which model is significantly better.

Foundation of Central South University (CX2010B057).

Automatic Control, 19 (1974) 716.

different classification models.

**6. Acknowledgements** 

23 (2009) 160.

**7. References** 

**5. Conclusions** 


**2** 

*Spain* 

**Critical Aspects of Supervised Pattern** 

**Recognition Methods for Interpreting** 

*Department of Analytical Chemistry, University of Seville, Seville* 

A lot of multivariate data sets of interest to scientists are called compositional or "closed" data sets, and consists essentially of relative proportions. A recent search on the web by entering "chemical compositional data", led to more than 2,730,000 results within different fields and disciplines, but specially, agricultural and food sciences (August 2011 using Google searcher). The driving causes for the composition of foods lie on four factors (González, 2007): Genetic factor (genetic control and manipulation of original specimens), Environmental factor (soil, climate and symbiotic and parasite organisms), Agricultural factor (cultures, crop, irrigation, fertilizers and harvest practices) and Processing factor (post-harvest manipulation, preservation, additives, conversion to another food preparation and finished product). But the influences of these factors are hidden behind the analytical measurements and only can be inferred and uncover by using suitable chemometric

Chemometrics is a term originally coined by Svante Wold and could be defined as "The art of extracting chemically relevant information from data produced in chemical experiments" (Wold, 1995). Besides, chemometrics can be also defined as the application of mathematical, statistical, graphical or symbolic methods to maximize the chemical information which can be extracted from data (Rock, 1985). Within the jargon of chemometrics some other terms are very common; among them, multivariate analysis and pattern recognition are often used. Chemometricians use the term Multivariate Analysis in reference to the different approaches (mathematical, statistical, graphical...) when considering samples featured by multiple descriptors simultaneously. Pattern recognition is a branch of the Artificial Intelligence that seeks to identify similarities and regularities present in the data in order to attain natural classification and grouping. When applied to chemical compositional data, pattern recognition methods can be seen as multivariate analysis applied to chemical measurements to find classification rules for discrimination issues. Depending on our knowledge about the category or class membership of the data set, two approaches can be

Supervised learning methods develop rules for the classification of unknown samples on the basis of a group of samples with known categories (known set). Unsupervised learning

applied: Supervised or unsupervised learning (pattern recognition).

**1. Introduction** 

procedures.

**Compositional Data** 

A. Gustavo González


## **Critical Aspects of Supervised Pattern Recognition Methods for Interpreting Compositional Data**

A. Gustavo González *Department of Analytical Chemistry, University of Seville, Seville Spain* 

#### **1. Introduction**

20 Chemometrics in Practical Applications

[28] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Key wavelengths screening using competitive

[29] J.-H. Jiang, R.J. Berry, H.W. Siesler, Y. Ozaki, Wavelength Interval Selection in

[30] C. Reynes, S. de Souza, R. Sabatier, G. Figueres, B. Vidal, Selection of discriminant

[31] Q.-J. Han, H.-L. Wu, C.-B. Cai, L. Xu, R.-Q. Yu, An ensemble of Monte Carlo

[32] B.-B. Tan, Y.-Z. Liang, L.-Z. Yi, H.-D. Li, Z.-G. Zhou, X.-Y. Ji, J.-H. Deng, Identification of

[34] Source codes of CARS-PLS for variable selection: http://code.google.com/p/carspls/

[36] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, Uncover the path from PCR to PLS via elastic component

[37] M. Stone, R.J. Brooks, Continuum Regression: Cross-Validated Sequentially

[39] B.M. Wise, N.L. Ricker, Identification of finite impulse response models with continuum

[40] J.H. Kalivas, Cyclic subspace regression with analysis of the hat matrix, Chemometr.

[42] M. Barker, W. Rayens, Partial least squares for discrimination, J. Chemometr., 17 (2003)

[43] J.A. Westerhuis, H.C.J. Hoefsloot, S. Smit, D.J. Vis, A.K. Smilde, E.J.J. van Velzen, J.P.M.

[44] L.-Z. Yi, J. He, Y.-Z. Liang, D.-L. Yuan, F.-T. Chau, Plasma fatty acid metabolic profiling

[45] J. Trygg, E. Holmes, T.r. Lundstedt, Chemometrics in Metabonomics, Journal of

van Duijnhoven, F.A. van Dorsten, Assessment of PLSDA cross validation,

and biomarkers of type 2 diabetes mellitus based on GC/MS and PLS-LDA, FEBS

[41] Source codes of Elastic Component Regression: http://code.google.com/p/ecr/

and Principal Components Regression, J. R. Statist. Soc. B, 52 (1990) 237. [38] A. Bjorkstrom, R. Sundberg, A generalized view on continuum regression, Scand. J.

Acta, 648 (2009) 77.

20 (2006) 136.

(2008) 121.

Data, Anal. Chem., 74 (2002) 3555.

Analytical Methods, 3 (2011) 1872.

http://code.google.com/p/cars2009/

Statist., 26 (1999) 17.

166.

Intell. Lab., 45 (1999) 215.

Metabolomics, 4 (2008) 81.

Letters, 580 (2006) 6837.

Proteome Research, 6 (2006) 469. [46] Source codes of Subwindow Permutation Analysis:

http://code.google.com/p/spa2010/

[35] Source codes of CARS-PLSLDA for variable selection:

regression, J. Chemometr., 7 (1993) 1.

regression, Chemometr. Intell. Lab., 104 (2010) 341.

adaptive reweighted sampling method for multivariate calibration, Anal. Chim.

Multicomponent Spectral Analysis by Moving Window Partial Least-Squares Regression with Applications to Mid-Infrared and Near-Infrared Spectroscopic

wavelength intervals in NIR spectrometry with genetic algorithms, J. Chemometr.,

uninformative variable elimination for wavelength selection, Anal. Chim. Acta, 612

free fatty acids profiling of type 2 diabetes mellitus and exploring possible biomarkers by GC–MS coupled with chemometrics, Metabolomics, 6 (2009) 219. [33] W. Fan, H.-D. Li, Y. Shan, H.-Y. Lv, H.-X. Zhang, Y.-Z. Liang, Classification of vinegar

samples based on near infrared spectroscopy combined with wavelength selection,

Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares

A lot of multivariate data sets of interest to scientists are called compositional or "closed" data sets, and consists essentially of relative proportions. A recent search on the web by entering "chemical compositional data", led to more than 2,730,000 results within different fields and disciplines, but specially, agricultural and food sciences (August 2011 using Google searcher). The driving causes for the composition of foods lie on four factors (González, 2007): Genetic factor (genetic control and manipulation of original specimens), Environmental factor (soil, climate and symbiotic and parasite organisms), Agricultural factor (cultures, crop, irrigation, fertilizers and harvest practices) and Processing factor (post-harvest manipulation, preservation, additives, conversion to another food preparation and finished product). But the influences of these factors are hidden behind the analytical measurements and only can be inferred and uncover by using suitable chemometric procedures.

Chemometrics is a term originally coined by Svante Wold and could be defined as "The art of extracting chemically relevant information from data produced in chemical experiments" (Wold, 1995). Besides, chemometrics can be also defined as the application of mathematical, statistical, graphical or symbolic methods to maximize the chemical information which can be extracted from data (Rock, 1985). Within the jargon of chemometrics some other terms are very common; among them, multivariate analysis and pattern recognition are often used. Chemometricians use the term Multivariate Analysis in reference to the different approaches (mathematical, statistical, graphical...) when considering samples featured by multiple descriptors simultaneously. Pattern recognition is a branch of the Artificial Intelligence that seeks to identify similarities and regularities present in the data in order to attain natural classification and grouping. When applied to chemical compositional data, pattern recognition methods can be seen as multivariate analysis applied to chemical measurements to find classification rules for discrimination issues. Depending on our knowledge about the category or class membership of the data set, two approaches can be applied: Supervised or unsupervised learning (pattern recognition).

Supervised learning methods develop rules for the classification of unknown samples on the basis of a group of samples with known categories (known set). Unsupervised learning

Critical Aspects of Supervised Pattern

observations are present.

column vector *<sup>j</sup> x*

1 *n jk lj j lk k l=1 c = x -x x -x*

"i" can be seen as a vector *<sup>i</sup> x*

<sup>1</sup>

Recognition Methods for Interpreting Compositional Data 23

these calibration procedures have to be suitably validated by using validation procedures based on the knowledge of class memberships of the objects, in a similar way as discussed

Let us assume that a known set of samples is available, where the category or class membership of every sample is *a priori* known. Then a suitable planning of the dataacquisition process is needed. At this point, chemical experience, *savoir faire* and intuition are invaluable in order to decide which measurements should be made on the samples and

In the case of compositional data, a lot of analytical techniques can be chosen. Analytical procedures based on these techniques are then selected to be applied to the samples. Selected analytical methods have to be fully validated and with an estimation of their uncertainty (González & Herrador, 2007) and carried out in Quality Assurance conditions (equipment within specifications, qualified staff, and documentation written as Standard Operational Procedures...). Measurements should be carried out at least duplicate and according to a given

Difficulties arise when the concentration of an element is below the detection limit (DL). It is often standard practice to report these data simply as '<DL' values. Such 'censoring' of data, however, can complicate all subsequent statistical analyses. The best method to use generally depends on the amount of data below the detection limit, the size of the data set, and the probability distribution of the measurements. When the number of '< DL' observations is small, replacing them with a constant is generally satisfactory (Clarke, 1998). The values that are commonly used to replace the '< DL' values are 0, DL, or DL/2. Distributional methods such as the marginal maximum likelihood estimation (Chung, 1993) or more robust techniques (Helsel, 1990) are often required when a large number of '< DL'

After all measurements are done we can built the corresponding data table or data matrix. A sample, object or pattern is described by a set of "p" variables, features or descriptors. So, all descriptors of one pattern form a 'pattern vector' and accordingly, a given pattern

space defined by the features. In matricial form, pattern vectors are row vectors. If we have n patterns, we can build a data matrix *Xn×p* by assembling the different row pattern vectors. A change in perspective is also possible: A given feature "j" can be seen as a

patterns. We can also construct the data matrix by assembling the different feature column vectors. Accordingly, the data matrix can be considered as describing the patterns in terms of features or *vice versa*. This lead to two main classes of chemometric techniques called Q- and R-modes respectively. R-mode techniques are concerned with the relationships amongst the features of the experiment and examine the interplay between the columns of the data matrix. A starting point for R-mode procedures is the covariance matrix of mean centered variables *<sup>T</sup> C=X X* whose elements are given by

*<sup>n</sup>* where j <sup>k</sup> x and x are the mean of the observations on the j*th*

with components *1j 2j ij nj x ,x ,... x ,... x* in the vectorial space defined by the

whose components are *i1 i2 ij ip x ,x ,... x ,... x* in the vectorial

below in this chapter that is devoted to supervised learning for classification.

experimental design to ensure randomization and avoid systematic trends.

which variables of these measurements are most likely to contain class information.

methods instead do not assume any known set and the goal is to find clusters of objects which may be assigned to classes. There is hardly any quantitative analytical method that does not make use of chemometrics. Even if one confines the scope to supervised learning pattern recognition, these chemometric techniques are increasingly being applied to of compositional data for classification and authentication purposes. The discrimination of the geographical origin, the assignation to Denominations of Origin, the classification of varieties and cultivars are typical issues in agriculture and food science.

There are analytical techniques such as the based on sensors arrays (electronic nose and electronic tongue) that cannot be imaginable without the help of chemometrics. Thus, one could celebrate the triumph of chemometrics...so what is wrong? (Pretsch & Wilkin, 2006). The answer could be supported by a very well-known quotation attributed to the british politician D'Israeli (Defernez & Kemsley, 1997) but modified by us as follows: "There are three kinds of lies: Lies, damned lies and chemometrics". Here we have changed the original word "statistics" into "chemometrics" in order to point out the suspicion towards some chemometric techniques even within the scientific community. There is no doubt that unwarranted reliance on poorly understood chemometric methods is responsible for such as suspicion.

By the way, chemometric techniques are applied using statistical/chemometric software packages that work as black boxes for final users. Sometimes the blindly use of "intelligent problem solvers" or similar wizards with a lot of hidden options as default may lead to misleading results. Accordingly, it should be advisable to use software packages with full control on parameters and options, and obviously, this software should be used by a chemometric *connaisseur*.

There are special statistical methods intended for closed data such as compositional ones (Aitchison, 2003; Egozcue et al., 2003; Varmuza & Filzmoser, 2009), but a detailed description of these methods may be outside the scope of this chapter.

#### **2. About the data set**

Supervised learning techniques are used either for developing classification rules which accurately predict the classification of unknown patterns or samples (Kryger, 1981) or for finding calibration relationships between one set of measurements which are easy or cheap to acquire, and other measurements which are expensive or labour intensive, in order to predict these later (Naes et al., 2004). The simplest calibration problem consists of predicting a single response (y-variable) from a known predictor (x-variable) and can be solved by using ordinary linear regression (OLR). When fitting a single response from several predictive variables, multiple linear regression (MLR) may be used; but for the sake of avoiding multicollinearity drawbacks, some other procedures such as principal component regression (PCR) or partial least squares regression (PLS) are a good choice. If faced to several response variables, multivariate partial least squares (PLS2) techniques have to be used. These procedures are common in multivariate calibration (analyte concentrations from NIR data) or linear free energy relationships (pKa values from molecular descriptors). However, in some instances like quantitative structure-activity relationships (QSAR), non linear strategies are needed, such as quadratic partial least squares regression (QPLS), or regression procedures based on artificial neural networks or support vector machines. All

methods instead do not assume any known set and the goal is to find clusters of objects which may be assigned to classes. There is hardly any quantitative analytical method that does not make use of chemometrics. Even if one confines the scope to supervised learning pattern recognition, these chemometric techniques are increasingly being applied to of compositional data for classification and authentication purposes. The discrimination of the geographical origin, the assignation to Denominations of Origin, the classification of

There are analytical techniques such as the based on sensors arrays (electronic nose and electronic tongue) that cannot be imaginable without the help of chemometrics. Thus, one could celebrate the triumph of chemometrics...so what is wrong? (Pretsch & Wilkin, 2006). The answer could be supported by a very well-known quotation attributed to the british politician D'Israeli (Defernez & Kemsley, 1997) but modified by us as follows: "There are three kinds of lies: Lies, damned lies and chemometrics". Here we have changed the original word "statistics" into "chemometrics" in order to point out the suspicion towards some chemometric techniques even within the scientific community. There is no doubt that unwarranted reliance on poorly understood chemometric methods is responsible for such as

By the way, chemometric techniques are applied using statistical/chemometric software packages that work as black boxes for final users. Sometimes the blindly use of "intelligent problem solvers" or similar wizards with a lot of hidden options as default may lead to misleading results. Accordingly, it should be advisable to use software packages with full control on parameters and options, and obviously, this software should be used by a

There are special statistical methods intended for closed data such as compositional ones (Aitchison, 2003; Egozcue et al., 2003; Varmuza & Filzmoser, 2009), but a detailed

Supervised learning techniques are used either for developing classification rules which accurately predict the classification of unknown patterns or samples (Kryger, 1981) or for finding calibration relationships between one set of measurements which are easy or cheap to acquire, and other measurements which are expensive or labour intensive, in order to predict these later (Naes et al., 2004). The simplest calibration problem consists of predicting a single response (y-variable) from a known predictor (x-variable) and can be solved by using ordinary linear regression (OLR). When fitting a single response from several predictive variables, multiple linear regression (MLR) may be used; but for the sake of avoiding multicollinearity drawbacks, some other procedures such as principal component regression (PCR) or partial least squares regression (PLS) are a good choice. If faced to several response variables, multivariate partial least squares (PLS2) techniques have to be used. These procedures are common in multivariate calibration (analyte concentrations from NIR data) or linear free energy relationships (pKa values from molecular descriptors). However, in some instances like quantitative structure-activity relationships (QSAR), non linear strategies are needed, such as quadratic partial least squares regression (QPLS), or regression procedures based on artificial neural networks or support vector machines. All

description of these methods may be outside the scope of this chapter.

varieties and cultivars are typical issues in agriculture and food science.

suspicion.

chemometric *connaisseur*.

**2. About the data set** 

these calibration procedures have to be suitably validated by using validation procedures based on the knowledge of class memberships of the objects, in a similar way as discussed below in this chapter that is devoted to supervised learning for classification.

Let us assume that a known set of samples is available, where the category or class membership of every sample is *a priori* known. Then a suitable planning of the dataacquisition process is needed. At this point, chemical experience, *savoir faire* and intuition are invaluable in order to decide which measurements should be made on the samples and which variables of these measurements are most likely to contain class information.

In the case of compositional data, a lot of analytical techniques can be chosen. Analytical procedures based on these techniques are then selected to be applied to the samples. Selected analytical methods have to be fully validated and with an estimation of their uncertainty (González & Herrador, 2007) and carried out in Quality Assurance conditions (equipment within specifications, qualified staff, and documentation written as Standard Operational Procedures...). Measurements should be carried out at least duplicate and according to a given experimental design to ensure randomization and avoid systematic trends.

Difficulties arise when the concentration of an element is below the detection limit (DL). It is often standard practice to report these data simply as '<DL' values. Such 'censoring' of data, however, can complicate all subsequent statistical analyses. The best method to use generally depends on the amount of data below the detection limit, the size of the data set, and the probability distribution of the measurements. When the number of '< DL' observations is small, replacing them with a constant is generally satisfactory (Clarke, 1998). The values that are commonly used to replace the '< DL' values are 0, DL, or DL/2. Distributional methods such as the marginal maximum likelihood estimation (Chung, 1993) or more robust techniques (Helsel, 1990) are often required when a large number of '< DL' observations are present.

After all measurements are done we can built the corresponding data table or data matrix. A sample, object or pattern is described by a set of "p" variables, features or descriptors. So, all descriptors of one pattern form a 'pattern vector' and accordingly, a given pattern "i" can be seen as a vector *<sup>i</sup> x* whose components are *i1 i2 ij ip x ,x ,... x ,... x* in the vectorial space defined by the features. In matricial form, pattern vectors are row vectors. If we have n patterns, we can build a data matrix *Xn×p* by assembling the different row pattern vectors. A change in perspective is also possible: A given feature "j" can be seen as a column vector *<sup>j</sup> x* with components *1j 2j ij nj x ,x ,... x ,... x* in the vectorial space defined by the patterns. We can also construct the data matrix by assembling the different feature column vectors. Accordingly, the data matrix can be considered as describing the patterns in terms of features or *vice versa*. This lead to two main classes of chemometric techniques called Q- and R-modes respectively. R-mode techniques are concerned with the relationships amongst the features of the experiment and examine the interplay between the columns of the data matrix. A starting point for R-mode procedures is the covariance matrix of mean centered variables *<sup>T</sup> C=X X* whose elements are given by

 <sup>1</sup> 1 *n jk lj j lk k l=1 c = x -x x -x <sup>n</sup>* where j <sup>k</sup> x and x are the mean of the observations on the j*th*

Critical Aspects of Supervised Pattern

been often used for decades. Hat matrix <sup>1</sup> *T T H XXX X*

smallest half-volume (SHV) (Egan & Morgan , 1998).

transformation, raw data are transformed according to

chromatographic columns.

**4. Data pre-treatment** 

Recognition Methods for Interpreting Compositional Data 25

2007), although the data about the skewness and kurtosis of the distribution are also of interest in order to consider parametric or non parametric descriptive statistics. Some simple presentations such as the Box-and-whisker plots help in the visual identification of outliers and other unusual characteristics of the data set. The box-and-whisker plot assorted with a numerical scale is a graphical representation of the five-number summary of the data set (Miller & Miller, 2005) where it is described by its extremes, its lower and upper quartiles and the median and gives at a glance the spread and the symmetry of the data set. Box-andwhisker plots may reveal suspicious patterns that should be tested for outliers. Abnormal data can road chemometric techniques leading to misleading conclusions, especially when outliers are present in the training set. Univariate outlier tests such as Dean and Dixon (1951) or Grubbs (1969) assays are not suitable. Instead, multivariate criteria for outlier detection are more advisable. The techniques based on the Mahalanobis distance (Gemperline & Boyer, 1995), and the hat matrix leverage (Hoaglin & Welsch, 1978) have

leverage values. Patterns having leverage values higher than 2p/n are commonly considered outliers. However these methods are unreliable for multivariate outlier detection. Numerous methods for outlier detection have been based on the singular value decomposition or Principal Component Analysis (PCA) (Jollife, 2002). Soft Independent Modelling of Class Analogy (SIMCA) has been also applied to outlier detection (Mertens et al. , 1994). Once outliers have been deleted, researchers usually remove them from the data set, but outliers could be corrected before applying the definite mathematical procedures by using robust algorithms (Daszykowski et al., 2007). Robust methods give better results, specially some improved algorithms such as resampling by the half-means (RHM) and

Within the field of food authentication, the wrong conclusions are, however, mostly due to data sets that do not keep all aspects of the food characterisation (partial or skewed data set) or do merge data measured with different techniques (Aparicio & Aparicio-Ruiz, 2002). A classical example is the assessment of the geographical origin of Italian virgin olive oils by Artificial Neural Networks (ANN). The differences between oils were mainly due to the use of different chromatographic columns (packed columns against capillary columns) when quantifying the free fatty acid (FFA) profile of the oils from Southern and Northern Italy respectively (Zupan et al. , 1994). The neural network thus mostly learned to recognise the

Data transformation (scaling) can be applied either for statistical dictates to optimise the analysis or based on chemical reasons. Raw compositional data are expressed in concentration units that can differ by orders of magnitude (e.g., percentage, ppm or ppb), the features with the largest absolute values are likely to dominate and influence the rule development and the classification process. Thus, for statistical needs, transformation of raw data can be applied to uniformize feature values. Autoscaling and column range scaling are the most common transformation (Sharaf et al., 1986). In the autoscaling or Z-

has diagonal values *ii h* called

and k*th* feature. If working with autoscaled data, the sample correlation matrix and the covariance matrix are identical. The element *r jk* of correlation matrix R represents the

cosine between each pair of column vectors and is given by *jk jk jj kk c r c c* . The diagonal

elements of R are always unity. The alternative viewpoint considers the relationships between patterns, the Q-mode technique. This way normally starts with a matrix of distances between the objects in the n-dimensional pattern space to study the clustering of samples. Typical metric measurements are Euclidean, Minkowski, Manhattan, Hamming, Tanimoto and Mahalanobis distances (Varmuza, 1980).

#### **3. Inspect data matrix**

Once the data matrix has been built, it should be fully examined in order to ensure the suitable application of Supervised Learning methodology. A typical undesirable issue is the existence of missing data. Holes in the data matrix must be avoided; however some measurements may not have been recorded or are impossible to obtain experimentally due to insufficient sample amounts or due to high costs. Besides, data can be missing due to various malfunctions of the instruments, or responses can be outside the instrument range.

As stated above, most chemometric techniques of data analysis do not allow for data gaps and thereof different methods have been applied for handling missing values in data matrix. Aside from the extreme situations of casewise deletion or mean substitution, the use of iterative algorithms (IA) is a promising tool. Each iteration consists of two steps. The first step performs estimation of model parameters just as if there were no missing data. The second step finds the conditional expectation of the missing elements given the observed data and current estimated parameters. The detailed procedure depends on the particular application. The typical IA used in Principal Component Analysis can be summarized as (Walczak & Massart, 2001):


Replacement of missing values or censored data with any value is always risky since this can substantially change the correlation in the data. It is possible to deal with both missing values and outliers simultaneously (Stanimirova et al., 2007). An excellent revision dealing with zeros and missing values in compositional data sets using non-parametric imputation has been performed by Martin-Fernández et al. (2003).

On the other hand, a number of chemometric procedures are based on the assumption of normality of features. Accordingly, features should be assessed for normality. The wellknown Kolmogorov-Smirnov, Shapiro-Wilks and Lilliefors tests are often used (González,

and k*th* feature. If working with autoscaled data, the sample correlation matrix and the

elements of R are always unity. The alternative viewpoint considers the relationships between patterns, the Q-mode technique. This way normally starts with a matrix of distances between the objects in the n-dimensional pattern space to study the clustering of samples. Typical metric measurements are Euclidean, Minkowski, Manhattan, Hamming,

Once the data matrix has been built, it should be fully examined in order to ensure the suitable application of Supervised Learning methodology. A typical undesirable issue is the existence of missing data. Holes in the data matrix must be avoided; however some measurements may not have been recorded or are impossible to obtain experimentally due to insufficient sample amounts or due to high costs. Besides, data can be missing due to various malfunctions of the instruments, or responses can be outside the instrument

As stated above, most chemometric techniques of data analysis do not allow for data gaps and thereof different methods have been applied for handling missing values in data matrix. Aside from the extreme situations of casewise deletion or mean substitution, the use of iterative algorithms (IA) is a promising tool. Each iteration consists of two steps. The first step performs estimation of model parameters just as if there were no missing data. The second step finds the conditional expectation of the missing elements given the observed data and current estimated parameters. The detailed procedure depends on the particular application. The typical IA used in Principal Component Analysis can be summarized as

1. Fill in missing elements with their initial estimates (expected values, calculated as the

4. Replace the missing elements with the predicted values and go to step 2 until

Replacement of missing values or censored data with any value is always risky since this can substantially change the correlation in the data. It is possible to deal with both missing values and outliers simultaneously (Stanimirova et al., 2007). An excellent revision dealing with zeros and missing values in compositional data sets using non-parametric imputation

On the other hand, a number of chemometric procedures are based on the assumption of normality of features. Accordingly, features should be assessed for normality. The wellknown Kolmogorov-Smirnov, Shapiro-Wilks and Lilliefors tests are often used (González,

mean of the corresponding row's and column's means) 2. Perform singular value decomposition of the complete data set

3. Reconstruct X with the predefined number of factors

has been performed by Martin-Fernández et al. (2003).

cosine between each pair of column vectors and is given by *jk*

*jk* of correlation matrix R represents the

*c*

*jj kk*

*c c* . The diagonal

*jk*

*r*

covariance matrix are identical. The element *r*

Tanimoto and Mahalanobis distances (Varmuza, 1980).

**3. Inspect data matrix** 

(Walczak & Massart, 2001):

convergence.

range.

2007), although the data about the skewness and kurtosis of the distribution are also of interest in order to consider parametric or non parametric descriptive statistics. Some simple presentations such as the Box-and-whisker plots help in the visual identification of outliers and other unusual characteristics of the data set. The box-and-whisker plot assorted with a numerical scale is a graphical representation of the five-number summary of the data set (Miller & Miller, 2005) where it is described by its extremes, its lower and upper quartiles and the median and gives at a glance the spread and the symmetry of the data set. Box-andwhisker plots may reveal suspicious patterns that should be tested for outliers. Abnormal data can road chemometric techniques leading to misleading conclusions, especially when outliers are present in the training set. Univariate outlier tests such as Dean and Dixon (1951) or Grubbs (1969) assays are not suitable. Instead, multivariate criteria for outlier detection are more advisable. The techniques based on the Mahalanobis distance (Gemperline & Boyer, 1995), and the hat matrix leverage (Hoaglin & Welsch, 1978) have been often used for decades. Hat matrix <sup>1</sup> *T T H XXX X* has diagonal values *ii h* called leverage values. Patterns having leverage values higher than 2p/n are commonly considered outliers. However these methods are unreliable for multivariate outlier detection. Numerous methods for outlier detection have been based on the singular value decomposition or Principal Component Analysis (PCA) (Jollife, 2002). Soft Independent Modelling of Class Analogy (SIMCA) has been also applied to outlier detection (Mertens et al. , 1994). Once outliers have been deleted, researchers usually remove them from the data set, but outliers could be corrected before applying the definite mathematical procedures by using robust algorithms (Daszykowski et al., 2007). Robust methods give better results, specially some improved algorithms such as resampling by the half-means (RHM) and smallest half-volume (SHV) (Egan & Morgan , 1998).

Within the field of food authentication, the wrong conclusions are, however, mostly due to data sets that do not keep all aspects of the food characterisation (partial or skewed data set) or do merge data measured with different techniques (Aparicio & Aparicio-Ruiz, 2002). A classical example is the assessment of the geographical origin of Italian virgin olive oils by Artificial Neural Networks (ANN). The differences between oils were mainly due to the use of different chromatographic columns (packed columns against capillary columns) when quantifying the free fatty acid (FFA) profile of the oils from Southern and Northern Italy respectively (Zupan et al. , 1994). The neural network thus mostly learned to recognise the chromatographic columns.

#### **4. Data pre-treatment**

Data transformation (scaling) can be applied either for statistical dictates to optimise the analysis or based on chemical reasons. Raw compositional data are expressed in concentration units that can differ by orders of magnitude (e.g., percentage, ppm or ppb), the features with the largest absolute values are likely to dominate and influence the rule development and the classification process. Thus, for statistical needs, transformation of raw data can be applied to uniformize feature values. Autoscaling and column range scaling are the most common transformation (Sharaf et al., 1986). In the autoscaling or Ztransformation, raw data are transformed according to

Critical Aspects of Supervised Pattern

**5. Feature selection and extraction** 

features and to build a final model based on it.

When they are applied to a selected j feature we have

*i x*

Q of classes, let ( ) *<sup>C</sup>*

( ) and *<sup>C</sup> m m*

matrix inversion when the rank of the matrix is less than p.

Recognition Methods for Interpreting Compositional Data 27

between features scores (the constant-row sum) that decreases the data dimensionality by 1 and the rank of data matrix is then p-1. Accordingly, the patterns fall on a hypersurface in the feature space and it is advisable to remove one feature to avoid problems involving

As a final remark it should be realized that when using some Supervised Learning techniques like SIMCA, the scaling of the data set is carried out only over the samples belonging to the same class (separate scaling). This is due because the own fundamentals of

Irrelevant features are very expensive ones, because they contribute to the chemical information with noise only; but even more expensive may be simply wrong features. Accordingly, they should be eliminated in order to circumvent disturb in classification. In almost all chemical applications of pattern recognition the number of original raw features is too large and a reduction of the dimensionality is necessary. Features which are essential for classification purposes are often called intrinsic features. Thus, a common practise to avoid redundant information consists of computing the correlation matrix of features R. Pair of most correlated features can be either combined or one of them is deleted. Researchers should be aware that the number of independent descriptors or features, p, must be much smaller than that of patterns, n. Otherwise, we can build a classification rule that even separates randomly selected classes of the training set (Varmuza, 1980). This assumes that the set of features is linearly independent (actually, it is the basis of the vectorial space) and the number of features is the dimensionality of the vectorial space. Accordingly, the true dimensionality should be evaluated for instance from an eigenanalysis (PCA) of the data matrix and extract the proper number of factors which correspond to the true dimensionality (d) of the space. Most efficient criteria for extracting the proper number of underlying factors are based on the Malinowski indicator function (Malinowski, 2002) and the Wold's Cross-Validation procedure (Wold, 1978). For most classification methods, a ratio n/d > 3 is advised and > 10, desirable. However, PCA-based methods like SIMCA or Partial Least Squares Discriminant Analysis (PLS-DA) can be applied without problem when p >> n. However, even in these instances there are suitable methods for selecting a subset of

Therefore, when it is advisable the feature selection, weighing methods determine the importance of the scaled features for a certain classification problem. Consider a pattern vector 1 2 ( , ,... ) *i ii ip x xx x* . Assuming that the data matrix X can be partitioned into a number

1 1 ( ) and

*n n* 

*i i i i C x x*

*n n C*

represent the general mean vector and the C-class mean, according to:

*m m*

a pattern vector belonging to class C. The averaged patterns

( ) ( )

*C*

(4)

the methodology and has a beneficial effect on the classification (Derde et al., 1982).

$$\mathbf{x}\_{ij}^{'} = \frac{\mathbf{x}\_{ij} - \overline{\mathbf{x}}\_{j}}{s\_{j}} \text{ with } s\_{j} = \sqrt{\frac{\sum\_{i=1}^{n} (\mathbf{x}\_{ij} - \overline{\mathbf{x}}\_{j})^{2}}{n - 1}} \tag{1}$$

and can be seen as a parametric scaling by column standardisation leading to ' ' 0 and 1 *j j x s* .

Column range scaling or minimax scaling involves the following transformation

$$\mathbf{x}\_{ij}^{'} = \frac{\mathbf{x}\_{ij} - \min\_{j} \left( \mathbf{x}\_{ij} \right)}{\max\_{j} \left( \mathbf{x}\_{ij} \right) - \min\_{j} \left( \mathbf{x}\_{ij} \right)} \tag{2}$$

Now, the transformed data verify ' 0 1 *ij x* . Range scaling can be considered as an interpolation of data within the interval (0,1). It is a non-parametric scaling but sensitive to outliers. When the data contain outliers and the preprocessing is necessary, one should consider robust way of data preprocessing in the context of robust Soft Independent Modelling of Class Analogy (SIMCA) method (Daszykowski et al., 2007).

The second kind of transformations are done for chemical reasons and comprise the called "constant-row sum" and "normalization variable" (Johnson & Ehrlich, 2002). Dealing with compositional data, concentrations can vary widely due to dilution away from a source. In the case of contaminated sediment investigations, for example, concentrations may decrease exponentially away from the effluent pipe. However, if the relative proportions of individual analytes remain relatively constant, then we would infer a single source scenario coupled with dilution far away from the source. Thus, a transformation is needed to normalize concentration/dilution effects. Commonly this is done using a transformation to a fractional ratio or percent, where each concentration value is divided by the total concentration of the sample: Row profile or constant row-sum transformation because the sum of analyte concentrations in each sample (across rows) sums unity or 100%:

$$\mathbf{x}\_{ij}^{'} = \frac{\mathbf{x}\_{ij}}{\sum\_{j=1}^{p} \mathbf{x}\_{ij}} \tag{3}$$

leading to ' 1 1 (or 100%) *p ij j x* . An alternative is to normalize the data with respect to a

single species or compound set as reference congener, the normalization variable. This transformation involves setting the value of the normalization feature to unity, and the values of all other features to some proportion of 1.0, such that their ratios with respect to the normalization feature remain the same in the original metric.

When n > p, the rank of data matrix is p (if all features are independent). Thus, autoscaling and minimax transformations do not change data dimensionality because these treatments do not induce any bound between features. Row profiles instead build a relationship

*ij j ij j <sup>i</sup>*

*s n*

and can be seen as a parametric scaling by column standardisation leading to ' ' 0 and 1 *j j x s* .

 ' min max min *ij j ij*

Now, the transformed data verify ' 0 1 *ij x* . Range scaling can be considered as an interpolation of data within the interval (0,1). It is a non-parametric scaling but sensitive to outliers. When the data contain outliers and the preprocessing is necessary, one should consider robust way of data preprocessing in the context of robust Soft Independent

The second kind of transformations are done for chemical reasons and comprise the called "constant-row sum" and "normalization variable" (Johnson & Ehrlich, 2002). Dealing with compositional data, concentrations can vary widely due to dilution away from a source. In the case of contaminated sediment investigations, for example, concentrations may decrease exponentially away from the effluent pipe. However, if the relative proportions of individual analytes remain relatively constant, then we would infer a single source scenario coupled with dilution far away from the source. Thus, a transformation is needed to normalize concentration/dilution effects. Commonly this is done using a transformation to a fractional ratio or percent, where each concentration value is divided by the total concentration of the sample: Row profile or constant row-sum transformation because the sum of analyte concentrations in each sample (across rows) sums unity or

'

 

*x*

1

single species or compound set as reference congener, the normalization variable. This transformation involves setting the value of the normalization feature to unity, and the values of all other features to some proportion of 1.0, such that their ratios with respect to

When n > p, the rank of data matrix is p (if all features are independent). Thus, autoscaling and minimax transformations do not change data dimensionality because these treatments do not induce any bound between features. Row profiles instead build a relationship

*ij ij p*

*x*

*ij j*

. An alternative is to normalize the data with respect to a

*x* 

*x x*

*j ij j ij*

*x x*

1 with

*ij j j*

Column range scaling or minimax scaling involves the following transformation

*x x x s*

*ij*

Modelling of Class Analogy (SIMCA) method (Daszykowski et al., 2007).

100%:

leading to '

1

*p ij j x* 

1 (or 100%)

the normalization feature remain the same in the original metric.

*x*

'

<sup>2</sup>

*x x*

*n*

1

(2)

(3)

(1)

between features scores (the constant-row sum) that decreases the data dimensionality by 1 and the rank of data matrix is then p-1. Accordingly, the patterns fall on a hypersurface in the feature space and it is advisable to remove one feature to avoid problems involving matrix inversion when the rank of the matrix is less than p.

As a final remark it should be realized that when using some Supervised Learning techniques like SIMCA, the scaling of the data set is carried out only over the samples belonging to the same class (separate scaling). This is due because the own fundamentals of the methodology and has a beneficial effect on the classification (Derde et al., 1982).

#### **5. Feature selection and extraction**

Irrelevant features are very expensive ones, because they contribute to the chemical information with noise only; but even more expensive may be simply wrong features. Accordingly, they should be eliminated in order to circumvent disturb in classification. In almost all chemical applications of pattern recognition the number of original raw features is too large and a reduction of the dimensionality is necessary. Features which are essential for classification purposes are often called intrinsic features. Thus, a common practise to avoid redundant information consists of computing the correlation matrix of features R. Pair of most correlated features can be either combined or one of them is deleted. Researchers should be aware that the number of independent descriptors or features, p, must be much smaller than that of patterns, n. Otherwise, we can build a classification rule that even separates randomly selected classes of the training set (Varmuza, 1980). This assumes that the set of features is linearly independent (actually, it is the basis of the vectorial space) and the number of features is the dimensionality of the vectorial space. Accordingly, the true dimensionality should be evaluated for instance from an eigenanalysis (PCA) of the data matrix and extract the proper number of factors which correspond to the true dimensionality (d) of the space. Most efficient criteria for extracting the proper number of underlying factors are based on the Malinowski indicator function (Malinowski, 2002) and the Wold's Cross-Validation procedure (Wold, 1978). For most classification methods, a ratio n/d > 3 is advised and > 10, desirable. However, PCA-based methods like SIMCA or Partial Least Squares Discriminant Analysis (PLS-DA) can be applied without problem when p >> n. However, even in these instances there are suitable methods for selecting a subset of features and to build a final model based on it.

Therefore, when it is advisable the feature selection, weighing methods determine the importance of the scaled features for a certain classification problem. Consider a pattern vector 1 2 ( , ,... ) *i ii ip x xx x* . Assuming that the data matrix X can be partitioned into a number Q of classes, let ( ) *<sup>C</sup> i x* a pattern vector belonging to class C. The averaged patterns ( ) and *<sup>C</sup> m m* represent the general mean vector and the C-class mean, according to:

$$
\vec{m} = \frac{\sum\_{i=1}^{n} \vec{x}\_i}{n} \quad \text{and} \quad \vec{m}^{\text{(C)}} = \frac{\sum\_{i=1}^{n \text{(C)}} \vec{x}\_i^{\text{(C)}}}{n} \tag{4}
$$

When they are applied to a selected j feature we have

Critical Aspects of Supervised Pattern

Fisher weights (FW) (Duda et al., 2000):

A multi-group criterion is the called Wilks'

coefficient of canonical correlation,

(Leardi & Lupiañez, 1998).

**6. Development of the decision rule** 

defined as

Coomans weights (g) (Coomans et al., 1978):

(1) (2)

*m m g s*

*j j j j*

> det det

> > 1

removed, building new networks for each (Maier et al., 1998).

Recognition Methods for Interpreting Compositional Data 29

*FW*

with

*j*

( ) () () <sup>2</sup> (1) (2) ( ) 1

*ij j j j <sup>C</sup> <sup>i</sup>*

*s s n C*

W, B and T, respectively, as defined above. Remembering that the ratio *BSS*

a general statistic used as a measure for testing the difference among group centroids. All classes are assumed to be homogeneous variance-covariance matrices and the statistic is

> *W SSW SSW T SST SSB SSW*

Where SSW, SSB and SST refer to the sum of squares corresponding to the scatter matrices

 0 and more significant are the centroid difference. Before calculation of the statistic, data should be autoscaled. This later criterion as well as the largest values of Rao's distance or Mahalanobis distance is generally used in Stepwise Discriminant Analysis (Coomans et al., 1979). Certain Supervised Learning techniques enable feature selection according to its own philosophy. Thus, for instance, SIMCA test the intrinsic features according the values of two indices called discriminating power and modelling power (Kvalheim & Karstang, 1992). Using ANNs for variable selection is attractive since one can globally adapt the variables selector together with the classifier by using the called "pruning" facilities. Pruning is a heuristic method to feature selection by building networks that do not use those variables as inputs. Thus, various combinations of input features can be added and

Genetic algorithms are also very useful for feature selection in fast methods such as PLS

In order to focus the commonly used Supervised Learning techniques of pattern recognition we have selected the following methods: K-Nearest Neighbours (KNN) (Silverman & Jones, 1989), Linear Discriminant Analysis (LDA) (Coomans et al., 1979), Canonical Variate Analysis (CVA) (Cole & Phelps, 1979), Soft Independent Modelling of Class Analogy (SIMCA) (Wold, 1976), Unequal dispersed classes (UNEQ) (Derde & Massart, 1986), PLS-DA (Stahle & Wold, 1987), Procrustes Discriminant Analysis (PDA) (González-Arjona et al., 2001), and methods based on ANN such as Multi-Layer Perceptrons (MLP) (Zupan & Gasteiger, 1993; Bishop, 2000), Supervised Kohonen Networks (Melssen et al., 2006),

(1) (2) <sup>2</sup> ( ) *j j*

*W* 

*m m*

*n C*

, and hence when

*jj*

( )

*x m*

*C C*

or McCabe U statistics (McCabe, 1975). This is

(11)

1 for intrinsic features,

*WSS*

is the

( )

$$m\_j = \frac{\sum\_{i=1}^n x\_{ij}}{n} \quad \text{and} \quad m\_j^{(\text{C})} = \frac{\sum\_{i=1}^{n(\text{C})} x\_{ij}^{(\text{C})}}{n(\text{C})} \tag{5}$$

Where n(C) is the number of patterns of class C.

Accordingly, we can explore the inter-class scatter matrix as well as the intra-class scatter matrix. The total scatter matrix can be obtained as

$$T = \sum\_{i=1}^{n} (\vec{\boldsymbol{x}}\_i - \vec{\boldsymbol{m}})(\vec{\boldsymbol{x}}\_i - \vec{\boldsymbol{m}})^T \tag{6}$$

and its elements as

$$T\_{jk} = \sum\_{i=1}^{n} (\mathbf{x}\_{ij} - m\_j)(\mathbf{x}\_{ik} - m\_k) \tag{7}$$

The within classes scatter matrix, together with its element is given by

$$\begin{aligned} \mathcal{W} &= \sum\_{\mathbb{C}=1}^{Q} \sum\_{i=1}^{n(\mathbb{C})} (\bar{\mathbf{x}}\_{i}^{\text{(C)}} - \bar{m}^{\text{(C)}}) (\bar{\mathbf{x}}\_{i}^{\text{(C)}} - \bar{m}^{\text{(C)}})^{T} \\ \mathcal{W}\_{jk} &= \sum\_{\mathbb{C}=1}^{Q} \sum\_{i=1}^{n(\mathbb{C})} (\mathbf{x}\_{ij}^{\text{(C)}} - m\_{j}^{\text{(C)}}) (\mathbf{x}\_{ik}^{\text{(C)}} - m\_{k}^{\text{(C)}}) \end{aligned} \tag{8}$$

And the between classes matrix and element,

$$\begin{aligned} B &= \sum\_{\mathbf{C}=1}^{Q} n(\mathbf{C}) (\vec{m}^{\langle \mathbf{C} \rangle} - \vec{m}) (\vec{m}^{\langle \mathbf{C} \rangle} - \vec{m})^{T} \\ B\_{jk} &= \sum\_{\mathbf{C}=1}^{Q} n(\mathbf{C}) (m\_{j}^{\langle \mathbf{C} \rangle} - m\_{j}) (m\_{k}^{\langle \mathbf{C} \rangle} - m\_{k}) \end{aligned} \tag{9}$$

For a case involving two classes 1 and 2 and one feature j we have the following:

$$\begin{aligned} T\_{\vec{jj}} &= \sum\_{i=1}^{p} (\mathbf{x}\_{ij} - m\_j)^2 \\ \mathcal{W}\_{\vec{jj}} &= \sum\_{i=1}^{n(1)} (\mathbf{x}\_{ij}^{(1)} - m\_j^{(1)})^2 + \sum\_{i=1}^{n(2)} (\mathbf{x}\_{ij}^{(2)} - m\_j^{(2)})^2 \\ B\_{\vec{jj}} &= n(1)(m\_j^{(1)} - m\_j)^2 + n(2)(m\_j^{(2)} - m\_j)^2 \end{aligned} \tag{10}$$

Weighting features in Supervised Learning techniques can be then extracted from its relative importance in discriminating classes pairwise. The largest weight corresponds to the most important feature. The most common weighting factors are:

 Variance weights (VW) (Kowalski & Bender, 1972): *jj j jj B VW W*

*n n C*

1 1 ( ) and

Accordingly, we can explore the inter-class scatter matrix as well as the intra-class scatter

*T x mx m*

( )( ) *<sup>n</sup> <sup>T</sup> i i*

( )( )

( ) ( ) ( ) ( ) ( )

*i i*

*jk ij j ik k*

*W x mx m*

*W x mx m*

( ) () () () ()

() ()

( )( )( )

*<sup>Q</sup> C C jk j j k k*

*B nC m m m m*

*B nC m m m m*

() ()

*C C T*

( )( )( )

(1) (2) (1) (1) 2 2 (2) (2)

(1)( ) (2)( )

( )( )

(1) 2 2 (2)

(10)

*j*

*VW*

*jj*

*B*

*W*

( )( )

*C CC C*

( )( )

*C C C C T*

*jk ij j ik k*

*T x mx m*

*m m*

1

1

*i*

The within classes scatter matrix, together with its element is given by

1 1

*C i Q n C*

*Q n C*

1 1

*C i*

1

*C*

1

important feature. The most common weighting factors are:

Variance weights (VW) (Kowalski & Bender, 1972): *jj*

*T xm*

*p jj ij j i*

*Q*

*C*

1

For a case involving two classes 1 and 2 and one feature j we have the following:

( )

2

*n n*

1 1

*W xm xm*

*jj ij j ij j i i*

*jj j j j j*

Weighting features in Supervised Learning techniques can be then extracted from its relative importance in discriminating classes pairwise. The largest weight corresponds to the most

*Bnm m nm m*

*n*

*i*

Where n(C) is the number of patterns of class C.

matrix. The total scatter matrix can be obtained as

And the between classes matrix and element,

and its elements as

*ij ij i i C j j*

*x x*

*n n C* 

( ) ( )

*C*

(6)

(7)

(5)

(8)

(9)

( )


$$\mathbf{g}\_{\circ} = \frac{\begin{vmatrix} m\_{\circ}^{(1)} - m\_{\circ}^{(2)} \end{vmatrix}}{s\_{\circ}^{(1)} + s\_{\circ}^{(2)}} \text{ with } s\_{\circ}^{(C)} = \sqrt{\frac{\sum\_{i=1}^{n(C)} (\mathbf{x}\_{ij}^{(C)} - m\_{\circ}^{(C)})^2}{n(C)}}$$

A multi-group criterion is the called Wilks' or McCabe U statistics (McCabe, 1975). This is a general statistic used as a measure for testing the difference among group centroids. All classes are assumed to be homogeneous variance-covariance matrices and the statistic is defined as

$$\mathcal{A} = \frac{\det W}{\det T} = \frac{SSW}{SST} = \frac{SSW}{SSB + SSW} \tag{11}$$

Where SSW, SSB and SST refer to the sum of squares corresponding to the scatter matrices

W, B and T, respectively, as defined above. Remembering that the ratio *BSS WSS* is the coefficient of canonical correlation, 1 , and hence when 1 for intrinsic features, 0 and more significant are the centroid difference. Before calculation of the statistic, data should be autoscaled. This later criterion as well as the largest values of Rao's distance or Mahalanobis distance is generally used in Stepwise Discriminant Analysis (Coomans et al., 1979). Certain Supervised Learning techniques enable feature selection according to its own philosophy. Thus, for instance, SIMCA test the intrinsic features according the values of two indices called discriminating power and modelling power (Kvalheim & Karstang, 1992). Using ANNs for variable selection is attractive since one can globally adapt the variables selector together with the classifier by using the called "pruning" facilities. Pruning is a heuristic method to feature selection by building networks that do not use those variables as inputs. Thus, various combinations of input features can be added and removed, building new networks for each (Maier et al., 1998).

Genetic algorithms are also very useful for feature selection in fast methods such as PLS (Leardi & Lupiañez, 1998).

#### **6. Development of the decision rule**

In order to focus the commonly used Supervised Learning techniques of pattern recognition we have selected the following methods: K-Nearest Neighbours (KNN) (Silverman & Jones, 1989), Linear Discriminant Analysis (LDA) (Coomans et al., 1979), Canonical Variate Analysis (CVA) (Cole & Phelps, 1979), Soft Independent Modelling of Class Analogy (SIMCA) (Wold, 1976), Unequal dispersed classes (UNEQ) (Derde & Massart, 1986), PLS-DA (Stahle & Wold, 1987), Procrustes Discriminant Analysis (PDA) (González-Arjona et al., 2001), and methods based on ANN such as Multi-Layer Perceptrons (MLP) (Zupan & Gasteiger, 1993; Bishop, 2000), Supervised Kohonen Networks (Melssen et al., 2006),

Critical Aspects of Supervised Pattern

(González-Arjona et al., 1999).

CAIMAN is a class modelling one.

suffer of nonlinear class separability problems.

decision is implemented).

Recognition Methods for Interpreting Compositional Data 31

On the other hands, according to the nature of the chemical problem, some supervised techniques perform better than others, because its own fundamentals and scope. In order to

1. *Parametric/non-parametric techniques*: This first distinction can be made between techniques that take account of the information on the population distribution. Non parametric techniques such as KNN, ANN, CAIMAN and SVM make no assumption on the population distribution while parametric methods (LDA, SIMCA, UNEQ, PLS-DA) are based on the information of the distribution functions. LDA and UNEQ are based on the assumption that the population distributions are multivariate normally distributed. SIMCA is a parametric method that constructs a PCA model for each class separately and it assumes that the residuals are normally distributed. PLS-DA is also a parametric technique because the prediction of class memberships is performed by means of model that can be formulated as a regression equation of Y matrix (class membership codes) against X matrix

2. *Discriminating (hard)/Class-Modelling (soft) techniques*: Pure classification, discriminating or hard classification techniques are said to apply for the first level of Pattern Recognition, where objects are classified into either of a number of defined classes (Albano et al., 1978). These methods operate dividing the hyperspace in as many regions as the number of classes so that, if a sample falls in the region of space corresponding to a particular category, it is classified as belonging to that category. These kinds of methods include LDA, KNN, PLS-DA, MLP, PNN and SVM. On the other hands, Class-Modelling techniques build frontiers between each class and the rest of the universe. The decision rule for a given class is a class box that envelopes the position of the class in the pattern space. So, three kinds of classification are possible: (i) an object is assigned to a category if it is situated inside the boundaries of only a class box, (ii) an object can be inside the boundaries (overlapping region) of more than one class box, or (iii) an object is considered to be an outlier for that class if it falls outside the class box. These are the features to be covered by methods designed for the so called second level of Pattern Recognition: The first level plus the possibility of outliers and multicategory objects. Thus, typical class modelling techniques are SIMCA and UNEQ as well as some modified kind of ANN as KCM. CAIMAN method is developed in different options: D-CAIMAN is a discriminating classification method and M-

3. *Deterministic/Probabilistic techniques*: A deterministic method classifies an object in one and only one of the training classes and the degree of reliability of this decision is not measured. Probabilistic methods provide an estimate of the reliability of the classification decision. KNN, MLP, SVM and CAIMAN are deterministic. Other techniques, including some kind of ANN are probabilistic (e.g., PNN where a Bayesian

4. *Linear/Non-Linear separation boundaries*: Here our attention is focused on the mathematical form of the decision boundary. Typical non-linear classification techniques are based on ANN and SVM, specially devoted to apply for classification problems of non-linear nature. It is remarkable that CAIMAN method seems not to

consider the different possibilities, four paradigms can be envisaged:

Kohonen Class-Modelling (KCM) (Marini et al., 2005), and Probabilistic Neural Networks (PNN) (Streit & Luginbuhl, 1994). Recently, new special classification techniques arose. A procedure called Classification And Influence Matrix Analysis (CAIMAN) has been introduced by Todeschini *et al* (2007). The method is based on the leverage matrix and models each class by means of the class dispersion matrix and calculates the leverage of each sample with respect to each class model space. Since about two decades another new classification (and regression) revolutionary technique based on statistical learning theory and kernel latent variables has been proposed: Support Vector Machines (SVM) (Vapnik, 1998; Abe, 2005; Burges, 1998). The purpose of SVM is separate the classes in a vectorial space independently on the probabilistic distribution of pattern vectors in the data set (Berrueta et al., 2007). This separation is performed with the particular hyperplane which maximizes a quantity called margin. The margin is the distance from a hyperplane separating the classes to the nearest point in the data set (Pardo & Sberveglieri, 2005). The training pattern vectors closest to the separation boundary are called *support vectors.* When dealing with a non linear boundary, the kernel method is applied. The key idea of kernel method is a transformation of the original vectorial space (input space) to a high dimensional Hilbert space (feature space), in which the classes can be separated linearly. The main advantages of SVM against its most direct concurrent method, ANN, are the easy avoiding of overfitting by using a penalty parameter and the finding of a deterministic global minimum against the non deterministic local minimum attained with ANN.

Some of the mentioned methods are equivalent. Let us consider some couples: CVA and LDA and PLS-DA and PDA. CVA attempts to find linear combinations of variables from each set that exhibit maximum correlation. These may be referred to as canonical variates, and data can be displayed as scatterplot of one against the other. The problem of maximizing the correlation can be formulated as an eigenanalysis problem with the largest eigenvalue providing the maximized correlation and the eigenvectors giving the canonical variates. Loadings of original features in the canonical variates and cumulative proportions of eigenvalues are interpreted, partly by analogy with PCA. Note that if one set of features are dummy variables giving group indicators, and then CVA is mathematically identical to LDA (González-Arjona et al., 2006). PLS-DA finds latent variables in the feature space which have a maximum covariance with the y variable. PDA may be considered equivalent to PLS-DA. The only difference is that in PDA, eigenvectors are obtained from the covariance matrix *<sup>T</sup> Z Z* instead of *<sup>T</sup> X X* , with *<sup>T</sup> Z YX* where Y is the membership target matrix constructed with ones and zeros: For a three classes problem, sample labels are 001, 010 and 100. Accordingly, we can consider CVA equivalent to LDA and PLS-DA equivalent to PDA.

Researchers should be aware of apply the proper methods according to the nature and goals of the chemical problem. As Daszykowski and Walczak pointed out in his excellent survey (Daszykowski & Walczak, 2006), in many applications, unsupervised methods such as PCA are used for classification purposes instead of the supervised approach. If the data set is well structured, then PCA-scores plot can reveal grouping of patterns with different origin, although the lack of these groups in the PCA space does not necessarily mean that there is no statistical difference between these samples. PCA by definition maximizes data variance, but the main variance cannot be necessarily associated with the studied effect (for instance, sample origin). Evidently, PCA can be used for exploration, compression and visualization of data trends, but it cannot be used as Supervised Learning classification method.

Kohonen Class-Modelling (KCM) (Marini et al., 2005), and Probabilistic Neural Networks (PNN) (Streit & Luginbuhl, 1994). Recently, new special classification techniques arose. A procedure called Classification And Influence Matrix Analysis (CAIMAN) has been introduced by Todeschini *et al* (2007). The method is based on the leverage matrix and models each class by means of the class dispersion matrix and calculates the leverage of each sample with respect to each class model space. Since about two decades another new classification (and regression) revolutionary technique based on statistical learning theory and kernel latent variables has been proposed: Support Vector Machines (SVM) (Vapnik, 1998; Abe, 2005; Burges, 1998). The purpose of SVM is separate the classes in a vectorial space independently on the probabilistic distribution of pattern vectors in the data set (Berrueta et al., 2007). This separation is performed with the particular hyperplane which maximizes a quantity called margin. The margin is the distance from a hyperplane separating the classes to the nearest point in the data set (Pardo & Sberveglieri, 2005). The training pattern vectors closest to the separation boundary are called *support vectors.* When dealing with a non linear boundary, the kernel method is applied. The key idea of kernel method is a transformation of the original vectorial space (input space) to a high dimensional Hilbert space (feature space), in which the classes can be separated linearly. The main advantages of SVM against its most direct concurrent method, ANN, are the easy avoiding of overfitting by using a penalty parameter and the finding of a deterministic

global minimum against the non deterministic local minimum attained with ANN.

of data trends, but it cannot be used as Supervised Learning classification method.

Some of the mentioned methods are equivalent. Let us consider some couples: CVA and LDA and PLS-DA and PDA. CVA attempts to find linear combinations of variables from each set that exhibit maximum correlation. These may be referred to as canonical variates, and data can be displayed as scatterplot of one against the other. The problem of maximizing the correlation can be formulated as an eigenanalysis problem with the largest eigenvalue providing the maximized correlation and the eigenvectors giving the canonical variates. Loadings of original features in the canonical variates and cumulative proportions of eigenvalues are interpreted, partly by analogy with PCA. Note that if one set of features are dummy variables giving group indicators, and then CVA is mathematically identical to LDA (González-Arjona et al., 2006). PLS-DA finds latent variables in the feature space which have a maximum covariance with the y variable. PDA may be considered equivalent to PLS-DA. The only difference is that in PDA, eigenvectors are obtained from the covariance matrix *<sup>T</sup> Z Z* instead of *<sup>T</sup> X X* , with *<sup>T</sup> Z YX* where Y is the membership target matrix constructed with ones and zeros: For a three classes problem, sample labels are 001, 010 and 100. Accordingly, we can consider CVA equivalent to LDA and PLS-DA equivalent to PDA. Researchers should be aware of apply the proper methods according to the nature and goals of the chemical problem. As Daszykowski and Walczak pointed out in his excellent survey (Daszykowski & Walczak, 2006), in many applications, unsupervised methods such as PCA are used for classification purposes instead of the supervised approach. If the data set is well structured, then PCA-scores plot can reveal grouping of patterns with different origin, although the lack of these groups in the PCA space does not necessarily mean that there is no statistical difference between these samples. PCA by definition maximizes data variance, but the main variance cannot be necessarily associated with the studied effect (for instance, sample origin). Evidently, PCA can be used for exploration, compression and visualization On the other hands, according to the nature of the chemical problem, some supervised techniques perform better than others, because its own fundamentals and scope. In order to consider the different possibilities, four paradigms can be envisaged:


Critical Aspects of Supervised Pattern

considered class. Accordingly,

performance.

Recognition Methods for Interpreting Compositional Data 33

In bootstrapping, we repeatedly analyze subsamples, instead of subsets of the known set. Each subsample is a random sample with replacement from the full sample (known set). Bootstrapping seems to perform better than cross-validation in many instances (Efron, 1983). However, the performance rate obtained for validating the decision rule could be misleading because they do not consider the number of false positive and false negative for each class. These two concepts provide a deep knowledge of the classes' space. Accordingly, it seems to be more advisable the use of terms sensitivity (*SENS*) and specificity (*SPEC*) (González-Arjona et al., 2006) for validating the decision rule. The *SENS* of a class corresponds to the rate of evaluation objects belonging to the class that are correctly classified, and the *SPEC* of a class corresponds to the rate of evaluation objects not belonging to the class that are correctly considered as belonging to the other classes. This may be explained in terms of the first and second kind of risks associated with prediction. The first kind of errors (*α*) corresponds to the probability of erroneously reject a member of the class as a non-member (rate of false negative, FN). The second kind of errors (*β*) corresponds to the probability of erroneously classify a non-member of the class as a member (rate of false positive, FP). Accordingly, for a given class A, and setting *nA* as the number of members of class A, *nA* as the number of non-members of class A, *nA* as the number of members of class A correctly classified as "belonging to class A" and *nA* as the number of nonmembers of class A classified as "not belonging to class A", we have (Yang et al., 2005):

> *A AA A A A*

(12)

(13)

(14)

*TP n FP n n TN n FN n n* 

*A*

*A*

With these parameters it can be built the called *confusion matrix* for class A:

*C A*

*M*

TP and TN being the number of True Positive and True Negative members of the

FN 1 1

*n n TP FN*

*n n TN FP*

*TN FP*

*FN TP* 

As it has been outlined, the common validation procedure consists of dividing the known set into two subsets, namely training and validation set. However, the validation procedure has to be considered with more caution in case of some kinds of ANN such as MLP because they suffer a special overfitting damage. The MLP consists of formal neurons and connection (weights) between them. As it is well known, neurons in MLP are commonly arranged in three layers: an input layer, one hidden layer (sometimes plus a bias neuron)

FP 1 1

*A A*

*<sup>n</sup> TP SENS*

*<sup>n</sup> TN SPEC*

*A A*

It is clear that values close to unity for both parameters indicates a successfully validation

#### **7. Validation of the decision rule**

A very important issue is the improper model validation. This pitfall even appears in very simple cases, such as the fitting of a series of data points by using a polynomial function. If we use a parsimonic fitting where the number of points is higher than the number of polynomial coefficients, the fitting train the generalities of the data set. Overparametrized fitting where the number of points becomes equal to the number of polynimial coefficients, trains idiosyncrasies and leads to overtraining or overfitting. Thus, a complex fitting function may fit the noise, not just the signal. Overfitting is a Damocles' sword that gravitates over any attempt to model the classification rule. We are interested to an intermediate behaviour: A model which is powerful enough to represent the underlying structure of the data (generalities), but not so powerful that it faithfully models the noise (idiosyncrasies) associated to data. This balance is known as the bias-variance tradeoff . The bias-variance tradeoff is most likely to become a problem when we have relatively few data points. In the opposite case, there is no danger of overfitting, as the noise associated with any single data point plays an immaterial role in the overall fit.

If we transfer the problem of fitting a polynomial to data into the use of another functions, such as the discriminant functions of canonical variates issued from LDA, the number of discriminant functions will be p (the number of features) or Q-1 (Q is the number of classes), whichever is smaller. As a rule of thumb (Defernez & Kemsley, 1997), the onset of overfitting should be strongly suspected when the dimensionality 3 *n Q <sup>d</sup>* . One of the simplest and most widely used means of preventing overfitting is to split the known data set into two sets: the training set and the validation, evaluation, prediction or test set.

Commonly, the known set is generally randomly divided into the training and validation sets, containing about P% and 100-P% samples of every class. Typical values are 75-25% or even 50-50% for training and validation sets. The classification performance is computed in average. Thus, the randomly generation of training and validation sets is repeated a number of times, 10 times for instance. Once the classification rule is developed, some workers consider as validation parameters the recalling efficiency (rate of training samples correctly classified by the rule) and, specially, the prediction ability (rate of evaluation samples correctly classified by the rule).

An alternative to the generation of training and validation sets are the cross-validation and the bootstrapping method (Efron and Gong, 1983). In the called k-fold cross validation, the know set is split into k subsets of approximately equal size. Then the training is performed k times, each time leaving out one of k the subsets, but using only the omitted subset to predict its class membership. From all predictions, the percentage of hits gives an averaged predictive ability. A very common and simple case of cross-validation is the leave-one-out method: At any given time, only a pattern is considered and tested and the remaining patterns form the training set. Training and prediction is repeated until each pattern was treated as test once. This later procedure is easily confused with jacknifing because both techniques involve omitting each pattern in turn, but cross-validation is used just for validation purposes and jacknife is applied in order to estimate the bias of a statistic.

A very important issue is the improper model validation. This pitfall even appears in very simple cases, such as the fitting of a series of data points by using a polynomial function. If we use a parsimonic fitting where the number of points is higher than the number of polynomial coefficients, the fitting train the generalities of the data set. Overparametrized fitting where the number of points becomes equal to the number of polynimial coefficients, trains idiosyncrasies and leads to overtraining or overfitting. Thus, a complex fitting function may fit the noise, not just the signal. Overfitting is a Damocles' sword that gravitates over any attempt to model the classification rule. We are interested to an intermediate behaviour: A model which is powerful enough to represent the underlying structure of the data (generalities), but not so powerful that it faithfully models the noise (idiosyncrasies) associated to data. This balance is known as the bias-variance tradeoff . The bias-variance tradeoff is most likely to become a problem when we have relatively few data points. In the opposite case, there is no danger of overfitting, as the noise associated with

If we transfer the problem of fitting a polynomial to data into the use of another functions, such as the discriminant functions of canonical variates issued from LDA, the number of discriminant functions will be p (the number of features) or Q-1 (Q is the number of classes), whichever is smaller. As a rule of thumb (Defernez & Kemsley, 1997), the onset of

simplest and most widely used means of preventing overfitting is to split the known data

Commonly, the known set is generally randomly divided into the training and validation sets, containing about P% and 100-P% samples of every class. Typical values are 75-25% or even 50-50% for training and validation sets. The classification performance is computed in average. Thus, the randomly generation of training and validation sets is repeated a number of times, 10 times for instance. Once the classification rule is developed, some workers consider as validation parameters the recalling efficiency (rate of training samples correctly classified by the rule) and, specially, the prediction ability (rate of evaluation samples

An alternative to the generation of training and validation sets are the cross-validation and the bootstrapping method (Efron and Gong, 1983). In the called k-fold cross validation, the know set is split into k subsets of approximately equal size. Then the training is performed k times, each time leaving out one of k the subsets, but using only the omitted subset to predict its class membership. From all predictions, the percentage of hits gives an averaged predictive ability. A very common and simple case of cross-validation is the leave-one-out method: At any given time, only a pattern is considered and tested and the remaining patterns form the training set. Training and prediction is repeated until each pattern was treated as test once. This later procedure is easily confused with jacknifing because both techniques involve omitting each pattern in turn, but cross-validation is used just for

validation purposes and jacknife is applied in order to estimate the bias of a statistic.

set into two sets: the training set and the validation, evaluation, prediction or test set.

*n Q <sup>d</sup>* . One of the

**7. Validation of the decision rule** 

correctly classified by the rule).

any single data point plays an immaterial role in the overall fit.

overfitting should be strongly suspected when the dimensionality 3

In bootstrapping, we repeatedly analyze subsamples, instead of subsets of the known set. Each subsample is a random sample with replacement from the full sample (known set). Bootstrapping seems to perform better than cross-validation in many instances (Efron, 1983).

However, the performance rate obtained for validating the decision rule could be misleading because they do not consider the number of false positive and false negative for each class. These two concepts provide a deep knowledge of the classes' space. Accordingly, it seems to be more advisable the use of terms sensitivity (*SENS*) and specificity (*SPEC*) (González-Arjona et al., 2006) for validating the decision rule. The *SENS* of a class corresponds to the rate of evaluation objects belonging to the class that are correctly classified, and the *SPEC* of a class corresponds to the rate of evaluation objects not belonging to the class that are correctly considered as belonging to the other classes. This may be explained in terms of the first and second kind of risks associated with prediction. The first kind of errors (*α*) corresponds to the probability of erroneously reject a member of the class as a non-member (rate of false negative, FN). The second kind of errors (*β*) corresponds to the probability of erroneously classify a non-member of the class as a member (rate of false positive, FP). Accordingly, for a given class A, and setting *nA* as the number of members of class A, *nA* as the number of non-members of class A, *nA* as the number of members of class A correctly classified as "belonging to class A" and *nA* as the number of nonmembers of class A classified as "not belonging to class A", we have (Yang et al., 2005):

$$\begin{aligned} TP &= \{ n\_A \} & FP &= \overline{n}\_A - \{ \overline{n}\_A \} \\ TN &= \{ \overline{n}\_A \} & FN &= n\_A - \{ n\_A \} \end{aligned} \tag{12}$$

TP and TN being the number of True Positive and True Negative members of the considered class. Accordingly,

$$\begin{aligned} SENS &= \frac{\{n\_A\}}{n\_A} = 1 - \alpha = 1 - \frac{\text{FN}}{n\_A} = \frac{\text{TP}}{\text{TP} + \text{FN}}\\ SPEC &= \frac{\{\overline{n}\_A\}}{\overline{n}\_A} = 1 - \beta = 1 - \frac{\text{FP}}{\overline{n}\_A} = \frac{\text{TN}}{\text{TN} + \text{FP}} \end{aligned} \tag{13}$$

It is clear that values close to unity for both parameters indicates a successfully validation performance.

With these parameters it can be built the called *confusion matrix* for class A:

$$\prescript{C}{}{M}\_A = \begin{bmatrix} T\mathcal{N} & FP \\ FN & TP \end{bmatrix} \tag{14}$$

As it has been outlined, the common validation procedure consists of dividing the known set into two subsets, namely training and validation set. However, the validation procedure has to be considered with more caution in case of some kinds of ANN such as MLP because they suffer a special overfitting damage. The MLP consists of formal neurons and connection (weights) between them. As it is well known, neurons in MLP are commonly arranged in three layers: an input layer, one hidden layer (sometimes plus a bias neuron)

Critical Aspects of Supervised Pattern

CVA, PLS\_DA).

statistic is as follows:

accurate classification models.

*n n n n*

: number of sa

00 01 10 11

procedure, the paper of Roggo et al (2006) is very promising.

with

Recognition Methods for Interpreting Compositional Data 35

In compositional data, as pointed out Berrueta et al. (2007), the main problem is class overlap, but with a suitable feature selection and adequate sample size, good classification performances can be achieved. In general, non-linear methods such as ANN or SVM are rarely needed and most classification problems can be solved using linear techniques (LDA,

Sometimes, several different types of techniques can be applied to the same data set. Classification methods are numerous and then the main problem is to select the most suitable one, especially dealing with quantitative criteria like prediction ability or misclassification percentage. In order to carry out the comparison adequately, the McNemar's test is a good choice (Roggo et al., 2003). Two classification procedures A and B are trained and the same validation set is used. Null hypothesis is that both techniques lead

freedom if the number of samples is higher than 20. The way to obtain the McNemar's

: number of samples misclassified by both methods A and B : number of samples misclassified by method A but not by B : number of samples misclassified by method B but not by A

number of patterns in the validation set *n nnnn val*

The critical value for a 5% significance level is 3.84. In order to get insight about this

Finally, a last consideration about problems with the data set representativeness. As it has been claimed in a published report a LDA was applied to differentiate 12 classes of oils on the basis of the chromatographic data, where some classes contained two or three members only (and besides, the model was not validated). There is no need of being an expertise chemometrician to be aware of two or three samples are insufficient to draw any relevant conclusion about the class to which they belong. There are more sources of possible data variance than the number of samples used to estimate class variability (Daszykowski & Walczak, 2006). The requirements of a sufficient number of samples for every class could be envisaged according to a class modelling technique to extract the class dimensionality and consider, for instance, a number of members within three to ten times this dimensionality.

Aside from this representativity context, it should be point out that when the aim is to classify food products or to build a classification rule to check the authentic origin of samples, they have to be collected very carefully according to a well established sampling plan. Often not enough care is taken about it, and thus is it hardly possible to obtain

 <sup>2</sup> 01 10 01 10

*n n n n* 

mples misclassified by neither method A nor B

1

test with one degree of

(16)

to the same misclassification rate. McNemar's test is based on a <sup>2</sup>

McNemar's value

and an output layer. The number of hidden nodes in a MLP indicates the complexity of the relationship in a way very similar to the fitting of a polynomial to a data set. Too many connections have the risk of a network specialization in training noise and poor prediction ability. Accordingly, a first action should be minimizing the number of neurons of the hidden layer. Some authors (Andrea & Kalayeh, 1991) have proposed the parameter which plays a major role in determining the best architecture:

$$\rho = \frac{\text{Number of data points in the training set}}{\text{Sum of the number of connections in the network}} \tag{15}$$

In order to avoid overfitting it is recommended that 1 2.2 .

Besides, the overfitting problem can be minimized by monitoring the performance of the network during training by using an extra verification set different from training set. This verification set is needed in order to stop the training process before the ANN learns idiosyncrasies present in the training data that leads to overfitting (González, 2007).

#### **8. Concluding remarks**

The selection of the supervised learning technique depends on the nature of the particular problem. If we have a data set composed only by a given number of classes and the rule is going to be used on test samples that we know they may belong to one of the former established classes only, then we can select a discriminating technique such as LDA, PLS-DA, SVM or some kind of discriminating ANN (MLP or PNN). Otherwise, class modelling techniques such as SIMCA, UNEQ or KCM are useful. Class modelling tools offer at least two main advantages: To identify samples which do not fall in any of the examined categories (and therefore can be either simply outlying observations or members of a new class not considered in the known set) and to take into account samples that can simultaneously belong to more than one class (multiclass patterns).

If the idiosyncrasy of the problem suggests that the boundaries could be of non-linear nature, then the use of SVM or ANN is the best choice.

In cases where the number of features in higher than the number of samples (p > n), a previous or simultaneous step dealing with feature selection is needed when non-PCA based techniques are used (KNN, LDA, ANN, UNEQ). PCA-based methods such as SIMCA and PLS-DA can be applied without need of feature selection. This characteristic is very interesting beyond of compositional analysis, when samples are characterized by a spectrum, like in spectrometric methods (FT-IR, FT-Raman, NMR...). A different behaviour of these two methods against the number of FP and FN has been noticed (Dahlberg et al., 1997). SIMCA is focused on class specificities, and hence it detects strangers with high accuracy (only when the model set does not contain outliers. Otherwise, robust SIMCA model can be used), but sometimes fails to recognize its own members if the class is not homogeneous enough or the training set is not large enough. PLS-DA, on the contrary, deals with an implicitly closed universe (since the Y variables have a constant sum) so that it ignores the possibility of strangers. However, this has the advantage to make the method more robust to class inhomogeneities, since what matters most in class differences.

In compositional data, as pointed out Berrueta et al. (2007), the main problem is class overlap, but with a suitable feature selection and adequate sample size, good classification performances can be achieved. In general, non-linear methods such as ANN or SVM are rarely needed and most classification problems can be solved using linear techniques (LDA, CVA, PLS\_DA).

Sometimes, several different types of techniques can be applied to the same data set. Classification methods are numerous and then the main problem is to select the most suitable one, especially dealing with quantitative criteria like prediction ability or misclassification percentage. In order to carry out the comparison adequately, the McNemar's test is a good choice (Roggo et al., 2003). Two classification procedures A and B are trained and the same validation set is used. Null hypothesis is that both techniques lead to the same misclassification rate. McNemar's test is based on a <sup>2</sup> test with one degree of freedom if the number of samples is higher than 20. The way to obtain the McNemar's statistic is as follows:

$$\text{McNeman's value} = \frac{\left(\left|n\_{01} - n\_{10}\right| - 1\right)^2}{n\_{01} + n\_{10}} \tag{16}$$

with

34 Chemometrics in Practical Applications

and an output layer. The number of hidden nodes in a MLP indicates the complexity of the relationship in a way very similar to the fitting of a polynomial to a data set. Too many connections have the risk of a network specialization in training noise and poor prediction ability. Accordingly, a first action should be minimizing the number of neurons of the hidden layer. Some authors (Andrea & Kalayeh, 1991) have proposed the parameter

> Number of data points in the training set Sum of the number of connections in the network

Besides, the overfitting problem can be minimized by monitoring the performance of the network during training by using an extra verification set different from training set. This verification set is needed in order to stop the training process before the ANN learns

The selection of the supervised learning technique depends on the nature of the particular problem. If we have a data set composed only by a given number of classes and the rule is going to be used on test samples that we know they may belong to one of the former established classes only, then we can select a discriminating technique such as LDA, PLS-DA, SVM or some kind of discriminating ANN (MLP or PNN). Otherwise, class modelling techniques such as SIMCA, UNEQ or KCM are useful. Class modelling tools offer at least two main advantages: To identify samples which do not fall in any of the examined categories (and therefore can be either simply outlying observations or members of a new class not considered in the known set) and to take into account samples that can

If the idiosyncrasy of the problem suggests that the boundaries could be of non-linear

In cases where the number of features in higher than the number of samples (p > n), a previous or simultaneous step dealing with feature selection is needed when non-PCA based techniques are used (KNN, LDA, ANN, UNEQ). PCA-based methods such as SIMCA and PLS-DA can be applied without need of feature selection. This characteristic is very interesting beyond of compositional analysis, when samples are characterized by a spectrum, like in spectrometric methods (FT-IR, FT-Raman, NMR...). A different behaviour of these two methods against the number of FP and FN has been noticed (Dahlberg et al., 1997). SIMCA is focused on class specificities, and hence it detects strangers with high accuracy (only when the model set does not contain outliers. Otherwise, robust SIMCA model can be used), but sometimes fails to recognize its own members if the class is not homogeneous enough or the training set is not large enough. PLS-DA, on the contrary, deals with an implicitly closed universe (since the Y variables have a constant sum) so that it ignores the possibility of strangers. However, this has the advantage to make the method

more robust to class inhomogeneities, since what matters most in class differences.

idiosyncrasies present in the training data that leads to overfitting (González, 2007).

(15)

.

which plays a major role in determining the best architecture:

In order to avoid overfitting it is recommended that 1 2.2

simultaneously belong to more than one class (multiclass patterns).

nature, then the use of SVM or ANN is the best choice.

**8. Concluding remarks** 

00 01 10 11 : number of samples misclassified by both methods A and B : number of samples misclassified by method A but not by B : number of samples misclassified by method B but not by A : number of sa *n n n n* 00 01 10 11 mples misclassified by neither method A nor B number of patterns in the validation set *n nnnn val*

The critical value for a 5% significance level is 3.84. In order to get insight about this procedure, the paper of Roggo et al (2006) is very promising.

Finally, a last consideration about problems with the data set representativeness. As it has been claimed in a published report a LDA was applied to differentiate 12 classes of oils on the basis of the chromatographic data, where some classes contained two or three members only (and besides, the model was not validated). There is no need of being an expertise chemometrician to be aware of two or three samples are insufficient to draw any relevant conclusion about the class to which they belong. There are more sources of possible data variance than the number of samples used to estimate class variability (Daszykowski & Walczak, 2006). The requirements of a sufficient number of samples for every class could be envisaged according to a class modelling technique to extract the class dimensionality and consider, for instance, a number of members within three to ten times this dimensionality.

Aside from this representativity context, it should be point out that when the aim is to classify food products or to build a classification rule to check the authentic origin of samples, they have to be collected very carefully according to a well established sampling plan. Often not enough care is taken about it, and thus is it hardly possible to obtain accurate classification models.

Critical Aspects of Supervised Pattern

ISSN:0165-9936

ISSN:0162-1459

225. ISSN:0021-9673

ISBN:0471056693, NY, USA

Vol. 35, pp. 279-300. ISSN:1573-8868

Chemistry. Vol. 26, pp. 227-237. ISSN:0165-9936

Systems. Vol. 57, pp. 133-137. ISSN:0169-7439

Technometrics. Vol. 11, pp. 1-21. ISSN:0040-1706

Recognition Methods for Interpreting Compositional Data 37

Dean, R.B. & Dixon, W.J. (1951). Simplified statistics for small number of observations.

Defernez, M. & Kemsley, E.K. (1997). The use and misuse of chemometrics for treating

Derde, M.P.; Coomans, D. & Massart, D.L. (1982). Effect of scaling on class modelling with

Derde, M.P. & Massart, D.L. (1986). UNEQ: A class modelling supervised pattern recognition technique. Microchimica Acta. Vol. 2, pp. 139-152. ISSN:0026-3672 Duda, R.O.; Hart, P.E. & Stork, D.G. (2000). Pattern classification. 2nd edition. Wiley,

Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross

Efron, B. & Gong, G. (1983). A leisurely look at the bootstrap, the jacknife and cross validation. The American Statiscian. Vol. 37, pp. 36-48. ISSN:0003-1305 Egan, W.J. & Morgan, S.L. (1998). Outlier detection in multivariate analytical chemical data.

Egozcue, J.J.; Pawlowsky-Glahn, V.; Mateu-Figueros, G.; Barcelo-Vidal, C. (2003). Isometric

Gemperline, P.J. & Boyer, N.R. (1995). Classification of near-infrared spectra using

González, A.G. & Herrador, M.A. (2007). A practical guide to analytical method validation,

González-Arjona, D.; López-Pérez, G. & González, A.G. (1999). Performing Procrustes

González-Arjona, D.; López-Pérez, G. & González, A.G. (2001). Holmes, a program for

González-Arjona, D.; López-Pérez, G. & González, A.G. (2006). Supervised pattern

Grubbs, F. (1969). Procedures for detecting outlying observations in samples.

Helsel, D.R. (1990).Less than obvious: Statistical treatment of data below the detection limit. Environmental Science & Technology. Vol. 24, pp. 1766-1774. ISSN: 1520-5851

and Food Chemistry. Vol. 54, pp. 1982-1989. ISSN:0021-8561

Analytical Chemistry. Vol. 70, pp. 2372-2379. ISSN:0003-2700

classification problems. Trends in Analytical Chemistry. Vol. 16, pp. 216-221.

the SIMCA method. Analytica Chimica Acta. Vol. 141, pp. 187-192. ISSN:0003-2670

validation. Journal of the American Statistical Association. Vol. 78, pp. 316-331.

logratio transformation for compositional data analysis. Mathematical Geology.

wavelength distances: Comparisons to the Mahalanobis distance and Residual Variance methods. Analytical Chemistry. Vol.67, pp. 160-166. ISSN:0003-2700 González, A.G. (2007). Use and misuse of supervised pattern recognition methods for

interpreting compositional data. Journal of Chromatograpy A. Vol. 1158, pp. 215-

including measurement uncertainty and accuracy profiles. Trends in Analytical

discriminant analysis with HOLMES. Talanta. Vol. 49, pp. 189-197. ISSN:0039-9140

performing Procrustes Transformations. Chemometrics and Intelligent Laboratory

recognition procedures for discrimination of whiskeys from Gas chromatography/Mass spectrometry congener analysis. Journal of Agricultural

Analytical Chemistry. Vol. 23, pp. 636-638. ISSN:0003-2700

#### **9. References**


Abe, S. (2005). Support vector machines for pattern classification. Springer,

Aitchison, J. (2003). The statistical analysis of compositional data. The Blackburn Press,

Albano, C.; Dunn III, W.; Edlund, U.; Johansson, E.; Norden, B.; Sjöström, M. & Wold, S.

Andrea, T.A.; Kalayeh, H. (1991). Applications of neural networks in quantitative structure-

Aparicio, R. & Aparicio-Ruíz, R. (2002). Chemometrics as an aid in authentication, In: Oils

Bishop, C.M. (2000). Neural Networks for Pattern Recognition, Oxford University Press,

Berrueta, L.A.; Alonso-Salces, R.M. & Héberger, K. (2007). Supervised pattern recognition in

Burges, C.J.C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. Vol. 2, pp. 121-167. ISSN:1384-5810 Chung, C.F. (1993). Estimation of covariance matrix from geochemical data with

Clarke, J.U. (1998). Evaluation of censored data methods to allow statistical comparisons

Environmental Science & Technology. Vol. 32, pp. 177-183. ISSN:1520-5851 Cole, R.A. & Phelps, K. (1979). Use of canonical variate analysis in the differentiation of

Coomans, D.; Broeckaert, I.; Fonckheer, M; Massart, D.L. & Blocks, P. (1978). The application

Coomans, D.; Massart, D.L. & Kaufman, L. (1979) Optimization by statistical linear

Dahlberg, D.B.; Lee, S.M.; Wenger, S.J. & Vargo, J.A. (1997). Classification of vegetable oils by FT-IR. Applied Spectroscopy. Vol. 51, pp. 1118-1124. ISSN:0003-7028 Daszykowski, M. & Walczak, B. (2006). Use and abuse of chemometrics in chromatography. Trends in Analytical Chemistry. Vol. 25, pp. 1081-1096. ISSN:0165-9936 Daszykowski, M.; Kaczmarek, K.; Stanimirova, I.; Vander Heyden, Y. & Walczak, B. (2007).

Chemistry. Vol. 34, pp. 2824-2836. ISSN: 0022-2623

Press, ISBN:1841273309, Oxford, UK and FL, USA

Chimica Acta. Vol. 103, pp. 409-415. ISSN:0003-2670

Laboratory Systems. Vol. 87, pp. 95-103. ISSN:0169-7439

(1978). Four levels of Pattern Recognition. Analytica Chimica Acta. Vol. 103, pp.

activity relationships of dihydrofolate reductase inhibitors. Journal of Medicinal

and Fats Authentication, M. Jee (Ed.), 156-180, Blackwell Publishing and CRC

food analysis. Journal of Chromatography A. Vol. 1158, pp. 196-214. ISSN:0021-

observations below detection limits. Mathematical Geology. Vol. 25, pp. 851-865.

among very small samples with below detection limits observations.

swede cultivars by gas-liquid chromatography of volatile hydrolysis products. Journal of the Science of Food and Agriculture. Vol. 30, pp. 669-676. ISSN:1097-0010

of linear discriminant analysis in the diagnosis of thyroid diseases. Analytica

discriminant analysis in analytical chemistry. Analytica Chimica Acta. Vol. 112, pp.

Robust SIMCA-bounding influence of outliers. Chemometrics and Intelligent

**9. References** 

ISBN:1852339299, London, UK

ISBN:1930665784, London, UK

429-443. ISSN:0003-2670

ISBN:0198538642, NY, USA

9673

ISSN:1573-8868

97-122. ISSN:0003-2670


Critical Aspects of Supervised Pattern

NY, USA

3207. ISSN:0022-2623

ISSN:1099-128X

9140

9227

Germany

ISSN:0169-7439

USA

Division. October 1985. Available from http://home.neo.rr.com/catbar/chemo/int\_chem.html

477, pp. 187-200. ISSN:0003-2670

Recognition Methods for Interpreting Compositional Data 39

Rock, B.A. (1985). An introduction to Chemometrics, 130th Meeting of the ACS Rubber

Roggo, Y.; Duponchel, L. & Huvenne, J.P. (2003). Comparison of supervised pattern

Sharaf, M.A.; Illman, D.A. & Kowalski, B.R. (1986). Chemometrics. Wiley, ISBN:0471831069,

Silverman, B.W. & Jones, M.C. (1989). E. Fix and J.L. Hodges (1951): An important

Stahle, L. & Wold, S. (1987). Partial least squares analysis with cross validation for the two

Stanimirova, I.; Daszykowski, M. & Walczak, B. (2007). Dealing with missing values and

Streit, R.L. & Luginbuhl, T.E. (1994). Maximun likelihood training of probabilistic neural

Todeschini, R.; Ballabio, D.; Consonni, V.; Mauri, A. & Pavan, M. (2007). CAIMAN

Varmuza, K. (1980). Pattern recognition in chemistry. Springer, ISBN:0387102736, Berlin,

Varmuza, K. & Filzmoser, P. (2009). Introduction to multivariate statistical analysis in

Walczak, B. & Massart, D.L. (2001). Dealing with missing data: Part I and Part II.

Wold, S. (1976). Pattern recognition by means of disjoint principal component models.

Wold, S. (1978). Cross validatory estimation of the number of components in factor and

Laboratory Systems. Vol. 87, pp. 3-17. ISSN:0169-7439

Pattern Recognition. Vol. 8, pp. 127-139. ISSN:0031-3203

Vapnik, V.N. (1998). Statistical learning theory. Wiley, ISBN:0471030031, NY, USA

International Statistical Review. Vol. 57, pp. 233-247. ISSN:0306-7734 So, S.S. & Richards, W.G. (1992). Application of Neural Networks: Quantitative structure

recognition methods with McNemar's statistical test: Application to qualitative analysis of sugar beet by near-infrared spectroscopy. Analytica Chimica Acta. Vol.

contribution to non parametric discriminant analysis and density estimation.

activity relationships of the derivatives of 2,4-diamino-5-(substituted-benzyl) pyrimidines as DHFR inhibitors. Journal of Medicinal Chemistry. Vol. 35, pp. 3201-

class problem: A monte-Carlo study. Journal of Chemometrics. Vol. 1, pp. 185-196.

outliers in principal component analysis. Talanta. Vol. 72, pp. 172-178. ISSN:0039-

networks. IEEE Transactions on Neural Networks. Vol. 5, pp. 764-783. ISSN:1045-

(Classification And Influence Matrix Analysis): A new approach to the classification based on leverage-scale functions. Chemometrics and Intelligent

chemometrics, CRC Press, Taylor & Francis Group, ISBN:14005975, Boca Ratón, FL,

Chemometrics and Intelligent Laboratory System. Vol. 58, pp. 15-27 and pp. 29-42.

principal components models. Technometrics. Vol. 20, pp. 397-405. ISSN:0040-1706


Hoaglin, D.C. & Welsch, R.E. (1978). The hat matrix in regression and ANOVA. The

Holger, R.M.; Dandy, G.C. & Burch, M.D. (1998). Use of artificial neural networks for

Jollife, I.T. (2002). Principal Component Analysis. 2nd edition, Springer, ISBN:0387954422,

Johnson, G.W. & Ehrlich, R. (2002). State of the Art report on multivariate chemometric

Kryger, L. (1981). Interpretation of analytical chemical information by pattern recognition

Kowalski, B.R. & Bender, C.F. (1972). Pattern recognition. A powerful approach to

Kvalheim, O.M. & Karstang, T.V. (1992). SIMCA-Classification by means of disjoint cross

Leardi, R. & Lupiañez, A. (1998). Genetic algorithms applied to feature selection in PLS

Malinowski, E.R. (2002). Factor Analysis in Chemistry. Wiley, ISBN:0471134791, NY, USA Marini, F.; Zupan, J. & Magrí, A.L. (2005). Class modelling using Kohonen artificial neural networks. Analytica Chimica Acta. Vol.544, pp. 306-314. ISSN:0003-2670 Martín-Fernández, J.A.; Barceló-Vidal, C. & Pawlowsky-Glahn, V. (2003). Dealing with zeros

McCabe, G.P. (1975). Computations for variable selection in discriminant analysis.

Melssen, W.; Wehrens, R. & Buydens, L. (2006). Supervised Kohonen networks for

Mertens, B.; Thompson, M. & Fearn, T. (1994). Principal component outlier detection and SIMCA: a synthesis. Analyst. Vol. 119, pp. 2777-2784. ISSN:0003-2654 Miller, J.N. & Miller, J.C. (2005). Statistics and Chemometrics for Analytical Chemistry. 4th

Naes, T.; Isaksson, T.; Fearn, T.; Davies, T. (2004). A user-friendly guide to multivariate calibration and classification. NIR Publications, ISBN;0952866625, Chichester, UK Pardo, M. & Sberveglieri, G. (2005). Classification of electronic nose data with support vector machines. Sensors and Actuators. Vol. 107, pp. 730-737. ISSN:0925-4005 Pretsch, E. & Wilkins, C.L. (2006). Use and abuse of Chemometrics. Trends in Analytical

Mathematical Geology. Vol. 35, pp. 253-278. ISSN:1573-8868

edition. Prentice-Hall, Pearson. ISBN:0131291920. Harlow, UK

Technometrics. Vol. 17, pp. 103-109. ISSN:0040-1706

Chemistry. Vol. 25, p. 1045. ISSN:0165-9936

methods-a survey. Talanta. Vol. 28, pp. 871-887. ISSN:0039-9140

modelling cyanobacteria Anabaena spp. In the river Murray, South Australia.

methods in Environmental Forensics. Environmental Forensics. Vol. 3, pp. 59-79.

interpreting chemical data. Journal of the American Chemical Society. Vol. 94, pp.

validated principal component models, In: Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, R.G. Brereton (Ed.), 209-245, Elsevier,

regression: how and when to use them. Chemometrics and Intelligent Laboratory

and missing values in compositional data sets using nonparametric imputation.

classification problems. Chemometrics and Intelligent Laboratory Systems. Vol. 83,

American Statiscian. Vol. 32, pp. 17-22. ISSN:0003-1305

NY, USA

ISSN:1527-5930

5632-5639. ISSN:0002-7863

pp. 99-113. ISSN:0169-7439

ISBN:0444897844, Amsterdam, Netherland

Systems. Vol. 41, pp. 195-207. ISSN:0169-7439

Ecological Modelling. Vol. 105, pp. 257-272. ISSN:0304-3800


**3** 

**Analysis of Chemical Processes, Determination** 

This chapter is intended to demonstrate some recent approaches to the quantitative determination of chemical processes based on the quantitative analysis of experimental spectrophotometric measurements. In this chapter we will discuss kinetic processes, equilibrium processes and also processes that include a combination of kinetic and

We also emphasise the advantage of 'global' multivariate (multiwavelength) data analysis which has the advantage of allowing the robust determination of more complex mechanisms than single wavelength analysis and also has the benefit of yielding the spectra

Rather than dwell on the mathematical derivation of the complex numerical algorithms and a repetition of the fundamentals of non-linear regression methods and least squares fitting which are available from a wide variety of sources (Martell and Motekaitis 1988; Polster and Lachmann 1989; Gans 1992; Press, Vetterling et al. 1995; Maeder and Neuhold 2007), we aim to show the experimentalist how to obtain the results they are interested, using purpose designed global analysis software and a variety of worked examples. We will be using ReactLab, a suite of versatile and powerful reaction modelling and analysis tools developed by the authors. Other academic and commercial applications exist for multivariate and related types of analysis and the reader is encouraged to explore these for comparative

Any spectroscopic technique is ideal for the analysis of chemical processes as there is no interference in the underlying chemistry by the measurement technique. This is in sharp contrast to say chromatographic analysis or other separation methods which are totally unsuitable for the analysis of dynamic equilibrium systems. Such methods are also of very limited use for kinetic studies which often are too fast on the chromatographic time scale of

purposes. All offer different features and benefits but will not be discussed here.

**2. Spectrophotometry, the ideal technique for process analysis** 

**1. Introduction** 

equilibrium steps.

of all the participating species.

**of the Reaction Mechanism and Fitting** 

**of Equilibrium and Rate Constants** 

*Department of Chemistry, University of Newcastle, Australia* 

Marcel Maeder and Peter King

*Jplus Consulting Ltd, Perth,* 

*Australia* 


## **Analysis of Chemical Processes, Determination of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants**

Marcel Maeder and Peter King

*Department of Chemistry, University of Newcastle, Australia Jplus Consulting Ltd, Perth, Australia* 

#### **1. Introduction**

40 Chemometrics in Practical Applications

Wold, S. (1995). Chemometrics, what do we mean with it, and what do we want from it?

Yang, Z.; Lu, W.;Harrison, R.G.; Eftestol, T.; Steen,P.A. (2005). A probabilistic neural

Zupan, J. & Gasteiger, J. (1993). Neural Networks for chemists. VCH, ISBN:1560817917,

Zupan, J.; Novic, M.; Li, X. & Gasteiger, J. (1994). Classification of multicomponent

Resuscitation. Vol. 64, pp. 31-36. ISSN:0300-9572

Acta. Vol. 292, pp. 219-234. ISSN:0003-2670

7439

Weinheim, Germany

Chemometrics and Intelligent Laboratory Systems. Vol. 30, pp. 109-115. ISSN:0169-

network as the predictive classifier of out-of-hospital defibrillation outcomes.

analytical data of olive oils using different neural networks. Analytica Chimica

This chapter is intended to demonstrate some recent approaches to the quantitative determination of chemical processes based on the quantitative analysis of experimental spectrophotometric measurements. In this chapter we will discuss kinetic processes, equilibrium processes and also processes that include a combination of kinetic and equilibrium steps.

We also emphasise the advantage of 'global' multivariate (multiwavelength) data analysis which has the advantage of allowing the robust determination of more complex mechanisms than single wavelength analysis and also has the benefit of yielding the spectra of all the participating species.

Rather than dwell on the mathematical derivation of the complex numerical algorithms and a repetition of the fundamentals of non-linear regression methods and least squares fitting which are available from a wide variety of sources (Martell and Motekaitis 1988; Polster and Lachmann 1989; Gans 1992; Press, Vetterling et al. 1995; Maeder and Neuhold 2007), we aim to show the experimentalist how to obtain the results they are interested, using purpose designed global analysis software and a variety of worked examples. We will be using ReactLab, a suite of versatile and powerful reaction modelling and analysis tools developed by the authors. Other academic and commercial applications exist for multivariate and related types of analysis and the reader is encouraged to explore these for comparative purposes. All offer different features and benefits but will not be discussed here.

#### **2. Spectrophotometry, the ideal technique for process analysis**

Any spectroscopic technique is ideal for the analysis of chemical processes as there is no interference in the underlying chemistry by the measurement technique. This is in sharp contrast to say chromatographic analysis or other separation methods which are totally unsuitable for the analysis of dynamic equilibrium systems. Such methods are also of very limited use for kinetic studies which often are too fast on the chromatographic time scale of

Analysis of Chemical Processes, Determination

equilibria are a feature of the mechanism.

a function of time or reagent addition.

**4. Information to be gained from the measurements** 

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 43

For equilibrium investigations the spectra of a series of pre-mixed and equilibrated solutions have to be recorded (Martell and Motekaitis 1988; Polster and Lachmann 1989). This is most commonly done as a titration where small amounts of a reagent are added stepwise to the solution under investigation. Titrations can be done in the cuvette, requiring internal stirring after each addition, prior to the absorption measurement, or the solutions can be mixed externally with transfer of the equilibrated solutions into the cuvette performed manually or using a flow cell and automatic pumping. In an alternative configuration optical probes can be coupled to the optical path in some spectrometers and placed into the solution contained in an external titration vessel (Norman and Maeder 2006). Often the pHs of the equilibrated titration solutions are recorded together with the absorption spectra where protonation

For both kinetic and equilibrium investigations the measurement data can be arranged in a data matrix **D** which contains row-wise the recorded spectra as a function of time or reagent addition. The number of columns of **D** is the number of wavelengths, *nlam*, over which the spectra are taken. For single wavelength data the matrix reduces to a single column (vector). The number of rows, *nspectra*, corresponds to the number of spectra recorded during the process (one at each time interval for kinetics or reagent addition for an equilibrium titration). The dimensions of **D** thus are *nspectra*×*nlam*. For spectra taken on a mechanical scanning instrument, the number of wavelengths can be 1 to typically some 10 or 20 but for diode array instruments it can easily be in excess of 1000 depending on the solid state detector pixel resolution (typically these provide a resolution progression of 256, 512 and 1024 pixels). The number of spectra taken is typically much larger on a stopped-flow instrument equipped with a fast diode array detector with a typical minimum spectrum acquisition time of the order of a millisecond. Frequently a logarithmic time base is an option which enables both fast and slower events to be resolved in a single kinetic

experiment. A graphical representation of a date matrix **D** is given in Figure 1.

wavelength

Fig. 1. Graphical representation of a data matrix **D**, the spectra are arranged as the rows.

For both kinetic and equilibrium investigations, we obtain a series of spectra each of which represent the solution at one particular moment during the process. The spectra are taken as

The purpose of collecting this type of data is to determine the chemical reaction mechanism that describes the underlying process in terms of identifiable steps together with the

time pH

added reagent

typically tens of minutes to hours (except where reactions are first quenched and the intermediates stabilised). In contrast most forms of spectroscopy provide a completely noninvasive snapshot of a sample's composition at a single instant.

Amongst the different spectroscopies routinely available to the chemist, light absorption spectrophotometry in the UV-Visible (UV/Vis) is most common for several reasons: instruments are relatively inexpensive and accurate, they provide stable referenced signals as they are usually split or double beam instruments, there is a simple relationship between concentration and the measured absorbance signal (Beer-Lambert's law) and many compounds absorb somewhere in the accessible wavelength region. As a consequence there is a considerable amount of software available for the analysis of spectrophotometric data. This is the case both for kinetic and equilibrium investigations. NMR spectroscopy is a powerful method for structural investigations but it is less commonly used for quantitative analytical purposes. A theoretically very powerful alternative to UV/Vis absorption spectroscopy is FT-IR spectroscopy. The richness of IR spectra is very attractive as there is much more information contained in an IR spectrum compared with a relatively structureless UV/Vis spectrum. The main disadvantage is the lack of long term stability as FT-IR instruments are single beam instruments. Other difficulties include solvent absorption and the lack of non-absorbing cell materials, particularly for aqueous solutions. However, attenuated total reflection or ATR is a promising novel measurement technique in the IR. Near-IR spectroscopy is very similar to UV/Vis spectroscopy and is covered by the present discussions.

Of course fluorescence detection is a very sensitive and important tool particularly in kinetics studies and can yield important mechanistic information where intermediates do not possess chromophores and are therefore colourless or are studied at very low concentrations. In the main fluorescence studies are carried out at a single emission wavelength or adopting the total fluorescence approach (using cut-off filters), so there is no wavelength discrimination in the data. Whilst this type of measurement can be analysed by the methods described below and is essentially equivalent to analysing single wavelength absorption data. We will in the following discussion concentrate on the general case of processing multiwavelength measurements

#### **3. The experiment, structure of the data**

For kinetic investigations the absorption of the reacting solution is measured as a function of reaction time; for equilibrium investigations the absorption is recorded as a function of the reagent addition or another independent variable such as pH. Absorption readings can of course be taken at a single wavelength but with modern instrumentations it is routine and advantageous to record complete spectra vs. time or reagent addition. This is particularly prevalent with the use of photodiode array (PDA) based spectrophotometers and online detectors.

In the case of kinetics, depending on the rate of a chemical reaction the mixing of the reagents that undergo the reaction has to be done fast using a stopped-flow instrument or it can be done manually for slower reactions in the cuvette of a standard UV-Vis spectrometer with suitably triggered spectral data acquisition. A series of spectra are collected at time intervals following the mixing event to cover the reaction time of interest. The measured spectra change as the reaction proceeds from reagents to products (Wilkins 1991; Espenson 1995).

typically tens of minutes to hours (except where reactions are first quenched and the intermediates stabilised). In contrast most forms of spectroscopy provide a completely non-

Amongst the different spectroscopies routinely available to the chemist, light absorption spectrophotometry in the UV-Visible (UV/Vis) is most common for several reasons: instruments are relatively inexpensive and accurate, they provide stable referenced signals as they are usually split or double beam instruments, there is a simple relationship between concentration and the measured absorbance signal (Beer-Lambert's law) and many compounds absorb somewhere in the accessible wavelength region. As a consequence there is a considerable amount of software available for the analysis of spectrophotometric data. This is the case both for kinetic and equilibrium investigations. NMR spectroscopy is a powerful method for structural investigations but it is less commonly used for quantitative analytical purposes. A theoretically very powerful alternative to UV/Vis absorption spectroscopy is FT-IR spectroscopy. The richness of IR spectra is very attractive as there is much more information contained in an IR spectrum compared with a relatively structureless UV/Vis spectrum. The main disadvantage is the lack of long term stability as FT-IR instruments are single beam instruments. Other difficulties include solvent absorption and the lack of non-absorbing cell materials, particularly for aqueous solutions. However, attenuated total reflection or ATR is a promising novel measurement technique in the IR. Near-IR spectroscopy is very similar to UV/Vis spectroscopy and is covered by the present

Of course fluorescence detection is a very sensitive and important tool particularly in kinetics studies and can yield important mechanistic information where intermediates do not possess chromophores and are therefore colourless or are studied at very low concentrations. In the main fluorescence studies are carried out at a single emission wavelength or adopting the total fluorescence approach (using cut-off filters), so there is no wavelength discrimination in the data. Whilst this type of measurement can be analysed by the methods described below and is essentially equivalent to analysing single wavelength absorption data. We will in the following discussion concentrate on the general case of

For kinetic investigations the absorption of the reacting solution is measured as a function of reaction time; for equilibrium investigations the absorption is recorded as a function of the reagent addition or another independent variable such as pH. Absorption readings can of course be taken at a single wavelength but with modern instrumentations it is routine and advantageous to record complete spectra vs. time or reagent addition. This is particularly prevalent with the use of photodiode array (PDA) based spectrophotometers and online

In the case of kinetics, depending on the rate of a chemical reaction the mixing of the reagents that undergo the reaction has to be done fast using a stopped-flow instrument or it can be done manually for slower reactions in the cuvette of a standard UV-Vis spectrometer with suitably triggered spectral data acquisition. A series of spectra are collected at time intervals following the mixing event to cover the reaction time of interest. The measured spectra change

as the reaction proceeds from reagents to products (Wilkins 1991; Espenson 1995).

invasive snapshot of a sample's composition at a single instant.

discussions.

detectors.

processing multiwavelength measurements

**3. The experiment, structure of the data** 

For equilibrium investigations the spectra of a series of pre-mixed and equilibrated solutions have to be recorded (Martell and Motekaitis 1988; Polster and Lachmann 1989). This is most commonly done as a titration where small amounts of a reagent are added stepwise to the solution under investigation. Titrations can be done in the cuvette, requiring internal stirring after each addition, prior to the absorption measurement, or the solutions can be mixed externally with transfer of the equilibrated solutions into the cuvette performed manually or using a flow cell and automatic pumping. In an alternative configuration optical probes can be coupled to the optical path in some spectrometers and placed into the solution contained in an external titration vessel (Norman and Maeder 2006). Often the pHs of the equilibrated titration solutions are recorded together with the absorption spectra where protonation equilibria are a feature of the mechanism.

For both kinetic and equilibrium investigations the measurement data can be arranged in a data matrix **D** which contains row-wise the recorded spectra as a function of time or reagent addition. The number of columns of **D** is the number of wavelengths, *nlam*, over which the spectra are taken. For single wavelength data the matrix reduces to a single column (vector). The number of rows, *nspectra*, corresponds to the number of spectra recorded during the process (one at each time interval for kinetics or reagent addition for an equilibrium titration). The dimensions of **D** thus are *nspectra*×*nlam*. For spectra taken on a mechanical scanning instrument, the number of wavelengths can be 1 to typically some 10 or 20 but for diode array instruments it can easily be in excess of 1000 depending on the solid state detector pixel resolution (typically these provide a resolution progression of 256, 512 and 1024 pixels). The number of spectra taken is typically much larger on a stopped-flow instrument equipped with a fast diode array detector with a typical minimum spectrum acquisition time of the order of a millisecond. Frequently a logarithmic time base is an option which enables both fast and slower events to be resolved in a single kinetic experiment. A graphical representation of a date matrix **D** is given in Figure 1.

Fig. 1. Graphical representation of a data matrix **D**, the spectra are arranged as the rows.

For both kinetic and equilibrium investigations, we obtain a series of spectra each of which represent the solution at one particular moment during the process. The spectra are taken as a function of time or reagent addition.

#### **4. Information to be gained from the measurements**

The purpose of collecting this type of data is to determine the chemical reaction mechanism that describes the underlying process in terms of identifiable steps together with the

Analysis of Chemical Processes, Determination

measured counterpart, equation (3) and Figure 3.

Fig. 3. Beer-Lambert's law including the residuals.

the squares of the residuals, *ssq*, is minimal,

manageable.

**6. The chemical model** 

concentration profiles in the matrix **C**.

one.

absorption spectra.

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 45

The matrix **D**(*nspectra*×*nlam*) is the product of a matrix of concentrations **C**(*nspectra*×*ncomp*) and a matrix **A**(*ncomp*×*nlam*). **C** contains as columns the concentration profiles of the reacting components and the matrix **A** contains, as rows, their molar

Equations (1) and (2) and Figure 2 represent the ideal case of perfect absorption readings without any experimental noise. This of course is not realistic and both equations have to be augmented by an error term, R(i,j) which is the difference between the ideal value and its

D(i,j) = C(i,k) ×A(k,j) R(i,j)

The goal of the fitting is to determine that set of matrices **C** and **A** for which the sum over all

nspectra nlam

i=1 j=1

At first sight this looks like a very daunting task. However, as we will see, it is

Ideally the final square sum achieved should be numerically equal to the sum of the squares of the Gaussian noise in the measurement – usually instrumental in origin. At this point the fit cannot be further improved, though this is not a guarantee that the model is the correct

The first and central step of any data analysis is the computation of the matrix **C** of concentration profiles based on the proposed chemical model and the associated parameters such as, but not exclusively, the rate and or equilibrium constants. Initially these key

So far the explanations are valid for kinetic and equilibrium studies. The difference between these two investigation lies in the different computations required for the calculation of the

**A**

+ **R**

ssq = R(i, j) (4)

(3)

ncomp

k=1

= +

**D C AR**

**D C** =

parameters may be only rough estimates of the true values.

associated key parameters; the rates and/or equilibrium constants which define the interconversions and stabilities of the various species. This may initially be a purely academic exercise to characterise a novel chemical reaction for publication purposes but ultimately defines the behaviour of the participating species for any future research into this or related chemistry as well as being the foundation for commercially important applications e.g. drug binding interactions in pharmaceutical development or reaction optimisation in industrial processes.

The objective is therefore to find the chemical model which best fits the data and validate and refine this model with subsequent experiments under other conditions. The clear benefit of multi-wavelength measurements is that the model must satisfy (fit) the data at all measurement wavelengths simultaneously and this significantly helps the accurate determination of multiple parameters and also allows determination of the individual spectra of the participating species.

#### **5. Beer-Lambert's law**

Before we can start the possible ways of extracting the useful parameters from the measured data set, the rate constants in the case of kinetics, the equilibrium constants in the case of equilibria, we need to further investigate the structure of the data matrix **D**. According to Beer-Lambert's law for multicomponent systems, the total absorption at any particular wavelength is the sum over all individual contributions of all absorbing species at this wavelength. It is best to write this as an equation:

$$\mathbf{D(i,j)} = \sum\_{\mathbf{k=1}}^{\text{ncomp}} \mathbf{C(i,k)} \times \mathbf{A(k,j)} \tag{1}$$

where:

D(i,j): absorption of the i-th solution at wavelength j C(i,k); concentration of the k-th component in the i-th solution A(k,j): molar absorptivity of the k-th component at the j-th wavelength *ncomp*: number of components in the system under investigation.

Thus equation (1) represents a system of *i*×*j* equations with many unknowns, i.e. all elements of **C** (*nspectra*×*ncomp*) and all elements of **A** (*ncomp*×*nlam*).

It is extremely useful to realise that the structure of Beer-Lambert's law allows the writing of Equation (1) in a very elegant matrix notion, Equation (2) and Figure 2

Fig. 2. Beer-Lambert's law, Equation (1) in matrix notation.

associated key parameters; the rates and/or equilibrium constants which define the interconversions and stabilities of the various species. This may initially be a purely academic exercise to characterise a novel chemical reaction for publication purposes but ultimately defines the behaviour of the participating species for any future research into this or related chemistry as well as being the foundation for commercially important applications e.g. drug binding interactions in pharmaceutical development or reaction

The objective is therefore to find the chemical model which best fits the data and validate and refine this model with subsequent experiments under other conditions. The clear benefit of multi-wavelength measurements is that the model must satisfy (fit) the data at all measurement wavelengths simultaneously and this significantly helps the accurate determination of multiple parameters and also allows determination of the individual

Before we can start the possible ways of extracting the useful parameters from the measured data set, the rate constants in the case of kinetics, the equilibrium constants in the case of equilibria, we need to further investigate the structure of the data matrix **D**. According to Beer-Lambert's law for multicomponent systems, the total absorption at any particular wavelength is the sum over all individual contributions of all absorbing species at this

ncomp

D(i,j) = C(i,k) ×A(k, j) (1)

**D = C×A** (2)

**A**

k=1

Thus equation (1) represents a system of *i*×*j* equations with many unknowns, i.e. all

It is extremely useful to realise that the structure of Beer-Lambert's law allows the writing of

optimisation in industrial processes.

spectra of the participating species.

wavelength. It is best to write this as an equation:

D(i,j): absorption of the i-th solution at wavelength j

C(i,k); concentration of the k-th component in the i-th solution

A(k,j): molar absorptivity of the k-th component at the j-th wavelength *ncomp*: number of components in the system under investigation.

elements of **C** (*nspectra*×*ncomp*) and all elements of **A** (*ncomp*×*nlam*).

Equation (1) in a very elegant matrix notion, Equation (2) and Figure 2

**D C** =

Fig. 2. Beer-Lambert's law, Equation (1) in matrix notation.

**5. Beer-Lambert's law** 

where:

The matrix **D**(*nspectra*×*nlam*) is the product of a matrix of concentrations **C**(*nspectra*×*ncomp*) and a matrix **A**(*ncomp*×*nlam*). **C** contains as columns the concentration profiles of the reacting components and the matrix **A** contains, as rows, their molar absorption spectra.

Equations (1) and (2) and Figure 2 represent the ideal case of perfect absorption readings without any experimental noise. This of course is not realistic and both equations have to be augmented by an error term, R(i,j) which is the difference between the ideal value and its measured counterpart, equation (3) and Figure 3.

Fig. 3. Beer-Lambert's law including the residuals.

The goal of the fitting is to determine that set of matrices **C** and **A** for which the sum over all the squares of the residuals, *ssq*, is minimal,

$$\text{lossq} = \sum\_{\text{i-1}}^{\text{nspectra}} \sum\_{\text{j-1}}^{\text{nlam}} \mathbf{R}(\text{i.j}) \tag{4}$$

At first sight this looks like a very daunting task. However, as we will see, it is manageable.

Ideally the final square sum achieved should be numerically equal to the sum of the squares of the Gaussian noise in the measurement – usually instrumental in origin. At this point the fit cannot be further improved, though this is not a guarantee that the model is the correct one.

#### **6. The chemical model**

The first and central step of any data analysis is the computation of the matrix **C** of concentration profiles based on the proposed chemical model and the associated parameters such as, but not exclusively, the rate and or equilibrium constants. Initially these key parameters may be only rough estimates of the true values.

So far the explanations are valid for kinetic and equilibrium studies. The difference between these two investigation lies in the different computations required for the calculation of the concentration profiles in the matrix **C**.

Analysis of Chemical Processes, Determination

equation

0.00 0.20 0.40 0.60 0.80 1.00 1.20

**8. Equilibria** 

**conc.**

0.00 50.00 100.00 150.00 200.00

**time**

concentration axis is used in the right panel.

enzyme and the enzyme-substrate complex.

only now there are exclusively equilibrium interactions, e.g.

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 47

This involves the modelling of the entire mechanism to deliver the matrix **C** comprising the concentration profiles of all the participating species. The reaction scheme in equation (5) defines a set of ordinary differential equations, ODE's, which need to be solved or integrated (Maeder and Neuhold 2007). Reaction schemes that only consist of first order reactions can be integrated analytically in which case the concentration can be calculated directly at any point using the resulting explicit function. Most other schemes, containing one or more second order reactions, require numerical integration. Numerical integration is usually done with variable step-size Runge-Kutta algorithms, unless the system is 'stiff' (comprising both very fast and slow reactions) for which special stiff solvers, such as Gear

Integration, explicit or numerical, requires the knowledge of the initial conditions, in the case of kinetics the initial concentrations of all interacting species. For the above example,

> k+1 k2

and using the rate constants (k+1=103 M-1 sec-1, k-1=102 sec-1, k2=102 sec-1) and initial concentrations, ([*S*]0=1 M, [*E*]0=10-4 M), the resulting concentration profiles generated by numerical integration and used to populate the columns of matrix **C** are shown in Figure 4.

> S E SE P

Fig. 4. Concentration profiles for the enzymatic reaction of equation (5); an expanded

The transformation of the substrate into the product follows approximately zero-th order kinetics for most of the reaction whilst the substrate is in excess and all the enzyme sites are populated. Later in the reaction the substrate is exhausted and free enzyme released. The expanded plot in the right hand panel displays more clearly the small concentrations for the

The chemical model for an equilibrium process is similar to the model of a kinetic process,

**conc.**

0.E+00 2.E-05 4.E-05 6.E-05 8.E-05 1.E-04 1.E-04

E + S ES E + P (5),

0.00 50.00 100.00 150.00 200.00

E SE

**time**

k-1

and Bulirsch-Stoer algorithms are available (Press, Vetterling et al. 1995).

#### **7. Kinetics**

The chemical model for a kinetic investigation is a set of reaction equations which describe the process under investigation. Consider as an example the basic enzymatic reaction scheme

$$\text{E} + \text{S} \xrightleftharpoons[k\_{\text{.}l}]{\text{k}\_{\text{.}l}} \text{ES} \xrightarrow{\text{k}\_{\text{.}l}} \text{E} + \text{P} \tag{5}$$

An enzyme *E* reacts rapidly and reversibly with the substrate *S* to form an enzyme substrate complex *ES*. This is followed by the 1st order chemical conversion of the substrate and release of product. The free enzyme is then available to undergo another catalytic cycle.

Before proceeding to a ReactLab based mechanistic analysis it is informative to briefly outline the classical approach to the quantitative analysis of this and similar basic enzyme mechanisms. The reader is referred to the many kinetics textbooks available for a more detailed description of these methods. The scheme in equation (5) was proposed by Michaelis and Menten in 1913 to aid in the interpretation of kinetic behaviour of enzymesubstrate reactions (Menten and Michaelis 1913). This model of the catalytic process was the basis for an analysis of measured initial rates (*v)* as a function of initial substrate concentration in order to determine the constants KM (The Michaelis constant) and Vmax that characterise the reaction. At low [S], *v* increases linearly, but as [S] increases the rise in *v* slows and ultimately reaches a limiting value Vmax.

Analysis was based on the derived Michaelis Menten formula:

$$\mathbf{v} = \frac{\begin{bmatrix} \mathbf{E}\_0 \end{bmatrix} \begin{bmatrix} \mathbf{S} \end{bmatrix} \mathbf{k}\_{\text{cat}}}{\mathbf{K}\_{\text{M}} + \begin{bmatrix} \mathbf{S} \end{bmatrix}} \tag{6}$$

Where Vmax= kcat[E]0 , and KM is equal to the substrate concentration at which *v*= ½ Vmax. The key to this derivation is that the enzyme substrate complex *ES* is in dynamic equilibrium with free *E* and *S* and the catalytic step proceeds with a first order rate constant kcat. This 'turnover number' kcat is represented by k2 in the scheme in equation (5).

It can be shown that under conditions where k2 << k-1 then KM is in fact equal to the equilibrium dissociation constant K1,

$$\mathbf{K}\_1 = \frac{\begin{bmatrix} \mathbf{ES} \end{bmatrix}}{\begin{bmatrix} \mathbf{E} \end{bmatrix} \begin{bmatrix} \mathbf{S} \end{bmatrix}} = \frac{\mathbf{k}\_{\ast 1}}{\mathbf{k}\_{\cdot 1}} \tag{7}$$

Importantly, however, the parameter KM is not always equivalent to this fundamental equilibrium constant K1 when this constraint (k2 << k-1) doesn't apply.

Furthermore though the Michealis Menten scheme can be extended to cover more complex mechanisms with additional intermediates, the KM and kcat parameters now become even more complex combinations of individual rate and equilibrium constants. The kcat and KM parameters determined by these classical approaches are therefore not the fundamental constants defining the mechanism and significant effort is required to determine the true underlying equilibrium and microscopic rate constants.

In contrast direct analysis using ReactLab to fit the core mechanism to suitable data delivers the true underlying rate and equilibrium constants in a wholly generic way that can be applied without assumptions and also to more complex models.

The chemical model for a kinetic investigation is a set of reaction equations which describe the process under investigation. Consider as an example the basic enzymatic reaction scheme

+1 2

E + S ES E + P (5)

K + [S] (6)

[E] [S] k (7)

k k

An enzyme *E* reacts rapidly and reversibly with the substrate *S* to form an enzyme substrate complex *ES*. This is followed by the 1st order chemical conversion of the substrate and release of product. The free enzyme is then available to undergo another catalytic cycle.

Before proceeding to a ReactLab based mechanistic analysis it is informative to briefly outline the classical approach to the quantitative analysis of this and similar basic enzyme mechanisms. The reader is referred to the many kinetics textbooks available for a more detailed description of these methods. The scheme in equation (5) was proposed by Michaelis and Menten in 1913 to aid in the interpretation of kinetic behaviour of enzymesubstrate reactions (Menten and Michaelis 1913). This model of the catalytic process was the basis for an analysis of measured initial rates (*v)* as a function of initial substrate concentration in order to determine the constants KM (The Michaelis constant) and Vmax that characterise the reaction. At low [S], *v* increases linearly, but as [S] increases the rise in *v*

> 0 cat M [E ] [S] k v =

Where Vmax= kcat[E]0 , and KM is equal to the substrate concentration at which *v*= ½ Vmax. The key to this derivation is that the enzyme substrate complex *ES* is in dynamic equilibrium with free *E* and *S* and the catalytic step proceeds with a first order rate constant kcat. This

It can be shown that under conditions where k2 << k-1 then KM is in fact equal to the

[ES] k K = =

Importantly, however, the parameter KM is not always equivalent to this fundamental

Furthermore though the Michealis Menten scheme can be extended to cover more complex mechanisms with additional intermediates, the KM and kcat parameters now become even more complex combinations of individual rate and equilibrium constants. The kcat and KM parameters determined by these classical approaches are therefore not the fundamental constants defining the mechanism and significant effort is required to determine the true

In contrast direct analysis using ReactLab to fit the core mechanism to suitable data delivers the true underlying rate and equilibrium constants in a wholly generic way that can be

+1



k

slows and ultimately reaches a limiting value Vmax.

equilibrium dissociation constant K1,

Analysis was based on the derived Michaelis Menten formula:

'turnover number' kcat is represented by k2 in the scheme in equation (5).

1

equilibrium constant K1 when this constraint (k2 << k-1) doesn't apply.

underlying equilibrium and microscopic rate constants.

applied without assumptions and also to more complex models.

**7. Kinetics** 

This involves the modelling of the entire mechanism to deliver the matrix **C** comprising the concentration profiles of all the participating species. The reaction scheme in equation (5) defines a set of ordinary differential equations, ODE's, which need to be solved or integrated (Maeder and Neuhold 2007). Reaction schemes that only consist of first order reactions can be integrated analytically in which case the concentration can be calculated directly at any point using the resulting explicit function. Most other schemes, containing one or more second order reactions, require numerical integration. Numerical integration is usually done with variable step-size Runge-Kutta algorithms, unless the system is 'stiff' (comprising both very fast and slow reactions) for which special stiff solvers, such as Gear and Bulirsch-Stoer algorithms are available (Press, Vetterling et al. 1995).

Integration, explicit or numerical, requires the knowledge of the initial conditions, in the case of kinetics the initial concentrations of all interacting species. For the above example, equation

$$\mathbf{E} + \mathbf{S} \xrightleftharpoons[\mathbf{k}\_{\cdot 1}]{\mathbf{k}\_{\cdot 1}} \mathbf{ES} \xrightarrow{\mathbf{k}\_{\cdot 2}} \mathbf{E} + \mathbf{P} \tag{5}$$

and using the rate constants (k+1=103 M-1 sec-1, k-1=102 sec-1, k2=102 sec-1) and initial concentrations, ([*S*]0=1 M, [*E*]0=10-4 M), the resulting concentration profiles generated by numerical integration and used to populate the columns of matrix **C** are shown in Figure 4.

Fig. 4. Concentration profiles for the enzymatic reaction of equation (5); an expanded concentration axis is used in the right panel.

The transformation of the substrate into the product follows approximately zero-th order kinetics for most of the reaction whilst the substrate is in excess and all the enzyme sites are populated. Later in the reaction the substrate is exhausted and free enzyme released. The expanded plot in the right hand panel displays more clearly the small concentrations for the enzyme and the enzyme-substrate complex.

#### **8. Equilibria**

The chemical model for an equilibrium process is similar to the model of a kinetic process, only now there are exclusively equilibrium interactions, e.g.

Analysis of Chemical Processes, Determination

aqueous solution as represented in equation (9).

0.00 0.02 0.04 0.06 0.08 0.10 0.12

also displayed is the pH of the reacting solution.

and [*H*+]0=0.1 M.

**9. Parameters** 

**conc**

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 49

mechanisms possible. An example is the complex formation between ammonia and *Ni2+* in

.

ML -ML ML2 -ML2

Ni +NH Ni(NH )

k 2+ 2+ 3 3 <sup>k</sup> <sup>k</sup> 2+ 2+ 3 3 3 2 <sup>k</sup>

(9)

8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6

M ML ML2 pH

**pH**

LH

logK + + 3 4

.

Ni(NH ) +NH Ni(NH )

.

*Ni2+* is interacting with *NH3* to form the 1:1 and subsequently the 1:2 complexes. Importantly the ammonia is also involved in a protonation equilibrium. As a result the pH changes during the reaction and the rates of the complex formation reactions appear to change. The classical approach to this situation is to add buffers that approximately maintain constant pH and thus also the protonation equilibrium. Since buffers often interfere with the process under investigation the possibility of avoiding them is advantageous. This has only been

0.00 0.05 0.10 0.15 0.20

**time**

The concentration profiles for the complex species are shown in Figure 6. The patterns for a 'normal' consecutive reaction are distorted as the reaction slows down with the drop in pH from 9.5 to 8.2. The initial concentrations for this reaction are [*Ni*2+]0=0.1 M, [*NH*3]0=0.25 M

Any parameter that is used to calculate the matrix **C** of concentrations is potentially a parameter that can be fitted. Obvious parameters are the rate constants k and the equilibrium constants K; other less obvious parameters are the initial concentrations in kinetics or the total concentrations in equilibrium titrations. Concentration determinations are of course very common in equilibrium studies (quantitative titrations); for several reasons concentrations are not often fitted in kinetic studies. For first order reactions the

Fig. 6. The concentration profiles for the complex species in the reaction of *Ni2+* and *NH3*;

made possible by this more sophisticated method of mechanistic analysis.

NH +H NH

$$\begin{aligned} \text{A} + \text{H} &\xleftarrow{\text{K}\_{\text{AH}}} \text{AH} \\ \text{AH} + \text{H} &\xleftarrow{\text{K}\_{\text{H}\_{2}}} \text{AH}\_{2} \\ \text{B} + \text{H} &\xleftarrow{\text{K}\_{\text{BH}}} \text{BH} \end{aligned} \tag{8}$$

The chemistry in this example comprises the protonation equilibria of the di-protic acid *AH*<sup>2</sup> and the mono-protic acid *BH*. The key difference now is that the steps are defined in terms of instantaneous stability or equilibrium constants, and the fast processes of the attainment of the equilibria are not observed.

Equilibrium investigations require a titration, which consists of the preparation of a series of solutions with different but known total component concentrations. In equilibrium studies we distinguish between components and species. Components are the building blocks; in the example (6) they are *A*, *B* and *H*; species are all the different molecules that are formed from the components during the titration, the example they are *A*, *AH*, *AH*2, *B*, *BH*, *H* and *OH*. Note, the components are also species.

Instead of utilising numerical integration to compute the concentration profiles of the individual species as we did with kinetic time courses we instead use an iterative Newton-Raphson algorithm to determine the speciation based on the total component concentrations for each sample and the estimated equilibrium constants (Maeder and Neuhold 2007).

For a titration of 10ml of a solution with total component concentrations [*A*]tot= 0.1M, [*B*]tot=0.06 M and [*H*]tot=0.4M with 5ml of 1.0M *NaOH* the concentration profiles of Figure 5 result. The protonation constants are log(KAH)=9, log(KAH2)=3 and log(KBH)=4.

Fig. 5. Concentration profiles for the titration of an acidified solution of *AH*2 and *BH* with *NaOH*.

#### **8.1 Kinetics with coupled protonation equilibria**

A significant recent development is the incorporation of instantaneous equilibria to kinetic analyses. Careful combination of numerical integration computations alongside the Newton-Raphson speciation calculations have made this possible (Maeder, Neuhold et al. 2002). This development has made the modelling of significantly more complex and realistic mechanisms possible. An example is the complex formation between ammonia and *Ni2+* in aqueous solution as represented in equation (9).

$$\text{Ni}^{2+} + \text{NH}\_3 \xrightleftharpoons[\text{NH}\_3]{k\_{\text{ML}}} \text{Ni(NH}\_3)^{2+} \tag{9}$$

$$\text{Ni(NH}\_3)^{2+} + \text{NH}\_3 \xrightleftharpoons[\text{k}\_{\text{M}\_2}]{k\_{\text{ML}\_2}} \text{Ni(NH}\_3)\_2^{2+} $$

$$\text{NH}\_3\text{+H}^+ \xleftarrow{\text{log}\text{K}\_{\text{III}}} \text{NH}\_4^+$$

*Ni2+* is interacting with *NH3* to form the 1:1 and subsequently the 1:2 complexes. Importantly the ammonia is also involved in a protonation equilibrium. As a result the pH changes during the reaction and the rates of the complex formation reactions appear to change. The classical approach to this situation is to add buffers that approximately maintain constant pH and thus also the protonation equilibrium. Since buffers often interfere with the process under investigation the possibility of avoiding them is advantageous. This has only been made possible by this more sophisticated method of mechanistic analysis.

Fig. 6. The concentration profiles for the complex species in the reaction of *Ni2+* and *NH3*; also displayed is the pH of the reacting solution.

The concentration profiles for the complex species are shown in Figure 6. The patterns for a 'normal' consecutive reaction are distorted as the reaction slows down with the drop in pH from 9.5 to 8.2. The initial concentrations for this reaction are [*Ni*2+]0=0.1 M, [*NH*3]0=0.25 M and [*H*+]0=0.1 M.

#### **9. Parameters**

48 Chemometrics in Practical Applications

AH

K

K

A + H AH

BH

The chemistry in this example comprises the protonation equilibria of the di-protic acid *AH*<sup>2</sup> and the mono-protic acid *BH*. The key difference now is that the steps are defined in terms of instantaneous stability or equilibrium constants, and the fast processes of the attainment

Equilibrium investigations require a titration, which consists of the preparation of a series of solutions with different but known total component concentrations. In equilibrium studies we distinguish between components and species. Components are the building blocks; in the example (6) they are *A*, *B* and *H*; species are all the different molecules that are formed from the components during the titration, the example they are *A*, *AH*, *AH*2, *B*, *BH*, *H* and

Instead of utilising numerical integration to compute the concentration profiles of the individual species as we did with kinetic time courses we instead use an iterative Newton-Raphson algorithm to determine the speciation based on the total component concentrations for each sample and the estimated equilibrium constants (Maeder and Neuhold 2007).

For a titration of 10ml of a solution with total component concentrations [*A*]tot= 0.1M, [*B*]tot=0.06 M and [*H*]tot=0.4M with 5ml of 1.0M *NaOH* the concentration profiles of Figure 5

0 2 4 6 8 10 12 14

**pH**

Fig. 5. Concentration profiles for the titration of an acidified solution of *AH*2 and *BH* with

A significant recent development is the incorporation of instantaneous equilibria to kinetic analyses. Careful combination of numerical integration computations alongside the Newton-Raphson speciation calculations have made this possible (Maeder, Neuhold et al. 2002). This development has made the modelling of significantly more complex and realistic

result. The protonation constants are log(KAH)=9, log(KAH2)=3 and log(KBH)=4.

AH + H AH

K

B + H BH

of the equilibria are not observed.

*OH*. Note, the components are also species.

0 0.02 0.04 0.06 0.08 0.1 0.12

**8.1 Kinetics with coupled protonation equilibria** 

**conc.**

*NaOH*.

AH2

2

(8)

A B AH AH2 BH

> Any parameter that is used to calculate the matrix **C** of concentrations is potentially a parameter that can be fitted. Obvious parameters are the rate constants k and the equilibrium constants K; other less obvious parameters are the initial concentrations in kinetics or the total concentrations in equilibrium titrations. Concentration determinations are of course very common in equilibrium studies (quantitative titrations); for several reasons concentrations are not often fitted in kinetic studies. For first order reactions the

Analysis of Chemical Processes, Determination

**13. Representing the chemical model** 

**12. ReactLab analysis tools** 

complete.

concentration matrix **C.** 

**14. Examples** 

kinetics and equilibrium studies.

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 51

ReactLab™ (Jplus Consulting Ltd) is a suite of software which is designed to carry out the fitting of reaction models to either kinetic or equilibrium multiwavelength (or single wavelength) data sets. All the core calculations described above are handled internally and the user simply provides the experimental measurements and a reaction scheme that is to be fitted to the data. A range of relevant supporting options are available as well as comprehensive graphical tools for visualising the data and the result of the analysis. To facilitate this all data, models and results are provided in pre-formatted Excel workbooks to allow post processing of results or customised plots to be added after the main analysis is

As has been discussed, at the root of the analysis is the generation of the species

Fitting a proposed reaction mechanism, or part of it, to the data therefore requires the determination of **C** from a reaction scheme preferably as would be written by a chemist. The ReactLab representation of the Ni2+/NH3 complexation mechanism of Equation (9) is shown in Figure 7. Forward arrows, >, are used to represent rate constants and the equal sign, =,

Fig. 7. The ReactLab definition of the mechanism in Equation (7) with rate and equilibrium

To get from this scheme to our intermediate matrix **C** involves a number of key computational steps requiring firstly the dissection of the mechanism into its fundamental mathematical building blocks. Many analysis tools require the user to take this initial step manually, and therefore understand some fairly sophisticated underlying mathematical principles. Whilst this is no bad thing it does complicate the overall process and provide a significant barrier to trying and becoming familiar with this type of direct data fitting using numerical integration based algorithms. A significant advance has been the development of model editors and translators which carry out this process transparently. These can be

In the following we will guide the reader through the steps required for the successful analysis of a number of example data sets. This section consists of two examples each from

represents an instantaneous equilibrium, e.g. a protonation equilibrium.

constants used to compute the concentration profiles of Figure 6.

found in a variety of data fitting applications including ReactLab.

concentrations are not defined at all unless there is additional spectroscopic information, i.e. molar absorptivities. For second order reactions they are in principle defined but only very poorly and thus kinetic determination is not a robust analytical technique in this case.

The parameters defining **C** are non-linear parameters and cannot be fitted explicitly, they need to be computed iteratively. Estimates are provided, a matrix **C** constructed and this is compared to the measurement according to the steps that follow below. Once this is complete it is possible to calculate shifts in these parameter estimates in a way that will improve the fit (i.e. reduce the square sum) when a new **C** is computed. This iterative improvement of the non-linear parameters is the basis of the non-linear regression algorithm at the heart of most fitting programs.

#### **10. Calculation of the absorption spectra**

The relationship between the matrix **C** and the measurement is based on equation (3). The matrix **A** contains the molar absorptivity for each species at each measured wavelength. All these molar absorptivities are unknown and thus also parameters to be determined. When spectra are collected the number of these parameters can be very large, but fortunately they are linear parameters and can be dealt with differently to the non-linear parameters discussed above.

Once the concentration profiles have been calculated, the matrix **A** of absorption spectra is computed. This is a linear least-squares calculation with an explicit solution

$$\mathbf{A} = \mathbf{C}^{\star} \mathbf{D} \tag{10}$$

**C**+ is the pseudo-inverse of the concentration matrix **C**, it can be calculated as (**C**t**C**)-1**C**<sup>t</sup> , or better using a numerically superior algorithm (Press, Vetterling et al. 1995).

#### **11. Non-linear regression: fitting of the non-linear parameters**

Fitting of the parameters requires the software to systematically vary all non-linear parameters, the rate and equilibrium constants as well as others such as initial concentrations, with the aim of minimising the sum of squares over all residuals, as defined in equation (4).

There are several algorithms for that task, the simplex algorithm which is relatively easy to program and features robust convergence with a high price of slow computation times particularly for the fitting of many parameters. Significantly faster is the Newton-Gauss algorithm; additionally it delivers error estimates for the parameters and with implementation of the Marquardt algorithm it is also very robust (Gans 1992; Maeder and Neuhold 2007).

As mentioned earlier non-linear regression is an iterative process and, provided the initial parameter estimates are not too poor and the model is not under-determined by the data, will converge to a unique minimum yielding the best fit parameters. With more complex models it is often necessary to fix certain parameters (either rate constants, equilibrium constants or complete spectra) particularly if they are known through independent investigations and most fitting applications will allow this type of constraint to be applied.

#### **12. ReactLab analysis tools**

50 Chemometrics in Practical Applications

concentrations are not defined at all unless there is additional spectroscopic information, i.e. molar absorptivities. For second order reactions they are in principle defined but only very poorly and thus kinetic determination is not a robust analytical technique in this case.

The parameters defining **C** are non-linear parameters and cannot be fitted explicitly, they need to be computed iteratively. Estimates are provided, a matrix **C** constructed and this is compared to the measurement according to the steps that follow below. Once this is complete it is possible to calculate shifts in these parameter estimates in a way that will improve the fit (i.e. reduce the square sum) when a new **C** is computed. This iterative improvement of the non-linear parameters is the basis of the non-linear regression algorithm

The relationship between the matrix **C** and the measurement is based on equation (3). The matrix **A** contains the molar absorptivity for each species at each measured wavelength. All these molar absorptivities are unknown and thus also parameters to be determined. When spectra are collected the number of these parameters can be very large, but fortunately they are linear parameters and can be dealt with differently to the non-linear parameters

Once the concentration profiles have been calculated, the matrix **A** of absorption spectra is

**C**+ is the pseudo-inverse of the concentration matrix **C**, it can be calculated as (**C**t**C**)-1**C**<sup>t</sup>

Fitting of the parameters requires the software to systematically vary all non-linear parameters, the rate and equilibrium constants as well as others such as initial concentrations, with the aim of minimising the sum of squares over all residuals, as defined

There are several algorithms for that task, the simplex algorithm which is relatively easy to program and features robust convergence with a high price of slow computation times particularly for the fitting of many parameters. Significantly faster is the Newton-Gauss algorithm; additionally it delivers error estimates for the parameters and with implementation of the Marquardt algorithm it is also very robust (Gans 1992; Maeder and

As mentioned earlier non-linear regression is an iterative process and, provided the initial parameter estimates are not too poor and the model is not under-determined by the data, will converge to a unique minimum yielding the best fit parameters. With more complex models it is often necessary to fix certain parameters (either rate constants, equilibrium constants or complete spectra) particularly if they are known through independent investigations and most fitting applications will allow this type of constraint to be applied.

**<sup>+</sup> A=C D** (10)

, or

computed. This is a linear least-squares calculation with an explicit solution

better using a numerically superior algorithm (Press, Vetterling et al. 1995).

**11. Non-linear regression: fitting of the non-linear parameters** 

at the heart of most fitting programs.

discussed above.

in equation (4).

Neuhold 2007).

**10. Calculation of the absorption spectra** 

ReactLab™ (Jplus Consulting Ltd) is a suite of software which is designed to carry out the fitting of reaction models to either kinetic or equilibrium multiwavelength (or single wavelength) data sets. All the core calculations described above are handled internally and the user simply provides the experimental measurements and a reaction scheme that is to be fitted to the data. A range of relevant supporting options are available as well as comprehensive graphical tools for visualising the data and the result of the analysis. To facilitate this all data, models and results are provided in pre-formatted Excel workbooks to allow post processing of results or customised plots to be added after the main analysis is complete.

#### **13. Representing the chemical model**

As has been discussed, at the root of the analysis is the generation of the species concentration matrix **C.** 

Fitting a proposed reaction mechanism, or part of it, to the data therefore requires the determination of **C** from a reaction scheme preferably as would be written by a chemist. The ReactLab representation of the Ni2+/NH3 complexation mechanism of Equation (9) is shown in Figure 7. Forward arrows, >, are used to represent rate constants and the equal sign, =, represents an instantaneous equilibrium, e.g. a protonation equilibrium.


Fig. 7. The ReactLab definition of the mechanism in Equation (7) with rate and equilibrium constants used to compute the concentration profiles of Figure 6.

To get from this scheme to our intermediate matrix **C** involves a number of key computational steps requiring firstly the dissection of the mechanism into its fundamental mathematical building blocks. Many analysis tools require the user to take this initial step manually, and therefore understand some fairly sophisticated underlying mathematical principles. Whilst this is no bad thing it does complicate the overall process and provide a significant barrier to trying and becoming familiar with this type of direct data fitting using numerical integration based algorithms. A significant advance has been the development of model editors and translators which carry out this process transparently. These can be found in a variety of data fitting applications including ReactLab.

#### **14. Examples**

In the following we will guide the reader through the steps required for the successful analysis of a number of example data sets. This section consists of two examples each from kinetics and equilibrium studies.

Analysis of Chemical Processes, Determination

concentrations need to be given.

for the rate constants shown in Figure 9.

model itself is plausible.

guesses for the rate constants are displayed in Figure 11.

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 53

Fig. 10. For the consecutive reaction scheme there are 3 reaction species for which initial

The program is now in a position to first calculate the concentration of all species as a function of time and subsequently their absorption spectra. The results for the present initial

Fig. 11. The concentration profiles and absorption spectra, as calculated with initial guesses

The concentration profiles indicate a fast first reaction *A*>*B* and a much slower subsequent reaction *B*>*C*. However, the calculated, partially negative absorption spectra clearly demonstrate that there is 'something wrong', that the initial guesses for the rate constants are obviously not correct. In this example the deviations are not too severe indicating the

Clicking the Fit button initiates the iterative improvement of the parameters and after a few iterations the 'perfect' results are evident. This of course is supportive of the validity of the model itself. If the scheme is wrong and cannot account for the detail in the data, a good fit

On the other hand an over complex model has to be carefully avoided as any data can usually be fitted with enough parameters (including artefacts!). Occam's razor should be assiduously applied accepting the simplest model that fits the data as the most likely.

Of course it is also a risk that a model is underdetermined by the data. Put simply the information in the measurement is not sufficient to deliver a unique solution, and the program will not converge properly and usually oscillate delivering one of an infinite number of solutions (usually combinations of rates). Whilst this does not imply the model is

will be unobtainable. The ReactLab GUI at the end of the fit is given in Figure 12.

Example 1: Consecutive reaction scheme **ABC k k 1 2**

The data set comprises a collection spectra measured as a function of time. These are arranged as rows of the 'Data' worksheet in Figure 8, the first spectrum in the cells D6:X6, the second in D7:X7, and so on. For each spectrum the measurement time is required and these times are collected in column C. The vector of wavelengths at which the spectra were acquired is stored in the row 5 above the spectra. The two inserted figures display the data as a function of time and of wavelength.

Fig. 8. The data arranged in an excel spreadsheet; the figure on the left displays the kinetic traces at all wavelengths, the figure on the right displays the measured spectra.

Prior to the fitting, the chemical reaction model on which the analysis will be based needs to be defined. As mentioned above ReactLab and other modern programs incorporate a model translator that allows the definition in a natural chemistry language and which subsequently translates automatically into internal coefficient information that allows the automatic construction of the mathematical expressions required by the numerical and speciation algorithms. Note for each reaction an initial guess for the rate constant has to be supplied. The ReactLab model is for this reaction is shown in Figure 9.


Fig. 9. The definition of the chemical model for the consecutive reaction scheme **ABC k k 1 2**

The 'compiler' recognises that there are 3 reacting species, *A*, *B*, *C*, and 2 rate constants. For the initial concentrations the appropriate values have to be supplied by the user. In the example [*A*]init=0.001 M [*B*]init and [*C*]init are zero. Further the spectral status of each species needs to be defined, in the example all 3 species are 'colored' i.e. they do absorb in the wavelength range of the data, see Figure 10. The alternative 'non-absorbing' indicates that the species does not absorb in the measured range. Advanced packages including ReactLab also allow the implementation of 'known' spectra which need to be introduced elsewhere in the workbook.

The data set comprises a collection spectra measured as a function of time. These are arranged as rows of the 'Data' worksheet in Figure 8, the first spectrum in the cells D6:X6, the second in D7:X7, and so on. For each spectrum the measurement time is required and these times are collected in column C. The vector of wavelengths at which the spectra were acquired is stored in the row 5 above the spectra. The two inserted figures display the data

Fig. 8. The data arranged in an excel spreadsheet; the figure on the left displays the kinetic

Prior to the fitting, the chemical reaction model on which the analysis will be based needs to be defined. As mentioned above ReactLab and other modern programs incorporate a model translator that allows the definition in a natural chemistry language and which subsequently translates automatically into internal coefficient information that allows the automatic construction of the mathematical expressions required by the numerical and speciation algorithms. Note for each reaction an initial guess for the rate constant has to be supplied.

traces at all wavelengths, the figure on the right displays the measured spectra.

Fig. 9. The definition of the chemical model for the consecutive reaction scheme

The 'compiler' recognises that there are 3 reacting species, *A*, *B*, *C*, and 2 rate constants. For the initial concentrations the appropriate values have to be supplied by the user. In the example [*A*]init=0.001 M [*B*]init and [*C*]init are zero. Further the spectral status of each species needs to be defined, in the example all 3 species are 'colored' i.e. they do absorb in the wavelength range of the data, see Figure 10. The alternative 'non-absorbing' indicates that the species does not absorb in the measured range. Advanced packages including ReactLab also allow the implementation of 'known' spectra which need to be introduced elsewhere in

The ReactLab model is for this reaction is shown in Figure 9.

**ABC k k 1 2**

the workbook.

**k k 1 2**

Example 1: Consecutive reaction scheme **ABC**

as a function of time and of wavelength.


Fig. 10. For the consecutive reaction scheme there are 3 reaction species for which initial concentrations need to be given.

The program is now in a position to first calculate the concentration of all species as a function of time and subsequently their absorption spectra. The results for the present initial guesses for the rate constants are displayed in Figure 11.

Fig. 11. The concentration profiles and absorption spectra, as calculated with initial guesses for the rate constants shown in Figure 9.

The concentration profiles indicate a fast first reaction *A*>*B* and a much slower subsequent reaction *B*>*C*. However, the calculated, partially negative absorption spectra clearly demonstrate that there is 'something wrong', that the initial guesses for the rate constants are obviously not correct. In this example the deviations are not too severe indicating the model itself is plausible.

Clicking the Fit button initiates the iterative improvement of the parameters and after a few iterations the 'perfect' results are evident. This of course is supportive of the validity of the model itself. If the scheme is wrong and cannot account for the detail in the data, a good fit will be unobtainable. The ReactLab GUI at the end of the fit is given in Figure 12.

On the other hand an over complex model has to be carefully avoided as any data can usually be fitted with enough parameters (including artefacts!). Occam's razor should be assiduously applied accepting the simplest model that fits the data as the most likely.

Of course it is also a risk that a model is underdetermined by the data. Put simply the information in the measurement is not sufficient to deliver a unique solution, and the program will not converge properly and usually oscillate delivering one of an infinite number of solutions (usually combinations of rates). Whilst this does not imply the model is

Analysis of Chemical Processes, Determination

Fig. 13. The 2-step reaction of *DTNB* with a thiolate, *RS-.*

on a stopped-flow instrument is shown in Figure 14.

0

with three colored species **ABC**

0 0.2 0.4 0.6 0.8 1 1.2

absorbance

).

thioglycerol (*RS-*

3.9sec-1.

0.5

1

1.5

time(sec) wavelength(nm)

In the experiment a large excess of thioglycerol was used and thus the two second order reactions can be approximated with a two-step pseudo first order sequential mechanism. Thus, we first attempt to fit this biphasic reaction with a simple consecutive reaction scheme

Fig. 14. Spectral changes resulting from the reactions of Ellmans reagent (*DTNB*) and

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 55

A 3-D representation of the spectra measured at different time intervals for a total of 1.5 sec.

<sup>300</sup> <sup>400</sup> <sup>500</sup> <sup>600</sup> <sup>700</sup> <sup>800</sup>

**k k 1 2** (Figure 15). The fitted rates are 75sec-1 and

Fig. 12. The ReactLab GUI displaying the progress of *ssq*, the residuals, the absorption spectra and the concentration profiles after the fitting.

incorrect, further work will be required to determine and fix key parameters or spectra in order to resolve the problem.

Example 2: Kinetic analysis of the reaction of Ellmans reagent (*DTNB*) and thioglycerol *RS*.

This example illustrates the direct fitting of a simplified model followed by the correct and more complex model to a data set collected using a PDA on a stopped flow (data courtesy of TgK Scientific Ltd, UK).

Ellmans reagent, 5,5'-Dithio-bis(2-nitrobenzoic acid) or *DTNB* is a commercially available reagent for quantifying thiols both in pure and biological samples and measuring the number of thiol groups on proteins. The reaction yields a colored thiolate, *RS-TNB*, ion which absorbs at 412nm and can be used to quantify the original thiol concentration. In this particular case the reaction with thioglycerol, *RS*- , leads to a 2 step disulphide exchange reaction and is particularly suited for establishing the dead-time of stopped flow instruments (Paul, Kirschner et al. 1979). The reaction is represented in Figure 13. The model in the ReactLab definition is given in Figure 16.

Fig. 12. The ReactLab GUI displaying the progress of *ssq*, the residuals, the absorption

incorrect, further work will be required to determine and fix key parameters or spectra in

Example 2: Kinetic analysis of the reaction of Ellmans reagent (*DTNB*) and thioglycerol *RS*. This example illustrates the direct fitting of a simplified model followed by the correct and more complex model to a data set collected using a PDA on a stopped flow (data courtesy of

Ellmans reagent, 5,5'-Dithio-bis(2-nitrobenzoic acid) or *DTNB* is a commercially available reagent for quantifying thiols both in pure and biological samples and measuring the number of thiol groups on proteins. The reaction yields a colored thiolate, *RS-TNB*, ion which absorbs at 412nm and can be used to quantify the original thiol concentration. In this particular case the reaction with thioglycerol, *RS*-, leads to a 2 step disulphide exchange reaction and is particularly suited for establishing the dead-time of stopped flow instruments (Paul, Kirschner et al. 1979). The reaction is represented in Figure 13. The model

spectra and the concentration profiles after the fitting.

in the ReactLab definition is given in Figure 16.

order to resolve the problem.

TgK Scientific Ltd, UK).

Fig. 13. The 2-step reaction of *DTNB* with a thiolate, *RS- .*

A 3-D representation of the spectra measured at different time intervals for a total of 1.5 sec. on a stopped-flow instrument is shown in Figure 14.

Fig. 14. Spectral changes resulting from the reactions of Ellmans reagent (*DTNB*) and thioglycerol (*RS-* ).

In the experiment a large excess of thioglycerol was used and thus the two second order reactions can be approximated with a two-step pseudo first order sequential mechanism. Thus, we first attempt to fit this biphasic reaction with a simple consecutive reaction scheme with three colored species **ABC k k 1 2** (Figure 15). The fitted rates are 75sec-1 and 3.9sec-1.

Analysis of Chemical Processes, Determination

**conc.**

0.0E+00 5.0E-06 1.0E-05 1.5E-05 2.0E-05 2.5E-05 3.0E-05 3.5E-05 4.0E-05 4.5E-05

wavelength kinetic analysis has no such indirect reinforcement.

**eps**

complete reaction scheme.

order rate constant.

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 57

DTNB TNB RSTNB RSSR

DTNB RS TNB RSTNB RSSR

0.000 0.500 1.000 1.500

**time**

300 400 500 600 700

**wavelength**

Fig. 17. Concentration profiles and molar absorption spectra for the analysis based on the

This example does serve to demonstrate that good fits can be obtained with an incorrect or simplistic model and that some insight and care is required to establish the correct mechanism and obtain genuine parameters. What is certainly true is that the second model could only be fitted because of the numerical integration of the more complex second order mechanisms. This was a trivial change of model configuration in ReactLab and could not have been achieved using classical analysis approaches. Secondly the importance of dealing with whole spectra is highlighted in that the spectra resulting for the fit provide important insight into the underlying chemistry and must make sense in this respect. Single

By way of a final comment on this example; we noted that the data was collected under pseudo first order conditions i.e. one reagent in excess. This ubiquitous approach was essential to enable the determination of second order rate constants using a first order fit by classical analysis using explicit functions (usually sums of exponentials). In the pseudo first order simplification a 2nd order rate constant is calculated from the observed pseudo first

Fig. 15. A simple consecutive reaction model with best fit rates and spectra.

The problems with this fit are two-fold. First we know the reactions are second order and that the intermediate spectrum for species *B* is implausible with two peaks.

The complete model as defined in ReactLab, together with the fit at 412 nm, is displayed in Figure 16. Whilst the square sum is not significantly improved the spectra are now correct according to literature sources and the corresponding rates for the 2 steps are 1.57 ×104 ± 6 M-1sec-1 and 780 ± 2 M-1sec-1.

Fig. 16. Fitting the complete mechanism to the data yields the correct rates. The right panel displays the quality of the fit at 412 nm on a logarithmic time axis.

The concentration profiles and spectra resulting from a fit to the correct model are now as in Figure 17.

0.000 0.200 0.400 0.600 0.800 1.000 1.200

The problems with this fit are two-fold. First we know the reactions are second order and

The complete model as defined in ReactLab, together with the fit at 412 nm, is displayed in Figure 16. Whilst the square sum is not significantly improved the spectra are now correct according to literature sources and the corresponding rates for the 2 steps are 1.57 ×104 ± 6

Fig. 16. Fitting the complete mechanism to the data yields the correct rates. The right panel

The concentration profiles and spectra resulting from a fit to the correct model are now as in

displays the quality of the fit at 412 nm on a logarithmic time axis.

**conc. Rel.**

Fig. 15. A simple consecutive reaction model with best fit rates and spectra.

that the intermediate spectrum for species *B* is implausible with two peaks.

M-1sec-1 and 780 ± 2 M-1sec-1.

Figure 17.

300 400 500 600 700

A B C

**time**

Fig. 17. Concentration profiles and molar absorption spectra for the analysis based on the complete reaction scheme.

This example does serve to demonstrate that good fits can be obtained with an incorrect or simplistic model and that some insight and care is required to establish the correct mechanism and obtain genuine parameters. What is certainly true is that the second model could only be fitted because of the numerical integration of the more complex second order mechanisms. This was a trivial change of model configuration in ReactLab and could not have been achieved using classical analysis approaches. Secondly the importance of dealing with whole spectra is highlighted in that the spectra resulting for the fit provide important insight into the underlying chemistry and must make sense in this respect. Single wavelength kinetic analysis has no such indirect reinforcement.

By way of a final comment on this example; we noted that the data was collected under pseudo first order conditions i.e. one reagent in excess. This ubiquitous approach was essential to enable the determination of second order rate constants using a first order fit by classical analysis using explicit functions (usually sums of exponentials). In the pseudo first order simplification a 2nd order rate constant is calculated from the observed pseudo first order rate constant.

Analysis of Chemical Processes, Determination

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 59

Fig. 20. Model entry and information on concentrations and spectral status.

Fitting results in values for the concentrations and their error estimates, Figure 21.

Fig. 21. The result of the fitting of the concentrations, complete with error analysis.

Example 4: Equilibrium Interaction between *Cu2+* and *PHE* (1,9-Bis(2-hydroxyphenyl)-2,5,8-

In this example we demonstrate the analysis of 'pH-metered' titrations, a mode of titration that is common in for the investigation of equilibria in aqueous solution. In this kind of titration the independent variable is the pH, rather than the added volume of reagent as has been the case in all previous examples. As before the measured spectra are the dependent variables. An important rationale for 'pH-metered' titrations is the fact that it is often difficult to completely exclude CO2 during the whole titration. Unknown amounts of CO2 result in the addition of unknown amounts of acid via formation of carbonate species. In 'pH-metered' titrations the effect of this impurity is minimal as long as none of the carbonate species interfere with the process under investigation; the effect in the 'default mode' can be much more pronounced. The price to pay for that advantage is more complex data acquisition as the pH has to be measured and recorded together with the spectra after

references to the auxiliary parameters which are fitted.

triazanonane)

each addition of reagent.

them as auxiliary parameters in the cells K7, K8; the contents of cells C37, D37, are now

Numerical integration methods eliminate the need for this constraint and therefore any requirement to work under pseudo first order conditions (or indeed the comparable constraint of keeping the reactant concentrations equal).

Example 3: Equilibrium investigation, concentration determination of a diprotic and a strong acid

Titrations can be used for the determination of equilibrium constants and insofar the analysis is very similar to a kinetic investigation. Titrations are also an important analytical tool for the determination of concentrations, in real terms this is probably the more common application.

Let us consider a titration of a solution of the diprotic acid *AH*2 in the presence of an excess of a strong acid, e.g. *HCl*. The equilibrium model only includes the diprotic acid as the strong acid is always completely dissociated, see the ReactLab model in Fig. 18.


Fig. 18. The model for a diprotic acid that undergoes two protonation equilibria.

The components are *A* and *H*, the species are *A*, *AH*, *AH*2, *H*, *OH*. For the components the total concentrations have to be known in each solution during the titration. They are collected in the columns E and F of a spreadsheet. The measured spectra are collected from the cell N7 on, see Figure 19


Fig. 19. Data entry for a titration, the crucial total concentrations are stored in the columns E and F, the spectra to the right (incomplete in this Figure).

In the example 10ml of a solution containing *A* and *H* are titrated with 0.1 ml aliquots of base. The concentrations [*A*]tot and [*H*]tot are computed from the volumes and concentrations of the original solution in the 'beaker' and in the 'burette'. These concentrations are stored in the main sheet in the rows 37 and 38 as seen in Figure 20.

The definition of the total concentrations of *A* and *H* in the 'beaker' are defined in the cells C37, D37, the same component concentrations in the burette solution in the row below. Often these concentrations are known and then there is nothing to be added. In the example the component concentrations in the 'beaker' are to be fitted. This is achieved by defining

#### Analysis of Chemical Processes, Determination of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 59

58 Chemometrics in Practical Applications

Numerical integration methods eliminate the need for this constraint and therefore any requirement to work under pseudo first order conditions (or indeed the comparable

Example 3: Equilibrium investigation, concentration determination of a diprotic and a

Titrations can be used for the determination of equilibrium constants and insofar the analysis is very similar to a kinetic investigation. Titrations are also an important analytical tool for the determination of concentrations, in real terms this is probably the more common

Let us consider a titration of a solution of the diprotic acid *AH*2 in the presence of an excess of a strong acid, e.g. *HCl*. The equilibrium model only includes the diprotic acid as the

The components are *A* and *H*, the species are *A*, *AH*, *AH*2, *H*, *OH*. For the components the total concentrations have to be known in each solution during the titration. They are collected in the columns E and F of a spreadsheet. The measured spectra are collected from

Fig. 19. Data entry for a titration, the crucial total concentrations are stored in the columns E

In the example 10ml of a solution containing *A* and *H* are titrated with 0.1 ml aliquots of base. The concentrations [*A*]tot and [*H*]tot are computed from the volumes and concentrations of the original solution in the 'beaker' and in the 'burette'. These concentrations are stored in

The definition of the total concentrations of *A* and *H* in the 'beaker' are defined in the cells C37, D37, the same component concentrations in the burette solution in the row below. Often these concentrations are known and then there is nothing to be added. In the example the component concentrations in the 'beaker' are to be fitted. This is achieved by defining

and F, the spectra to the right (incomplete in this Figure).

the main sheet in the rows 37 and 38 as seen in Figure 20.

strong acid is always completely dissociated, see the ReactLab model in Fig. 18.

Fig. 18. The model for a diprotic acid that undergoes two protonation equilibria.

constraint of keeping the reactant concentrations equal).

strong acid

application.

the cell N7 on, see Figure 19


Fig. 20. Model entry and information on concentrations and spectral status.

them as auxiliary parameters in the cells K7, K8; the contents of cells C37, D37, are now references to the auxiliary parameters which are fitted.

Fitting results in values for the concentrations and their error estimates, Figure 21.


Fig. 21. The result of the fitting of the concentrations, complete with error analysis.

Example 4: Equilibrium Interaction between *Cu2+* and *PHE* (1,9-Bis(2-hydroxyphenyl)-2,5,8 triazanonane)

In this example we demonstrate the analysis of 'pH-metered' titrations, a mode of titration that is common in for the investigation of equilibria in aqueous solution. In this kind of titration the independent variable is the pH, rather than the added volume of reagent as has been the case in all previous examples. As before the measured spectra are the dependent variables. An important rationale for 'pH-metered' titrations is the fact that it is often difficult to completely exclude CO2 during the whole titration. Unknown amounts of CO2 result in the addition of unknown amounts of acid via formation of carbonate species. In 'pH-metered' titrations the effect of this impurity is minimal as long as none of the carbonate species interfere with the process under investigation; the effect in the 'default mode' can be much more pronounced. The price to pay for that advantage is more complex data acquisition as the pH has to be measured and recorded together with the spectra after each addition of reagent.

Analysis of Chemical Processes, Determination

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

protonations are defined as overall stabilities, see Figure 24.

**abs**

wavelengths.

complexes.

are used to neutralise this excess

of the Reaction Mechanism and Fitting of Equilibrium and Rate Constants 61

2 4 6 810

The ligand *PHE* has five protonation constants which have to be determined independently. The successive logK values are 10.06, 10.41, 9.09, 7.94 and 4.18. Note that in the model the

The results of the analysis are summarised in Figure 24 and Figure 25. Again the protonation equilibria for the complex species are defined as formation constants, the logK values for the

protonation equilibria ML+H MLH and MLH+H MLH <sup>2</sup> are 8.42 and 3.92.

Fig. 24. The fitted equilibrium constants for the formation of the *ML*, *MLH* and *MLH2*

The concentration profiles are represented in two different modes, the left part has the measured pH as the x-axis and only the metal species are shown, the right part shows all species concentrations as a function of the added volume of base. This figure reveals that a substantial excess of acid has been added to the initial solution and the first 0.2 mL of base

Fig. 23. The measurement, here displayed as a series of titration curves at the different

**pH**

The example is the titration of *PHE*, 1,9-Bis(2-hydroxyphenyl)-2,5,8-triazanonane, with *Cu2+* in aqueous solution.(Gampp, Haspra et al. 1984) The structure of the ligand is shown below. It forms several complexes: *ML*, where the ligand is penta-coordinated presumable via all three secondary amine groups as well as the deprotonated phenolates; and two partially protonated species *MLH* and *MLH2*, in these complexes one or both of the phenolates are protonated and most likely not or only very weakly coordinated.

In this titration a solution of 7.2310-4 M Cu2+ and 1.6010-3 M *PHE* with an excess *HCl* were titrated with a total of approx. 750L *NaOH* solution. After each addition of the base the pH and the spectrum were measured. The total concentrations of metal and ligand are entered for each sample in the **'Data'** worksheet. Note that the columns for the total concentration of the protons is left empty: the measured pH in column M is defining the free proton concentration which in turn is used to compute all species concentrations in conjunction with the total concentrations of in this case the metal ion and the ligand provided, see Figure 22.


Fig. 22. Only the total concentrations of the metal and ligand are required, the column for the protons is left empty. Column M contains the pH values and the entry 'pH' in cell M6 to indicate a 'pH-metered' titration.

The measurement is displayed in Figure 23, each curve is the measured absorption at one particular wavelength.

The example is the titration of *PHE*, 1,9-Bis(2-hydroxyphenyl)-2,5,8-triazanonane, with *Cu2+* in aqueous solution.(Gampp, Haspra et al. 1984) The structure of the ligand is shown below. It forms several complexes: *ML*, where the ligand is penta-coordinated presumable via all three secondary amine groups as well as the deprotonated phenolates; and two partially protonated species *MLH* and *MLH2*, in these complexes one or both of the phenolates are

> N H

OH H OH

In this titration a solution of 7.2310-4 M Cu2+ and 1.6010-3 M *PHE* with an excess *HCl* were titrated with a total of approx. 750L *NaOH* solution. After each addition of the base the pH and the spectrum were measured. The total concentrations of metal and ligand are entered for each sample in the **'Data'** worksheet. Note that the columns for the total concentration of the protons is left empty: the measured pH in column M is defining the free proton concentration which in turn is used to compute all species concentrations in conjunction with the total concentrations of in this case the metal ion and the ligand provided, see Figure

Fig. 22. Only the total concentrations of the metal and ligand are required, the column for the protons is left empty. Column M contains the pH values and the entry 'pH' in cell M6 to

The measurement is displayed in Figure 23, each curve is the measured absorption at one

N

protonated and most likely not or only very weakly coordinated.

N H

22.

indicate a 'pH-metered' titration.

particular wavelength.

Fig. 23. The measurement, here displayed as a series of titration curves at the different wavelengths.

The ligand *PHE* has five protonation constants which have to be determined independently. The successive logK values are 10.06, 10.41, 9.09, 7.94 and 4.18. Note that in the model the protonations are defined as overall stabilities, see Figure 24.

The results of the analysis are summarised in Figure 24 and Figure 25. Again the protonation equilibria for the complex species are defined as formation constants, the logK values for the

protonation equilibria ML+H MLH and MLH+H MLH <sup>2</sup> are 8.42 and 3.92.


Fig. 24. The fitted equilibrium constants for the formation of the *ML*, *MLH* and *MLH2* complexes.

The concentration profiles are represented in two different modes, the left part has the measured pH as the x-axis and only the metal species are shown, the right part shows all species concentrations as a function of the added volume of base. This figure reveals that a substantial excess of acid has been added to the initial solution and the first 0.2 mL of base are used to neutralise this excess

**0**

**4**

**Subspace Models**

*University of Granada, Granada*

José Camacho

*Spain*

**Exploratory Data Analysis with Latent**

Exploratory Data Analysis (EDA) has been employed for decades in many research fields, including social sciences, psychology, education, medicine, chemometrics and related fields (1) (2). EDA is both a data analysis philosophy and a set of tools (3). Nevertheless, while the philosophy has essentially remained the same, the tools are in constant evolution. The application of EDA to current problems is challenging due to the large scale of the data sets involved. For instance, genomics data sets can have up to a million of variables (5). There is a clear interest in developing EDA methods to manage these scales of data while taking

In data sets with a large number of variables, collinear data and missing values, projection models based on latent structures, such as Principal Component Analysis (PCA) (6) (7) (1) and Partial Least Squares (PLS) (8) (9) (10), are valuable tools within EDA. Projection models and the set of tools used in combination simplify the analysis of complex data sets, pointing out to special observations (outliers), clusters of similar observations, groups of related variables, and crossed relationships between specific observations and variables. All this information is

EDA based on projection models has been successfully applied in the area of chemometrics and industrial process analysis. In this chapter, several standard tools for EDA with projection models, namely score plots, loading plots and biplots, are revised and their limitations are elucidated. Two recently proposed tools are introduced to overcome these limitations. The first of them, named Missing-data methods for Exploratory Data Analysis or MEDA for short (11), is used to investigate the relationships between variables in projection subspaces. The second one is an extension of MEDA, named observation-based MEDA or *o*MEDA (33), to discover the relationships between observations and variables. The EDA approach based on PCA/PLS with scores and loading plots, MEDA and *o*MEDA is illustrated with several real

This chapter is organized as follows. Section 2 briefly discusses the importance of subspace models and score plots to explore the data distribution. Section 3 is devoted to the investigation of the relationship among variables in a data set. Section 4 studies the relationship between observations and variables in latent subspaces. Section 5 presents a EDA case study of Quantitative Structure-Activity Relationship (QSAR) modelling and

advantage of *the basic importance of simply looking at data* (3).

of paramount importance to improve data knowledge.

examples from the chemometrics field.

**1. Introduction**

*Department of Signal Theory, Telematics and Communication,*

Fig. 25. The concentration profiles of the complexes as a function of pH and all species as a function of the volume of added base.

#### **15. Conclusion**

Data fitting is a well-established method that has been extensively used for the analysis of chemical processes since the beginnings of instrumental methods. Generally, increasing sophistication of instrumentation has inspired parallel developments in numerical methods for appropriate analysis of ever better and more plentiful data. This chapter concentrates on spectroscopic methods for the investigation of chemical processes; it details the structure of the data and the principles of model based data fitting. It is rounded of by a collection of typical and illustrative examples using a modern data analysis software package. The aim is to demonstrate the power and effortlessness of modern data analysis of typical data sets.

#### **16. References**

Espenson, J. H. (1995). *Chemical Kinetics and Reaction Mechanisms*. New York, McGraw-Hill.


### **Exploratory Data Analysis with Latent Subspace Models**

José Camacho *Department of Signal Theory, Telematics and Communication, University of Granada, Granada Spain*

#### **1. Introduction**

62 Chemometrics in Practical Applications

Cu CuL CuLH CuLH2

0 0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 0.004 0.0045

Fig. 25. The concentration profiles of the complexes as a function of pH and all species as a

Data fitting is a well-established method that has been extensively used for the analysis of chemical processes since the beginnings of instrumental methods. Generally, increasing sophistication of instrumentation has inspired parallel developments in numerical methods for appropriate analysis of ever better and more plentiful data. This chapter concentrates on spectroscopic methods for the investigation of chemical processes; it details the structure of the data and the principles of model based data fitting. It is rounded of by a collection of typical and illustrative examples using a modern data analysis software package. The aim is to demonstrate the power and effortlessness of modern data analysis of typical data sets.

Espenson, J. H. (1995). *Chemical Kinetics and Reaction Mechanisms*. New York, McGraw-Hill. Gampp, H., D. Haspra, et al. (1984). "Copper(II) Complexes with Linear Pentadentate

Maeder, M. and Y.-M. Neuhold (2007). *Practical Data Analysis in Chemistry*. Amsterdam, Elsevier. Maeder, M., Y. M. Neuhold, et al. (2002). "Analysis of reactions in aqueous solution at nonconstant pH: No more buffers?" *Phys. Chem. Chem. Phys.* 5: 2836-2841. Martell, A. E. and R. J. Motekaitis (1988). *The Determination and Use of Stability Constants*.

Menten, L. and M. I. Michaelis (1913). "Die Kinetik der Invertinwirkung." *Biochem Z* 49: 333-369. Norman, S. and M. Maeder (2006). "Model-Based Analysis for Kinetic and Equilibrium

Paul, C., K. Kirschner, et al. (1979). "Calibration of Stopped-Flow Spectrophotometers Using a Two-Step Disulfide Exchenge Reaction." *Analytical Biochemistry* 101: 442-448. Polster, J. and H. Lachmann (1989). *Spectrometric Titrations: Analysis of Chemical Equilibria*.

Press, W. H., W. T. Vetterling, et al. (1995). *Numerical Recipes in C*. Cambridge, Cambridge

Wilkins, R. G. (1991). *Kinetics and Mechanism of Reactions of Transition Metal Complexes*.

Investigations." *Critical Reviews in Analytical Chemistry* 36: 199-209.

Chelators." *Inorganic Chemistry* 23: 3724-4730. Gans, P. (1992). Data Fitting in the Chemical Sciences, Wiley.

0 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008

**15. Conclusion** 

**16. References** 

New York, VCH.

Weinheim, VCH.

University Press.

Weinheim, VCH.

2 4 6 810 **pH**

function of the volume of added base.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 **added NaOH, mL**

L H Cu LH LH2 LH3 LH4 LH5 CuL CuLH CuLH2 OH

> Exploratory Data Analysis (EDA) has been employed for decades in many research fields, including social sciences, psychology, education, medicine, chemometrics and related fields (1) (2). EDA is both a data analysis philosophy and a set of tools (3). Nevertheless, while the philosophy has essentially remained the same, the tools are in constant evolution. The application of EDA to current problems is challenging due to the large scale of the data sets involved. For instance, genomics data sets can have up to a million of variables (5). There is a clear interest in developing EDA methods to manage these scales of data while taking advantage of *the basic importance of simply looking at data* (3).

> In data sets with a large number of variables, collinear data and missing values, projection models based on latent structures, such as Principal Component Analysis (PCA) (6) (7) (1) and Partial Least Squares (PLS) (8) (9) (10), are valuable tools within EDA. Projection models and the set of tools used in combination simplify the analysis of complex data sets, pointing out to special observations (outliers), clusters of similar observations, groups of related variables, and crossed relationships between specific observations and variables. All this information is of paramount importance to improve data knowledge.

> EDA based on projection models has been successfully applied in the area of chemometrics and industrial process analysis. In this chapter, several standard tools for EDA with projection models, namely score plots, loading plots and biplots, are revised and their limitations are elucidated. Two recently proposed tools are introduced to overcome these limitations. The first of them, named Missing-data methods for Exploratory Data Analysis or MEDA for short (11), is used to investigate the relationships between variables in projection subspaces. The second one is an extension of MEDA, named observation-based MEDA or *o*MEDA (33), to discover the relationships between observations and variables. The EDA approach based on PCA/PLS with scores and loading plots, MEDA and *o*MEDA is illustrated with several real examples from the chemometrics field.

> This chapter is organized as follows. Section 2 briefly discusses the importance of subspace models and score plots to explore the data distribution. Section 3 is devoted to the investigation of the relationship among variables in a data set. Section 4 studies the relationship between observations and variables in latent subspaces. Section 5 presents a EDA case study of Quantitative Structure-Activity Relationship (QSAR) modelling and

−4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> −4

(a) First Data Set

−15 −10 −5 <sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> −10

(d) First Data Set

Scores on PC 1 (41.33%)

**3. Relationships among variables**

according to the following equalities:

Variable 1

−4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> −4

Exploratory Data Analysis with Latent Subspace Models 65

(b) Second Data Set

−10 <sup>0</sup> <sup>10</sup> <sup>20</sup> <sup>30</sup> −10

Fig. 1. Experiment with three simulated data sets of dimension 100 × 100. Data simulation was performed using the ADICOV technique (15). In the first row of figures, the scatter plots corresponding to the first two variables in the data sets are shown. In the second row of figures, the scatter plots (score plots) corresponding to the first two PCs in the data sets are

Scores on PC 1 (41.33%)

(e) Second Data Set

PCA has been often employed to explore the relationships among variables in a data set (19; 20). Nevertheless, it is generally accepted that Factor Analysis (FA) is better suited than PCA to study these relationships (1; 7). This is because FA algorithms are designed to distinguish between shared and unique variability. The shared variability, the so-called communalities in the FA community, reflect the common factors–common variability–among observable variables. The unique variability is only present in one observable variable. The common factors make up the relationship structure in the data. PCA makes no distinction between shared and unique variability and therefore it is not suited to find the structure in the

When either PCA or FA are used for data understanding, a two step procedure is typically followed (1; 7). Firstly, the model is calibrated from the available data. Secondly, the model is rotated to obtain a so-called simple structure. The second step is aimed at obtaining loading vectors with as much loadings close to 0 as possible. That way, the loading vectors are easier to interpret. It is generally accepted that oblique transformations are preferred to the more simple orthogonal transformations (19; 20), although in many situations the results are similar

The limitation of PCA to detect common factors and the application of rotation methods will be illustrated using the pipelines artificial examples (14). Data for each pipeline are simulated

Variable 1

−4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> −4

(c) Third Data Set

−4 −2 <sup>0</sup> <sup>2</sup> <sup>4</sup> <sup>6</sup> <sup>8</sup> <sup>10</sup> <sup>12</sup> −8

(f) Third Data Set

Scores on PC 1 (41.33%)

Variable 1

Scores on PC 2 (11.28%)

Variable 2

Scores on PC 2 (11.28%)

Variable 2

Scores on PC 2 (11.28%)

shown.

data.

(1).

Variable 2

section 6 proposes some concluding remarks. Examples and Figures were computed using the MATLAB programming environment, with the PLS-Toolbox (32) and home-made software. A MATLAB toolbox with the tools employed in this chapter is available at http://wdb.ugr.es/ josecamacho/.

#### **2. Patterns in the data distribution**

The distribution of the observations in a data set contains relevant information for data understanding. For instance, in an industrial process, one outlier may represent an abnormal situation which affects the process variables to a large extent. Studying this observation with more detail, one may be able to identify if it is the result of a process upset or, very commonly, a sensor failure. Also, clusters of observations may represent different operation points. Outliers, clusters and trends in the data may be indicative of the degree of control in the process and of assignable sources of variation. The identification of these sources of variation may lead to the reduction of the variance in the process with the consequent reduction of costs.

The distribution of the observations can be visualized using scatter plots. For obvious reasons, scatter plots are limited to three dimensions at most, and typically to two dimensions. Therefore, the direct observation of the data distribution in data sets with several tens, hundreds or even thousands of variables is not possible. One can always construct scatter plots for selective pairs or thirds of variables, but this is an overwhelming and often misleading approach. Projection models overcome this problem. PCA and PLS can be used straightforwardly to visualize the distribution of the data in the latent subspace, considering only a few latent variables (LVs) which contain most of the variability of interest. Scatter plots of the scores corresponding to the LVs, the so-called score plots, are used for this purpose.

Score plots are well known and accepted in the chemometric field. Although simple to understand, score plots are paramount for EDA. The following example may be illustrative of this. In Figure 1, three simulated data sets of the same size (100 × 100) are compared. Data simulation was performed using the technique named Approximation of a DIstribution for a given COVariance matrix (15), or ADICOV for short. Using this technique, the same covariance structure was simulated for the three data sets but with different distributions: the first data set presents a multi-normal distribution in the latent subspace, the second one presents a severe outlier and the third one presents a pair of clusters. If the scatter plot of the observations in the plane spanned by the first two variables is depicted (first row of Figure 1), the data sets seem to be almost identical. Therefore, unless an extensive exploration is performed, the three data sets may be though to come from a similar data generation procedure. However, if a PCA model for each data set is fitted and the score plots corresponding to the first 2 PCs are shown (second row of Figure 1), differences among the three data sets are made apparent: in the second data set there is one outlier (right side of Figure 1(e)) and in the third data set there are two clusters of observations. As already discussed, the capability to find these details is paramount for data understanding, since outliers and clusters are very informative of the underlaying phenomena. Most of the times these details are also apparent in the original variables, but finding them may be a tedious work. Score plots after PCA modelling are perfectly suited to discover large deviations among the observations, avoiding the overwhelming task of visualizing each possible pair of original variables. Also, score plots in regression models such as PLS are paramount for model interpretation prior to prediction.

2 Will-be-set-by-IN-TECH

section 6 proposes some concluding remarks. Examples and Figures were computed using the MATLAB programming environment, with the PLS-Toolbox (32) and home-made software. A MATLAB toolbox with the tools employed in this chapter is available at

The distribution of the observations in a data set contains relevant information for data understanding. For instance, in an industrial process, one outlier may represent an abnormal situation which affects the process variables to a large extent. Studying this observation with more detail, one may be able to identify if it is the result of a process upset or, very commonly, a sensor failure. Also, clusters of observations may represent different operation points. Outliers, clusters and trends in the data may be indicative of the degree of control in the process and of assignable sources of variation. The identification of these sources of variation may lead to the reduction of the variance in the process with the consequent reduction of

The distribution of the observations can be visualized using scatter plots. For obvious reasons, scatter plots are limited to three dimensions at most, and typically to two dimensions. Therefore, the direct observation of the data distribution in data sets with several tens, hundreds or even thousands of variables is not possible. One can always construct scatter plots for selective pairs or thirds of variables, but this is an overwhelming and often misleading approach. Projection models overcome this problem. PCA and PLS can be used straightforwardly to visualize the distribution of the data in the latent subspace, considering only a few latent variables (LVs) which contain most of the variability of interest. Scatter plots of the scores corresponding to the LVs, the so-called score plots, are used for this purpose. Score plots are well known and accepted in the chemometric field. Although simple to understand, score plots are paramount for EDA. The following example may be illustrative of this. In Figure 1, three simulated data sets of the same size (100 × 100) are compared. Data simulation was performed using the technique named Approximation of a DIstribution for a given COVariance matrix (15), or ADICOV for short. Using this technique, the same covariance structure was simulated for the three data sets but with different distributions: the first data set presents a multi-normal distribution in the latent subspace, the second one presents a severe outlier and the third one presents a pair of clusters. If the scatter plot of the observations in the plane spanned by the first two variables is depicted (first row of Figure 1), the data sets seem to be almost identical. Therefore, unless an extensive exploration is performed, the three data sets may be though to come from a similar data generation procedure. However, if a PCA model for each data set is fitted and the score plots corresponding to the first 2 PCs are shown (second row of Figure 1), differences among the three data sets are made apparent: in the second data set there is one outlier (right side of Figure 1(e)) and in the third data set there are two clusters of observations. As already discussed, the capability to find these details is paramount for data understanding, since outliers and clusters are very informative of the underlaying phenomena. Most of the times these details are also apparent in the original variables, but finding them may be a tedious work. Score plots after PCA modelling are perfectly suited to discover large deviations among the observations, avoiding the overwhelming task of visualizing each possible pair of original variables. Also, score plots in regression models such as PLS are paramount for

http://wdb.ugr.es/ josecamacho/.

costs.

**2. Patterns in the data distribution**

model interpretation prior to prediction.

Fig. 1. Experiment with three simulated data sets of dimension 100 × 100. Data simulation was performed using the ADICOV technique (15). In the first row of figures, the scatter plots corresponding to the first two variables in the data sets are shown. In the second row of figures, the scatter plots (score plots) corresponding to the first two PCs in the data sets are shown.

#### **3. Relationships among variables**

PCA has been often employed to explore the relationships among variables in a data set (19; 20). Nevertheless, it is generally accepted that Factor Analysis (FA) is better suited than PCA to study these relationships (1; 7). This is because FA algorithms are designed to distinguish between shared and unique variability. The shared variability, the so-called communalities in the FA community, reflect the common factors–common variability–among observable variables. The unique variability is only present in one observable variable. The common factors make up the relationship structure in the data. PCA makes no distinction between shared and unique variability and therefore it is not suited to find the structure in the data.

When either PCA or FA are used for data understanding, a two step procedure is typically followed (1; 7). Firstly, the model is calibrated from the available data. Secondly, the model is rotated to obtain a so-called simple structure. The second step is aimed at obtaining loading vectors with as much loadings close to 0 as possible. That way, the loading vectors are easier to interpret. It is generally accepted that oblique transformations are preferred to the more simple orthogonal transformations (19; 20), although in many situations the results are similar (1).

The limitation of PCA to detect common factors and the application of rotation methods will be illustrated using the pipelines artificial examples (14). Data for each pipeline are simulated according to the following equalities:

in others components. Because of this, inspecting only a pair of components may lead to incorrect conclusions. A good interpretation would require inspecting and interrelating all pairs of components with relevant information, something which may be challenging in many situations. This problem affects the interpretation and it is the reason why FA is generally

Exploratory Data Analysis with Latent Subspace Models 67

Figure 2(b) shows the resulting loadings after applying one of the most used rotation methods: the varimax rotation. Now, the variables corresponding to each pipeline are grouped towards one of the loading vectors. This highlights the fact that there are two main and orthogonal sources of variability, each one representing the variability in a pipeline. Also, in the first component variables collected from pipeline 2 present low loadings whereas in the second component variables collected from pipeline 1 present low loadings. This is the result of applying the notion of simple structure, with most of the loadings rotated towards 0. The interpretation is simplified as a consequence of improving the alignment of components with

Although FA and rotation methods may improve the interpretation, they still present severe limitations. The derivation of the structure in the data from a loading plot is not straightforward. On the other hand, the rotated model depends greatly on the normalization of the data and the number of PCs (1; 21). To avoid this, several alternative approaches to rotation have been suggested. The point in common of these approaches is that they find a trade-off between variance explained and model simplicity (1). Nevertheless, imposing a simple structure has also drawbacks. Reference (11) shows that, when simplicity is pursued, there is a potential risk of simplifying even the true relationships in the data set, missing part of the data structure. Thus, the indirect improvement of data interpretation by imposing a

MEDA is designed to find the true relationships in the data. Therefore, it is an alternative to rotation methods or in general to the simple structure approach. A main advantage of MEDA is that, unlike rotation or FA methods, it is applied over any projection subspace without actually modifying it. The benefit is twofold. Firstly, MEDA is straightforwardly applied in any subspace of interest: PCA (maximizing variance), PLS (maximizing correlations) and any other. On the contrary, FA methods are typically based on complicated algorithms, several of which have not been extended to regression. Secondly, MEDA is also useful for model interpretation, since common factors and components are easily interrelated. This is quite

MEDA is based on the capability of missing values estimation of projection models (22–27). The MEDA approach is depicted in Figure 3. Firstly, a projection model is fitted from the calibration *N* × *M* matrix **X** (and optionally **Y**). Then, for each variable *m*, matrix **X***<sup>m</sup>* is constructed, which is a *N* × *M* matrix full with zeros except in the *m*-th column where it contains the *m*-th column of matrix **X**. Using **X***m* and the model, the scores are estimated with a missing data method. The known data regression (KDR) method (22; 25) is suggested at this point. From the scores, the original data is reconstructed and the estimation error computed. The variability of the estimation error is compared to that of the original data according to the

common factors. This is especially useful in data sets with many variables.

simple structure may also report misleading results in certain situations.

useful, for instance, in the selection of the number of components.

following index of goodness of prediction:

preferred to PCA.

**3.1 MEDA**

Fig. 2. Loading plots of the PCA model fitted from the data in the two pipelines example: (a) original loadings and (b) loadings after varimax rotation.

$$F12 = F1 + F2$$

$$F123 = F1 + F2 + F3$$

where F1, F2 and F3 represent liquid flows which are generated independently at random following a normal distribution of 0 mean and standard deviation 1. A 30% of measurement noise is generated for each of the five variables in a pipeline at random, following a normal distribution of 0 mean:

$$\mathfrak{x}\_{i} = (x\_{i} + \sqrt{0.3} \cdot n) / (\sqrt{1.3})$$

where *x*˜*<sup>i</sup>* is the contaminated variable, *xi* the noise-free variable and *n* the noise generated. This simulation structure generates blocks of five variables with three common factors: the common factor F1, present in the observed variables F1, F12 and F123; the common factor F2, present in the observed variables F2, F12 and F123; and the common factor F3, present in F3 and F123. Data sets of any size can be obtained by combining the variables from different pipelines. In this present example, a data set with 100 observations from two pipelines for which data are independently generated is considered. Thus, the size of the data set is 100 × 10 and the variability is built from 6 common factors.

Figure 2 shows the loading plots of the PCA model of the data before and after rotation. Loading plots are interpreted so that close variables, provided they are far enough from the origin of coordinates, are considered to be correlated. This interpretation is not always correct. In Figure 2(a), the first component separates the variables corresponding to the two pipelines. The second component captures variance of most variables, specially of those in the second pipeline. The two PCs capture variability corresponding to most common factors in the data at the same time, which complicates the interpretation. As already discussed, PCA is focused on variance, without making the distinction between unique and shared variance. The result is that the common factors are not aligned with the PCs. Thus, one single component reflects several common factors and the same common factor may be reflected in several components. As a consequence, variables with high and similar loadings in the same subset of components do not necessarily need to be correlated, since they may present very different loadings in others components. Because of this, inspecting only a pair of components may lead to incorrect conclusions. A good interpretation would require inspecting and interrelating all pairs of components with relevant information, something which may be challenging in many situations. This problem affects the interpretation and it is the reason why FA is generally preferred to PCA.

Figure 2(b) shows the resulting loadings after applying one of the most used rotation methods: the varimax rotation. Now, the variables corresponding to each pipeline are grouped towards one of the loading vectors. This highlights the fact that there are two main and orthogonal sources of variability, each one representing the variability in a pipeline. Also, in the first component variables collected from pipeline 2 present low loadings whereas in the second component variables collected from pipeline 1 present low loadings. This is the result of applying the notion of simple structure, with most of the loadings rotated towards 0. The interpretation is simplified as a consequence of improving the alignment of components with common factors. This is especially useful in data sets with many variables.

Although FA and rotation methods may improve the interpretation, they still present severe limitations. The derivation of the structure in the data from a loading plot is not straightforward. On the other hand, the rotated model depends greatly on the normalization of the data and the number of PCs (1; 21). To avoid this, several alternative approaches to rotation have been suggested. The point in common of these approaches is that they find a trade-off between variance explained and model simplicity (1). Nevertheless, imposing a simple structure has also drawbacks. Reference (11) shows that, when simplicity is pursued, there is a potential risk of simplifying even the true relationships in the data set, missing part of the data structure. Thus, the indirect improvement of data interpretation by imposing a simple structure may also report misleading results in certain situations.

#### **3.1 MEDA**

4 Will-be-set-by-IN-TECH

0 0.1 0.2 0.3 0.4 0.5 0.6

Loadings on PC 2

Fig. 2. Loading plots of the PCA model fitted from the data in the two pipelines example: (a)

*F*12 = *F*1 + *F*2 *F*123 = *F*1 + *F*2 + *F*3

where F1, F2 and F3 represent liquid flows which are generated independently at random following a normal distribution of 0 mean and standard deviation 1. A 30% of measurement noise is generated for each of the five variables in a pipeline at random, following a normal

where *x*˜*<sup>i</sup>* is the contaminated variable, *xi* the noise-free variable and *n* the noise generated. This simulation structure generates blocks of five variables with three common factors: the common factor F1, present in the observed variables F1, F12 and F123; the common factor F2, present in the observed variables F2, F12 and F123; and the common factor F3, present in F3 and F123. Data sets of any size can be obtained by combining the variables from different pipelines. In this present example, a data set with 100 observations from two pipelines for which data are independently generated is considered. Thus, the size of the data set is 100 ×

Figure 2 shows the loading plots of the PCA model of the data before and after rotation. Loading plots are interpreted so that close variables, provided they are far enough from the origin of coordinates, are considered to be correlated. This interpretation is not always correct. In Figure 2(a), the first component separates the variables corresponding to the two pipelines. The second component captures variance of most variables, specially of those in the second pipeline. The two PCs capture variability corresponding to most common factors in the data at the same time, which complicates the interpretation. As already discussed, PCA is focused on variance, without making the distinction between unique and shared variance. The result is that the common factors are not aligned with the PCs. Thus, one single component reflects several common factors and the same common factor may be reflected in several components. As a consequence, variables with high and similar loadings in the same subset of components do not necessarily need to be correlated, since they may present very different loadings

√ 1.3)

*<sup>x</sup>*˜*<sup>i</sup>* = (*xi* <sup>+</sup> <sup>√</sup>0.3 · *<sup>n</sup>*)/(

−0.2 <sup>0</sup> 0.2 0.4 0.6 −0.1

(b) Varimax rotation

F3b

F1b F2b

 F12b F123b

Loadings on PC 1

F1a

F123a

F2a F3a F12a

−0.2 0 0.2 0.4 0.6

original loadings and (b) loadings after varimax rotation.

10 and the variability is built from 6 common factors.

 F1a F2a

 F12a F123a

Loadings on PC 1

(a) PCA loadings

F3a

F3b

0.1

0.2

0.3

F1b

distribution of 0 mean:

F12b F123b

F2b

Loadings on PC 2

0.4

0.5

MEDA is designed to find the true relationships in the data. Therefore, it is an alternative to rotation methods or in general to the simple structure approach. A main advantage of MEDA is that, unlike rotation or FA methods, it is applied over any projection subspace without actually modifying it. The benefit is twofold. Firstly, MEDA is straightforwardly applied in any subspace of interest: PCA (maximizing variance), PLS (maximizing correlations) and any other. On the contrary, FA methods are typically based on complicated algorithms, several of which have not been extended to regression. Secondly, MEDA is also useful for model interpretation, since common factors and components are easily interrelated. This is quite useful, for instance, in the selection of the number of components.

MEDA is based on the capability of missing values estimation of projection models (22–27). The MEDA approach is depicted in Figure 3. Firstly, a projection model is fitted from the calibration *N* × *M* matrix **X** (and optionally **Y**). Then, for each variable *m*, matrix **X***<sup>m</sup>* is constructed, which is a *N* × *M* matrix full with zeros except in the *m*-th column where it contains the *m*-th column of matrix **X**. Using **X***m* and the model, the scores are estimated with a missing data method. The known data regression (KDR) method (22; 25) is suggested at this point. From the scores, the original data is reconstructed and the estimation error computed. The variability of the estimation error is compared to that of the original data according to the following index of goodness of prediction:

F1a F2a F3a F12a F123a F1b F2b F3b F12b F123b

(a) 2 PCs

in the *<sup>l</sup>*-th row and *<sup>m</sup>*-th column of matrix **XX** · **<sup>P</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup>*

and the kernel algorithms (29) (30) (31) for PLS.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

**<sup>R</sup>***<sup>A</sup>* <sup>=</sup> **<sup>W</sup>***<sup>A</sup>* · (**P***<sup>T</sup>*

Fig. 4. MEDA matrix from the PCA model fitted from the data in the two pipelines example.

and *<sup>m</sup>*-th column of the cross-product matrix **XX** <sup>=</sup> **<sup>X</sup>***<sup>T</sup>* · **<sup>X</sup>** and *SlmA* corresponds to the element

The relationship of the MEDA algorithm and cross-product matrices was firstly pointed out by Arteaga (28) and it can also be derived from the original MEDA paper (11). Equation (2) represents a direct and fast procedure to compute MEDA, similar in nature to the algorithms for model fitting from cross-product matrices, namely the eigendecomposition (ED) for PCA

In Figure 4(a), the MEDA matrix corresponding to the 2 PCs PCA model of the example in the previous section, the two independent pipelines, is shown. The structure in the data is elucidated from this matrix. The separation between the two pipelines is shown in the fact that upper-right and lower-left quadrants are close to zero. The relationship among variables corresponding to factors F1 and F2 are also apparent in both pipelines. Since the variability corresponding to factors F3 is barely captured by the first 2 PCs, these are not reflected in the matrix. Nevertheless, if 6 PCs are selected, (Figure 4(b)) the complete structure in the data is

MEDA improves the interpretation of both the data set and the model fitted without actually pursuing a simple structure. The result is that MEDA has better properties than rotation methods: it is more accurate and its performance is not deteriorated when the number of PCs is overestimated. Also, the output of MEDA does not depend on the normalization of the loadings, like rotated models do, an it is not limited to subspaces with two or three components at most. A comparison of MEDA with rotation methods is out of the scope of this chapter. Please refer to (11) for it and also for a more algorithmic description of MEDA.

The limitations of loading plots and the application of MEDA were introduced with the pipelines artificial data set. This is further illustrated in this section with two examples provided with the PLS-toolbox (32): the Wine data set, which is used in the documentation of the cited software to show the capability of PCA for improving data understanding, and the

F1a F2a F3a F12a F123a F1b F2b F3b F12b F123b

Exploratory Data Analysis with Latent Subspace Models 69

F1a F2a F3a F12a F123a F1b F2b F3b F12b F123b

(b) 6 PCs

*<sup>A</sup>* in PCA and of matrix **XX** · **<sup>R</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup>*

*<sup>A</sup>* · **<sup>W</sup>***A*)<sup>−</sup>1. (3)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

*<sup>A</sup>* in

F1a F2a F3a F12a F123a F1b F2b F3b F12b F123b

PLS, with:

clearly found.

**3.2 Loading plots and MEDA**

Fig. 3. MEDA technique: (1) model calibration, (2) introduction of missing data, (3) missing data imputation, (4) error computation, (5) computation of matrix **Q**<sup>2</sup> *A*.

$$q\_{A,(m,l)}^2 = 1 - \frac{||\hat{\mathbf{e}}\_{A,(l)}||^2}{||\mathbf{x}\_{(l)}||^2}, \quad \forall l \neq m. \tag{1}$$

where ˆ**e***A*,(*l*) corresponds to the estimation error for the *l*-th variable and **x**(*l*) is its actual value. The closer the value of the index is to 1, the more related variables *m* and *l* are. After all the indices corresponding to each pair of variables are computed, matrix **Q**<sup>2</sup> *<sup>A</sup>* is formed so that *q*2 *<sup>A</sup>*,(*m*,*l*) is located at row *m* and column *l*. For interpretation, when the number of variables is large, matrix **Q**<sup>2</sup> *<sup>A</sup>* can be shown as a color map. Also, a threshold can be applied to **<sup>Q</sup>**<sup>2</sup> *<sup>A</sup>* so that elements over this threshold are set to 1 and elements below the threshold are set to 0.

The procedure depicted in Figure 3 is the original and more general MEDA algorithm. Nevertheless, provided KDR is the missing data estimation technique, matrix *Q*<sup>2</sup> *<sup>A</sup>* can be computed from cross-product matrices following a more direct procedure. The value corresponding to the element in the *i*-th row and *j*-th column of matrix **Q**<sup>2</sup> *<sup>A</sup>* in MEDA is equal to:

$$q\_{A\_r(m,l)}^2 = \frac{2 \cdot \mathcal{S}\_{ml} \cdot \mathcal{S}\_{ml^A} - (\mathcal{S}\_{ml^A})^2}{\mathcal{S}\_{mm} \cdot \mathcal{S}\_{ll}}.\tag{2}$$

where *Slm* stands for the cross-product of variables **x***<sup>l</sup>* and **x***m*, i.e. *Slm* = **x***<sup>T</sup> <sup>l</sup>* · **x***m*, and *SlmA* stands for the cross-product of variables **x***<sup>l</sup>* and **x***<sup>A</sup> <sup>m</sup>*, being **x***<sup>A</sup> <sup>m</sup>* the projection of **x***<sup>m</sup>* in the model sub-space in coordinates of the original space. Thus, *Slm* is the element in the *l*-th row 6 Will-be-set-by-IN-TECH

Fig. 3. MEDA technique: (1) model calibration, (2) introduction of missing data, (3) missing

where ˆ**e***A*,(*l*) corresponds to the estimation error for the *l*-th variable and **x**(*l*) is its actual value. The closer the value of the index is to 1, the more related variables *m* and *l* are. After all the

*<sup>A</sup>*,(*m*,*l*) is located at row *m* and column *l*. For interpretation, when the number of variables is

The procedure depicted in Figure 3 is the original and more general MEDA algorithm. Nevertheless, provided KDR is the missing data estimation technique, matrix *Q*<sup>2</sup>

be computed from cross-product matrices following a more direct procedure. The value

*<sup>A</sup>*,(*m*,*l*) <sup>=</sup> <sup>2</sup> · *Sml* · *SmlA* <sup>−</sup> (*SmlA* )<sup>2</sup>

model sub-space in coordinates of the original space. Thus, *Slm* is the element in the *l*-th row

where *Slm* stands for the cross-product of variables **x***<sup>l</sup>* and **x***m*, i.e. *Slm* = **x***<sup>T</sup>*

*Smm* · *Sll*

*<sup>m</sup>*, being **x***<sup>A</sup>*

elements over this threshold are set to 1 and elements below the threshold are set to 0.

*<sup>A</sup>* can be shown as a color map. Also, a threshold can be applied to **<sup>Q</sup>**<sup>2</sup>

*<sup>A</sup>*,(*m*,*l*) <sup>=</sup> <sup>1</sup> <sup>−</sup> �**e**<sup>ˆ</sup> *<sup>A</sup>*,(*l*)�<sup>2</sup>

*A*.

�**x**(*l*)�<sup>2</sup> , <sup>∀</sup>*<sup>l</sup>* �<sup>=</sup> *<sup>m</sup>*. (1)

*<sup>A</sup>* is formed so that

*<sup>A</sup>* in MEDA is equal

. (2)

*<sup>m</sup>* the projection of **x***<sup>m</sup>* in the

*<sup>A</sup>* so that

*<sup>A</sup>* can

*<sup>l</sup>* · **x***m*, and

data imputation, (4) error computation, (5) computation of matrix **Q**<sup>2</sup>

indices corresponding to each pair of variables are computed, matrix **Q**<sup>2</sup>

corresponding to the element in the *i*-th row and *j*-th column of matrix **Q**<sup>2</sup>

*q*2

*SlmA* stands for the cross-product of variables **x***<sup>l</sup>* and **x***<sup>A</sup>*

*q*2

*q*2

to:

large, matrix **Q**<sup>2</sup>

Fig. 4. MEDA matrix from the PCA model fitted from the data in the two pipelines example.

and *<sup>m</sup>*-th column of the cross-product matrix **XX** <sup>=</sup> **<sup>X</sup>***<sup>T</sup>* · **<sup>X</sup>** and *SlmA* corresponds to the element in the *<sup>l</sup>*-th row and *<sup>m</sup>*-th column of matrix **XX** · **<sup>P</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup> <sup>A</sup>* in PCA and of matrix **XX** · **<sup>R</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup> <sup>A</sup>* in PLS, with:

$$\mathbf{R}\_A = \mathbf{W}\_A \cdot (\mathbf{P}\_A^T \cdot \mathbf{W}\_A)^{-1}. \tag{3}$$

The relationship of the MEDA algorithm and cross-product matrices was firstly pointed out by Arteaga (28) and it can also be derived from the original MEDA paper (11). Equation (2) represents a direct and fast procedure to compute MEDA, similar in nature to the algorithms for model fitting from cross-product matrices, namely the eigendecomposition (ED) for PCA and the kernel algorithms (29) (30) (31) for PLS.

In Figure 4(a), the MEDA matrix corresponding to the 2 PCs PCA model of the example in the previous section, the two independent pipelines, is shown. The structure in the data is elucidated from this matrix. The separation between the two pipelines is shown in the fact that upper-right and lower-left quadrants are close to zero. The relationship among variables corresponding to factors F1 and F2 are also apparent in both pipelines. Since the variability corresponding to factors F3 is barely captured by the first 2 PCs, these are not reflected in the matrix. Nevertheless, if 6 PCs are selected, (Figure 4(b)) the complete structure in the data is clearly found.

MEDA improves the interpretation of both the data set and the model fitted without actually pursuing a simple structure. The result is that MEDA has better properties than rotation methods: it is more accurate and its performance is not deteriorated when the number of PCs is overestimated. Also, the output of MEDA does not depend on the normalization of the loadings, like rotated models do, an it is not limited to subspaces with two or three components at most. A comparison of MEDA with rotation methods is out of the scope of this chapter. Please refer to (11) for it and also for a more algorithmic description of MEDA.

#### **3.2 Loading plots and MEDA**

The limitations of loading plots and the application of MEDA were introduced with the pipelines artificial data set. This is further illustrated in this section with two examples provided with the PLS-toolbox (32): the Wine data set, which is used in the documentation of the cited software to show the capability of PCA for improving data understanding, and the

Liquor Wine Beer LifeEx HeartD

(a) First PC

0 0.2 0.4 0.6 0.8 1

−0.1 <sup>0</sup> 0.1 0.2 0.3 0.4 0.5 0.6 −1

(a) LV 1 vs LV2

Loadings on LV 1 (81.55%)

−0.4

data sets used do not contain the same observations.

−0.2

0

0.2

Loadings on LV 3 (2.06%)

0.4

0.6

10

1 3 2 4 5 6 7 8

9

 17<sup>18</sup> 19

Liquor Wine Beer LifeEx HeartD Liquor Wine Beer LifeEx HeartD

Exploratory Data Analysis with Latent Subspace Models 71

(b) Second PC

Fig. 6. MEDA matrices of the first PCs from the Wine data set provided with the PLS-toolbox

−0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

 11 12 13 14 165 17 18 19

7

8

−1 −0.5 0 0.5

9

20

Variables/Loadings Plot for xblock1

Loadings on LV 2 (5.18%)

(c) LV2 vs LV3

and 200 test observations. This same data set was used to illustrate MEDA with PCA in (11) with the temperatures and the level of molten glass together in the same block of data 1.

<sup>1</sup> There are some results of the analysis in (11) which are not coherent with those reported here, since the

Fig. 7. Loading plots from the PLS model in the Slurry-Fed Ceramic Melter data set.

Loadings on LV 3 (2.06%)

0 0.2 0.4 0.6 0.8 1

> 12 3 4 5

20

10

Liquor Wine Beer LifeEx HeartD Liquor Wine Beer LifeEx HeartD

(c) First 2 PCs

7

9

 18 19

17

−0.1 0 0.1 0.2 0.3 0.4 0.5 0.6

(b) LV1 vs LV3

6

Loadings on LV 1 (81.55%)

0 0.2 0.4 0.6 0.8 1

8

Liquor Wine Beer LifeEx HeartD

(32).

Loadings on LV 2 (5.18%)

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4

10

20

11 12 13 14 15 16

Fig. 5. Loading plot of the first 2 PCs from the Wine Data set provided with the PLS-toolbox (32).

PLSdata data set, which is used to introduce regression models, including PLS. The reading of the analysis and discussion of both data sets in (32) is recommended.

As suggested in (32), two PCs are selected for the PCA model of the Wine data set. The corresponding loading plot is shown in Figure 5. According to the reference, this plot shows that variables HeartD and LifeEx are negatively correlated, being this correlation captured in the first component. Also, "wine is somewhat positively correlated with life expectancy, likewise, liquor is somewhat positively correlated with heart disease". Finally, bear, wine and liquor form a triangle in the figure, which "suggests that countries tend to trade one of these vices for others, but the sum of the three tends to remain constant". Notice that although these conclusions are interesting, some of them are not so evidently shown by the plot. For instance, Liquor is almost as close to HeartD than to Wine. Is Liquor correlated to Wine as it is to HeartD?

MEDA can be used to improve the interpretation of loading plots. In Figure 6(a), the MEDA matrix for the first PC is shown. It confirms the negative correlation between HeartD and LifeEx, and the–lower–positive correlation between HeartD and Liqour and LifeEx and Wine. Notice that these three relationships are three different common factors. Nevertheless, they all manifest in the same component, making the interpretation with loading plots more complex. The MEDA matrix for the second PC in Figure 6(b) shows the relationship between the three types of drinks. The fact that the second PC captures this relationship was not clear in the loading plot. Furthermore, the MEDA matrix shows that Wine and Liquor are not correlated, answering to the question in the previous paragraph. Finally, this absence of correlation refutes that countries tend to trade wine for liquor or viceversa, although this effect may be true for bear.

In the PLSdata data set, the aim is to obtain a regression model that relates 20 temperatures measured in a Slurry-Fed Ceramic Melter (SFCM) with the level of molten glass. The x-block contains 20 variables which correspond to temperatures collected in two vertical thermowells. Variables 1 to 10 are taken from the bottom to the top in thermowell 1, and variables 11 to 20 from the bottom to the top in thermowell 2. The data set includes 300 training observations 8 Will-be-set-by-IN-TECH

Beer

LifeEx HeartD

−0.5 0 0.5

Fig. 5. Loading plot of the first 2 PCs from the Wine Data set provided with the PLS-toolbox

PLSdata data set, which is used to introduce regression models, including PLS. The reading

As suggested in (32), two PCs are selected for the PCA model of the Wine data set. The corresponding loading plot is shown in Figure 5. According to the reference, this plot shows that variables HeartD and LifeEx are negatively correlated, being this correlation captured in the first component. Also, "wine is somewhat positively correlated with life expectancy, likewise, liquor is somewhat positively correlated with heart disease". Finally, bear, wine and liquor form a triangle in the figure, which "suggests that countries tend to trade one of these vices for others, but the sum of the three tends to remain constant". Notice that although these conclusions are interesting, some of them are not so evidently shown by the plot. For instance, Liquor is almost as close to HeartD than to Wine. Is Liquor correlated to Wine as it

MEDA can be used to improve the interpretation of loading plots. In Figure 6(a), the MEDA matrix for the first PC is shown. It confirms the negative correlation between HeartD and LifeEx, and the–lower–positive correlation between HeartD and Liqour and LifeEx and Wine. Notice that these three relationships are three different common factors. Nevertheless, they all manifest in the same component, making the interpretation with loading plots more complex. The MEDA matrix for the second PC in Figure 6(b) shows the relationship between the three types of drinks. The fact that the second PC captures this relationship was not clear in the loading plot. Furthermore, the MEDA matrix shows that Wine and Liquor are not correlated, answering to the question in the previous paragraph. Finally, this absence of correlation refutes that countries tend to trade wine for liquor or viceversa, although this effect may be

In the PLSdata data set, the aim is to obtain a regression model that relates 20 temperatures measured in a Slurry-Fed Ceramic Melter (SFCM) with the level of molten glass. The x-block contains 20 variables which correspond to temperatures collected in two vertical thermowells. Variables 1 to 10 are taken from the bottom to the top in thermowell 1, and variables 11 to 20 from the bottom to the top in thermowell 2. The data set includes 300 training observations

Liquor

Loadings on PC 1 (46.03%)

Wine

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

of the analysis and discussion of both data sets in (32) is recommended.

Loadings on PC 2 (32.11%)

(32).

is to HeartD?

true for bear.

Fig. 6. MEDA matrices of the first PCs from the Wine data set provided with the PLS-toolbox (32).

Fig. 7. Loading plots from the PLS model in the Slurry-Fed Ceramic Melter data set.

and 200 test observations. This same data set was used to illustrate MEDA with PCA in (11) with the temperatures and the level of molten glass together in the same block of data 1.

<sup>1</sup> There are some results of the analysis in (11) which are not coherent with those reported here, since the data sets used do not contain the same observations.

**Variables Complete model** [3 : 10 13 : 20] [3 : 10] [13 : 20] **LVs** 3 3 33 **X-block variance** 88.79 89.84 96.36 96.93 **Y-block variance** 87.89 87.78 84.61 83.76 **RMSEC** 0.1035 0.1039 0.1166 0.1198 **RMSECV** 0.1098 0.1098 0.1253 0.1271 **RMSEP** 0.1396 0.1394 0.1522 0.1631

Exploratory Data Analysis with Latent Subspace Models 73

Table 1. Comparison of three PLS models in the Slurry-Fed Ceramic Melter data set. The variance in both blocks of data and the Root Mean Square Error of Calibration (RMSEC),

remains the same. Also, considering the correlation among thermowells, one may be tempted to use only one of the thermowells for prediction, reducing the associated costs of maintaining two thermowells. If this is done, only 8 predictor variables are used and the prediction performance is reduced, but not to a large extent. Correlated variables in a prediction model help to better discriminate between true structure and noise. For instance, in this example, when only the sensors of one thermowell are used, the PLS model captures more x-block variance and less y-block variance. This is showing that more specific–noisy–variance in the x-block is being captured. Using both thermowells reduces this effect. Another example of

There is a close similarity between MEDA and correlation matrices. To this regard, equation (2) simplifies the interpretation of the MEDA procedure. The MEDA index combines the original variance with the model subspace variance in *Sml* and *SmlA* . Also, the denominator of the index in eq. (2) is the original variance. Thus, those pairs of variables where a high amount of the total variance of one of them can be recovered from the other are highlighted. This is convenient for data interpretation, since only factors of high variance are highlighted. On the other hand, it is easy to see that when the number of LVs, *A*, equals the rank of **X**, then

*<sup>A</sup>* is equal to the element-wise squared correlation matrix of **<sup>X</sup>**, *<sup>C</sup>*<sup>2</sup> (11). This can be observed

squared–correlation matrix. To elaborate this similarity, a correlation matrix can be easily extended to the notion of latent subspace. The correlation matrix in the latent subspace, *CA*, can de defined as the correlation matrix of the reconstruction of **X** with the first *A* LVs. Thus,

are then squared, the element-wise squared correlation in the latent subspace, noted as *C*<sup>2</sup>

*<sup>A</sup>*,(*m*,*l*) <sup>=</sup> *<sup>S</sup>*<sup>2</sup>

*ml Smm* · *Sll* = *c* 2 (*m*,*l*)

*<sup>A</sup>* · **<sup>C</sup>** · **<sup>R</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup>*

*<sup>A</sup>* is defined as:

*mAlA SmAmA* · *SlAlA*

*<sup>A</sup>* has a similar structure than the–element-wise

. (4)

*<sup>A</sup>* in PLS. If the elements of *CA*

. (5)

*<sup>A</sup>*, is

*Rank*(*X*),(*m*,*l*) <sup>=</sup> *<sup>S</sup>*<sup>2</sup>

*<sup>A</sup>* in PCA and *CA* <sup>=</sup> **<sup>P</sup>***<sup>A</sup>* · **<sup>R</sup>***<sup>t</sup>*

*c*2

Cross-validation (RMSECV) and Prediction (RMSEP) are compared.

variable selection with MEDA will be presented in Section 5.

*q*2

**3.3 correlation matrices and MEDA**

in the following element-wise equality:

This equivalence shows that matrix *Q*<sup>2</sup>

obtained. Strictly speaking, each element of *C*<sup>2</sup>

*<sup>A</sup>* · **<sup>C</sup>** · **<sup>P</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup>*

*Q*2

*CA* <sup>=</sup> **<sup>P</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup>*

Fig. 8. MEDA matrices from the PLS model in the Slurry-Fed Ceramic Melter data set.

Following recommendations in (32), a 3 LVs PLS model from a mean-centered x-block was fitted. The loading plots corresponding to the three possible pairs of LVs are shown in Figure 7. The first LV captures the predictive variance in the temperatures, with higher loadings for higher indices of the temperatures in each thermowell. This holds exception made on temperatures 10 and 20, which present their predictive variance in the second and third LVs, being the variance in 10 between three and four times higher than that in 20. The third LVs also seems to discriminate between both thermowells. For instance, in Figure 7(c) most temperatures of the first thermowell are in the upper middle of the Figure, whereas most temperatures of the second thermowell are in the lower middle. Nevertheless, it should be noted that the third LV only captures 2% of the variability in the x-block and 1% of the variability in the y-block.

The information gained with the loading plots can be complemented with that in MEDA, which brings a much clearer picture of the structure in the data. The MEDA matrix for the PLS model with 3 LVs is shown in Figure 8(a). There is a clear auto-correlation effect in the temperatures, so that closer sensors are more correlated. This holds exception made on temperatures 10 and 20. Also, the corresponding temperatures in both thermowells are correlated, including 10 and 20. Finally, the temperatures at the bottom do not contain almost any predictive information of the level of molten glass. In (32), the predictive error by cross-validation is used to identify the number of LVs. Four LVs attain the minimum predictive error, but 3 LVs are selected since the fourth LV does not contribute much to the reduction of this error. In Figure 8(b), the contribution of this fourth LV is shown. It is capturing the predictive variability from the third to the fifth temperature sensors in the first thermowell, which are correlated. The corresponding variability in the second thermowell is already captured by the first 3 LVs. This is an example of the capability of MEDA for model interpretation, which can be very useful in the determination of the number of LVs. In this case and depending on the application of the model, the fourth LV may be added to the model in order to compensate the variance captured in both thermowells, even if the improvement in prediction performance is not high.

The information provided by MEDA can also be useful for variable selection. In this example, temperatures 1-2 and 11-12 do not contribute to a relevant degree to the regression model. As shown in Table 1, if those variables are not used in the model, its prediction performance 10 Will-be-set-by-IN-TECH

1 2 3 4 5 6 7 8 91011121314151617181920

<sup>20</sup> <sup>0</sup>

(b) Fourth LV

0.2

0.4

0.6

0.8

1

0.2

Fig. 8. MEDA matrices from the PLS model in the Slurry-Fed Ceramic Melter data set.

Following recommendations in (32), a 3 LVs PLS model from a mean-centered x-block was fitted. The loading plots corresponding to the three possible pairs of LVs are shown in Figure 7. The first LV captures the predictive variance in the temperatures, with higher loadings for higher indices of the temperatures in each thermowell. This holds exception made on temperatures 10 and 20, which present their predictive variance in the second and third LVs, being the variance in 10 between three and four times higher than that in 20. The third LVs also seems to discriminate between both thermowells. For instance, in Figure 7(c) most temperatures of the first thermowell are in the upper middle of the Figure, whereas most temperatures of the second thermowell are in the lower middle. Nevertheless, it should be noted that the third LV only captures 2% of the variability in the x-block and 1% of the

The information gained with the loading plots can be complemented with that in MEDA, which brings a much clearer picture of the structure in the data. The MEDA matrix for the PLS model with 3 LVs is shown in Figure 8(a). There is a clear auto-correlation effect in the temperatures, so that closer sensors are more correlated. This holds exception made on temperatures 10 and 20. Also, the corresponding temperatures in both thermowells are correlated, including 10 and 20. Finally, the temperatures at the bottom do not contain almost any predictive information of the level of molten glass. In (32), the predictive error by cross-validation is used to identify the number of LVs. Four LVs attain the minimum predictive error, but 3 LVs are selected since the fourth LV does not contribute much to the reduction of this error. In Figure 8(b), the contribution of this fourth LV is shown. It is capturing the predictive variability from the third to the fifth temperature sensors in the first thermowell, which are correlated. The corresponding variability in the second thermowell is already captured by the first 3 LVs. This is an example of the capability of MEDA for model interpretation, which can be very useful in the determination of the number of LVs. In this case and depending on the application of the model, the fourth LV may be added to the model in order to compensate the variance captured in both thermowells, even if the improvement

The information provided by MEDA can also be useful for variable selection. In this example, temperatures 1-2 and 11-12 do not contribute to a relevant degree to the regression model. As shown in Table 1, if those variables are not used in the model, its prediction performance

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 91011121314151617181920

<sup>20</sup> <sup>0</sup>

(a) First three LVs

variability in the y-block.

in prediction performance is not high.


Table 1. Comparison of three PLS models in the Slurry-Fed Ceramic Melter data set. The variance in both blocks of data and the Root Mean Square Error of Calibration (RMSEC), Cross-validation (RMSECV) and Prediction (RMSEP) are compared.

remains the same. Also, considering the correlation among thermowells, one may be tempted to use only one of the thermowells for prediction, reducing the associated costs of maintaining two thermowells. If this is done, only 8 predictor variables are used and the prediction performance is reduced, but not to a large extent. Correlated variables in a prediction model help to better discriminate between true structure and noise. For instance, in this example, when only the sensors of one thermowell are used, the PLS model captures more x-block variance and less y-block variance. This is showing that more specific–noisy–variance in the x-block is being captured. Using both thermowells reduces this effect. Another example of variable selection with MEDA will be presented in Section 5.

#### **3.3 correlation matrices and MEDA**

There is a close similarity between MEDA and correlation matrices. To this regard, equation (2) simplifies the interpretation of the MEDA procedure. The MEDA index combines the original variance with the model subspace variance in *Sml* and *SmlA* . Also, the denominator of the index in eq. (2) is the original variance. Thus, those pairs of variables where a high amount of the total variance of one of them can be recovered from the other are highlighted. This is convenient for data interpretation, since only factors of high variance are highlighted. On the other hand, it is easy to see that when the number of LVs, *A*, equals the rank of **X**, then *Q*2 *<sup>A</sup>* is equal to the element-wise squared correlation matrix of **<sup>X</sup>**, *<sup>C</sup>*<sup>2</sup> (11). This can be observed in the following element-wise equality:

$$q\_{Rank(X),(m,l)}^2 = \frac{S\_{ml}^2}{\mathbf{S}\_{mm} \cdot \mathbf{S}\_{ll}} = c\_{(m,l)}^2. \tag{4}$$

This equivalence shows that matrix *Q*<sup>2</sup> *<sup>A</sup>* has a similar structure than the–element-wise squared–correlation matrix. To elaborate this similarity, a correlation matrix can be easily extended to the notion of latent subspace. The correlation matrix in the latent subspace, *CA*, can de defined as the correlation matrix of the reconstruction of **X** with the first *A* LVs. Thus, *CA* <sup>=</sup> **<sup>P</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup> <sup>A</sup>* · **<sup>C</sup>** · **<sup>P</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup> <sup>A</sup>* in PCA and *CA* <sup>=</sup> **<sup>P</sup>***<sup>A</sup>* · **<sup>R</sup>***<sup>t</sup> <sup>A</sup>* · **<sup>C</sup>** · **<sup>R</sup>***<sup>A</sup>* · **<sup>P</sup>***<sup>t</sup> <sup>A</sup>* in PLS. If the elements of *CA* are then squared, the element-wise squared correlation in the latent subspace, noted as *C*<sup>2</sup> *<sup>A</sup>*, is obtained. Strictly speaking, each element of *C*<sup>2</sup> *<sup>A</sup>* is defined as:

$$\mathcal{L}\_{A,(m,l)}^2 = \frac{\mathcal{S}\_{m^A l^A}^2}{\mathcal{S}\_{m^A m^A} \cdot \mathcal{S}\_{l^A l^A}}.\tag{5}$$

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>0</sup>

0 5 10 15 20

(b) Second Data Set

<sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>1</sup>

(d) Second Data Set

MEDA COV2

*<sup>A</sup>*, of dimension *M* × 1,

�2, <sup>∀</sup>*l*. (7)

0

1.5 2 2.5 3 3.5 4 4.5 5 5.5

Ratio

Fig. 9. Comparison of MEDA and the projected and element-wise squared covariance matrix in the identification of the data structure from the PCA model fitted from the data in the ten pipelines example: (a) and (b) show the eigenvalues when the pipelines are correlated and independent, respectively, and (c) and (d) show the ratio between the mean of the elements with common factors and the mean of the elements without common factors in the matrices.

far from the bulk of the data, **L**. One may be interested in identifying, for instance, the variables related to the deviation of **C**<sup>1</sup> from **L** without considering the rest of clusters. For that, a dummy variable **d** is created so that observations in **C**<sup>1</sup> are set to 1, observations in **L** are set to -1, while the remaining observations are left to 0. Also, values other than 1 and -1 can be included in the dummy variable if desired. *o*MEDA is then performed using this dummy

The *o*MEDA technique is illustrated in Figure 10. Firstly, the dummy variable is designed and combined with the data set. Then, a MEDA run is performed by predicting the original

being *M* the number of original variables. In practice, the *o*MEDA index is slightly different to that used in MEDA. Being **d** the dummy variable, designed to compare a set of observations with value 1 (or in general positive values) with another set with value -1 (or in general

(*l*)�<sup>2</sup> − �**e**<sup>ˆ</sup> *<sup>d</sup>*

*A*,(*l*)

variables from the dummy variable. The result is a single vector, **d**<sup>2</sup>

*<sup>A</sup>*,(*l*) <sup>=</sup> �**x***<sup>d</sup>*

*d*2

50

100

Eigenvalues

MEDA COV2

150

Exploratory Data Analysis with Latent Subspace Models 75

(a) First Data Set

<sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>1</sup>

(c) First Data Set

negative values), then the *o*MEDA index follows:

100

variable.

Ratio

200

300

Eigenvalues

400

500

However, for the same reason explained before, if *C*<sup>2</sup> *<sup>A</sup>* is aimed at data interpretation, the denominator should be original variance:

$$
\sigma\_{A,(m,l)}^2 = \frac{\mathcal{S}\_{m^A l^A}^2}{\mathcal{S}\_{mm} \cdot \mathcal{S}\_{ll}}.\tag{6}
$$

If this equation is compared to equation (2), we can see that the main difference between MEDA and the–projected and element-wise squared–correlation matrix is the combination of original and projected variance in the numerator of the former. This combination is paramount for interpretation. Figure 9 illustrates this. The example of the pipelines is used again, but in this case ten pipelines and only 20 observations are considered, yielding a dimension of 20 × 50 in the data. Two data sets are simulated. In the first one, the pipelines are correlated. As a consequence, the data present three common factors represented by the three biggest eigenvalues in Figure 9(a). In the second one, each pipeline is independently generated, yielding a more distributed variance in the eigenvalues (Figure 9(b)). For matrices *Q*2 *<sup>A</sup>* and *<sup>C</sup>*<sup>2</sup> *<sup>A</sup>* to infer the structure in the data, they should have large values in the elements which represent real structural information (common factors) and low values in the rest of the elements. Since in both data sets it is known a-priori which elements in the matrices represent actual common factors and which not, the mean values for the two groups of elements in matrices *Q*<sup>2</sup> *<sup>A</sup>* and *<sup>C</sup>*<sup>2</sup> *<sup>A</sup>* can be computed. The ratio of these means, computed by dividing the mean of the elements with common factors by the mean of the elements without common factors, is a measure of the discrimination capability between structure and noise of each matrix. The higher this index is, the better the discrimination capability is. This ratio is shown in Figures 9(c) and 9(d) for different numbers of PCs. *Q*<sup>2</sup> *<sup>A</sup>* outperforms *<sup>C</sup>*<sup>2</sup> *A* until all relevant eigenvalues are incorporated to the model. Also, *Q*<sup>2</sup> *<sup>A</sup>* presents maximum discrimination capability for a reduced number of components. Notice that both alternative definitions of *C*<sup>2</sup> *<sup>A</sup>* in equations (5) and (6) give exactly the same result, though equation (6) is preferred for visual interpretation.

#### **4. Connection between observations and variables**

The most relevant issue for data understanding is probably the connection between observations and variables. It is almost useless to detect certain details in the data distribution, such as outliers or clusters, if the set of variables related to these details are not identified. Traditionally, biplots (12) have been used for this purpose. In biplots, the scatter plots of loadings and scores are combined in a single plot. Apart from relevant considerations regarding the comparability of the axes in the plot, which is also important for any scatter plots, and of the scales in scores and loadings (18), biplots may be misleading just because of the loading plot included. In this point, a variant of MEDA, named observation-based MEDA or *o*MEDA, can be used to unveil the connection between observations and variables without the limitations of biplots.

#### **4.1** *o***MEDA**

*o*MEDA is a variant of MEDA to connect observations and variables. Basically, *o*MEDA is a MEDA algorithm applied over a combination of the original data and a dummy variable designed to cover the observations of interest. Take the following example: a number of subsets of observations {**C**1, ..., **C***N*} form different clusters in the scores plot which are located 12 Will-be-set-by-IN-TECH

*<sup>A</sup>*,(*m*,*l*) <sup>=</sup> *<sup>S</sup>*<sup>2</sup>

If this equation is compared to equation (2), we can see that the main difference between MEDA and the–projected and element-wise squared–correlation matrix is the combination of original and projected variance in the numerator of the former. This combination is paramount for interpretation. Figure 9 illustrates this. The example of the pipelines is used again, but in this case ten pipelines and only 20 observations are considered, yielding a dimension of 20 × 50 in the data. Two data sets are simulated. In the first one, the pipelines are correlated. As a consequence, the data present three common factors represented by the three biggest eigenvalues in Figure 9(a). In the second one, each pipeline is independently generated, yielding a more distributed variance in the eigenvalues (Figure 9(b)). For matrices

*mAlA Smm* · *Sll*

*<sup>A</sup>* to infer the structure in the data, they should have large values in the elements

*<sup>A</sup>* can be computed. The ratio of these means, computed by

which represent real structural information (common factors) and low values in the rest of the elements. Since in both data sets it is known a-priori which elements in the matrices represent actual common factors and which not, the mean values for the two groups of

dividing the mean of the elements with common factors by the mean of the elements without common factors, is a measure of the discrimination capability between structure and noise of each matrix. The higher this index is, the better the discrimination capability is. This

discrimination capability for a reduced number of components. Notice that both alternative

The most relevant issue for data understanding is probably the connection between observations and variables. It is almost useless to detect certain details in the data distribution, such as outliers or clusters, if the set of variables related to these details are not identified. Traditionally, biplots (12) have been used for this purpose. In biplots, the scatter plots of loadings and scores are combined in a single plot. Apart from relevant considerations regarding the comparability of the axes in the plot, which is also important for any scatter plots, and of the scales in scores and loadings (18), biplots may be misleading just because of the loading plot included. In this point, a variant of MEDA, named observation-based MEDA or *o*MEDA, can be used to unveil the connection between observations and variables without

*o*MEDA is a variant of MEDA to connect observations and variables. Basically, *o*MEDA is a MEDA algorithm applied over a combination of the original data and a dummy variable designed to cover the observations of interest. Take the following example: a number of subsets of observations {**C**1, ..., **C***N*} form different clusters in the scores plot which are located

*<sup>A</sup>* in equations (5) and (6) give exactly the same result, though equation (6) is

ratio is shown in Figures 9(c) and 9(d) for different numbers of PCs. *Q*<sup>2</sup>

until all relevant eigenvalues are incorporated to the model. Also, *Q*<sup>2</sup>

*<sup>A</sup>* is aimed at data interpretation, the

. (6)

*<sup>A</sup>* outperforms *<sup>C</sup>*<sup>2</sup>

*<sup>A</sup>* presents maximum

*A*

However, for the same reason explained before, if *C*<sup>2</sup>

*<sup>A</sup>* and *<sup>C</sup>*<sup>2</sup>

**4. Connection between observations and variables**

*c* 2

denominator should be original variance:

*Q*2

*<sup>A</sup>* and *<sup>C</sup>*<sup>2</sup>

definitions of *C*<sup>2</sup>

preferred for visual interpretation.

the limitations of biplots.

**4.1** *o***MEDA**

elements in matrices *Q*<sup>2</sup>

Fig. 9. Comparison of MEDA and the projected and element-wise squared covariance matrix in the identification of the data structure from the PCA model fitted from the data in the ten pipelines example: (a) and (b) show the eigenvalues when the pipelines are correlated and independent, respectively, and (c) and (d) show the ratio between the mean of the elements with common factors and the mean of the elements without common factors in the matrices.

far from the bulk of the data, **L**. One may be interested in identifying, for instance, the variables related to the deviation of **C**<sup>1</sup> from **L** without considering the rest of clusters. For that, a dummy variable **d** is created so that observations in **C**<sup>1</sup> are set to 1, observations in **L** are set to -1, while the remaining observations are left to 0. Also, values other than 1 and -1 can be included in the dummy variable if desired. *o*MEDA is then performed using this dummy variable.

The *o*MEDA technique is illustrated in Figure 10. Firstly, the dummy variable is designed and combined with the data set. Then, a MEDA run is performed by predicting the original variables from the dummy variable. The result is a single vector, **d**<sup>2</sup> *<sup>A</sup>*, of dimension *M* × 1, being *M* the number of original variables. In practice, the *o*MEDA index is slightly different to that used in MEDA. Being **d** the dummy variable, designed to compare a set of observations with value 1 (or in general positive values) with another set with value -1 (or in general negative values), then the *o*MEDA index follows:

$$d\_{A,(l)}^2 = \|\mathbf{x}\_{(l)}^d\|^2 - \|\dot{\mathbf{e}}\_{A,(l)}^d\|^2 \quad \forall l. \tag{7}$$

10 20 30 40 50 60 70 80 90 100

Exploratory Data Analysis with Latent Subspace Models 77

Fig. 11. *o*MEDA vector of two clusters of data from the 10 PCs PCA model of a simulated data set of dimension 100 × 100. This model captures 30% of the variability. Data present two

**<sup>D</sup>** <sup>=</sup> **<sup>d</sup>** · (**d**)*<sup>T</sup>*

(*l*) <sup>−</sup> <sup>Σ</sup>*<sup>d</sup>*

weights in **d**, respectively. Equation (10) has two advantages. Firstly, it presents the *o*MEDA vector as a weighted sum of values, which is easier to understand. Secondly, it has the sign computation built in, due to the absolute value in the last element. Notice also that *o*MEDA

In Figure 11 an example of *o*MEDA is shown. For this, a 100 × 100 data set with two clusters of data was simulated. The distribution of the observations was designed so that both clusters had significantly different values only in variable 10 and then data was auto-scaled. The *o*MEDA vector clearly highlights variable 10 as the main difference between both clusters.

Let us return to the discussion regarding the relationship between the common factors and the components. As already commented, several common factors can be captured by the same component in a projection model. As a result, a group of variables may be located close in a loading plot without the need to be correlated. This is also true for the observations. Thus, two observations closely located in a score plot may be quite similar or quite different depending on their scores in the remaining LVs. However, score plots are typically employed to observe a general distribution of the observations. This exploration is more aimed at finding differences among observations rather than similarities. Because of this, the problem described for loading plots is not so relevant for the typical use of score plots. However, this is a problem when interpreting biplots. In biplots, deviations in the observations are related to deviations in the variables. Like loading plots, biplots may be useful to perform a fast view on the data, but any conclusion should be confirmed with another technique. *o*MEDA is perfectly

*A*,(*l*)

*<sup>A</sup>*,(*l*) being the weighted sum of elements in **x**(*l*) and **x***A*,(*l*) according to the

) · |Σ*<sup>d</sup> A*,(*l*)

*<sup>N</sup>* · (<sup>2</sup> · <sup>Σ</sup>*<sup>d</sup>*

inherits the combination of total and projected variance present in MEDA.

�**d**�<sup>2</sup> . (9)


Finally, the equation can also be reexpressed as follows:

*d*2 *<sup>A</sup>*,(*l*) <sup>=</sup> <sup>1</sup>

clusters in variable 10.

(*l*) and <sup>Σ</sup>*<sup>d</sup>*

**4.2 Biplots vs oMEDA**

suited for this.

with Σ*<sup>d</sup>*

Fig. 10. *o*MEDA technique: (1) introduction of the dummy variable, (2) model calibration, (3) introduction of missing data, (4) missing data imputation, (5) error computation, (6) computation of vector **d**<sup>2</sup> *A*.

where **x***<sup>d</sup>* (*l*) represents the values of the *l*-th variable in the original observations different to 0 in **d** and ˆ**e***<sup>d</sup> <sup>A</sup>*,(*l*) is the corresponding estimation error. The main difference between the computation of index *d*<sup>2</sup> *<sup>A</sup>*,(*l*) in *o*MEDA and that of MEDA is the absence of the denominator in the former. This modification is convenient to avoid high values in *d*<sup>2</sup> *<sup>A</sup>*,(*l*) when the amount of variance of a variable in the reduced set of observations of interest is very low. Once **d**<sup>2</sup> *<sup>A</sup>* is computed for a given dummy variable, sign information can be added from the mean vectors of the two groups of observations considered (33).

In practice, in order to avoid any modification in the PCA or PLS subspace due to the introduction of the dummy variable, the *o*MEDA algorithm is slightly more complex than the procedure shown in Figure 10. For a description of this algorithm refer to (33). However, like in MEDA, the *o*MEDA vector can be computed in a more direct way by assuming KDR (26) is used as the missing data estimation procedure. If this holds, the *o*MEDA vector follows:

$$\mathbf{d}\_{A,(l)}^2 = \mathbf{2} \cdot \mathbf{x}\_{(l)}^t \cdot \mathbf{D} \cdot \mathbf{x}\_{A,(l)} - \mathbf{x}\_{A,(l)}^t \cdot \mathbf{D} \cdot \mathbf{x}\_{A,(l)}.\tag{8}$$

where **x**(*l*) represents the *l*-th variable in the–complete–set of original observations and **x***A*,(*l*) its projection in the latent subspace in coordinates of the original space and:

14 Will-be-set-by-IN-TECH

Fig. 10. *o*MEDA technique: (1) introduction of the dummy variable, (2) model calibration, (3)

of variance of a variable in the reduced set of observations of interest is very low. Once **d**<sup>2</sup>

computed for a given dummy variable, sign information can be added from the mean vectors

In practice, in order to avoid any modification in the PCA or PLS subspace due to the introduction of the dummy variable, the *o*MEDA algorithm is slightly more complex than the procedure shown in Figure 10. For a description of this algorithm refer to (33). However, like in MEDA, the *o*MEDA vector can be computed in a more direct way by assuming KDR (26) is used as the missing data estimation procedure. If this holds, the *o*MEDA vector follows:

(*l*) · **<sup>D</sup>** · **<sup>x</sup>***A*,(*l*) <sup>−</sup> **<sup>x</sup>***<sup>t</sup>*

where **x**(*l*) represents the *l*-th variable in the–complete–set of original observations and **x***A*,(*l*)

(*l*) represents the values of the *l*-th variable in the original observations different to

*<sup>A</sup>*,(*l*) is the corresponding estimation error. The main difference between the

*<sup>A</sup>*,(*l*) in *o*MEDA and that of MEDA is the absence of the denominator

*<sup>A</sup>*,(*l*) when the amount

*<sup>A</sup>*,(*l*) · **D** · **x***A*,(*l*), (8)

*<sup>A</sup>* is

introduction of missing data, (4) missing data imputation, (5) error computation, (6)

in the former. This modification is convenient to avoid high values in *d*<sup>2</sup>

computation of vector **d**<sup>2</sup>

computation of index *d*<sup>2</sup>

where **x***<sup>d</sup>*

0 in **d** and ˆ**e***<sup>d</sup>*

*A*.

of the two groups of observations considered (33).

*d*2

*<sup>A</sup>*,(*l*) <sup>=</sup> <sup>2</sup> · **<sup>x</sup>***<sup>t</sup>*

its projection in the latent subspace in coordinates of the original space and:

Fig. 11. *o*MEDA vector of two clusters of data from the 10 PCs PCA model of a simulated data set of dimension 100 × 100. This model captures 30% of the variability. Data present two clusters in variable 10.

$$\mathbf{D} = \frac{\mathbf{d} \cdot (\mathbf{d})^T}{||\mathbf{d}||^2}. \tag{9}$$

Finally, the equation can also be reexpressed as follows:

$$d\_{A,(l)}^2 = \frac{1}{N} \cdot (2 \cdot \Sigma\_{(l)}^d - \Sigma\_{A,(l)}^d) \cdot |\Sigma\_{A,(l)}^d| \tag{10}$$

with Σ*<sup>d</sup>* (*l*) and <sup>Σ</sup>*<sup>d</sup> <sup>A</sup>*,(*l*) being the weighted sum of elements in **x**(*l*) and **x***A*,(*l*) according to the weights in **d**, respectively. Equation (10) has two advantages. Firstly, it presents the *o*MEDA vector as a weighted sum of values, which is easier to understand. Secondly, it has the sign computation built in, due to the absolute value in the last element. Notice also that *o*MEDA inherits the combination of total and projected variance present in MEDA.

In Figure 11 an example of *o*MEDA is shown. For this, a 100 × 100 data set with two clusters of data was simulated. The distribution of the observations was designed so that both clusters had significantly different values only in variable 10 and then data was auto-scaled. The *o*MEDA vector clearly highlights variable 10 as the main difference between both clusters.

#### **4.2 Biplots vs oMEDA**

Let us return to the discussion regarding the relationship between the common factors and the components. As already commented, several common factors can be captured by the same component in a projection model. As a result, a group of variables may be located close in a loading plot without the need to be correlated. This is also true for the observations. Thus, two observations closely located in a score plot may be quite similar or quite different depending on their scores in the remaining LVs. However, score plots are typically employed to observe a general distribution of the observations. This exploration is more aimed at finding differences among observations rather than similarities. Because of this, the problem described for loading plots is not so relevant for the typical use of score plots. However, this is a problem when interpreting biplots. In biplots, deviations in the observations are related to deviations in the variables. Like loading plots, biplots may be useful to perform a fast view on the data, but any conclusion should be confirmed with another technique. *o*MEDA is perfectly suited for this.

Liquor Wine Beer LifeEx HeartD −2

seem to be very similar.

seem to be more healthy.

observations.

(a) Russia Vs The rest

Liquor Wine Beer LifeEx HeartD

Exploratory Data Analysis with Latent Subspace Models 79

Liquor Wine Beer LifeEx HeartD

(c) W&LE Vs The rest

−8 −6 −4 −2 0 2

(b) L&B Vs The rest

In Figure 14(a), the oMEDA vector to discover the differences between Russia and the rest of countries is shown. For this, a dummy variable is built where all observations except Russia are set to -1 and Russia is set to 1. oMEDA shows that Russia has in general less life expectancy and more heart disease and liquor consumption than the rest of countries. The same experiment is repeated for artificial observations L&B and W&LE in Figures 14(b) and 14(c). oMEDA clearly distinguishes among the three observations, while in the biplot they

To analyze the trend shown by all countries except Russia in Figure 12, the simplest approach is to compare the most separated observations, in this case France and the Czech Republic. The oMEDA vector is shown in Figure 15(a). In this case, the dummy variable is built so that France has value 1, the Czech Republic has value -1 and the rest of the countries have 0 value. Thus, positive values in the oMEDA vector identify variables with higher value in France than in the Czech Republic and negative values the opposite. oMEDA shows that the French consume more wine and less beer than Czech people. Also, according to the data, the former

Comparing the two most separated observations may be misleading in certain situations. Another choice is to use the capability of oMEDA to unveil the variables related to any direction in a score plot. For instance, let us analyze the trend of the countries incorporating the information in all the countries. For this, different weights are considered in the dummy variable. We can think of these weights as-approximate–projections of the observations in the direction of interest. Following this approach, the weights listed in Table 2 are assigned, which approximate the projection of the countries in the imaginary line depicted by the arrow in Figure 13. Since Russia is not in the trend, it is left to 0. Using these weights, the resulting oMEDA vector is shown in Figure 15(b). In this case, the analysis of the complete set of observations in the trend resembles the conclusions in the analysis of the two most separated

> **Country Weight CountryWeight** France 3 Mexico -1 Italy 2 U.S.A. -1 Switz 1 Austra -1 Japan 1 Brit -1 Russia 0 Czech -3

Table 2. Weights used in the dummy variable for the oMEDA vector in Figure 15(b).

Fig. 14. oMEDA vectors of the first 2 PCs from the Wine Data set provided with the PLS-toolbox (32). Russia (a), L&B (b) and W&LE (c) compared to the rest of countries.

−1.5 −1 −0.5 0 0.5 1 1.5 2

Fig. 12. Biplot of the first 2 PCs from the Wine Data set provided with the PLS-toolbox (32).

Fig. 13. Score plot of the first 2 PCs from the Wine Data set provided with the PLS-toolbox (32) and two artificial observations.

In Figure 12, the biplot of the Wine data set is presented. The distribution of the scores show that all countries but Russia follow a trend from the Czech Republic to France, while Russia is located far from this trend. The biplot may be useful to make hypothesis on the variables related to the special nature of Russia or to the trend in the rest of countries. Nevertheless, this hypothesis making is not straightforward. To illustrate this, in Figure 13 the scores are shown together with two artificial observations. The artificial observations were designed to lay close to Russia in the score plot of the first two PCs. In both cases, three of the five variables were left to their average value in the Wine data set and only two variables are set to approach Russia. Thus, observation W&LE only uses variables Wine and LifeEx to yield a point close to Russia in the score plot while the other variables are set to the average. Observation L&B only uses Liquor and Beer. With this example, it can be concluded that very little can be said about Russia only by looking at the biplot.

16 Will-be-set-by-IN-TECH

Beer

Czech

Liquor

 Austra Brit U.S.A.

LifeEx HeartD

 Japan Mexico

−0.5 0 0.5

Czech

Trend

Samples/Scores Plot of Wine & art,

−4 −3 −2 −1 0 1 2

Fig. 13. Score plot of the first 2 PCs from the Wine Data set provided with the PLS-toolbox

In Figure 12, the biplot of the Wine data set is presented. The distribution of the scores show that all countries but Russia follow a trend from the Czech Republic to France, while Russia is located far from this trend. The biplot may be useful to make hypothesis on the variables related to the special nature of Russia or to the trend in the rest of countries. Nevertheless, this hypothesis making is not straightforward. To illustrate this, in Figure 13 the scores are shown together with two artificial observations. The artificial observations were designed to lay close to Russia in the score plot of the first two PCs. In both cases, three of the five variables were left to their average value in the Wine data set and only two variables are set to approach Russia. Thus, observation W&LE only uses variables Wine and LifeEx to yield a point close to Russia in the score plot while the other variables are set to the average. Observation L&B only uses Liquor and Beer. With this example, it can be concluded that very little can be said

Scores on PC 1 (46.03%)

Fig. 12. Biplot of the first 2 PCs from the Wine Data set provided with the PLS-toolbox (32).

PC 1 (46.03%)

 France Italy Switz

Austra Brit

Mexico

U.S.A.

Wine

France

Switz

Japan

Italy

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

−2

(32) and two artificial observations.

about Russia only by looking at the biplot.

Russia

 L&B W&LE

−1

0

Scores on PC 2 (32.11%)

1

2

Russia

PC 2 (32.11%)

Fig. 14. oMEDA vectors of the first 2 PCs from the Wine Data set provided with the PLS-toolbox (32). Russia (a), L&B (b) and W&LE (c) compared to the rest of countries.

In Figure 14(a), the oMEDA vector to discover the differences between Russia and the rest of countries is shown. For this, a dummy variable is built where all observations except Russia are set to -1 and Russia is set to 1. oMEDA shows that Russia has in general less life expectancy and more heart disease and liquor consumption than the rest of countries. The same experiment is repeated for artificial observations L&B and W&LE in Figures 14(b) and 14(c). oMEDA clearly distinguishes among the three observations, while in the biplot they seem to be very similar.

To analyze the trend shown by all countries except Russia in Figure 12, the simplest approach is to compare the most separated observations, in this case France and the Czech Republic. The oMEDA vector is shown in Figure 15(a). In this case, the dummy variable is built so that France has value 1, the Czech Republic has value -1 and the rest of the countries have 0 value. Thus, positive values in the oMEDA vector identify variables with higher value in France than in the Czech Republic and negative values the opposite. oMEDA shows that the French consume more wine and less beer than Czech people. Also, according to the data, the former seem to be more healthy.

Comparing the two most separated observations may be misleading in certain situations. Another choice is to use the capability of oMEDA to unveil the variables related to any direction in a score plot. For instance, let us analyze the trend of the countries incorporating the information in all the countries. For this, different weights are considered in the dummy variable. We can think of these weights as-approximate–projections of the observations in the direction of interest. Following this approach, the weights listed in Table 2 are assigned, which approximate the projection of the countries in the imaginary line depicted by the arrow in Figure 13. Since Russia is not in the trend, it is left to 0. Using these weights, the resulting oMEDA vector is shown in Figure 15(b). In this case, the analysis of the complete set of observations in the trend resembles the conclusions in the analysis of the two most separated observations.


Table 2. Weights used in the dummy variable for the oMEDA vector in Figure 15(b).

1 2 3 4 5 6 7 8 9 1011121314151617181920

Exploratory Data Analysis with Latent Subspace Models 81

Fig. 17. oMEDA vector comparing training and test observations in the PLS model with 3

F15

 I26 N31 H14

Samples/Scores Plot of selwoodx

M6

 L21 E20

 B13 B8 B27 B29 C12 C16 L22 B3

E24

 B28 B7

 G4 G9

<sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>25</sup> <sup>30</sup> −10

Fig. 18. Scores corresponding to the first LV in the PLS model with the complete set of

optimization algorithms which make use of heuristics to reduce the search space.

Sample

consists of 31 antifilarial antimycin *A*<sup>1</sup> analogues for which 53 physicochemical descriptors were calculated for Quantitative Structure-Activity Relationship (QSAR) modelling. The set of descriptors is listed in Table 3. These descriptors are used for predicting in vitro antifilarial activity (-LOGEC50). This data set has been employed for testing variables selection methods, for instance in (35; 36), in order to find a reduced number of descriptors with best prediction performance. Generally speaking, these variable selection methods are based on complex

1:10 ATCH1 ATCH2 ATCH3 ATCH4 ATCH5 ATCH6 ATCH7 ATCH8 ATCH9 ATCH10 11:20 DIPV\_X DIPV\_Y DIPV\_Z DIPMOM ESDL1 ESDL2 ESDL3 ESDL4 ESDL5 ESDL6 21:30 ESDL7 ESDL8 ESDL9 ESDL10 NSDL1 NSDL2 NSDL3 NSDL4 NSDL5 NSDL6 31:40 NSDL7 NSDL8 NSDL9 NSDL10 VDWVOL SURF\_A MOFI\_X MOFI\_Y MOFI\_Z PEAX\_X 41:50 PEAX\_Y PEAX\_Z MOL\_WT S8\_1DX S8\_1DY S8\_1DZ S8\_1CX S8\_1CY S8\_1CZ LOGP

−3 −2.5 −2 −1.5 −1 −0.5 0 x 104

> −8 −6 −4 −2 0 2 4

**Indices Descriptors**

51:53 M\_PNT SUM\_F SUM\_R Table 3. Physicochemical descriptors of the Selwood dataset.

K17

J19

 J1 K18

G2

 L25 A10 C11 D23

D30

A5

Scores on LV 1 (21.34%)

descriptors of the Selwood dataset.

LVs fitted from the Slurry-Fed Ceramic Melter (SFCM) data set in (32).

Fig. 15. oMEDA vectors of the first 2 PCs from the Wine Data set provided with the PLS-toolbox (32). In (a), France and Czech Republic are compared. In (b), the trend shown in the score plot by all countries except Russia is analyzed.

Fig. 16. Measured vs predicted values of molten glass level in the PLS model with 3 LVs fitted from the Slurry-Fed Ceramic Melter (SFCM) data set in (32).

Let us return to the PLSdata data set. A PLS model relating temperatures with the level of molten glass was previously fitted. As already discussed, the data set includes 300 training observations and 200 test observations. The measured and predicted values of both sets of observations are compared in Figure 16(a). The predicted values in the test observations (inverted triangles) tend to be higher than true values. This is also observed in Figure 16(b). The cause for this seems to be that the process has slightly moved from the operation point where training data was collected. oMEDA can be used to identify this change of operation point by simply comparing training and test observations in the model subspace. Thus, training (value 1) and test observations (value -1) are compared in the subspace spanned by the first 3 LVs of the PLS model fitted only from training data. The resulting oMEDA vector is shown in Figure 17. According to the result, considering the test observations have value -1 in the dummy variable, it can be concluded that the process has moved to a situation in which top temperatures are higher than during model calibration.

#### **5. Case study: Selwood data set**

In this section, an exploratory data analysis of the Selwood data set (34) is carried out. The data set was downloaded from http://michem.disat.unimib.it/chm/download/datasets.htm. It 18 Will-be-set-by-IN-TECH

20 20.2 20.4 20.6 20.8 21

Fig. 16. Measured vs predicted values of molten glass level in the PLS model with 3 LVs

Let us return to the PLSdata data set. A PLS model relating temperatures with the level of molten glass was previously fitted. As already discussed, the data set includes 300 training observations and 200 test observations. The measured and predicted values of both sets of observations are compared in Figure 16(a). The predicted values in the test observations (inverted triangles) tend to be higher than true values. This is also observed in Figure 16(b). The cause for this seems to be that the process has slightly moved from the operation point where training data was collected. oMEDA can be used to identify this change of operation point by simply comparing training and test observations in the model subspace. Thus, training (value 1) and test observations (value -1) are compared in the subspace spanned by the first 3 LVs of the PLS model fitted only from training data. The resulting oMEDA vector is shown in Figure 17. According to the result, considering the test observations have value -1 in the dummy variable, it can be concluded that the process has moved to a situation in which

In this section, an exploratory data analysis of the Selwood data set (34) is carried out. The data set was downloaded from http://michem.disat.unimib.it/chm/download/datasets.htm. It

Liquor Wine Beer LifeEx HeartD −6

(b) All except Russia

<sup>0</sup> <sup>50</sup> <sup>100</sup> <sup>150</sup> <sup>200</sup> 19.8

(b) Test

Sample

Y Measured Y Predicted

Liquor Wine Beer LifeEx HeartD −4

the score plot by all countries except Russia is analyzed.

19.8 <sup>20</sup> 20.2 20.4 20.6 20.8 <sup>21</sup> 21.2 19.8

(a) Training & Test

Y Measured 1 Level

fitted from the Slurry-Fed Ceramic Melter (SFCM) data set in (32).

top temperatures are higher than during model calibration.

**5. Case study: Selwood data set**

20 20.2 20.4 20.6 20.8 21 21.2

Y Predicted 1 Level

(a) France Vs Czech

Fig. 15. oMEDA vectors of the first 2 PCs from the Wine Data set provided with the

PLS-toolbox (32). In (a), France and Czech Republic are compared. In (b), the trend shown in

Fig. 17. oMEDA vector comparing training and test observations in the PLS model with 3 LVs fitted from the Slurry-Fed Ceramic Melter (SFCM) data set in (32).

Fig. 18. Scores corresponding to the first LV in the PLS model with the complete set of descriptors of the Selwood dataset.

consists of 31 antifilarial antimycin *A*<sup>1</sup> analogues for which 53 physicochemical descriptors were calculated for Quantitative Structure-Activity Relationship (QSAR) modelling. The set of descriptors is listed in Table 3. These descriptors are used for predicting in vitro antifilarial activity (-LOGEC50). This data set has been employed for testing variables selection methods, for instance in (35; 36), in order to find a reduced number of descriptors with best prediction performance. Generally speaking, these variable selection methods are based on complex optimization algorithms which make use of heuristics to reduce the search space.


Table 3. Physicochemical descriptors of the Selwood dataset.

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −1

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −1

Variables (c) J19

set to 0, while the rest of compounds in the data set are set to -1.

the same conclusions, like it happens in the present example.

Variables (a) K17

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −1

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −1

Variables (d) K18

Variables (b) J1

−0.5

−0.5

Fig. 20. *o*MEDA vectors corresponding to the first LV in the PLS model with the complete set of descriptors of the Selwood dataset. To compute each of the vectors, one of the four compounds K17, J1, J19 and K18 are set to 1 in the dummy variable and the other three are

The subsequent step is to search for relevant descriptors (variable selection) For this, MEDA will be employed. In this point, there are two choices. The four compounds with low score in the first LV may be treated as outliers and separated from the rest of the data (34) or the complete data set may be modelled with a single QSAR model (35; 36). It should be noted that differences among observations in one model may not be found in a different model, so that the same observation may be an outlier or a normal observation depending on the model. Furthermore, as discussed in (35), the more general the QSAR model is, so that it models a wider set of compounds, the better. Therefore, the complete set of compounds will be considered in the remaining of the example. On the other hand, regarding the analysis tools used, there are different possibilities. MEDA may be applied over the PLS model relating the descriptors in the x-block and -LOGEC50 in the y-block. Alternatively, both blocks may be joined together in a single block of data and MEDA with PCA be applied. The second choice will be generally preferred to avoid over-fitting, but typically both approaches may lead to

The application of MEDA requires to select the number of PCs in the PCA model. Considering that the aim is to understand how the variability in -LOGEC50 is related to the descriptors in

0

0.5

1

0

0.5

1

Exploratory Data Analysis with Latent Subspace Models 83

−0.5

−0.5

0

0.5

1

0

0.5

1

Fig. 19. Several vectors corresponding to the first LV in the PLS model with the complete set of descriptors of the Selwood dataset.

First of all, a PLS model is calibrated between the complete set of descriptors and -LOGEC50. Leave-one-out cross-validation suggests one LV. The score plot corresponding to the 31 analogues in that LV are shown in Figure 18. Four of the compounds, namely K17, J1, J19 and K18, present an abnormally low score. This deviation is highly contributing to the variance in the first LV and the reason for it should be investigated. Two of these compounds, K17 and K18, were catalogued as outliers by (34), where the authors stated that "Chemically, these compounds are distinct from the bulk of the training set in that they have an n-alkyl side chain as opposed to a side chain of the phenoxy ether type". Since the four compounds present an abnormally low score in the first LV, typically the analyst may interpret the coefficients of that LV to try to explain this abnormality. In Figure 19, the loadings, weights and regression coefficients of the PLS model are presented together with the *o*MEDA vector. The latter identifies those variables related to the deviation of the four compounds from the rest. The *o*MEDA vector is similar, but with opposite sign, to the other vectors in several descriptors, but quite different in others. Therefore, the loadings, weights or coefficient vectors should not be used in this case for the investigation of the deviation, or otherwise one may arrive to incorrect conclusions. On the other hand, it may be worth to check whether the *o*MEDA vector is representative of the deviation in the four compounds. Performing *o*MEDA individually on each of the compounds confirm this fact (see Figure 20)

20 Will-be-set-by-IN-TECH

−0.5

−5

Fig. 19. Several vectors corresponding to the first LV in the PLS model with the complete set

First of all, a PLS model is calibrated between the complete set of descriptors and -LOGEC50. Leave-one-out cross-validation suggests one LV. The score plot corresponding to the 31 analogues in that LV are shown in Figure 18. Four of the compounds, namely K17, J1, J19 and K18, present an abnormally low score. This deviation is highly contributing to the variance in the first LV and the reason for it should be investigated. Two of these compounds, K17 and K18, were catalogued as outliers by (34), where the authors stated that "Chemically, these compounds are distinct from the bulk of the training set in that they have an n-alkyl side chain as opposed to a side chain of the phenoxy ether type". Since the four compounds present an abnormally low score in the first LV, typically the analyst may interpret the coefficients of that LV to try to explain this abnormality. In Figure 19, the loadings, weights and regression coefficients of the PLS model are presented together with the *o*MEDA vector. The latter identifies those variables related to the deviation of the four compounds from the rest. The *o*MEDA vector is similar, but with opposite sign, to the other vectors in several descriptors, but quite different in others. Therefore, the loadings, weights or coefficient vectors should not be used in this case for the investigation of the deviation, or otherwise one may arrive to incorrect conclusions. On the other hand, it may be worth to check whether the *o*MEDA vector is representative of the deviation in the four compounds. Performing *o*MEDA individually on

0

5

10

0

0.5

1

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −1

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −10

Variables (d) *o*Meda (K17, J1, J19 and K18 vs the rest)

Variables (b) Weights

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −1

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> −0.1

of descriptors of the Selwood dataset.

Variables (c) Regression Coefficients

each of the compounds confirm this fact (see Figure 20)

Variables (a) Loadings

−0.5

−0.05

0

0.05

0.1

0

0.5

1

Fig. 20. *o*MEDA vectors corresponding to the first LV in the PLS model with the complete set of descriptors of the Selwood dataset. To compute each of the vectors, one of the four compounds K17, J1, J19 and K18 are set to 1 in the dummy variable and the other three are set to 0, while the rest of compounds in the data set are set to -1.

The subsequent step is to search for relevant descriptors (variable selection) For this, MEDA will be employed. In this point, there are two choices. The four compounds with low score in the first LV may be treated as outliers and separated from the rest of the data (34) or the complete data set may be modelled with a single QSAR model (35; 36). It should be noted that differences among observations in one model may not be found in a different model, so that the same observation may be an outlier or a normal observation depending on the model. Furthermore, as discussed in (35), the more general the QSAR model is, so that it models a wider set of compounds, the better. Therefore, the complete set of compounds will be considered in the remaining of the example. On the other hand, regarding the analysis tools used, there are different possibilities. MEDA may be applied over the PLS model relating the descriptors in the x-block and -LOGEC50 in the y-block. Alternatively, both blocks may be joined together in a single block of data and MEDA with PCA be applied. The second choice will be generally preferred to avoid over-fitting, but typically both approaches may lead to the same conclusions, like it happens in the present example.

The application of MEDA requires to select the number of PCs in the PCA model. Considering that the aim is to understand how the variability in -LOGEC50 is related to the descriptors in

1 10 20 30 40 50

Variables

Exploratory Data Analysis with Latent Subspace Models 85

Fig. 22. MEDA matrix of the PCA model with 10 PCs from the data set which combines the in vitro antifilarial activity (-LOGEC50) with the complete set of descriptors of the Selwood

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> <sup>0</sup>

(b)

ATCH1, ATCH3, ATCH7 or SUM\_F have a very similar prediction performance. The least-squares regression model with ATCH6 as regressor attains a *Q*<sup>2</sup> equal to 0.30, more than the *Q*<sup>2</sup> attained by any number of LVs in the PLS model with the complete set of descriptors. If for instance ATCH6 and ATCH1 are used as regressors, *Q*<sup>2</sup> = 0.26 is obtained for least squares regression and *Q*<sup>2</sup> = 0.31 for PLS with 1 LV, which means an almost negligible improvement with the addition of ATCH1. The facts that the improvement is low and that the 1 LV PLS model outperforms the least squares model are caused by the correlation between ATCH6 and ATCH1, correlation clearly pointed out in the MEDA matrix (see the element at the sixth column and first row or the first column and sixth row) Clearly, both ATCH6 and ATCH1 are related to the same common factor in -LOGEC50. However, the variability in -LOGEC50 is the result of several sources of variability, which may be common factors with other descriptors.

Fig. 23. MEDA vector corresponding to the in vitro antifilarial activity (-LOGEC50) (a) comparison between this vector and that corresponding to ATCH6 (b) and MEDA vector corresponding to LOGP (c) in the PCA model with 10 PCs from the data set which combines

Variables

ATCH6 −log EC50

0.2 0.4 0.6 0.8 1


0

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> <sup>0</sup>

Variables (c)

0.2 0.4 0.6 0.8 1 0.2

0.4

0.6

0.8

1

1

10

20

30

Variables

<sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>50</sup> <sup>0</sup>

Variables (a)

dataset.

0.1 0.2 0.3 0.4 0.5 40

50

Fig. 21. Structural and Variance Information (SVI) plot of in vitro antifilarial activity (-LOGEC50). The data set considered combines -LOGEC50 with the complete set of descriptors of the Selwood dataset.

the data set, the Structural and Variance Information (SVI) plots are the adequate analysis tool (14). The SVI plots combine variance information with structural information to elucidate how a PCA model captures the variance of a single variable. The SVI plot of a variable *v* reveals how the following indices evolve with the addition of PCs in the PCA model:


Figure 21 shows the SVI plot of -LOGEC50 in the PCA model with the complete set of descriptors. The plot shows that the model remains quite stable until 5-6 PCs are included. This is seen in the closeness of the circles which represents the different instances of *α* computed on a leave-one-out cross-validation run. The main portion of variability in -LOGEC50 is captured in the second and eighth PCs. Nevertheless, is not until the tenth PC that the missing data imputation (*Q*2) yields a high value. For more PCs, the captured variability is only slightly augmented. Since MEDA makes use of the missing data imputation of a PCA model, *Q*<sup>2</sup> is a relevant index. At the same time, from equation (2) is clear that MEDA is also influenced by captured variance. Thus, 10 PCs are selected. In any case, it should be noted that MEDA is quite robust to the overestimation in the number of PCs (11) and very similar MEDA matrices are obtained for 3 or more PCs in this example.

The MEDA matrix corresponding to the PCA model with 10 PCs from the data set which combines -LOGEC50 with the complete set of descriptors of the Selwood dataset is presented in Figure 22. For variable selection, the most relevant part of this matrix is the last column (or row), which corresponds to -LOGEC50. This vector is shown in Figure 23(a). Those descriptors with high value in this vector are the ones from which -LOGEC50 can be better predicted. Nevertheless, the selection of, say, the first *n* variables with higher value is not an adequate strategy because the relationship among the descriptors should also be considered. Let us select the descriptor with better prediction performance, in this case ATCH6, though 22 Will-be-set-by-IN-TECH

<sup>m</sup> Q2

A,m <sup>α</sup><sup>A</sup> m(i)

R2 A,m <sup>α</sup><sup>A</sup>

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>0</sup>
