**Part 2**

**Biochemistry** 

138 Chemometrics in Practical Applications

Zou, H. & Hastie, T. (2005), Regularization and Variable Selection via the Elastic Net, *Journal* 

Wheeler, R. (1974) Portable power , *Technometrics*, Vol. 16, No. 2, pp. 193-201

*of the Royal Statistical Society, Series B*, Vol. 76, pp. 301-320.

**0**

**6**

*Italy*

**Metabolic Biomarker Identification**

Pietro Franceschi, Urska Vrhovsek, Fulvio Mattivi and Ron Wehrens

Biomarker selection represents a key step in bioinformatic data processing pipelines; examples range from DNA microarrays (Tusher et al., 2001; Yousef et al., 2009) to proteomics (Araki et al., 2010; Oh et al., 2011) to metabolomics (Chadeau-Hyam et al., 2010). Meaningful biological interpretation is greatly aided by identification of a "short-list" of features – biomarkers – characterizing the main differences between several states in a biological system. In a two-class setting the biomarkers are those variables (metabolites, proteins, genes ...) that allow discrimination between the classes. A class or group tag can be used to distinguish many situations: it can be used to discriminate between treated and non-treated samples, to mark different varieties of the same organism, etcetera. In the following, we will – for clarity – restrict the discussion to metabolomics, and the variables will constitute concentration levels of metabolites, but similar arguments hold *mutatis mutandis* for other -omics sciences, such as proteomics and transcriptomics, where the variables correspond to protein levels or

There are several reasons why the selection of biomarker short-lists can be beneficial:

in general leads to better, i.e., more robust and more accurate predictions.

networks in many cases is only scratching the surface.

• Predictive purposes: using only a small number of biomarkers in predictive class modeling

• Interpretative purposes: it makes sense to first concentrate on those metabolites that show clear differences in levels in the different classes, since our knowledge of metabolic

• Discovery purposes: the complete characterization of unknown compounds identified in untargeted experiments is time- and resource-consuming. The primary focus should thus be placed on a carefully selected group of "unknowns" to be characterized at structural

Two fundamentally different statistical approaches to biomarker selection are possible. With the first, experimental data can be used to construct multivariate statistical models of increasing complexity and predictive power – well-known examples are Partial Least Square Discriminant Analysis (PLS-DA) (Barker & Rayens, 2003; Kemsley, 1996; Szymanska et al., 2011) or Principal Component Linear Discriminant Analysis (PC-LDA) (Smit et al., 2007; Werf et al., 2006). Inspection of the model coefficients then should point to those variables that are important for class discrimination. As an alternative, univariate statistical tests can be

**1. Introduction**

expression levels, respectively.

and functional level.

**with Few Samples**

*IASMA Research and Innovation Centre Via E. Mach, 1 38010 S. Michele all'Adige (TN)*

### **Metabolic Biomarker Identification with Few Samples**

Pietro Franceschi, Urska Vrhovsek, Fulvio Mattivi and Ron Wehrens *IASMA Research and Innovation Centre Via E. Mach, 1 38010 S. Michele all'Adige (TN) Italy*

#### **1. Introduction**

Biomarker selection represents a key step in bioinformatic data processing pipelines; examples range from DNA microarrays (Tusher et al., 2001; Yousef et al., 2009) to proteomics (Araki et al., 2010; Oh et al., 2011) to metabolomics (Chadeau-Hyam et al., 2010). Meaningful biological interpretation is greatly aided by identification of a "short-list" of features – biomarkers – characterizing the main differences between several states in a biological system. In a two-class setting the biomarkers are those variables (metabolites, proteins, genes ...) that allow discrimination between the classes. A class or group tag can be used to distinguish many situations: it can be used to discriminate between treated and non-treated samples, to mark different varieties of the same organism, etcetera. In the following, we will – for clarity – restrict the discussion to metabolomics, and the variables will constitute concentration levels of metabolites, but similar arguments hold *mutatis mutandis* for other -omics sciences, such as proteomics and transcriptomics, where the variables correspond to protein levels or expression levels, respectively.

There are several reasons why the selection of biomarker short-lists can be beneficial:


Two fundamentally different statistical approaches to biomarker selection are possible. With the first, experimental data can be used to construct multivariate statistical models of increasing complexity and predictive power – well-known examples are Partial Least Square Discriminant Analysis (PLS-DA) (Barker & Rayens, 2003; Kemsley, 1996; Szymanska et al., 2011) or Principal Component Linear Discriminant Analysis (PC-LDA) (Smit et al., 2007; Werf et al., 2006). Inspection of the model coefficients then should point to those variables that are important for class discrimination. As an alternative, univariate statistical tests can be

methods (PLS-DA and PC-LDA) and the univariate *t*-test, leading to at least a rough estimate of how consistent biomarker discovery can be when small sample sizes are considered. In particular, we compare the effect of sample size reduction on multivariate and univariate models on the basis of Receiver Operating Characteristics (ROC) (Brown & Davis, 2005).

Metabolic Biomarker Identification with Few Samples 143

There are many strategies for identifying differentially expressed variables in two-class situations – a recent overview can be found in Saeys et al. (2007). A general approach is to construct a model with good predictive properties, and to see which variables are important in such a model. Given the low sample-to-variable ratio, however, one can not expect to be able to fit very complicated models, and in many cases a linear model is the best one can do (Hastie et al., 2001). The oldest, and most well-known technique is Linear Discriminant Analysis (LDA, McLachlan (2004)). One formulation of this technique, dating back to R.A. Fisher, is to find a linear combination of variables *a* that maximizes the ratio of the between-groups sums

That is, *a* is the direction that maximizes the separation between the classes, both by having compact classes (a small within-groups variance) and by having the class centers far apart (a large between-groups variance). Large values in *a* indicate which variables are important in the discrimination. Another formulation is to calculate the Mahalanobis distance of a new

)

The new sample is then assigned to the class of the closest center. This approach is equivalent to Fisher's criterion for two classes (but not for more than two classes). In this equation, **Σ** is the (estimated) pooled covariance matrix of the classes. If the Mahalanobis distance to each class center is calculated using the individual class covariance matrices, the result is Quadratic Discriminant Analysis (QDA), which as the name suggests, no longer leads to linear class boundaries. A final formulation is to use regression using indicator variables for the class. In a two-class situation one can use, e.g., the values of −1 and 1 for the two classes; positive predictions will be assigned to class one, and negative predictions to class −1. In many other cases, 0 and 1 are used, and the class threshold is put at 0.5. When there are more than two classes, one can use a separate column in the dependent variable for every class – if a sample belongs to that class the column should contain 1, else 0. Again, the size of the regression

*<sup>T</sup>* **<sup>Σ</sup>**−<sup>1</sup> (*<sup>x</sup>* <sup>−</sup> *<sup>μ</sup><sup>i</sup>*

*aTBa*/*aTW a* (1)

) (2)

**2. Material and methods 2.1 Biomarker Identification**

sample *x* to the class centers *μ<sup>i</sup>*

of squares, *B*, and the within-groups sums of squares *W*:

:

*d*(*x*, *i*) = (*x* − *μ<sup>i</sup>*

coefficients indicates which of the variables contribute most to the discrimination.

For most applications in the "omics" fields, even the most simple multivariate techniques such as Linear Discriminant Analysis (LDA) cannot be applied directly. From Equation 2 it is clear that an inverse of the the covariance matrix **Σ** needs to be calculated, which is impossible in cases where the number of variables exceeds the number of samples. In practice, the number of samples is nowhere near the number of variables. For QDA, the situation is even worse: to allow a stable matrix inversion, every single class should have at least as many samples as variables (and preferably quite a bit more). A common approach is to compress the information in the data into a low number of latent variables (LVs), either using PCA (leading

applied to individual variables, treating each one independent of the others and indicating which of them show significant differences between groups (see, e.g., Guo et al. (2007); Reiner et al. (2003); Zuber & Strimmer (2009)). Multivariate techniques are potentially more powerful in pin-pointing weak differences because they take into account correlation among the variables, but the models can be too much adapted to the experimental data, leading to poor generalization capacity. Univariate approaches, in contrast, both could miss important "weak" details and could overestimate the importance of certain variables, because correlation between variables is not taken into account.

As for many sciences with the "omics" suffix, in metabolomics the number of experimental variables usually greatly exceeds the number of objects, especially with the development of new mass-spectrometry-based technologies. In MS-based metabolomics, high resolution mass spectrometers are often coupled with high performance chromatographic techniques, like Ultra Performance Liquid Chromatography (UPLC). In these experiments, the variables, i.e., the metabolites, are represented by mass/retention-time combinations, and it is typical to have numbers of features varying from several hundreds to several thousands, depending on the experimental and analytical conditions. This increase in experimental possibilities, however, does not correspond to a proportional increase in the number of available samples, which can be limited by the availability of biological samples, by laboratory practice, in particular when complex protocols are required, and also by ethical issues, when, for example, experiments on animals have to be planned.

All these constraints produce *small sample sets*, presenting serious challenges for the statistical analysis, mainly because there is simply not enough information to model the natural biological variability. The situation is critical for multivariate approaches where the parameters of the statistical model need to be optimized (e.g., the number of components in a PLS-DA model). For this purpose, the classical approach is to use sub-sampling in combination with estimates of predictive power, like crossvalidation (Stone, 1974). In extreme conditions, i.e., really small sample sizes, this sub-sampling can give rise to inconsistent sub-models and tuning in the classical way becomes virtually impossible. In Hanczar et al. (2010), as an example, conclusions are focussed on ROC-based statistics (see below), but they are equally relevant for classical error estimates like the root-mean-square error of prediction, RMSEP) multivariate techniques can be still applied to the full data set, but it is not possible to assess the reliability of the biomarker selection pipeline, even if it is still reasonable to think that the biomarkers are strongly contributing to the statistical model. In these situations, univariate methods seem the best solution, also considering the presence of several strategies able to determine cut-off values in *t*-test based techniques (e.g., thresholding of *p* values subjected to some form of multiple testing correction (Benjamini & Hochberg, 1995; Noble, 2009; Reiner et al., 2003)). Regardless of the statistical strategy, for the "biomarkers" extracted in these conditions there is no obvious validation possible in the statistical sense; however, the results of the experiments are extremely important in the hypothesis generation phase to plan more informative investigations.

Interestingly, there is no literature on the effect of sample size on biomarker identification in the "omics" sciences, and the objective of this contribution is to fill this gap. We focus on a two-class problem, and in particular on small data sets. In our approach, real class differences have been introduced by spiking apple extracts with selected compounds, analyzing them using UPLC-TOF mass spectrometry, and comparing the feature lists to those of unspiked apple extracts. Using these data we are able to run a comparison between two multivariate methods (PLS-DA and PC-LDA) and the univariate *t*-test, leading to at least a rough estimate of how consistent biomarker discovery can be when small sample sizes are considered. In particular, we compare the effect of sample size reduction on multivariate and univariate models on the basis of Receiver Operating Characteristics (ROC) (Brown & Davis, 2005).

#### **2. Material and methods**

2 Will-be-set-by-IN-TECH

applied to individual variables, treating each one independent of the others and indicating which of them show significant differences between groups (see, e.g., Guo et al. (2007); Reiner et al. (2003); Zuber & Strimmer (2009)). Multivariate techniques are potentially more powerful in pin-pointing weak differences because they take into account correlation among the variables, but the models can be too much adapted to the experimental data, leading to poor generalization capacity. Univariate approaches, in contrast, both could miss important "weak" details and could overestimate the importance of certain variables, because correlation

As for many sciences with the "omics" suffix, in metabolomics the number of experimental variables usually greatly exceeds the number of objects, especially with the development of new mass-spectrometry-based technologies. In MS-based metabolomics, high resolution mass spectrometers are often coupled with high performance chromatographic techniques, like Ultra Performance Liquid Chromatography (UPLC). In these experiments, the variables, i.e., the metabolites, are represented by mass/retention-time combinations, and it is typical to have numbers of features varying from several hundreds to several thousands, depending on the experimental and analytical conditions. This increase in experimental possibilities, however, does not correspond to a proportional increase in the number of available samples, which can be limited by the availability of biological samples, by laboratory practice, in particular when complex protocols are required, and also by ethical issues, when, for example,

All these constraints produce *small sample sets*, presenting serious challenges for the statistical analysis, mainly because there is simply not enough information to model the natural biological variability. The situation is critical for multivariate approaches where the parameters of the statistical model need to be optimized (e.g., the number of components in a PLS-DA model). For this purpose, the classical approach is to use sub-sampling in combination with estimates of predictive power, like crossvalidation (Stone, 1974). In extreme conditions, i.e., really small sample sizes, this sub-sampling can give rise to inconsistent sub-models and tuning in the classical way becomes virtually impossible. In Hanczar et al. (2010), as an example, conclusions are focussed on ROC-based statistics (see below), but they are equally relevant for classical error estimates like the root-mean-square error of prediction, RMSEP) multivariate techniques can be still applied to the full data set, but it is not possible to assess the reliability of the biomarker selection pipeline, even if it is still reasonable to think that the biomarkers are strongly contributing to the statistical model. In these situations, univariate methods seem the best solution, also considering the presence of several strategies able to determine cut-off values in *t*-test based techniques (e.g., thresholding of *p* values subjected to some form of multiple testing correction (Benjamini & Hochberg, 1995; Noble, 2009; Reiner et al., 2003)). Regardless of the statistical strategy, for the "biomarkers" extracted in these conditions there is no obvious validation possible in the statistical sense; however, the results of the experiments are extremely important in the hypothesis generation phase to plan

Interestingly, there is no literature on the effect of sample size on biomarker identification in the "omics" sciences, and the objective of this contribution is to fill this gap. We focus on a two-class problem, and in particular on small data sets. In our approach, real class differences have been introduced by spiking apple extracts with selected compounds, analyzing them using UPLC-TOF mass spectrometry, and comparing the feature lists to those of unspiked apple extracts. Using these data we are able to run a comparison between two multivariate

between variables is not taken into account.

experiments on animals have to be planned.

more informative investigations.

#### **2.1 Biomarker Identification**

There are many strategies for identifying differentially expressed variables in two-class situations – a recent overview can be found in Saeys et al. (2007). A general approach is to construct a model with good predictive properties, and to see which variables are important in such a model. Given the low sample-to-variable ratio, however, one can not expect to be able to fit very complicated models, and in many cases a linear model is the best one can do (Hastie et al., 2001). The oldest, and most well-known technique is Linear Discriminant Analysis (LDA, McLachlan (2004)). One formulation of this technique, dating back to R.A. Fisher, is to find a linear combination of variables *a* that maximizes the ratio of the between-groups sums of squares, *B*, and the within-groups sums of squares *W*:

$$\mathbf{a}^T \mathbf{B} \mathbf{a} / \mathbf{a}^T \mathbf{W} \mathbf{a} \tag{1}$$

That is, *a* is the direction that maximizes the separation between the classes, both by having compact classes (a small within-groups variance) and by having the class centers far apart (a large between-groups variance). Large values in *a* indicate which variables are important in the discrimination. Another formulation is to calculate the Mahalanobis distance of a new sample *x* to the class centers *μ<sup>i</sup>* :

$$d(\mathbf{x}, i) = \left(\mathbf{x} - \boldsymbol{\mu}\_i\right)^T \boldsymbol{\Sigma}^{-1} \left(\mathbf{x} - \boldsymbol{\mu}\_i\right) \tag{2}$$

The new sample is then assigned to the class of the closest center. This approach is equivalent to Fisher's criterion for two classes (but not for more than two classes). In this equation, **Σ** is the (estimated) pooled covariance matrix of the classes. If the Mahalanobis distance to each class center is calculated using the individual class covariance matrices, the result is Quadratic Discriminant Analysis (QDA), which as the name suggests, no longer leads to linear class boundaries. A final formulation is to use regression using indicator variables for the class. In a two-class situation one can use, e.g., the values of −1 and 1 for the two classes; positive predictions will be assigned to class one, and negative predictions to class −1. In many other cases, 0 and 1 are used, and the class threshold is put at 0.5. When there are more than two classes, one can use a separate column in the dependent variable for every class – if a sample belongs to that class the column should contain 1, else 0. Again, the size of the regression coefficients indicates which of the variables contribute most to the discrimination.

For most applications in the "omics" fields, even the most simple multivariate techniques such as Linear Discriminant Analysis (LDA) cannot be applied directly. From Equation 2 it is clear that an inverse of the the covariance matrix **Σ** needs to be calculated, which is impossible in cases where the number of variables exceeds the number of samples. In practice, the number of samples is nowhere near the number of variables. For QDA, the situation is even worse: to allow a stable matrix inversion, every single class should have at least as many samples as variables (and preferably quite a bit more). A common approach is to compress the information in the data into a low number of latent variables (LVs), either using PCA (leading

possible to decide a cut-off value to identify variables which show "significant" differences

Metabolic Biomarker Identification with Few Samples 145

Generally speaking, the absolute size of coefficients is taken as a measure for the likelihood of being a true marker: the variable with the largest coefficient, in a PLS-DA model for example, is the first biomarker candidate, the second largest the second candidate, and so on. Note that this approach assumes that all variables have been standardized, i.e., scaled to mean zero and unit variance. This is often done in metabolomics to prevent dominance of highly abundant

To evaluate the performance of biomarker selections one typically relies on quantities like the fraction of true positives, i.e., that fraction of the real biomarkers that is actually identified by the selection method, and the false positives – those variables that have been selected but do not correspond to real differences. Similarly, true and false negatives can be defined. These statistics can be summarized graphically in an ROC plot (Brown & Davis, 2005), where the fraction of true positives (y-axis) is plotted against the fraction of false positives (x-axis). These two characteristics are also known as the sensitivity and the (complement of) specificity. An ideal biomarker identification method would lead to a position in the top left corner: all true biomarkers would be found (the fraction of true positives would be one, or close to one) with no or only very few false positives. Gradually relaxing the selection criterion, allowing more and more variables to be considered as biomarkers, generally leads to an increase in the true positive fraction (upwards in the plot), but also to an increase in the false positive fraction (in the plot to the right). The best biomarker selection method is obviously the one that finds all

A quantitative measure of the efficiency of a method can be obtained by calculating the area under the ROC curve (AUC). A value of one (or close to one) indicates that the method does a very good job in identifying biomarkers – all true biomarkers are found almost immediately. A value of one half indicates a completely random selection (this corresponds to the diagonal in the ROC plot). Values significantly lower than one half should not occur. In many cases, the most important area in the ROC plot is the left side, which indicates the efficiency of the model in selecting the most important biomarkers. Consequently, it is common to calculate a partial area under the curve (pAUC), for instance up to twenty percent of false positives (pAUC.2). In a method with higher pAUC, the true biomarkers will be present in the first positions of the candidate biomarkers list, hence this is the quantity that will be considered in

Twenty apples, variety Golden Delicious, were purchased at the local store. Extracts of every single fruit were prepared according to Vrhovsek et al. (Vrhovsek et al., 2004). The core of the fruit was removed with a corer and each apple was cut into equal slices. Three slices (cortex and skin) from the opposite side of each fruit were used for the preparation of aqueous acetone extracts. The samples were homogenized in a blender Osterizer model 847-86 at speed one in a mixture of acetone/water (70/30 w/w). Before the injection, acetone was evaporated by rotary evaporation, the samples were brought back to the original volume with ethanol and were filtered with a 0.22 *μ*m filter (Millipore, Bedford, USA). UPLC-MS spectra were

metabolites. Statistics from a *t*-test can be treated in the same way.

biomarkers very quickly, leading to a very steep ROC curve at the beginning.

from the null hypothesis.

**2.2 Quality assessment**

the current paper.

**2.3 Apple data set**

to PC-LDA, e.g. Smit et al. (2007); Werf et al. (2006)) or PLS (which gives PLS-DA; see Barker & Rayens (2003); Kemsley (1996)), and to perform the discriminant analysis on the resulting score matrices. These are not only of low dimension, but also orthogonal so that the matrix inversion, the calculation of **Σ**<sup>−</sup>1, can be performed very fast and reliably. Both for PC-LDA and PLS-DA, the problem is more often usually cast in a regression context, where again the response variable *Y* can take values of either 0 or 1. The model thus becomes:

$$\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathcal{E} \approx \mathbf{T}\mathbf{P}^{T}\mathbf{B} + \mathcal{E} \tag{3}$$

where E is the matrix of residuals. Matrix *X* is decomposed into a score matrix *T* and a loading matrix *P*, both consisting of a very low number of latent variables, typically less than ten or twenty. The coefficients for the scores, *A* = *PTB*, can therefore be easily be calculated in the normal way of least-squares regression:

$$\mathbf{A} = (\mathbf{T}^T \mathbf{T})^{-1} \mathbf{T}^T \mathbf{Y} \tag{4}$$

which by premultiplication with *P* lead to estimates for the overall regression coefficients *B*:

$$B = PA \tag{5}$$

These equations are the same for both PLS-DA and PC-LDA. The difference lies in the decomposition of *X*. In PC-LDA, *T* and *P* correspond to the scores and loadings, respectively, from PCA. That is, the class of the samples is completely ignored, and the only criterion is to capture as much variance as possible from *X*. In PLS-DA, on the other hand, the scores and loadings are taken from a PLS model and the decomposition of *X does* take into account class information: the first PLS components by definition explain more, often much more, variance of *Y* than the first PCA components.

Both methods, PC-LDA as well as PLS-DA, are usually very sensitive to the choice of the number of LVs. Taking too few LVs will lead to bad predictions since important information is missed. Taking too many, the model will be too flexible and will show a phenomenon known as *overtraining*: it is more or less learning all the examples in the training set by heart but is not able to generalize and to make good predictions for new, unseen samples. As discussed, the assessment of the optimal number of LVs is neigh impossible with small sample sets. In the case under consideration, the extent of this effect is investigated by constructing several models with increasing numbers of LVs. Using real and simulated data sets (see below), models with 1–4, 6, and 8 LVs, respectively, are compared.

A simplification of statistical modeling can be obtained by ignoring all possible correlations between variables and assuming a diagonal covariance matrix, which leads to diagonal discriminant analysis (DDA). It can be shown that using the latter for feature selection corresponds to examining regular *t*-statistics (Zuber & Strimmer, 2009), and this is the approach we will take in this paper. For each variable, the difference between the class means *x*¯1*<sup>i</sup>* and *x*¯2*<sup>i</sup>* is transformed into a *z*-score by dividing by the appropriate standard deviation estimate *si*:

$$z\_{\mathbf{i}} = |\mathfrak{x}\_{1\mathbf{i}} - \mathfrak{x}\_{2\mathbf{i}}|/s\_{\mathbf{i}} \tag{6}$$

Using the appropriate number of degrees of freedom, these *z*-scores can be transformed into *p* values, which have the usual interpretation of the probability under the null hypothesis of encountering an observation with a value that is at least as extreme. In biomarker identification, *p* values can be used to sort the variables in order of importance and it is also possible to decide a cut-off value to identify variables which show "significant" differences from the null hypothesis.

Generally speaking, the absolute size of coefficients is taken as a measure for the likelihood of being a true marker: the variable with the largest coefficient, in a PLS-DA model for example, is the first biomarker candidate, the second largest the second candidate, and so on. Note that this approach assumes that all variables have been standardized, i.e., scaled to mean zero and unit variance. This is often done in metabolomics to prevent dominance of highly abundant metabolites. Statistics from a *t*-test can be treated in the same way.

#### **2.2 Quality assessment**

4 Will-be-set-by-IN-TECH

to PC-LDA, e.g. Smit et al. (2007); Werf et al. (2006)) or PLS (which gives PLS-DA; see Barker & Rayens (2003); Kemsley (1996)), and to perform the discriminant analysis on the resulting score matrices. These are not only of low dimension, but also orthogonal so that the matrix inversion, the calculation of **Σ**<sup>−</sup>1, can be performed very fast and reliably. Both for PC-LDA and PLS-DA, the problem is more often usually cast in a regression context, where again the

where E is the matrix of residuals. Matrix *X* is decomposed into a score matrix *T* and a loading matrix *P*, both consisting of a very low number of latent variables, typically less than ten or twenty. The coefficients for the scores, *A* = *PTB*, can therefore be easily be calculated in the

which by premultiplication with *P* lead to estimates for the overall regression coefficients *B*:

These equations are the same for both PLS-DA and PC-LDA. The difference lies in the decomposition of *X*. In PC-LDA, *T* and *P* correspond to the scores and loadings, respectively, from PCA. That is, the class of the samples is completely ignored, and the only criterion is to capture as much variance as possible from *X*. In PLS-DA, on the other hand, the scores and loadings are taken from a PLS model and the decomposition of *X does* take into account class information: the first PLS components by definition explain more, often much more, variance

Both methods, PC-LDA as well as PLS-DA, are usually very sensitive to the choice of the number of LVs. Taking too few LVs will lead to bad predictions since important information is missed. Taking too many, the model will be too flexible and will show a phenomenon known as *overtraining*: it is more or less learning all the examples in the training set by heart but is not able to generalize and to make good predictions for new, unseen samples. As discussed, the assessment of the optimal number of LVs is neigh impossible with small sample sets. In the case under consideration, the extent of this effect is investigated by constructing several models with increasing numbers of LVs. Using real and simulated data sets (see below),

A simplification of statistical modeling can be obtained by ignoring all possible correlations between variables and assuming a diagonal covariance matrix, which leads to diagonal discriminant analysis (DDA). It can be shown that using the latter for feature selection corresponds to examining regular *t*-statistics (Zuber & Strimmer, 2009), and this is the approach we will take in this paper. For each variable, the difference between the class means *x*¯1*<sup>i</sup>* and *x*¯2*<sup>i</sup>* is transformed into a *z*-score by dividing by the appropriate standard deviation

Using the appropriate number of degrees of freedom, these *z*-scores can be transformed into *p* values, which have the usual interpretation of the probability under the null hypothesis of encountering an observation with a value that is at least as extreme. In biomarker identification, *p* values can be used to sort the variables in order of importance and it is also

*<sup>Y</sup>* <sup>=</sup> *XB* <sup>+</sup> E ≈ *TPT<sup>B</sup>* <sup>+</sup> <sup>E</sup> (3)

*A* = (*TTT*)−1*TTY* (4)

*zi* = |*x*¯1*<sup>i</sup>* − *x*¯2*i*|/*si* (6)

*B* = *PA* (5)

response variable *Y* can take values of either 0 or 1. The model thus becomes:

normal way of least-squares regression:

of *Y* than the first PCA components.

estimate *si*:

models with 1–4, 6, and 8 LVs, respectively, are compared.

To evaluate the performance of biomarker selections one typically relies on quantities like the fraction of true positives, i.e., that fraction of the real biomarkers that is actually identified by the selection method, and the false positives – those variables that have been selected but do not correspond to real differences. Similarly, true and false negatives can be defined. These statistics can be summarized graphically in an ROC plot (Brown & Davis, 2005), where the fraction of true positives (y-axis) is plotted against the fraction of false positives (x-axis). These two characteristics are also known as the sensitivity and the (complement of) specificity. An ideal biomarker identification method would lead to a position in the top left corner: all true biomarkers would be found (the fraction of true positives would be one, or close to one) with no or only very few false positives. Gradually relaxing the selection criterion, allowing more and more variables to be considered as biomarkers, generally leads to an increase in the true positive fraction (upwards in the plot), but also to an increase in the false positive fraction (in the plot to the right). The best biomarker selection method is obviously the one that finds all biomarkers very quickly, leading to a very steep ROC curve at the beginning.

A quantitative measure of the efficiency of a method can be obtained by calculating the area under the ROC curve (AUC). A value of one (or close to one) indicates that the method does a very good job in identifying biomarkers – all true biomarkers are found almost immediately. A value of one half indicates a completely random selection (this corresponds to the diagonal in the ROC plot). Values significantly lower than one half should not occur. In many cases, the most important area in the ROC plot is the left side, which indicates the efficiency of the model in selecting the most important biomarkers. Consequently, it is common to calculate a partial area under the curve (pAUC), for instance up to twenty percent of false positives (pAUC.2). In a method with higher pAUC, the true biomarkers will be present in the first positions of the candidate biomarkers list, hence this is the quantity that will be considered in the current paper.

#### **2.3 Apple data set**

Twenty apples, variety Golden Delicious, were purchased at the local store. Extracts of every single fruit were prepared according to Vrhovsek et al. (Vrhovsek et al., 2004). The core of the fruit was removed with a corer and each apple was cut into equal slices. Three slices (cortex and skin) from the opposite side of each fruit were used for the preparation of aqueous acetone extracts. The samples were homogenized in a blender Osterizer model 847-86 at speed one in a mixture of acetone/water (70/30 w/w). Before the injection, acetone was evaporated by rotary evaporation, the samples were brought back to the original volume with ethanol and were filtered with a 0.22 *μ*m filter (Millipore, Bedford, USA). UPLC-MS spectra were

Fig. 1. Visualization of the data of the first control sample, measured in positive mode. The top of the figure shows the square root of the Total Ion Current (TIC); background color indicates the intensity of the signal in the plane formed by retention time and *m/z*axes. Circles indicate features found by the peak picking; the fill colour of these circles indicates

Metabolic Biomarker Identification with Few Samples 147

**Compound** mgl−<sup>1</sup> pool Δ Conc. (mgl−1)

Table 2. Spiked compound summary. The difference in concentration is relative to the one measured in the pooled extract. Cyanidin-3-galactoside and *trans*-resveratrol are not

quercetin-3-galactoside (querc-3-gal) 5.69 1.48 quercetin 0.006 0.008 quercetin-3-glucoside (querc-3-glc) 1.05 0.3 quercetin-3-rhamnoside (querc-3rham) 3.64 3.55 phloridzin 2.92 2.3 cyanidin-3-galactoside (cy-3-gal) n.d. 0.57 *trans*-resveratrol n.d. 0.4

the intensity of the features.

normally found in Golden Delicious.


Table 1. Chromatographic and spectrometric conditions of the spiked-apple data set.

acquired on a ACQUITY - SYNAPT Q-TOF (Waters, Milford, USA) in positive and negative ion mode with the chromatographic conditions summarized in Table 1. No technical replicates were performed. Raw data were transformed to the open NetCDF format by the DataBridge built-in utility of the MassLynx software.

Class differences were introduced by spiking ten of the twenty extracts with a number of selected compounds, leaving the other ten as "untreated" controls. The majority of the spiked compounds are known to be commonly present in apples, while two of them (*trans*-resveratrol and cyanidin-3-galactoside) are not naturally present in the chosen matrix. The concentrations of the specific compounds in the pooled extract are presented in Table 2; markers were added in different concentrations to test the identification pipeline in conditions which mimic those found in a typical metabolomic experiment, where variation is usually present at different concentration levels. As an example of what the data look like, the first control sample, measured in positive mode, is shown in Figure 1. The horizontal axis shows the chromatographic dimension, and the vertical axis the mass-to-charge ratio. Circles indicate features that have been identified in this plane. In the remainder only the extracted triplets for the features, consisting of retention time, mass-to-charge ratio and intensity, will be used.

Feature extraction is performed with XCMS (Smith et al., 2006) and all statistical analyses are carried out in R (R Development Core Team, 2011). The CentWave peak-picking algorithm (Tautenhahn et al., 2008) is applied, using the following parameter settings: ppm = 20, peakwidth = c(3,15), snthresh = 2, prefilter = c(3,5). The average numbers of detected features per chromatogram are 1179 and 610 for positive and negative ion mode, respectively. 6 Will-be-set-by-IN-TECH

from 0 to 100% of solvent B in 10 minutes

100% of B for 2 minutes 100% A within 0.1 minutes Equilibration for 2.9 minutes.

**HPLC** ACQUITY UPLC (Waters) Column BEH C18 1.7 *μ*m, 2.1\*50 mm

Solvent A 0.1% formic acid in H2O Solvent B 0.1% formic acid in MeOH

**Mass Spectrometer** SYNAPT Q-TOF (Waters)

Table 1. Chromatographic and spectrometric conditions of the spiked-apple data set.

acquired on a ACQUITY - SYNAPT Q-TOF (Waters, Milford, USA) in positive and negative ion mode with the chromatographic conditions summarized in Table 1. No technical replicates were performed. Raw data were transformed to the open NetCDF format by the DataBridge

Class differences were introduced by spiking ten of the twenty extracts with a number of selected compounds, leaving the other ten as "untreated" controls. The majority of the spiked compounds are known to be commonly present in apples, while two of them (*trans*-resveratrol and cyanidin-3-galactoside) are not naturally present in the chosen matrix. The concentrations of the specific compounds in the pooled extract are presented in Table 2; markers were added in different concentrations to test the identification pipeline in conditions which mimic those found in a typical metabolomic experiment, where variation is usually present at different concentration levels. As an example of what the data look like, the first control sample, measured in positive mode, is shown in Figure 1. The horizontal axis shows the chromatographic dimension, and the vertical axis the mass-to-charge ratio. Circles indicate features that have been identified in this plane. In the remainder only the extracted triplets for the features, consisting of retention time, mass-to-charge ratio and intensity, will be used.

Feature extraction is performed with XCMS (Smith et al., 2006) and all statistical analyses are carried out in R (R Development Core Team, 2011). The CentWave peak-picking algorithm (Tautenhahn et al., 2008) is applied, using the following parameter settings: ppm = 20, peakwidth = c(3,15), snthresh = 2, prefilter = c(3,5). The average numbers of detected features per chromatogram are 1179 and 610 for positive and negative ion mode, respectively.

Column temperature 40◦C Injection volume 5*μ*l

Eluent flux 0.8 mlmin−<sup>1</sup>

Gradient linear gradient

Mass range 50-3000 Da. Capillary 3 kV Sampling cone 25 V Extraction cone 3 V Source temperatures 150◦C Desolvation temperatures 500◦C Cone gas flow 50 Lh−<sup>1</sup> Desolvation gas flow 1000 Lh−<sup>1</sup>

built-in utility of the MassLynx software.

Fig. 1. Visualization of the data of the first control sample, measured in positive mode. The top of the figure shows the square root of the Total Ion Current (TIC); background color indicates the intensity of the signal in the plane formed by retention time and *m/z*axes. Circles indicate features found by the peak picking; the fill colour of these circles indicates the intensity of the features.


Table 2. Spiked compound summary. The difference in concentration is relative to the one measured in the pooled extract. Cyanidin-3-galactoside and *trans*-resveratrol are not normally found in Golden Delicious.

í10 í5 0 5 10 15

PC 1 (8.9%)

í20 2 4

●

●

●

●

●

● ● ●

> ● Control Spiked

●

●

PC 1 (81.5%)

**Top 10 biomarkers**

í1.5

í0.5 0.0 0.5 1.0 1.5

PC 2 (7.9%)

Metabolic Biomarker Identification with Few Samples 149

Fig. 2. PCA score plot (PC1 vs PC2) for the positive ion mode data set after standardization. In the left plot the principal components have been calculated on the full data set. In the right panel PCA analysis has been performed considering only the top 10 variables selected by a

between control and spiked samples is evident, thus indicating that this subset of the variables separates the two classes. Whether these ten variables contain the true biomarkers remains to be seen: especially in small data sets there may be chance correlations causing false positives, and seeing differences between the two groups in the score plots after *t*-testing in fact is trivial. The score plot is merely showing that the variables, selected on the bases of their discriminating power, are separating the two classes. As already discussed, small data sets will in general not capture all relevant biological variability, which implies that the predictive power of statistical models based on small data sets usually is very low. To illustrate this effect, the predictive power, i.e., the fraction of correct predictions for PC-LDA and PLS-DA models is presented in Figure 4. Four subsets of different sizes are considered as training sets, and the estimate of predictive power is based on predictions for the apples not in the training set. Again, the results are the average over 100 different subsamplings. Even if the control and spiked subsets are different, it can be seen that the predictive power of the multivariate methods is comparable to random selection, meaning that for every subset different variables will be important in the models and no consistency can be achieved. However, it is important to point out that this fact does not mean that some of the true biomarkers are not consistently selected upon subsetting, but rather that the more important variables are changing from a subset to another: even with models that are unpredictive it is possible to extract relatively good lists of putative biomarkers. Obviously, with very different characteristics for the two classes there *will* be predictive power, but for realistic data sets like the one used in this paper,

To evaluate the efficiency of the different methods as far as biomarker selection is concerned, ROC curves for the *t*-test and two-component PLS-DA and PC-LDA models are presented in Figure 5, for 3, 4, 6 and 8 biological samples per class, respectively. The ROC curves indicate that all three variables selection methods perform significantly better than random selection.

where differences are small, it is unwise to focus solely on prediction.

●

**Full dataset**

●

●

●

●

●

●

í15 í10 í5 0 5 10

●

●

●

PC 2 (7.6%)

*t*-test.

After grouping across samples, features are screened for isotopes, clusters and common adducts with in-house developed software.

Due to fragmentation occurring in the ionization source, it is common for a single neutral molecule to give rise to several ionic species. A single spiked compound can then generate several "biomarkers" in the MS peak table. Adducts, isotopes and common clusters are automatically screened, but fragments must be included in the biomarker list, as in real metabolomic experiments no a priori knowledge can be used to distinguish molecular from fragment ions. For the apple data set, the characteristic couples mass/retention time for all spiked metabolites were identified by manual inspection of the UPLC-MS profiles of standards. For negative ions the following numbers of features have been associated with the spike-in compounds: querc-3-gal/querc-3-glc (1 feature), phloridzin (2 features), *trans*-resveratrol (1), querc-3-rham (1). In the case of positive ion mode the numbers are cy-3-gal (1), *trans*-resveratrol (1), querc-3-rham (1), quercetin (1) and phloridzin (4). These feature are now taken to be the "true" biomarkers and they are used to construct ROC curves. The data set, as well as a more extended version including different concentrations of spiked-in compounds is publicly available in the R package BioMark (see http://cran. r-project.org/web/packages/BioMark, Wehrens & Franceschi (2011)) and has been used to evaluate a novel stability-based biomarker selection method (Wehrens et al., 2011).

In this application, the effects of decreasing sample size are investigated by subsampling the original set of twenty samples: sample sizes of 16, 12, 8 and 6 apples, respectively, are considered. In all cases, both classes (spiked and control) have equal sizes, which is the most easy case for detecting significant differences. Results are summarized by analysis of ROC curves – to prevent effects from accidentally easy or difficult subsets, the final ROC curves are obtained by averaging the results of 100 repeated re-samplings.

#### **2.4 Simulated data sets**

To assess the behaviour of biomarker selection for larger data sets, we resort to simulation. Simulated data sets have been constructed as multivariate normal distributions, using the means and covariance matrices of the experimental data: both classes (untreated and spiked) have been simulated separately. Simulations are performed for both positive and negative modes; in every simulation, one hundred data sets are created. The outcomes reported here are the averages of the results for the one hundred simulations. Data sets consisting of 10, 25, 50 and 200 biological samples per class have been synthesized.

#### **3. Results and discussion**

As a first step, the data are visualized using Principal Component Analysis (PCA). Since the intensities of the features can vary enormously, standardized data are used. The score plots for the positive and negative data sets are shown in Figure 2 for the positive ion mode, and in Figure 3 for the negative mode. In both cases, control and spiked data sets are not completely separated and the same is also true for the other PCs (not shown). This fact indicates that the "inherent" variability of the data set is not perturbed to a significant extent by spiking, as could be expected considering the small number of affected variables.

Even with this data structure, biomarker selection strategies can still perform efficiently. Figure 2 and Figure 3 also display the score plots of a PCA analysis performed considering only the top 10 variables selected by univariate *t*-testing. In these conditions, the separation 8 Will-be-set-by-IN-TECH

After grouping across samples, features are screened for isotopes, clusters and common

Due to fragmentation occurring in the ionization source, it is common for a single neutral molecule to give rise to several ionic species. A single spiked compound can then generate several "biomarkers" in the MS peak table. Adducts, isotopes and common clusters are automatically screened, but fragments must be included in the biomarker list, as in real metabolomic experiments no a priori knowledge can be used to distinguish molecular from fragment ions. For the apple data set, the characteristic couples mass/retention time for all spiked metabolites were identified by manual inspection of the UPLC-MS profiles of standards. For negative ions the following numbers of features have been associated with the spike-in compounds: querc-3-gal/querc-3-glc (1 feature), phloridzin (2 features), *trans*-resveratrol (1), querc-3-rham (1). In the case of positive ion mode the numbers are cy-3-gal (1), *trans*-resveratrol (1), querc-3-rham (1), quercetin (1) and phloridzin (4). These feature are now taken to be the "true" biomarkers and they are used to construct ROC curves. The data set, as well as a more extended version including different concentrations of spiked-in compounds is publicly available in the R package BioMark (see http://cran. r-project.org/web/packages/BioMark, Wehrens & Franceschi (2011)) and has been used to evaluate a novel stability-based biomarker selection method (Wehrens et al., 2011). In this application, the effects of decreasing sample size are investigated by subsampling the original set of twenty samples: sample sizes of 16, 12, 8 and 6 apples, respectively, are considered. In all cases, both classes (spiked and control) have equal sizes, which is the most easy case for detecting significant differences. Results are summarized by analysis of ROC curves – to prevent effects from accidentally easy or difficult subsets, the final ROC curves are

To assess the behaviour of biomarker selection for larger data sets, we resort to simulation. Simulated data sets have been constructed as multivariate normal distributions, using the means and covariance matrices of the experimental data: both classes (untreated and spiked) have been simulated separately. Simulations are performed for both positive and negative modes; in every simulation, one hundred data sets are created. The outcomes reported here are the averages of the results for the one hundred simulations. Data sets consisting of 10, 25,

As a first step, the data are visualized using Principal Component Analysis (PCA). Since the intensities of the features can vary enormously, standardized data are used. The score plots for the positive and negative data sets are shown in Figure 2 for the positive ion mode, and in Figure 3 for the negative mode. In both cases, control and spiked data sets are not completely separated and the same is also true for the other PCs (not shown). This fact indicates that the "inherent" variability of the data set is not perturbed to a significant extent by spiking, as

Even with this data structure, biomarker selection strategies can still perform efficiently. Figure 2 and Figure 3 also display the score plots of a PCA analysis performed considering only the top 10 variables selected by univariate *t*-testing. In these conditions, the separation

adducts with in-house developed software.

obtained by averaging the results of 100 repeated re-samplings.

50 and 200 biological samples per class have been synthesized.

could be expected considering the small number of affected variables.

**2.4 Simulated data sets**

**3. Results and discussion**

Fig. 2. PCA score plot (PC1 vs PC2) for the positive ion mode data set after standardization. In the left plot the principal components have been calculated on the full data set. In the right panel PCA analysis has been performed considering only the top 10 variables selected by a *t*-test.

between control and spiked samples is evident, thus indicating that this subset of the variables separates the two classes. Whether these ten variables contain the true biomarkers remains to be seen: especially in small data sets there may be chance correlations causing false positives, and seeing differences between the two groups in the score plots after *t*-testing in fact is trivial. The score plot is merely showing that the variables, selected on the bases of their discriminating power, are separating the two classes. As already discussed, small data sets will in general not capture all relevant biological variability, which implies that the predictive power of statistical models based on small data sets usually is very low. To illustrate this effect, the predictive power, i.e., the fraction of correct predictions for PC-LDA and PLS-DA models is presented in Figure 4. Four subsets of different sizes are considered as training sets, and the estimate of predictive power is based on predictions for the apples not in the training set. Again, the results are the average over 100 different subsamplings. Even if the control and spiked subsets are different, it can be seen that the predictive power of the multivariate methods is comparable to random selection, meaning that for every subset different variables will be important in the models and no consistency can be achieved. However, it is important to point out that this fact does not mean that some of the true biomarkers are not consistently selected upon subsetting, but rather that the more important variables are changing from a subset to another: even with models that are unpredictive it is possible to extract relatively good lists of putative biomarkers. Obviously, with very different characteristics for the two classes there *will* be predictive power, but for realistic data sets like the one used in this paper, where differences are small, it is unwise to focus solely on prediction.

To evaluate the efficiency of the different methods as far as biomarker selection is concerned, ROC curves for the *t*-test and two-component PLS-DA and PC-LDA models are presented in Figure 5, for 3, 4, 6 and 8 biological samples per class, respectively. The ROC curves indicate that all three variables selection methods perform significantly better than random selection.

FPR

Fig. 5. ROC curves for the *t*-test and two component PLS-DA and PC-LDA as a function of

1. The performance of the methods improves by increasing the number samples per class. 2. The performance of PLS-DA is not particularly sensitive to the number of components. 3. PC-LDA does not show top class performance in any of the conditions considered. 4. The performance of PC-LDA is very much dependent on the number of components. 5. Multivariate approaches do not show a definitive advantage over univariate *t*-testing. As expected, the performances of all the methods in terms of biomarker identification decrease with a reduction of the data set size. However, it is important to point out that even in the worst possible case (3 samples per class) early AUC for PLS-DA and the *t*-test are significantly greater than that obtained for completely random selection. This indicates that both methods can be used effectively in the biomarker selection phase, even with a low number of samples. In other words, features related to spiked compounds are consistently present in the top positions of the ordered list of experimental variables, which implies that also models constructed with very few samples can be relied upon to recognize these features. The performance of PC-LDA is very much dependent on the number of components taken into account. This behavior can be explained by considering that in PC-LDA the variable reduction step is performed without any knowledge of class labels, only selecting the directions of greater variance. If these directions show little discriminating power, their supervised linear combination leads to poor modeling. However, performance improves with

0.0 0.2 0.4 0.6 0.8 1.0 PLSDA 2c neg

Metabolic Biomarker Identification with Few Samples 151

6 samp 8 samp

0.0 0.2 0.4 0.6 0.8 1.0

títest pos

títest neg

0.0 0.2 0.4 0.6 0.8 1.0

TPR

0.0 0.2 0.4 0.6 0.8 1.0

the number of samples per class.

0.0 0.2 0.4 0.6 0.8 1.0

PCLDA 2c neg

3 samp 4 samp

PCLDA 2c pos PLSDA 2c pos

Fig. 3. PCA score plot (PC1 vs PC2) for the negative ion mode data set after standardization. In the left plot the Principal Components have been calculated on the full data set. In the right panel PCA analysis has been performed considering only the top 10 variables selected by a *t*-test.

Fig. 4. Predictive power of multivariate PLS-DA and PC-LDA on a subset of the initial data set for positive and negative ion mode. Different lines are relative to models constructed with an increasing number of LVs. The horizontal dashed line indicates random selection.

Of the three, PC-LDA is always the least efficient, while PLS-DA and the *t*-test have a very similar performance. In absolute terms, the efficiency of the three methods increases with the number of biological samples. ROC curves for all possible conditions were constructed and the results are summarized in terms of early AUC (pAUC.2) in Figure 6, for positive and negative ion mode, respectively. From these figures it is possible to extract some clear trends:

10 Will-be-set-by-IN-TECH

í2

í10

PC 2 (15.1%)

Fig. 3. PCA score plot (PC1 vs PC2) for the negative ion mode data set after standardization. In the left plot the Principal Components have been calculated on the full data set. In the right panel PCA analysis has been performed considering only the top 10 variables selected

Samples per Class

Fig. 4. Predictive power of multivariate PLS-DA and PC-LDA on a subset of the initial data set for positive and negative ion mode. Different lines are relative to models constructed with an increasing number of LVs. The horizontal dashed line indicates random selection.

Of the three, PC-LDA is always the least efficient, while PLS-DA and the *t*-test have a very similar performance. In absolute terms, the efficiency of the three methods increases with the number of biological samples. ROC curves for all possible conditions were constructed and the results are summarized in terms of early AUC (pAUC.2) in Figure 6, for positive and negative ion mode, respectively. From these figures it is possible to extract some clear trends:

● ● ● ●

PLSDA pos

● ● ● ●

0.2 0.4 0.6 0.8

345678

PLSDA neg

345678

● ● ● ●

PCLDA pos

● ● ● ●

PCLDA neg

1 LV 2 LV 3 LV

 1 2 3

í3 í2 í10 1 2 3

●

●

● ● ● ●

●

●

●

●

PC 1 (55.8%)

**Top 10 biomarkers**

í15 í10 í5 0 5 10

● ●

●

●

●

●

●

PC 1 (18.3%)

Predictive power

0.2 0.4 0.6 0.8

**Full dataset**

●

●

í10

by a *t*-test.

● Control Spiked

í5

 0

PC 2 (9.6%)

 5 10 15

●

Fig. 5. ROC curves for the *t*-test and two component PLS-DA and PC-LDA as a function of the number of samples per class.


As expected, the performances of all the methods in terms of biomarker identification decrease with a reduction of the data set size. However, it is important to point out that even in the worst possible case (3 samples per class) early AUC for PLS-DA and the *t*-test are significantly greater than that obtained for completely random selection. This indicates that both methods can be used effectively in the biomarker selection phase, even with a low number of samples. In other words, features related to spiked compounds are consistently present in the top positions of the ordered list of experimental variables, which implies that also models constructed with very few samples can be relied upon to recognize these features.

The performance of PC-LDA is very much dependent on the number of components taken into account. This behavior can be explained by considering that in PC-LDA the variable reduction step is performed without any knowledge of class labels, only selecting the directions of greater variance. If these directions show little discriminating power, their supervised linear combination leads to poor modeling. However, performance improves with

Samples per Class

Fig. 7. pAUC.2 for PLS-DA, PC-LDA and *t*-test as a function of the number of samples per class and the number of LVs. Simulated data set. Gray dashed line indicates the pAUC.2 of

pAUC.2 on the number of replicates and of components is presented in Figure 7, comparing

This analysis shows that PC-LDA only becomes effective if a large number of LVs is considered: the true biomarkers should have appreciable weight in the latent variables and it is by no means certain that this is the case for the first couple of LVs. Is it worth noting that for negative ion mode, the model with 2 LVs is comparable to random selection. In the case of PLS-DA, this dependence on the number of LVs is less evident and shows an opposite trend: the best performance is obtained with the smallest number of LVs. This is in agreement with the explanation given earlier: the relevant variables are captured in the very first PLS components, and the effect of overtraining leads to deterioration if more components are added. If anything, it is surprising that the overtraining effect is relatively small for these

The results on the simulated data sets are in agreement with the conclusions from the apple data. Differences between the methods decrease with increasing sample sizes, but even with the largest number of objects (200 in each group) the *t*-test still performs as well as PLS-DA. Multivariate testing is slightly more effective for the positive ion mode, while the *t*-test shows a slight advantage for the negative ion mode. This behaviour is probably due to the different characteristics of both ionization modes, leading to different levels of correlation

20 40 60 80 100 PLSDA neg

Metabolic Biomarker Identification with Few Samples 153

4 LV 6 LV

20 40 60 80 100

PCLDA pos

PCLDA neg

8 LV

0.2 0.4 0.6 0.8

pAUC(.2)

random selection

data.

0.2 0.4 0.6 0.8

20 40 60 80 100

the multivariate methods to the *t*-test and to "random" selection.

● ● ● ●

● ● ● ● títest neg

títest 2 LV

títest pos PLSDA pos

Fig. 6. pAUC.2 for PLS-DA, PC-LDA and *t*-test as a function of the number of samples per class and the number of LVs. The gray dashed line indicates the pAUC.2 of random selection.

the number of components, as an increase of the number of LVs leads to a better "coverage" of the data space. These limitations do not affect PLS-DA, as the variable reduction step is already performed in a supervised framework, where discriminating power is the main request. This means that the first PLS components are by definition more relevant than the first PCA components in biomarker identification. The other side of the coin is the danger of overfitting, very real in the application of PLS-DA (Westerhuis et al., 2008) – we will come back to this point later.

In this small-sample set, the *t*-test does as well as the best multivariate methods. This shows that modeling the correlation structure is not necessarily an advantage if the number of samples is low, or, alternatively, that the true correlation structure has not been captured well enough from the few samples that are available to allow meaningful inference. A definite advantage of the *t*-test is that it has no tunable parameters and can be applied without further optimization. It should be noted that we do not need to apply multiple-testing corrections in this context since we only use the order of the absolute size of the *t*-statistics to construct the ROC curves, and not a specific cut-off level *α*. In other applications, however, this aspect should be taken into account.

To extend the comparison between different models beyond the limits imposed by the apple experiment, ROC curves and early AUC were calculated for the simulated data using larger sample sizes (10, 25, 50, 200), both for positive and negative ion modes. The dependence of 12 Will-be-set-by-IN-TECH

3468 PLSDA neg

2 LV 3 LV

Samples per Class

Fig. 6. pAUC.2 for PLS-DA, PC-LDA and *t*-test as a function of the number of samples per class and the number of LVs. The gray dashed line indicates the pAUC.2 of random selection.

the number of components, as an increase of the number of LVs leads to a better "coverage" of the data space. These limitations do not affect PLS-DA, as the variable reduction step is already performed in a supervised framework, where discriminating power is the main request. This means that the first PLS components are by definition more relevant than the first PCA components in biomarker identification. The other side of the coin is the danger of overfitting, very real in the application of PLS-DA (Westerhuis et al., 2008) – we will come

In this small-sample set, the *t*-test does as well as the best multivariate methods. This shows that modeling the correlation structure is not necessarily an advantage if the number of samples is low, or, alternatively, that the true correlation structure has not been captured well enough from the few samples that are available to allow meaningful inference. A definite advantage of the *t*-test is that it has no tunable parameters and can be applied without further optimization. It should be noted that we do not need to apply multiple-testing corrections in this context since we only use the order of the absolute size of the *t*-statistics to construct the ROC curves, and not a specific cut-off level *α*. In other applications, however, this aspect

To extend the comparison between different models beyond the limits imposed by the apple experiment, ROC curves and early AUC were calculated for the simulated data using larger sample sizes (10, 25, 50, 200), both for positive and negative ion modes. The dependence of

3468

PCLDA pos

PCLDA neg

0.2 0.4 0.6 0.8

pAUC(.2)

back to this point later.

should be taken into account.

0.2 0.4 0.6 0.8

3468

● ● ● ●

●

●

títest neg

títest 1 LV

● ●

títest pos PLSDA pos

Fig. 7. pAUC.2 for PLS-DA, PC-LDA and *t*-test as a function of the number of samples per class and the number of LVs. Simulated data set. Gray dashed line indicates the pAUC.2 of random selection

pAUC.2 on the number of replicates and of components is presented in Figure 7, comparing the multivariate methods to the *t*-test and to "random" selection.

This analysis shows that PC-LDA only becomes effective if a large number of LVs is considered: the true biomarkers should have appreciable weight in the latent variables and it is by no means certain that this is the case for the first couple of LVs. Is it worth noting that for negative ion mode, the model with 2 LVs is comparable to random selection. In the case of PLS-DA, this dependence on the number of LVs is less evident and shows an opposite trend: the best performance is obtained with the smallest number of LVs. This is in agreement with the explanation given earlier: the relevant variables are captured in the very first PLS components, and the effect of overtraining leads to deterioration if more components are added. If anything, it is surprising that the overtraining effect is relatively small for these data.

The results on the simulated data sets are in agreement with the conclusions from the apple data. Differences between the methods decrease with increasing sample sizes, but even with the largest number of objects (200 in each group) the *t*-test still performs as well as PLS-DA. Multivariate testing is slightly more effective for the positive ion mode, while the *t*-test shows a slight advantage for the negative ion mode. This behaviour is probably due to the different characteristics of both ionization modes, leading to different levels of correlation

Guo, Y., Hastie, T. & Tibshirani, R. (2007). Regularized discriminant analysis and its

Metabolic Biomarker Identification with Few Samples 155

Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M. & Dougherty, E. (2010). Small-sample

Hastie, T., Tibshirani, R. & Friedman, J. (2001). *The Elements of Statistical Learning*, Springer

Kemsley, E. K. (1996). Discriminant analysis of high-dimensional data: a comparison of

McLachlan, G. (2004). *Discriminant Analysis and Statistical Pattern Recognition*, Wiley-

Noble, W. S. (2009). How does multiple testing correction work?, *Nat. Biotechnol.*

Oh, J., Craft, J., Townsend, R., Deasy, J., Bradley, J. & Naqa, I. E. (2011). A bioinformatics

R Development Core Team (2011). *R: A Language and Environment for Statistical Computing*, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

Reiner, A., Yekutieli, D. & Benjamini, Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures, *Bioinformatics* 19(3): 368–375. Saeys, Y., Inza, I. & Larranaga, P. (2007). A review of feature selection techniques in

Smit, S., Breemen, M. J. v., Hoefsloot, H. C. J., Aerts, J. M. F. G., Koster, C. G. d. & Smilde, A. K.

Smith, C. A., Want, E. J., Tong, G. C., Abagyan, R. & Siuzdak, G. (2006). XCMS: Processing

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions, *J. R. Statist.*

Szymanska, E., Saccenti, E., Smilde, A. & Westerhuis, J. (2011). Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies, *Metabolomics* . Tautenhahn, R., Bottcher, C. & Neumann, S. (2008). Highly sensitive feature detection for high

Tusher, V., Tibshirani, R. & Chu, G. (2001). Significance analysis of microarrays applied to the

Vrhovsek, U., Rigo, A., Tonon, D. & Mattivi, F. (2004). Quantitation of polyphenols in different

Wehrens, R. & Franceschi, P. (2011). *BioMark: finding biomarkers in two-class discrimination*

Wehrens, R., Franceschi, P., Vrhovsek, U. & Mattivi, F. (2011). Stability-based biomarker

Werf, M. J. v. d., Pieterse, B., Luijk, N. v., Schuren, F., Vat, B. v. d. W.-v. d., Overkamp,

K. & Jellema, R. H. (2006). Multivariate analysis of microarray data by principal component discriminant analysis: prioritizing relevant transcripts linked to the

(2007). Assessing the statistical validity of proteomics based biomarkers, *Anal. Chim.*

Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment,

principal components analysis and partial least squares data reduction methods,

approach for biomarker identification in radiation-induced lung inflammation from

precision of ROC-related estimates, *Bioinformatics* 28: 822–830.

limited proteomics data, *J. Proteome Res.* 10(3): 1406–1415.

application in microarrays, *Biostatistics* 8: 86–100.

Series in Statistics, Springer, New York.

*Chemom. Intell. Lab. Syst.* 33: 47–61.

URL: *http://www.R-project.org*

bioinformatics, *Bioinformatics* 23: 2507–2517.

*Soc. B* 36: 111–147. Including discussion.

*problems*. R package version 0.3.0.

selection, *Anal. Chim. Acta* 705: 15–23.

resolution LC/MS, *BMC Bioinformatics* 9: 504.

ionizing radiation response, *PNAS* 98: 5116–5121.

apple varieties, *J. Agr. Food. Chem.* 52(21): 6532–6538.

Matching, and Identification, *Anal. Chem.* 78: 779–787.

Interscience.

27: 1135–1137.

*Acta* 592: 210–217.

between biomarkers. Indeed, in positive ion mode, the ionization shows a more pronounced fragmentation (phloridzine, for example, gives rise to four different biomarkers).

#### **4. Conclusions**

In this paper we have investigated the effects of sample set size on the performance of some popular strategies for biomarker identification (PLS-DA, PC-LDA and the *t*-test). The experiments are performed on a spiked metabolomic data set measured in apple extracts by UPLC-QTOF. The efficiency of the different statistical approaches is compared in terms of ROC curves, and in order to assess general trends, simulated data have been used to extend the data set. The experimental results clearly show that Linear Discriminant Analysis carried out on the Principal Components (PC-LDA) is the least efficient strategy for biomarker identification among the ones we considered. PLS-DA and the *t*-test show comparable performances in all the considered conditions. These results, and the observation that PLS-DA based selection is relatively consistent for different numbers of components, indicate that multivariate and univariate approaches are equally efficient for the apple data set. It is perhaps surprising that relatively good results in terms of biomarker selection are obtained, even for models that have very poor predictive performance. One should realise, however, that this is not a paradox at all: it merely is the result from the low sample-to-variable ratio, leading to chance correlations of intensities of metabolite signals with class. The true biomarkers are often present among the most significant variables in, e.g., a PLS-DA model, but many other false positives are, too, destroying the predictive power. One recently published approach actually utilizes this variability by focusing only on those variables that are *consistently* present in the most important variables upon disturbance of the data by jackkifing or bootstrapping (Wehrens et al., 2011).

The main point of this contribution, however, is the relation between data set size and reliability of biomarker identification. As expected, all the methods become less efficient as the number of biological replicates decreases, but even in these conditions the use of PLS-DA and the *t*-test offer effective biomarker identification strategies. This observation is fundamentally important in all studies where it is impossible to acquire more samples, and suggests that small sample sizes can still allow reliable selection of biomarkers.

#### **5. References**

Araki, Y., Yoshikawa, K., Okamoto, S., Sumitomo, M., Maruwaka, M. & Wakabayashi, T. (2010). Identification of novel biomarker candidates by proteomic analysis of cerebrospinal fluid from patients with moyamoya disease using SELDI-TOF-MS, *BMC Neurology* 10: 112.

Barker, M. & Rayens, W. (2003). Partial least squares for discrimination, *J. Chemom.* 17: 166–173. Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, *J. Royal. Stat. Soc. B* 57: 289–300.


14 Will-be-set-by-IN-TECH

between biomarkers. Indeed, in positive ion mode, the ionization shows a more pronounced

In this paper we have investigated the effects of sample set size on the performance of some popular strategies for biomarker identification (PLS-DA, PC-LDA and the *t*-test). The experiments are performed on a spiked metabolomic data set measured in apple extracts by UPLC-QTOF. The efficiency of the different statistical approaches is compared in terms of ROC curves, and in order to assess general trends, simulated data have been used to extend the data set. The experimental results clearly show that Linear Discriminant Analysis carried out on the Principal Components (PC-LDA) is the least efficient strategy for biomarker identification among the ones we considered. PLS-DA and the *t*-test show comparable performances in all the considered conditions. These results, and the observation that PLS-DA based selection is relatively consistent for different numbers of components, indicate that multivariate and univariate approaches are equally efficient for the apple data set. It is perhaps surprising that relatively good results in terms of biomarker selection are obtained, even for models that have very poor predictive performance. One should realise, however, that this is not a paradox at all: it merely is the result from the low sample-to-variable ratio, leading to chance correlations of intensities of metabolite signals with class. The true biomarkers are often present among the most significant variables in, e.g., a PLS-DA model, but many other false positives are, too, destroying the predictive power. One recently published approach actually utilizes this variability by focusing only on those variables that are *consistently* present in the most important variables upon disturbance of the data by jackkifing or bootstrapping (Wehrens

The main point of this contribution, however, is the relation between data set size and reliability of biomarker identification. As expected, all the methods become less efficient as the number of biological replicates decreases, but even in these conditions the use of PLS-DA and the *t*-test offer effective biomarker identification strategies. This observation is fundamentally important in all studies where it is impossible to acquire more samples, and suggests that

Araki, Y., Yoshikawa, K., Okamoto, S., Sumitomo, M., Maruwaka, M. & Wakabayashi,

Barker, M. & Rayens, W. (2003). Partial least squares for discrimination, *J. Chemom.* 17: 166–173. Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, *J. Royal. Stat. Soc. B* 57: 289–300. Brown, C. D. & Davis, H. T. (2005). Receiver operating characteristics curves and related

Chadeau-Hyam, M., Ebbels, T., Brown, I., Chan, Q., Stamler, J., Huang, C., Daviglus, M.,

Ueshima, H., Zhao, L., Holmes, E., Nicholson, J., Elliott, P. & Iorio, M. D. (2010). Metabolic profiling and the metabolome-wide association study: significance level

decision measures: A tutorial, *Chemom. Intell. Lab. Syst.* 80: 24–38.

for biomarker identification, *J. Proteome Res.* 9(9): 4620–4627.

T. (2010). Identification of novel biomarker candidates by proteomic analysis of cerebrospinal fluid from patients with moyamoya disease using SELDI-TOF-MS,

small sample sizes can still allow reliable selection of biomarkers.

fragmentation (phloridzine, for example, gives rise to four different biomarkers).

**4. Conclusions**

et al., 2011).

**5. References**

*BMC Neurology* 10: 112.


**7** 

*Chongqing, China* 

**Kinetic Analyses of Enzyme Reaction Curves** 

A reaction system of Michaelis-Menten enzyme on single substrate can be characterized by the initial substrate concentration before enzyme action (*S*0), the maximal reaction rate (*V*m) and Michaelis-Menten constant (*K*m), besides some other required parameters. The estimations of *S*0, *V*m and *K*m can be used to measure enzyme substrates, enzyme activities, epitope or hapten (enzyme-immunoassay), irreversible inhibitors and so on. During enzyme reaction, the changes of substrate or product concentrations can be monitored; continuous monitor of such changes provides a reaction curve while discontinuous monitor of such changes provides signals just for the starting point and the terminating point of enzyme reaction. It is an end-point method when only signals for the starting point and the terminating point are analyzed. It is a kinetic method when a range of data from a reaction curve are analyzed, and can be classifieid into the initial rate method and kinetic analysis of reaction curve. The initial rate method only analyzes data for initial rate reaction whose instantaneous rates are constants; kinetic analysis of reaction curve analyzes data whose instantaneous rates show obvious deviations from the initial rate (Bergmeyer, 1983; Guilbault, 1976; Marangoni, 2003). To estimate those parameters of an enzyme reaction system, kinetic analysis of reaction curve is favoured because the analysis of one reaction curve can concomitantly provide *V*m, *S*0 and *K*m. Hence, methods for kinetic analysis of reaction curve to estimate parameters of enzyme reaction systems are widely studied.

An enzyme reaction curve is a function of dependent variables, which are proportional to concentrations of a substrate or product, with respect to reaction time as the predictor variable. In general, there are two types of enzyme reaction curves. The first type involves the action of just one enzyme, and employs either a selective substrate to detect the activity of one enzyme of interest or a specific enzyme to act on a unique substrate of interest. The second type involves the actions of at least two enzymes, and requires at least one auxiliary enzyme as a tool to continuously monitor a reaction curve. The second type is an enzymecoupled reaction system. For kinetic analysis of reaction curve, there are many reports on

**1. Introduction** 

 \*

Corresponding Author

**with New Integrated Rate Equations** 

Xiaolan Yang, Gaobo Long, Hua Zhao and Fei Liao\* *College of Laboratory Medicine, Chongqing Medical University,* 

**and Applications** 

degradation of different carbohydrates in Pseudomonas putida S12, *Microbiology* 152: 257–272.


### **Kinetic Analyses of Enzyme Reaction Curves with New Integrated Rate Equations and Applications**

Xiaolan Yang, Gaobo Long, Hua Zhao and Fei Liao\* *College of Laboratory Medicine, Chongqing Medical University, Chongqing, China* 

#### **1. Introduction**

16 Will-be-set-by-IN-TECH

156 Chemometrics in Practical Applications

Westerhuis, J., Hoefsloot, H., Smit, S., D.J., V., Smilde, A. K., van Velzen, E., van Duijnhoven,

Yousef, M., Ketany, M., Manevitz, L., Showe, L. & Showe, M. (2009). Classification and

Zuber, V. & Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation,

152: 257–272.

*BMC Bioinformatics* 10: 337.

*Bioinformatics* 25: 2700–2707.

4: 81–89.

degradation of different carbohydrates in Pseudomonas putida S12, *Microbiology*

J. & van Dorsten, F. A. (2008). Assessment of PLSDA cross validation, *Metabolomics*

biomarker identification using gene network modules and support vector machines,

A reaction system of Michaelis-Menten enzyme on single substrate can be characterized by the initial substrate concentration before enzyme action (*S*0), the maximal reaction rate (*V*m) and Michaelis-Menten constant (*K*m), besides some other required parameters. The estimations of *S*0, *V*m and *K*m can be used to measure enzyme substrates, enzyme activities, epitope or hapten (enzyme-immunoassay), irreversible inhibitors and so on. During enzyme reaction, the changes of substrate or product concentrations can be monitored; continuous monitor of such changes provides a reaction curve while discontinuous monitor of such changes provides signals just for the starting point and the terminating point of enzyme reaction. It is an end-point method when only signals for the starting point and the terminating point are analyzed. It is a kinetic method when a range of data from a reaction curve are analyzed, and can be classifieid into the initial rate method and kinetic analysis of reaction curve. The initial rate method only analyzes data for initial rate reaction whose instantaneous rates are constants; kinetic analysis of reaction curve analyzes data whose instantaneous rates show obvious deviations from the initial rate (Bergmeyer, 1983; Guilbault, 1976; Marangoni, 2003). To estimate those parameters of an enzyme reaction system, kinetic analysis of reaction curve is favoured because the analysis of one reaction curve can concomitantly provide *V*m, *S*0 and *K*m. Hence, methods for kinetic analysis of reaction curve to estimate parameters of enzyme reaction systems are widely studied.

An enzyme reaction curve is a function of dependent variables, which are proportional to concentrations of a substrate or product, with respect to reaction time as the predictor variable. In general, there are two types of enzyme reaction curves. The first type involves the action of just one enzyme, and employs either a selective substrate to detect the activity of one enzyme of interest or a specific enzyme to act on a unique substrate of interest. The second type involves the actions of at least two enzymes, and requires at least one auxiliary enzyme as a tool to continuously monitor a reaction curve. The second type is an enzymecoupled reaction system. For kinetic analysis of reaction curve, there are many reports on

<sup>\*</sup> Corresponding Author

Kinetic Analyses of Enzyme Reaction Curves

Orsi & Tipton, 1979).

practiced.

with New Integrated Rate Equations and Applications 159

and substrates has more absorbing advantages (Liu, et al., 2009; Yang, et al., 2010); such integration strategies can be applied to enzyme-coupled reaction systems and enzymes sufferring inhibition by substrates/products. Herein, we discuss chemometrics for both kinetic analysis of reaction curve and its integration with other methods, and demonstrate their applications to quantify enzyme initial rates and substrates with some typical enzymes.

**2. Kinetic analysis of enzyme reaction curve: chemometrics and application**  To estimate parameters by kinetic analysis of reaction curve, the desired parameters are included in a set of parameters for the best fitting. Regardless of the number of enzymes involved in a reaction curve, there are the following two approaches for kinetic analysis of reaction curve based on different ways to realize NLSF and their data transformation.

In the first approach, with a differential or integrated rate equation, a series of dependent variables are derived from data in a reaction curve with each set of preset parameters. Such dependent variables should follow a predetermined response to predictor variables that are either reaction time or data transformed from those in the reaction curve. The goodness of the predetermined response is the criterion for the best fitting. In this approach, NLSF is realized with a model for data transformed from a reaction curve (Burguillo, 1983; Cornish-Bowden, 1995; Liao, 2005; Liao, et al., 2003a, 2003b, 2005a, 2005b;

In the second approach, reaction curves are calculated with sets of preset parameters by iterative numerical integration from a preset staring point. Such calculated reaction curves are fit to a reaction curve of interest; the least sum of residual squares indicates the best fitting (Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998; Yang, et al., 2010). In this approach, calculated reaction curves still utilize reaction time as the predictor variable and become discrete at the same intervals as the reaction curve of interest. Clearly,

With any enzyme, iterative numerical integration of the differential rate equation(s) from a starting point with sets of preset parameters can be universally applicable regardless of the complexity of the kinetics. Thus, the second approach exhibits better universality and there are few technical challenges to kinetic analysis of reaction curve *via* NLSF. In fact, however, the second approach is occasionally utilized while the first approach is widely

In the following subsections, the differential rate equation of simple Michaelis-Menten kinetics on single substrate is integrated; the prerequisites for kinetic analysis of reaction curve with integrated rate equations, kinetic analysis of enzyme-coupled reaction curve, the integrations of kinetic analysis of reaction curve with other methods, and the applications of

Assigning instantaneous substrate concentration to *S*, instantaneous reaction time to *t*, steady-state kinetics of Michaelis-Menten enzyme on single substrate follows Equ.(1).

*dS dt V S K S* ( )( ) m m (1)

there is no transformation of data from a reaction curve in this approach.

such integration strategies to some typical enzymes are discussed.

**2.1 Integrated rate equation for one enzyme on single substrate** 

one enzyme reaction system, but are just a few reports on enzyme-coupled reaction system (Atkins & Nimmo, 1973; Liao, et al., 2005; Duggleby, 1983, 1985, 1994; Walsh, 2010).

In theory, enzyme reactions may tolerate reversibility, the activation/inhibition by substrates/products, and even thermo-inactivation of enzyme. From a mathematic view, it is still feasible to estimate parameters of an enzyme reaction system by kinetic analysis of reaction curve if the roles of all those factors mentioned above are included in a kinetic model (Baywenton, 1986; Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998). However, enzyme kinetics is usually so complex due to the effects of those mentioned factors that there are always some technical challenges for kinetic analysis of reaction curve. Hence, most methods for kinetic analysis of reaction curve are reported for enzymes whose actions suffer alterations by those mentioned factors as few as possible.

In practice, kinetic analysis of reaction curve usually employs nonlinear-least-squarefitting (NLSF) of the differential or integrated rate equation(s) to either the reaction curve *per se* or data set(s) transformed from the reaction curve (Cornish-Bowden, 1995; Duggleby, 1983, 1994; Orsi & Tipton, 1979). The use of NLSF rather than matrix inversion is due to the existence of multiple minima of the sum of residual squares with respect to some nonlinear parameters (Liao, et al., 2003a, 2007a). When a differential rate equation is used, numerical differentiation of data from the reaction curve has to be employed to derive instantsneous reaction rates. In this case, there must be intervals as short as possible to monitor reaction curves (Burden & Faires, 2001; Dagys, 1990; Hasinoff, 1985; Koerber & Fink, 1987). However, the instantaneous reaction rates from reaction curves inherenetly exhibit narrow distribution ranges and large errors; the strategy by numerical differentiation of data in a reaction curve is unfavourable for estimating *V*m and *S*<sup>0</sup> because of their low reliaiblity and unsatisfactory working ranges. On the other hand, when an integrated rate equation of an enzyme reaction is used for kinetic analysis of reaction curve, there is no prerequisites of short intervals to record reaction curves so that automated analyses in parallel can be realized for enhanced performance with a large number of samples. As a result, integrated rate equations of enzymes are widely studied for kinetic analysis of reaction curve to estimate parameters of enzyme reaction systems (Duggleby, 1994;Liao, et al, 2003a, 2005a; Orsi & Tipton, 1979).

Due possibly to the limitation on computation resources, integrated rate equations of enzymes in such methods are usually rearranged into special forms to facilitate NLSF after data transformation (Atkins & Nimmo, 1973; Orsi & Tipton, 1979). In appearance, the uses of different forms of the same integrated rate equation for NLSF to data sets transformed from the same reaction curve can give the same parameters. However, kinetic analysis of reaction curve with rearranged forms of an integrated rate equation always gives parameters with uncertainty too large to have practical roles (Newman, et al, 1974). Therefore, proper forms of an integrated rate equation should be selected carefully for estimating parameters by kinetic analysis of reaction curve.

In the past ten years, our group studied chemometrics for kinetic analysis of reaction curve to estimate parameters of enzyme reaction systems; the following results were found. (a) In terms of reliability and performance for estimating parameters, the use of the integrated rate equations with the predictor variable of reaction time is superior to the use of the integrated rate equations with predictor variables other than reaction time (Liao, et al., 2005a); (b) the integration of kinetic analysis of reaction curve with other methods to quantify initial rates

one enzyme reaction system, but are just a few reports on enzyme-coupled reaction system

In theory, enzyme reactions may tolerate reversibility, the activation/inhibition by substrates/products, and even thermo-inactivation of enzyme. From a mathematic view, it is still feasible to estimate parameters of an enzyme reaction system by kinetic analysis of reaction curve if the roles of all those factors mentioned above are included in a kinetic model (Baywenton, 1986; Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998). However, enzyme kinetics is usually so complex due to the effects of those mentioned factors that there are always some technical challenges for kinetic analysis of reaction curve. Hence, most methods for kinetic analysis of reaction curve are reported for enzymes whose

In practice, kinetic analysis of reaction curve usually employs nonlinear-least-squarefitting (NLSF) of the differential or integrated rate equation(s) to either the reaction curve *per se* or data set(s) transformed from the reaction curve (Cornish-Bowden, 1995; Duggleby, 1983, 1994; Orsi & Tipton, 1979). The use of NLSF rather than matrix inversion is due to the existence of multiple minima of the sum of residual squares with respect to some nonlinear parameters (Liao, et al., 2003a, 2007a). When a differential rate equation is used, numerical differentiation of data from the reaction curve has to be employed to derive instantsneous reaction rates. In this case, there must be intervals as short as possible to monitor reaction curves (Burden & Faires, 2001; Dagys, 1990; Hasinoff, 1985; Koerber & Fink, 1987). However, the instantaneous reaction rates from reaction curves inherenetly exhibit narrow distribution ranges and large errors; the strategy by numerical differentiation of data in a reaction curve is unfavourable for estimating *V*m and *S*<sup>0</sup> because of their low reliaiblity and unsatisfactory working ranges. On the other hand, when an integrated rate equation of an enzyme reaction is used for kinetic analysis of reaction curve, there is no prerequisites of short intervals to record reaction curves so that automated analyses in parallel can be realized for enhanced performance with a large number of samples. As a result, integrated rate equations of enzymes are widely studied for kinetic analysis of reaction curve to estimate parameters of enzyme reaction systems

Due possibly to the limitation on computation resources, integrated rate equations of enzymes in such methods are usually rearranged into special forms to facilitate NLSF after data transformation (Atkins & Nimmo, 1973; Orsi & Tipton, 1979). In appearance, the uses of different forms of the same integrated rate equation for NLSF to data sets transformed from the same reaction curve can give the same parameters. However, kinetic analysis of reaction curve with rearranged forms of an integrated rate equation always gives parameters with uncertainty too large to have practical roles (Newman, et al, 1974). Therefore, proper forms of an integrated rate equation should be selected carefully for

In the past ten years, our group studied chemometrics for kinetic analysis of reaction curve to estimate parameters of enzyme reaction systems; the following results were found. (a) In terms of reliability and performance for estimating parameters, the use of the integrated rate equations with the predictor variable of reaction time is superior to the use of the integrated rate equations with predictor variables other than reaction time (Liao, et al., 2005a); (b) the integration of kinetic analysis of reaction curve with other methods to quantify initial rates

(Atkins & Nimmo, 1973; Liao, et al., 2005; Duggleby, 1983, 1985, 1994; Walsh, 2010).

actions suffer alterations by those mentioned factors as few as possible.

(Duggleby, 1994;Liao, et al, 2003a, 2005a; Orsi & Tipton, 1979).

estimating parameters by kinetic analysis of reaction curve.

and substrates has more absorbing advantages (Liu, et al., 2009; Yang, et al., 2010); such integration strategies can be applied to enzyme-coupled reaction systems and enzymes sufferring inhibition by substrates/products. Herein, we discuss chemometrics for both kinetic analysis of reaction curve and its integration with other methods, and demonstrate their applications to quantify enzyme initial rates and substrates with some typical enzymes.

### **2. Kinetic analysis of enzyme reaction curve: chemometrics and application**

To estimate parameters by kinetic analysis of reaction curve, the desired parameters are included in a set of parameters for the best fitting. Regardless of the number of enzymes involved in a reaction curve, there are the following two approaches for kinetic analysis of reaction curve based on different ways to realize NLSF and their data transformation.

In the first approach, with a differential or integrated rate equation, a series of dependent variables are derived from data in a reaction curve with each set of preset parameters. Such dependent variables should follow a predetermined response to predictor variables that are either reaction time or data transformed from those in the reaction curve. The goodness of the predetermined response is the criterion for the best fitting. In this approach, NLSF is realized with a model for data transformed from a reaction curve (Burguillo, 1983; Cornish-Bowden, 1995; Liao, 2005; Liao, et al., 2003a, 2003b, 2005a, 2005b; Orsi & Tipton, 1979).

In the second approach, reaction curves are calculated with sets of preset parameters by iterative numerical integration from a preset staring point. Such calculated reaction curves are fit to a reaction curve of interest; the least sum of residual squares indicates the best fitting (Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998; Yang, et al., 2010). In this approach, calculated reaction curves still utilize reaction time as the predictor variable and become discrete at the same intervals as the reaction curve of interest. Clearly, there is no transformation of data from a reaction curve in this approach.

With any enzyme, iterative numerical integration of the differential rate equation(s) from a starting point with sets of preset parameters can be universally applicable regardless of the complexity of the kinetics. Thus, the second approach exhibits better universality and there are few technical challenges to kinetic analysis of reaction curve *via* NLSF. In fact, however, the second approach is occasionally utilized while the first approach is widely practiced.

In the following subsections, the differential rate equation of simple Michaelis-Menten kinetics on single substrate is integrated; the prerequisites for kinetic analysis of reaction curve with integrated rate equations, kinetic analysis of enzyme-coupled reaction curve, the integrations of kinetic analysis of reaction curve with other methods, and the applications of such integration strategies to some typical enzymes are discussed.

#### **2.1 Integrated rate equation for one enzyme on single substrate**

Assigning instantaneous substrate concentration to *S*, instantaneous reaction time to *t*, steady-state kinetics of Michaelis-Menten enzyme on single substrate follows Equ.(1).

$$-dS/dt = (V\_{\rm m} \times S)/(\rm K\_{\rm m} + S) \tag{1}$$

Kinetic Analyses of Enzyme Reaction Curves

with New Integrated Rate Equations and Applications 161

For this prerequisite, the first and the last points of data in a reaction curve for analysis should be carefully selected. The first point should exclude data within the lag time of steady-state reaction. The last point should ensure data for analyses to have substrate concentrations high enough for steady-state reaction. Namely, substrate concentrations should be much higher than the concentration of the active site of the enzyme (Dixon & Webb, 1979). The use of special weighting functions for NLSF can mitigate the contributions of residual squares at low substrate levels that potentially obviate steady-state reaction.

The forth prerequisite is that the enzyme should be stable to validate Equ.(2), or else the inactivation kinetics of the enzyme should be included in the kinetic model. Enzyme stability should be checked before kinetic analysis of reaction curve. When the inactivation kinetics of an enzyme is included in a kinetic model for kinetic analysis of reaction curve, the integrated rate equation is usually quite complex or even inaccessible if the inactivation kinetics is too complex. For kinetic analysis of reaction curve of complicated kinetics, numerical integration to produce calculated reaction curves for NLSF to a reaction curve of interest, instead of NLSF with Equ.(4), can be used to estimate parameters (Duggleby, 1983,

The fifth prerequisite is that there should be negligible inhibition/activation of activity of an enzyme by products/substrates, or else such inhibition/activation on the activity of the enzyme by its substrate/product should be included in an integrated rate equation for kinetic analysis of reaction curve (Zhao, L.N., et al., 2006). For validating Equ.(2), any substrate that alters enzyme activity should be preset at levels low enough to cause negligible alterations; any product that alters enzyme activity can be scavenged by proper reactions. When such alterations are complex, numerical integration of differential rate equations for NLSF to a reaction curve of interest can be used (Duggleby, 1983, 1994;

Obviously, the first three prerequisites are mandatory for the inherent reliability of parameters estimated by kinetic analysis of reaction curve; the later two prerequisites are required for the validity of Equ.(2) or its equivalency for kinetic analysis of reaction curve.

To estimate parameters by kinetic analysis of reaction curve based on NLSF, the main concerns are the satisfaction to the prerequisites for the quality of data under analysis, the

For the estimation of parameters by kinetic analysis of reaction curve, there are two general prerequisites for the quality of data under analysis: (a) there should be a minimum number of the effective data whose changes in signals are over three times the random error; (b) there should be a minimum consumption percentage of the substrate in such effective data for analysis. In general, at least two parameters like *V*m and *S*0 should be estimated; the minimum number of the effective data should be no less than 7 (Atkins & Nimmo, 1973; Baywenton, 1986; Miller, J. C. & Miller, J. N., 1993). The minimum consumption percentage of the substrate can be about 40% if only *V*m and *S*0 are estimated while other parameters are fixed as constants. In general, the estimation of more parameters requires higher

1994; Moruno-Davila, et al., 2001; Varon, et al., 1998; Yang, et al., 2010).

**2.2 Realization of NLSF and limitation on parameter estimation** 

procedure to realize NLSF, and the reliability of parameters estimated thereby.

consumption percentages of the substrate in the effective data for analysis.

Moruno-Davila, et al., 2001; Varon, et al., 1998).

Assigning the substrate concentration at the first point for analysis to *S*1, Equ.(1) is integrated into Equ.(2) when the enzyme is stable, the substrate and product do not alter the intrinsic activity of the enzyme and the reaction is irreversible (Atkins & Nimmo, 1973; Marangoni, 2003; Orsi & Tipton, 1979; Zou & Zhu, 1997). In Equ.(2), *t*lag accounts for the lag time of steady-state reaction. After transformation of data in a reaction curve according to Equ.(3), there should be a linear response of the left part in Equ.(2) to reaction time, as in Equ.(4). The goodness of this linear response is judged based on regression analysis. However, to estimate parameters by kinetic analysis of reaction curve *via* NLSF, there are the following general prerequisites for Equ.(2) or any of its equivalency.

$$\left(\text{S1}\cdot\text{S}\right)\left/\text{K}\text{m} + \ln\left(\text{S1}\,\text{/S}\right) = \left(\text{V}\_{\text{m}}\,\text{/}\text{K}\,\text{m}\right) \times \left(\text{t}-\text{t}\,\text{ag}\right) \tag{2}$$

$$y = \text{(S1-S)} \{ \text{Km} + \text{ln(Sv/S)} \tag{3}$$

$$y = a + b \times t \tag{4}$$

The first prerequisite is that enzyme reaction should apparently follow kinetics on single substrate. For enzyme reactions with multiple substrates whose concentrations are all changing during reactions, kinetic analysis of reaction curve always give parameters of too low reliability to have practical roles no matter what methods are used for NLSF (data unpublished). From our experiences to estimate parameters by kinetic analyses of reaction curves, any substrate at levels below 10% of its *K*m can be considered negligible; the use of one substrate at levels below 10% of those of other substrates can make enzyme reactions follow single substrate kinetics (Liao, et al, 2001; Liao, et al, 2003a, 2003b; Li et al., 2011; Zhao et al., 2006). For any enzyme on multiple substrates, therefore, there are two approaches to make it apparently follow kinetics on single substrate. The first is the use of one substrate of interest at levels below 10% of those of the other substrates; this approach has universal applicability to common enzymes such as hydrolases in aqueous buffers and oxidases in airsaturated buffers. The second is the utilization of special reaction systems to regenerate the substrate of the enzyme of interest by actions of some auxiliary enzymes, and indeed this approach usually yields enzyme-coupled reaction curves of complicated kinetics.

The second prerequisite is that enzyme reaction should be irreversible. In theory, the estimation of parameters by kinetic analysis of reaction curve is still feasible when reaction reversibility is considered, but the estimated parameters possess too low reliability to have practical roles (data unpublished). Generally, a preparation of a substance with contaminants less than 1% in mass content can be taken as a pure substance. Namely, a reagent leftover in a reaction accounting for less than 1% of that before reaction can be negligible. For convenience, therefore, an enzyme reaction is considered irreversible when the leftover level of a substrate of interest in equilibrium is much less than 1%of its initial one. To promote the consumption of the substrate of interest, the concentrations of other substrates should be preset at levels much over 10 times the initial level of the substrate of interest. In this case, the enzyme reaction is apparently irreversible and follows kinetics on single substrate. Or else, the use of scavenging reactions to remove products can drive the reaction forward. The concurrent uses of both approaches are usually better.

The third prerequisite is that there should be steady-state data for analysis (Atkins & Nimmo, 1973; Dixon & Webb, 1979; Liao, et al, 2005a; Marangoni, 2003; Orsi & Tipton, 1979).

Assigning the substrate concentration at the first point for analysis to *S*1, Equ.(1) is integrated into Equ.(2) when the enzyme is stable, the substrate and product do not alter the intrinsic activity of the enzyme and the reaction is irreversible (Atkins & Nimmo, 1973; Marangoni, 2003; Orsi & Tipton, 1979; Zou & Zhu, 1997). In Equ.(2), *t*lag accounts for the lag time of steady-state reaction. After transformation of data in a reaction curve according to Equ.(3), there should be a linear response of the left part in Equ.(2) to reaction time, as in Equ.(4). The goodness of this linear response is judged based on regression analysis. However, to estimate parameters by kinetic analysis of reaction curve *via* NLSF, there are

The first prerequisite is that enzyme reaction should apparently follow kinetics on single substrate. For enzyme reactions with multiple substrates whose concentrations are all changing during reactions, kinetic analysis of reaction curve always give parameters of too low reliability to have practical roles no matter what methods are used for NLSF (data unpublished). From our experiences to estimate parameters by kinetic analyses of reaction curves, any substrate at levels below 10% of its *K*m can be considered negligible; the use of one substrate at levels below 10% of those of other substrates can make enzyme reactions follow single substrate kinetics (Liao, et al, 2001; Liao, et al, 2003a, 2003b; Li et al., 2011; Zhao et al., 2006). For any enzyme on multiple substrates, therefore, there are two approaches to make it apparently follow kinetics on single substrate. The first is the use of one substrate of interest at levels below 10% of those of the other substrates; this approach has universal applicability to common enzymes such as hydrolases in aqueous buffers and oxidases in airsaturated buffers. The second is the utilization of special reaction systems to regenerate the substrate of the enzyme of interest by actions of some auxiliary enzymes, and indeed this

approach usually yields enzyme-coupled reaction curves of complicated kinetics.

reaction forward. The concurrent uses of both approaches are usually better.

The second prerequisite is that enzyme reaction should be irreversible. In theory, the estimation of parameters by kinetic analysis of reaction curve is still feasible when reaction reversibility is considered, but the estimated parameters possess too low reliability to have practical roles (data unpublished). Generally, a preparation of a substance with contaminants less than 1% in mass content can be taken as a pure substance. Namely, a reagent leftover in a reaction accounting for less than 1% of that before reaction can be negligible. For convenience, therefore, an enzyme reaction is considered irreversible when the leftover level of a substrate of interest in equilibrium is much less than 1%of its initial one. To promote the consumption of the substrate of interest, the concentrations of other substrates should be preset at levels much over 10 times the initial level of the substrate of interest. In this case, the enzyme reaction is apparently irreversible and follows kinetics on single substrate. Or else, the use of scavenging reactions to remove products can drive the

The third prerequisite is that there should be steady-state data for analysis (Atkins & Nimmo, 1973; Dixon & Webb, 1979; Liao, et al, 2005a; Marangoni, 2003; Orsi & Tipton, 1979).

( ) ln( ) ( ) ( ) *S -S K S S V K t t* 1 m 1 m m la <sup>g</sup> (2)

*y* ( ) ln( ) *S -S K S S* 1m1 (3)

*y abt* (4)

the following general prerequisites for Equ.(2) or any of its equivalency.

For this prerequisite, the first and the last points of data in a reaction curve for analysis should be carefully selected. The first point should exclude data within the lag time of steady-state reaction. The last point should ensure data for analyses to have substrate concentrations high enough for steady-state reaction. Namely, substrate concentrations should be much higher than the concentration of the active site of the enzyme (Dixon & Webb, 1979). The use of special weighting functions for NLSF can mitigate the contributions of residual squares at low substrate levels that potentially obviate steady-state reaction.

The forth prerequisite is that the enzyme should be stable to validate Equ.(2), or else the inactivation kinetics of the enzyme should be included in the kinetic model. Enzyme stability should be checked before kinetic analysis of reaction curve. When the inactivation kinetics of an enzyme is included in a kinetic model for kinetic analysis of reaction curve, the integrated rate equation is usually quite complex or even inaccessible if the inactivation kinetics is too complex. For kinetic analysis of reaction curve of complicated kinetics, numerical integration to produce calculated reaction curves for NLSF to a reaction curve of interest, instead of NLSF with Equ.(4), can be used to estimate parameters (Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998; Yang, et al., 2010).

The fifth prerequisite is that there should be negligible inhibition/activation of activity of an enzyme by products/substrates, or else such inhibition/activation on the activity of the enzyme by its substrate/product should be included in an integrated rate equation for kinetic analysis of reaction curve (Zhao, L.N., et al., 2006). For validating Equ.(2), any substrate that alters enzyme activity should be preset at levels low enough to cause negligible alterations; any product that alters enzyme activity can be scavenged by proper reactions. When such alterations are complex, numerical integration of differential rate equations for NLSF to a reaction curve of interest can be used (Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998).

Obviously, the first three prerequisites are mandatory for the inherent reliability of parameters estimated by kinetic analysis of reaction curve; the later two prerequisites are required for the validity of Equ.(2) or its equivalency for kinetic analysis of reaction curve.

#### **2.2 Realization of NLSF and limitation on parameter estimation**

To estimate parameters by kinetic analysis of reaction curve based on NLSF, the main concerns are the satisfaction to the prerequisites for the quality of data under analysis, the procedure to realize NLSF, and the reliability of parameters estimated thereby.

For the estimation of parameters by kinetic analysis of reaction curve, there are two general prerequisites for the quality of data under analysis: (a) there should be a minimum number of the effective data whose changes in signals are over three times the random error; (b) there should be a minimum consumption percentage of the substrate in such effective data for analysis. In general, at least two parameters like *V*m and *S*0 should be estimated; the minimum number of the effective data should be no less than 7 (Atkins & Nimmo, 1973; Baywenton, 1986; Miller, J. C. & Miller, J. N., 1993). The minimum consumption percentage of the substrate can be about 40% if only *V*m and *S*0 are estimated while other parameters are fixed as constants. In general, the estimation of more parameters requires higher consumption percentages of the substrate in the effective data for analysis.

Kinetic Analyses of Enzyme Reaction Curves

with New Integrated Rate Equations and Applications 163

It is also concerned which parameter is suitable for estimation by kinetic analysis of reaction curve. In theory, all parameters of an enzyme reaction system can be simultaneously estimated by kinetic analysis of reaction curve. However, there is unknown covariance among some parameters to devalue their reliability; there is the limited accuracy of original data for analyses and the estimation of some parameters with narrow working ranges will have negligible practical roles. *V*m is independent of all other parameters and so is *S*0, and the assay of *V*m and *S*0 are already routinely practiced in biomedical analyses. Therefore, *V*<sup>m</sup> and *S*0 may be the parameters suitable for estimation by kinetic analysis of reaction curve. Additionally, *K*m is used for screening enzyme mutants and enzyme inhibitors; but *K*<sup>m</sup> estimated by kinetic analysis of reaction curve usually exhibits lower reliability and is preferred to be fixed for estimating *V*m and *S*0. If *K*m is estimated as well, *S*1 should be at least 1.5-fold *K*m and there should be more than 85% consumption of the substrate in the data selected for analysis (Atkins & Nimmo, 1973; Liao, et al., 2005a; Newman, et al., 1974; Orsi & Tipton, 1979). To estimate *K*m, the initial datum (*S*1) and its corresponding ending datum from a reaction curve for analysis should be tried sequentially till the requirements for data range are met concurrently. In this case, the estimation of *S*1 has no practical roles. In general, the resistance of *V*m and *S*0 to reasonable changes in ranges of data for analyses

can be a criterion to select the optimized set of parameters that are fixed as constants.

In comparison to the low reliability to estimate *K*m independently for screening enzyme inhibitors and enzyme mutants, the ratio of *V*m to *K*m as an index of enzyme activity can be estimated robustly by kinetic analysis of reaction curve. Reversible inhibitors of Michaelis-Menten enzyme include competitive, noncompetitive, uncompetitive and mixed ones (Bergmeyer, 1983; Dixon & Webb, 1979; Marangoni, 2003). The ratios of *V*m to *K*m will respond to concentrations of common inhibitors except uncompetitive ones that are very rare in nature. Thus, the ratio of *V*m to *K*m can be used for screening common inhibitors. More importantly, the ratio of *V*m to *K*m is an index of the intrinsic activity of an enzyme and the estimation of the ratios of *V*m to *K*m can also be a promising strategy to screen enzyme mutants of powerful catalytic capacity (Fresht, 1985; Liao, et al., 2001; Northrop, 1983).

For robust estimation of the ratio of *V*m to *K*m of an enzyme, *S*0 can be preset at a value below 10% of *K*m to simplify Equ.(2) into Equ.(9). Steady-state data from a reaction curve can be analyzed after data transformation according to the left part in Equ.(9). For validating Equ.(9), it is proposed that *S*0 should be below 1% of *K*m (Mey1er-Almes & Auer, 2000). The use of extremely low *S*0 requires special methods to monitor enzyme reaction curves and steady-state reaction can not always be achieved with enzymes of low intrinsic catalytic activities. On the other hand, the use of *S*0 below 10% of *K*m is reasonable to estimate the ratio of *V*m to *K*m (Liao, et al., 2001). To estimate the ratio of *V*m to *K*m, the use of Equ.(9) to analyze data is robust and resistant to variations of *S*0 if Equ.(9) is valid; this property makes the estimation of the ratio of *V*m to *K*m for screening reversible inhibitors superior to the

ln( ) ( ) *SS a V K t* <sup>1</sup> m m (9)

estimation of the half-inhibition concentrations (Cheng & Prusoff, 1973).

*W S y K SK S* f mm ( ) (7)

2 2 *Q Wy y* f predicted calculated ( ) (8)

With a valid Equ.(2), data in a reaction curve can be transformed according to Equ.(3) to realize NLSF with Equ.(4). The use of Equ.(4) for NLSF needs no special treatment of the unknown *t*lag. For any method to continuously monitor reaction curve, there may be an unknown but constant background in signals (Newman, et al., 1974; Liao, et al., 2003a, 2005a; Yang, et al., 2010). Thus, the background in the signal for *S*1 in Equ.(2) is better to be treated as a nonlinear parameter to realize NLSF; this procedure gives the term of NLSF but causes the burden of computation; as a result, a rearranged form of Equ.(2) is suggested for kinetic analysis of reaction curve (Atkins & Nimmo, 1973; Liao, et al., 2005a).

In theory, Equ.(2) can be rearranged into Equ.(5) as a linear function of *V*m and *K*m. In Equ.(5), the instantaneous reaction time at the moment for *S*1 is preset as zero so that there is no treatment of *t*lag. When the signal for *S*1 is not treated as a nonlinear parameter, kinetic analysis of reaction curve by fitting with Equ.(5) can be finished within 1 s with a pocket calculator. However, parameters estimated with Equ.(5) always have so large errors that Equ.(5) is scarcely practiced in biomedical analyses. Hence, the proper form of an integrated rate equation after validating should be selected carefully.

$$\left(\text{S1}\text{-}\text{S}\right)\left(t-\text{t}\text{a}\mathfrak{g}\right) = V\_{\text{m}} - K\_{\text{m}} \times \left(\text{In}\left(\text{S1}\{\text{S1}\}\big/\mathfrak{f}\mathfrak{t}-\text{t}\mathfrak{a}\mathfrak{g}\right)\right) \tag{5}$$

In principle, to reliably estimate parameters based on NLSF, the distribution ranges of both the dependent variables and the predictor variables in any kinetic model should be as wide as possible while their random errors should be as small as possible (Baywenton, 1986; del Rio, et al., 2001; Draper & Smith, 1998; Miller, J. C. & Miller, J. N., 1993). By serial studies with common enzymes, we found the use of Equ.(4) or similar forms of integrated rate equations with the predictor variables of reaction time for kinetic analysis of reaction curve could give reliable *V*m and *S*0, when *K*m was fixed at a constant after optimization (Liao, 2005; Liao, et al, 2001, 2003a, 2003b, 2005a, 2005b, 2006, 2007b; Zhao, Y.S., et al., 2006, 2009). Reaction time as the predictor variable has the widest distribution and the smallest random errors, in comparison to the predictor variable in Equ.(5). The left part in Equ.(4) also possess a wider distribution range. Such differences in predictor variables and dependent variables should account for different reliability of parameters estimated with Equ.(2) and Equ.(5), and thus an integrated rate equation with the predictor variable of reaction time may be the proper form for kinetic analysis of reaction curve. However, when NLSF with Equ.(4) is realized with *S*1 as a nonlinear parameter, there is nearly 10 s for computation with Celeron 300A CPU on a personal computer. Currently, computation resource is no longer a problem and thus Equ.(4) or its equivalent equations should always be adopted.

The selection of a weighting factor for kinetic analysis of reaction curve is also a concern. Based on error propagation and the principle for weighted NLSF with *y* defined in Equ.(3), squares of instantaneous rates can be the weighting factors (*W*f) with Equ.(4) for NLSF to get the weighted sum of residual squares (*Q*), as described in Equ.(6), Equ.(7) and Equ.(8) (Baywenton, 1986; Draper & Smith, 1998; Gutierrez & Danielson, 2006; Miller, J. C. & Miller, J. N., 1993). The use of a weighting function like Equ.(7) can mitigate the effects of errors in substrate or product concentrations near the completion of reaction. The resistance of an estimated parameter (the variation within 3% in our studies) to reasonable changes in data ranges for analysis can be a criterion to judge the reliability of parameter estimated.

$$
\hat{c}\hat{y}/\hat{c}\mathbf{S} = -(\mathbf{K\_m} + \mathbf{S})/(\mathbf{K\_m} \times \mathbf{S})\tag{6}
$$

With a valid Equ.(2), data in a reaction curve can be transformed according to Equ.(3) to realize NLSF with Equ.(4). The use of Equ.(4) for NLSF needs no special treatment of the unknown *t*lag. For any method to continuously monitor reaction curve, there may be an unknown but constant background in signals (Newman, et al., 1974; Liao, et al., 2003a, 2005a; Yang, et al., 2010). Thus, the background in the signal for *S*1 in Equ.(2) is better to be treated as a nonlinear parameter to realize NLSF; this procedure gives the term of NLSF but causes the burden of computation; as a result, a rearranged form of Equ.(2) is suggested for

In theory, Equ.(2) can be rearranged into Equ.(5) as a linear function of *V*m and *K*m. In Equ.(5), the instantaneous reaction time at the moment for *S*1 is preset as zero so that there is no treatment of *t*lag. When the signal for *S*1 is not treated as a nonlinear parameter, kinetic analysis of reaction curve by fitting with Equ.(5) can be finished within 1 s with a pocket calculator. However, parameters estimated with Equ.(5) always have so large errors that Equ.(5) is scarcely practiced in biomedical analyses. Hence, the proper form of an integrated

In principle, to reliably estimate parameters based on NLSF, the distribution ranges of both the dependent variables and the predictor variables in any kinetic model should be as wide as possible while their random errors should be as small as possible (Baywenton, 1986; del Rio, et al., 2001; Draper & Smith, 1998; Miller, J. C. & Miller, J. N., 1993). By serial studies with common enzymes, we found the use of Equ.(4) or similar forms of integrated rate equations with the predictor variables of reaction time for kinetic analysis of reaction curve could give reliable *V*m and *S*0, when *K*m was fixed at a constant after optimization (Liao, 2005; Liao, et al, 2001, 2003a, 2003b, 2005a, 2005b, 2006, 2007b; Zhao, Y.S., et al., 2006, 2009). Reaction time as the predictor variable has the widest distribution and the smallest random errors, in comparison to the predictor variable in Equ.(5). The left part in Equ.(4) also possess a wider distribution range. Such differences in predictor variables and dependent variables should account for different reliability of parameters estimated with Equ.(2) and Equ.(5), and thus an integrated rate equation with the predictor variable of reaction time may be the proper form for kinetic analysis of reaction curve. However, when NLSF with Equ.(4) is realized with *S*1 as a nonlinear parameter, there is nearly 10 s for computation with Celeron 300A CPU on a personal computer. Currently, computation resource is no longer a problem and thus Equ.(4) or its equivalent equations should always be adopted.

The selection of a weighting factor for kinetic analysis of reaction curve is also a concern. Based on error propagation and the principle for weighted NLSF with *y* defined in Equ.(3), squares of instantaneous rates can be the weighting factors (*W*f) with Equ.(4) for NLSF to get the weighted sum of residual squares (*Q*), as described in Equ.(6), Equ.(7) and Equ.(8) (Baywenton, 1986; Draper & Smith, 1998; Gutierrez & Danielson, 2006; Miller, J. C. & Miller, J. N., 1993). The use of a weighting function like Equ.(7) can mitigate the effects of errors in substrate or product concentrations near the completion of reaction. The resistance of an estimated parameter (the variation within 3% in our studies) to reasonable changes in data

ranges for analysis can be a criterion to judge the reliability of parameter estimated.

( ) ( ) (ln( ) ( )) *S -S t t V K S S t t* <sup>1</sup> lag mm 1 lag (5)

*y S K SK S* ( )( ) m m (6)

kinetic analysis of reaction curve (Atkins & Nimmo, 1973; Liao, et al., 2005a).

rate equation after validating should be selected carefully.

$$\text{V}\mathbf{V}\mathbf{i} = \hat{\boldsymbol{\varepsilon}}\mathbf{S}\mathbf{j}\mathbf{\hat{\varepsilon}}\mathbf{y} = -\mathbf{K}\_{\text{m}} \mathbf{x} \times \mathbf{S}\mathbf{j}\mathbf{\hat{\varepsilon}}\mathbf{K}\_{\text{m}} + \mathbf{S}\mathbf{j} \tag{7}$$

$$Q = \sum \mathcal{V} \mathbf{\hat{V}}^2 \times \left( \mathcal{Y} \text{predicted} - \mathcal{Y} \text{calculated} \right)^2 \tag{8}$$

It is also concerned which parameter is suitable for estimation by kinetic analysis of reaction curve. In theory, all parameters of an enzyme reaction system can be simultaneously estimated by kinetic analysis of reaction curve. However, there is unknown covariance among some parameters to devalue their reliability; there is the limited accuracy of original data for analyses and the estimation of some parameters with narrow working ranges will have negligible practical roles. *V*m is independent of all other parameters and so is *S*0, and the assay of *V*m and *S*0 are already routinely practiced in biomedical analyses. Therefore, *V*<sup>m</sup> and *S*0 may be the parameters suitable for estimation by kinetic analysis of reaction curve. Additionally, *K*m is used for screening enzyme mutants and enzyme inhibitors; but *K*<sup>m</sup> estimated by kinetic analysis of reaction curve usually exhibits lower reliability and is preferred to be fixed for estimating *V*m and *S*0. If *K*m is estimated as well, *S*1 should be at least 1.5-fold *K*m and there should be more than 85% consumption of the substrate in the data selected for analysis (Atkins & Nimmo, 1973; Liao, et al., 2005a; Newman, et al., 1974; Orsi & Tipton, 1979). To estimate *K*m, the initial datum (*S*1) and its corresponding ending datum from a reaction curve for analysis should be tried sequentially till the requirements for data range are met concurrently. In this case, the estimation of *S*1 has no practical roles. In general, the resistance of *V*m and *S*0 to reasonable changes in ranges of data for analyses can be a criterion to select the optimized set of parameters that are fixed as constants.

In comparison to the low reliability to estimate *K*m independently for screening enzyme inhibitors and enzyme mutants, the ratio of *V*m to *K*m as an index of enzyme activity can be estimated robustly by kinetic analysis of reaction curve. Reversible inhibitors of Michaelis-Menten enzyme include competitive, noncompetitive, uncompetitive and mixed ones (Bergmeyer, 1983; Dixon & Webb, 1979; Marangoni, 2003). The ratios of *V*m to *K*m will respond to concentrations of common inhibitors except uncompetitive ones that are very rare in nature. Thus, the ratio of *V*m to *K*m can be used for screening common inhibitors. More importantly, the ratio of *V*m to *K*m is an index of the intrinsic activity of an enzyme and the estimation of the ratios of *V*m to *K*m can also be a promising strategy to screen enzyme mutants of powerful catalytic capacity (Fresht, 1985; Liao, et al., 2001; Northrop, 1983).

For robust estimation of the ratio of *V*m to *K*m of an enzyme, *S*0 can be preset at a value below 10% of *K*m to simplify Equ.(2) into Equ.(9). Steady-state data from a reaction curve can be analyzed after data transformation according to the left part in Equ.(9). For validating Equ.(9), it is proposed that *S*0 should be below 1% of *K*m (Mey1er-Almes & Auer, 2000). The use of extremely low *S*0 requires special methods to monitor enzyme reaction curves and steady-state reaction can not always be achieved with enzymes of low intrinsic catalytic activities. On the other hand, the use of *S*0 below 10% of *K*m is reasonable to estimate the ratio of *V*m to *K*m (Liao, et al., 2001). To estimate the ratio of *V*m to *K*m, the use of Equ.(9) to analyze data is robust and resistant to variations of *S*0 if Equ.(9) is valid; this property makes the estimation of the ratio of *V*m to *K*m for screening reversible inhibitors superior to the estimation of the half-inhibition concentrations (Cheng & Prusoff, 1973).

$$\ln(\text{S1}/\text{S}) = a + (V\_{\text{m}}/\text{K}\_{\text{m}}) \times t \tag{9}$$

Kinetic Analyses of Enzyme Reaction Curves

p,i 1 p,i 1k

*C CV Δt*

analysis of reaction curve of any system of much complicated kinetics.

of personal computers surely can promote the practice of this approach.

**2.4 Integration of kinetic analysis of reaction curve with other methods** 

respectively, with both favourable analysis efficiency and ideal linear ranges.

**2.4.1 New integration strategy for enzyme initial rate assay**

activities, or else the lower limits of linear response are unfavourable.

with New Integrated Rate Equations and Applications 165

<sup>m</sup> (1 a n,i b p,i ab n,i p,i ( ))

By simulation with such a new approach for kinetic analysis of enzyme-coupled reaction curve recorded at 1-s intervals, the upper limit of linear response for measuring ALT initial rates is increased to about five times that by the classical initial rate method. This new approach is resistant to reasonable variations in data range for analysis. By experimentation using the sampling intervals of 10 s, the upper limit is about three times that by the classical initial rate method. Therefore, this new approach for kinetic analysis of enzyme-coupled reaction curve is advantageous, and can potentially be a universal approach for kinetic

The computation time for numerical integration is inversely proportional to the integration step, *Δt* ; the use of shorter *Δt* is always better but *Δt* of 0.20 s at low cost on computation is sufficient for a desirable upper limit of linear response. This new approach with Celeron 300A CPU on a personal computer needs about 10 min for just 30 data in a LDH-coupled reaction curve, but it consumes just about 5 s with Lenovo Notebook S10e. The advancement

Any analytical method should have favourable analysis efficiency, wide linear range, low cost and strong robustness. Kinetic analysis of reaction curve for *V*m and *S*0 assay can have much better upper limit of linear response, but inevitably tolerates low analysis efficiency when wide linear range is required. Based on kinetic analysis of reaction curve, however, our group developed two integration strategies for enzyme initial rate and substrate assay,

The classical initial rate method to measure enzyme initial rates requires *S*0 much higher than *K*m to have desirable linear ranges (Bergmeyer, 1983; Dixon & Webb, 1979; Guilbault, 1976; Marangoni, 2003). Due to substrate inhibition, limited solubility and other causes, practical substrate levels are always relatively low and thus the linear ranges by the classical initial rate method are always unsatisfactory (Li, et al., 2011; Morishita, et al., 2000; Stromme & Theodorsen, 1976). As described above, kinetic analysis of reaction curve can measure enzyme *V*m, and many approaches based on kinetic analysis of reaction curve are already proposed (Cheng, et al., 2008; Claro, 2000; Cornish-Bowden 1975, 1995; Dagys, et al., 1986, 1990; Duggleby, 1983, 1985, 1994; Hasinoff, 1985; Koerber, & Fink, 1987; Liao, et al., 2001; Lu & Fei, 2003; Marangoni, 2003; Walsh, et al. 2010). Such approaches all require substrate consumption percentage over 40% with *K*m preset as a constant. As a result, there should be intolerably long reaction duration to monitor reaction curves for samples of low enzyme

The integration of kinetic analysis of reaction curve using proper integrated rate equations with the classical initial rate method gives an integration strategy to measure enzyme initial

(11)

<sup>m</sup> *Δt KC KC K C C* (1 a n,i b p,i ab n,i p,i) (12)

*V Δt KC KC K C C*

*AAV* i1 i

Kinetic analysis of reaction curve requires more considerations when activities of enzymes are altered by their substrates/products. In this case, more parameters can be included in kinetic models similar to Equ.(2) for kinetic analysis of reaction curve, but there must be complicated process to optimize reaction conditions and preset parameters. Based on the principle for kinetic analysis of reaction curve described above, we developed some new integration strategies to successfully quantify enzyme initial rates and substrate with absorbing performance even when the activities of enzymes of interest are altered significantly by substrates/products (Li, et al., 2011; Liao, 2007a; Zhao, L.N., et al., 2006).

#### **2.3 Kinetic analysis of enzyme-coupled reaction curve**

When neither substrate nor product is suitable for continuous monitor of reaction curve, a tool enzyme can be used to regenerate a substrate or consume a product of the enzyme of interest; the action of the tool enzyme should consume/generate a substrate/product as an indicator suitable for continuous monitor of reaction curve. Namely, the reaction of the tool enzyme is coupled to the reaction of an enzyme of interest for continuous monitor of reaction curve (Bergmeyer, 1983; Guilbault, 1976; Dixon & Webb, 1979). When such enzymecoupled assays are used to measure initial rates of an enzyme, there are always unsatisfactory linear range because the activities of the tool enzyme is always limited and the concentration of the substrate of the tool enzyme is also limited (Bergmeyer, 1983; Dixon & Webb, 1979). It is expected that kinetic analysis of enzyme-coupled reaction curve may effectively enhance the upper limit of linear response. However, kinetics of enzyme-coupled reaction systems is described with a set of differential rate equations, which cause difficulty in accessing an integrated rate equation with the predictor variable of reaction time.

In this case, iterative numerical integration to obtain calculated reaction curves for NLSF to a reaction curve of interest can be used (Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998; Yang, et al., 2010). Lactic dehydrogenase (LDH) is widely used as a tool enzyme for enzyme-coupled assay. The assay of activity of alanineaminotransferase (ALT) in sera has important biomedical roles and usually employs LDH-coupled assay. For LDHcoupled ALT assay, iterative numerical integration of the set of differential rate equations with each set of preset parameters from a preset starting point can produce a calculated reaction curve; such a calculated reaction curve can be made discrete at the same intervals as the reaction curve of interest and then be used for NLSF to the reaction curve of interest.

The process of iterative numerical integration for LDH-coupled ALT assay is given below (Yang, et al., 2010). In an LDH-coupled ALT reaction system, assigning instantaneous concentration of NADH to *C*n,i, instantaneous concentration of pyruvate to *C*p,i, instantaneous absorbance at 340 n for NADH to *A*i, the molar absorptivity of NADH to , the initial rate of ALT under steady-state reaction to *V*1k, the maximal activity of LDH to *V*m, the integration step to *Δt* , there are Equ.(10), Equ.(11) and Equ.(12) to describe the iterative integration of the set of differential rate equations. Calculated reaction curves according to Equ. (12) using different sets of preset parameters become discrete and are fit to the reaction curve of interest, and background absorbance at 340 nm is treated as a parameter as well.

$$\mathbf{C} \text{ \space n,i} = \begin{pmatrix} A \ \mathbf{i} - A \ \mathbf{b} \end{pmatrix} \begin{Bmatrix} \varepsilon \end{Bmatrix} \tag{10}$$

Kinetic analysis of reaction curve requires more considerations when activities of enzymes are altered by their substrates/products. In this case, more parameters can be included in kinetic models similar to Equ.(2) for kinetic analysis of reaction curve, but there must be complicated process to optimize reaction conditions and preset parameters. Based on the principle for kinetic analysis of reaction curve described above, we developed some new integration strategies to successfully quantify enzyme initial rates and substrate with absorbing performance even when the activities of enzymes of interest are altered significantly by substrates/products (Li, et al., 2011; Liao, 2007a; Zhao, L.N., et al., 2006).

When neither substrate nor product is suitable for continuous monitor of reaction curve, a tool enzyme can be used to regenerate a substrate or consume a product of the enzyme of interest; the action of the tool enzyme should consume/generate a substrate/product as an indicator suitable for continuous monitor of reaction curve. Namely, the reaction of the tool enzyme is coupled to the reaction of an enzyme of interest for continuous monitor of reaction curve (Bergmeyer, 1983; Guilbault, 1976; Dixon & Webb, 1979). When such enzymecoupled assays are used to measure initial rates of an enzyme, there are always unsatisfactory linear range because the activities of the tool enzyme is always limited and the concentration of the substrate of the tool enzyme is also limited (Bergmeyer, 1983; Dixon & Webb, 1979). It is expected that kinetic analysis of enzyme-coupled reaction curve may effectively enhance the upper limit of linear response. However, kinetics of enzyme-coupled reaction systems is described with a set of differential rate equations, which cause difficulty

in accessing an integrated rate equation with the predictor variable of reaction time.

In this case, iterative numerical integration to obtain calculated reaction curves for NLSF to a reaction curve of interest can be used (Duggleby, 1983, 1994; Moruno-Davila, et al., 2001; Varon, et al., 1998; Yang, et al., 2010). Lactic dehydrogenase (LDH) is widely used as a tool enzyme for enzyme-coupled assay. The assay of activity of alanineaminotransferase (ALT) in sera has important biomedical roles and usually employs LDH-coupled assay. For LDHcoupled ALT assay, iterative numerical integration of the set of differential rate equations with each set of preset parameters from a preset starting point can produce a calculated reaction curve; such a calculated reaction curve can be made discrete at the same intervals as the reaction curve of interest and then be used for NLSF to the reaction curve of interest.

The process of iterative numerical integration for LDH-coupled ALT assay is given below (Yang, et al., 2010). In an LDH-coupled ALT reaction system, assigning instantaneous concentration of NADH to *C*n,i, instantaneous concentration of pyruvate to *C*p,i, instantaneous absorbance at 340 n for NADH to *A*i, the molar absorptivity of NADH to

the initial rate of ALT under steady-state reaction to *V*1k, the maximal activity of LDH to *V*m, the integration step to *Δt* , there are Equ.(10), Equ.(11) and Equ.(12) to describe the iterative integration of the set of differential rate equations. Calculated reaction curves according to Equ. (12) using different sets of preset parameters become discrete and are fit to the reaction curve of interest, and background absorbance at 340 nm is treated as a parameter as well.

*C AA* n,i i b ( )

(10)

,

**2.3 Kinetic analysis of enzyme-coupled reaction curve** 

$$\begin{aligned} \mathbf{C\_{p,i}} + \mathbf{1} &= \mathbf{C\_{p,i}} + V \mathbf{1} \mathbf{k} \times \Delta t - \\ V\_{\mathbf{m}} &\times \Delta t \left\{ (\mathbf{1} + \mathbf{K\_{a}} \mathbf{\underline{C\_{n,i}}} + \mathbf{K\_{b}} \mathbf{\underline{C\_{p,i}}} + \mathbf{K\_{ab}}) (\mathbf{C\_{n,i}} \times \mathbf{C\_{p,i}}) \right\} \end{aligned} \tag{11}$$

$$A \text{ i} + 1 = A \text{ i} - \boldsymbol{\varepsilon} \times \boldsymbol{V}\_{\text{m}} \times \Delta t \left\{ (1 + \text{K}\_{\text{f}} \text{\{C}\_{\text{n,i}} + \text{K}\_{\text{b}} \text{\}} \text{C}\_{\text{p,i}} + \text{K}\_{\text{ab}} \text{\{C}\_{\text{n,i}} \times \text{C}\_{\text{p,i}} \text{\}} \right. \tag{12}$$

By simulation with such a new approach for kinetic analysis of enzyme-coupled reaction curve recorded at 1-s intervals, the upper limit of linear response for measuring ALT initial rates is increased to about five times that by the classical initial rate method. This new approach is resistant to reasonable variations in data range for analysis. By experimentation using the sampling intervals of 10 s, the upper limit is about three times that by the classical initial rate method. Therefore, this new approach for kinetic analysis of enzyme-coupled reaction curve is advantageous, and can potentially be a universal approach for kinetic analysis of reaction curve of any system of much complicated kinetics.

The computation time for numerical integration is inversely proportional to the integration step, *Δt* ; the use of shorter *Δt* is always better but *Δt* of 0.20 s at low cost on computation is sufficient for a desirable upper limit of linear response. This new approach with Celeron 300A CPU on a personal computer needs about 10 min for just 30 data in a LDH-coupled reaction curve, but it consumes just about 5 s with Lenovo Notebook S10e. The advancement of personal computers surely can promote the practice of this approach.

#### **2.4 Integration of kinetic analysis of reaction curve with other methods**

Any analytical method should have favourable analysis efficiency, wide linear range, low cost and strong robustness. Kinetic analysis of reaction curve for *V*m and *S*0 assay can have much better upper limit of linear response, but inevitably tolerates low analysis efficiency when wide linear range is required. Based on kinetic analysis of reaction curve, however, our group developed two integration strategies for enzyme initial rate and substrate assay, respectively, with both favourable analysis efficiency and ideal linear ranges.

#### **2.4.1 New integration strategy for enzyme initial rate assay**

The classical initial rate method to measure enzyme initial rates requires *S*0 much higher than *K*m to have desirable linear ranges (Bergmeyer, 1983; Dixon & Webb, 1979; Guilbault, 1976; Marangoni, 2003). Due to substrate inhibition, limited solubility and other causes, practical substrate levels are always relatively low and thus the linear ranges by the classical initial rate method are always unsatisfactory (Li, et al., 2011; Morishita, et al., 2000; Stromme & Theodorsen, 1976). As described above, kinetic analysis of reaction curve can measure enzyme *V*m, and many approaches based on kinetic analysis of reaction curve are already proposed (Cheng, et al., 2008; Claro, 2000; Cornish-Bowden 1975, 1995; Dagys, et al., 1986, 1990; Duggleby, 1983, 1985, 1994; Hasinoff, 1985; Koerber, & Fink, 1987; Liao, et al., 2001; Lu & Fei, 2003; Marangoni, 2003; Walsh, et al. 2010). Such approaches all require substrate consumption percentage over 40% with *K*m preset as a constant. As a result, there should be intolerably long reaction duration to monitor reaction curves for samples of low enzyme activities, or else the lower limits of linear response are unfavourable.

The integration of kinetic analysis of reaction curve using proper integrated rate equations with the classical initial rate method gives an integration strategy to measure enzyme initial

Kinetic Analyses of Enzyme Reaction Curves

with New Integrated Rate Equations and Applications 167

Fig. 1. The integration strategy for enzyme initial rate assay (Modified from Liu et al. (2009)).

After the integration strategy for enzyme initial rate assay is validated, a switch point should be determined for changing from the classical initial rate method to kinetic analysis of reaction curve. The estimation of *V*m by kinetic analysis of reaction curve usually prefers substrate consumption percentages reasonably high. Therefore, the substrate consumption percentage that gives an enzyme activity from 90% to 100% of the upper limit of linear

It should be noted that the lower limit of linear response is difficult to be defined for enzyme initial rate assay by an integration strategy. For most methods, their lower limits of linear response are usually defined as three times the standard errors of estimate (Miller, J. C. & Miller, J. N., 1993). Usually, enzyme initial rate assay utilizes just one method for data processing and the difference between the lower limit and the upper limit of linear response is seldom over 30-fold. By the integration strategy, the measurable ranges of enzyme quantities cover two magnitudes and the detection limit is reduced to that by the classical initial rate method. By manual operation, different dilution ratios of a stock solution of the enzyme have to be used and any dilution error will increase the standard error of estimate for regression analysis. The measurement of higher enzyme activities will inevitably have larger standard deviation. Thus, regression analysis of the response of all measurable enzyme initial rates by the integration strategy to quantities of the enzyme will give higher standard error of estimate and thus an unfavourable lower limit of linear response. By this new integration strategy, we arbitrarily use twice the lower limit of linear response by the classical initial rate method as the lower limit if the overall standard error of estimate is more than twice that by the classical initial rate method alone; or else, the lower limit of

Taken together, for measuring initial rates of enzyme acting on single substrate by the integration strategy based on NLSF and data transformation, there are the following basic steps different from those by the classical initial rate method. The first is to work out the

response by the classical initial rate method can be used as the switch point.

linear response is still three times the overall standard error of estimate.

rates with expanded linear ranges and practical analysis efficiency. This integration strategy is effective at substrate concentrations from one-eighth of *K*m to three-fold of *K*m (Li, et al., 2011; Liao, et al., 2009; Liu, et al., 2009; Yang, et al., 2011). The integration strategy for enzyme initial rate assay uses a special method to convert *V*m into initial rates so that the indexes of enzyme activities by both methods become the same; it is applicable to enzymes suffering strong inhibition by substrates/products (Li, et al., 2011). Walsh et al. proposed an integration strategy to measure enzyme initial rate but they employed Equ.(9) that requires substrate levels below 10% of *K*m (Walsh, et al. 2010). Our integration strategy is valid at any substrate level to satisfy Equ.(2) and hence can be a universal approach to common enzymes of different *K*m. The principles and applications of the integration strategy to one enzyme reaction systems and enzyme-coupled reaction systems are discussed below.

As for one enzyme reaction systems, kinetic analysis of reaction curve can be realized with an integrated rate equation after data transformation; the integration strategy for enzyme initial rate assay requires enzyme kinetics on single substrate and an integrated rate equation with the predictor variable of reaction time (Liao, et al., 2003a, 2005a, Zhao, L.N., et al., 2006). Moreover, the integration strategy should solve the following challenges: (a) there should be an overlapped range of enzyme activities measurable by both methods with consistent results; (b) there should be consistent slopes of linear response for enzyme activities to enzyme quantities by both methods (Figure 1). After these two challenges are solved, the linear segment of response by the classical initial rate method is an extended line of the linear segment of response by kinetic analysis of reaction curve (Liu, et al., 2009).

To solve the first challenge, a practical *S*0 and reasonable duration to monitor reaction curve for favourable analysis efficiency are required as optimized experimental conditions. By mathematic derivation and simulation analyses to solve the first challenge, it is demonstrated that a ratio of *S*0 to *K*m from 0.5 to 2.5, the duration of 5.0 min to monitor reaction curves at intervals no longer than 10 s can solve the first challenge for most enzymes, any ratio of *S*0 to *K*m smaller than 0.5 or larger than 2.5 requires longer duration to monitor reaction curves. The use of *S*0 about one-eighth of *K*m requires no shorter than 8.0 min to monitor reaction curves at 10-s intervals to solve the first challenge (Li, et al., 2011; Liu, et al., 2009). When *S*0 is too much larger than three times *K*m, reaction time to record reaction curves for analysis to solve the first challenge should be much longer than 5 min. Clearly, the first challenge can be solved with practical *S*0 for favourable analysis efficiency.

To solve the second challenge, *K*m and other parameters should be optimized and fixed as constants to estimate *V*m by kinetic analysis of reaction curve, and a preset substrate concentration (PSC) should be optimized to covert *V*m into initial rates according to the differential rate equation. In theory, a reliable *V*m should be independent of ranges of data when they are reasonably restricted, and CVs for estimating parameters by enzymatic analysis are usually about 5%. Hence, the estimation of *V*m with variations below 3% for the changes of substrate consumption percentages from 60% to 90% can be a criterion to select the optimized set of preset parameters. For converting *V*m into initial rates, the optimized PSC is usually about 93% of *S*0 and can be refined for different enzymes (Li, et al., 2011; Liao, et al., 2009; Liu, et al., 2009; Yang, et al., 2011). Optimized *K*m and PSC to solve the second challenge are parameters for data processing while optimized *S*0 and reaction duration to solve the first challenge are experimental conditions. The concomitant solution of the two challenges provides feasibility and potential reliability to the integration strategy.

rates with expanded linear ranges and practical analysis efficiency. This integration strategy is effective at substrate concentrations from one-eighth of *K*m to three-fold of *K*m (Li, et al., 2011; Liao, et al., 2009; Liu, et al., 2009; Yang, et al., 2011). The integration strategy for enzyme initial rate assay uses a special method to convert *V*m into initial rates so that the indexes of enzyme activities by both methods become the same; it is applicable to enzymes suffering strong inhibition by substrates/products (Li, et al., 2011). Walsh et al. proposed an integration strategy to measure enzyme initial rate but they employed Equ.(9) that requires substrate levels below 10% of *K*m (Walsh, et al. 2010). Our integration strategy is valid at any substrate level to satisfy Equ.(2) and hence can be a universal approach to common enzymes of different *K*m. The principles and applications of the integration strategy to one enzyme

As for one enzyme reaction systems, kinetic analysis of reaction curve can be realized with an integrated rate equation after data transformation; the integration strategy for enzyme initial rate assay requires enzyme kinetics on single substrate and an integrated rate equation with the predictor variable of reaction time (Liao, et al., 2003a, 2005a, Zhao, L.N., et al., 2006). Moreover, the integration strategy should solve the following challenges: (a) there should be an overlapped range of enzyme activities measurable by both methods with consistent results; (b) there should be consistent slopes of linear response for enzyme activities to enzyme quantities by both methods (Figure 1). After these two challenges are solved, the linear segment of response by the classical initial rate method is an extended line of the linear segment of response by kinetic analysis of reaction curve (Liu, et al., 2009).

To solve the first challenge, a practical *S*0 and reasonable duration to monitor reaction curve for favourable analysis efficiency are required as optimized experimental conditions. By mathematic derivation and simulation analyses to solve the first challenge, it is demonstrated that a ratio of *S*0 to *K*m from 0.5 to 2.5, the duration of 5.0 min to monitor reaction curves at intervals no longer than 10 s can solve the first challenge for most enzymes, any ratio of *S*0 to *K*m smaller than 0.5 or larger than 2.5 requires longer duration to monitor reaction curves. The use of *S*0 about one-eighth of *K*m requires no shorter than 8.0 min to monitor reaction curves at 10-s intervals to solve the first challenge (Li, et al., 2011; Liu, et al., 2009). When *S*0 is too much larger than three times *K*m, reaction time to record reaction curves for analysis to solve the first challenge should be much longer than 5 min. Clearly, the first challenge can be solved with practical *S*0 for favourable analysis efficiency. To solve the second challenge, *K*m and other parameters should be optimized and fixed as constants to estimate *V*m by kinetic analysis of reaction curve, and a preset substrate concentration (PSC) should be optimized to covert *V*m into initial rates according to the differential rate equation. In theory, a reliable *V*m should be independent of ranges of data when they are reasonably restricted, and CVs for estimating parameters by enzymatic analysis are usually about 5%. Hence, the estimation of *V*m with variations below 3% for the changes of substrate consumption percentages from 60% to 90% can be a criterion to select the optimized set of preset parameters. For converting *V*m into initial rates, the optimized PSC is usually about 93% of *S*0 and can be refined for different enzymes (Li, et al., 2011; Liao, et al., 2009; Liu, et al., 2009; Yang, et al., 2011). Optimized *K*m and PSC to solve the second challenge are parameters for data processing while optimized *S*0 and reaction duration to solve the first challenge are experimental conditions. The concomitant solution of the two

challenges provides feasibility and potential reliability to the integration strategy.

reaction systems and enzyme-coupled reaction systems are discussed below.

Fig. 1. The integration strategy for enzyme initial rate assay (Modified from Liu et al. (2009)).

After the integration strategy for enzyme initial rate assay is validated, a switch point should be determined for changing from the classical initial rate method to kinetic analysis of reaction curve. The estimation of *V*m by kinetic analysis of reaction curve usually prefers substrate consumption percentages reasonably high. Therefore, the substrate consumption percentage that gives an enzyme activity from 90% to 100% of the upper limit of linear response by the classical initial rate method can be used as the switch point.

It should be noted that the lower limit of linear response is difficult to be defined for enzyme initial rate assay by an integration strategy. For most methods, their lower limits of linear response are usually defined as three times the standard errors of estimate (Miller, J. C. & Miller, J. N., 1993). Usually, enzyme initial rate assay utilizes just one method for data processing and the difference between the lower limit and the upper limit of linear response is seldom over 30-fold. By the integration strategy, the measurable ranges of enzyme quantities cover two magnitudes and the detection limit is reduced to that by the classical initial rate method. By manual operation, different dilution ratios of a stock solution of the enzyme have to be used and any dilution error will increase the standard error of estimate for regression analysis. The measurement of higher enzyme activities will inevitably have larger standard deviation. Thus, regression analysis of the response of all measurable enzyme initial rates by the integration strategy to quantities of the enzyme will give higher standard error of estimate and thus an unfavourable lower limit of linear response. By this new integration strategy, we arbitrarily use twice the lower limit of linear response by the classical initial rate method as the lower limit if the overall standard error of estimate is more than twice that by the classical initial rate method alone; or else, the lower limit of linear response is still three times the overall standard error of estimate.

Taken together, for measuring initial rates of enzyme acting on single substrate by the integration strategy based on NLSF and data transformation, there are the following basic steps different from those by the classical initial rate method. The first is to work out the

Kinetic Analyses of Enzyme Reaction Curves

all advantages of common kinetic methods.

with New Integrated Rate Equations and Applications 169

For enzymatic analysis of substrate, the equilibrium method can still be preferable as long as it has desirable analysis efficiency and favourable cost on tool enzyme. In theory, the last signal for the stable product or the background in the equilibrium method can be estimated by kinetic analysis of reaction curve with data far before the completion of reaction. This process can be a new kinetic method for enzyme substrate assay and is distinguished from the equilibrium method and other kinetic methods by its prediction of the last signal after the completion of enzyme reaction (Liao, 2005; Liao, et al., 2003, 2005a, 2006; Zhao, L.N., et al., 2006; Zhao, Y.S., et al., 2006, 2009). This new kinetic method should have resistance to factors affecting enzyme activities and upper limit of linear response higher than *K*m besides

An enzyme reaction curve can be monitored by absorbance of a stable product or the substrate itself (Figure 2). The initial absorbance before enzyme action (*A*0) thus is the background (*A*b) when absorbance for a stable product is quantified, or is the absorbance of the substrate plus background when absorbance of the substrate is quantified. The last absorbance after the completion of enzyme reaction, which is predicted by kinetic analysis of reaction curve, is the maximum absorbance of the stable product plus the background (*A*m) or *A*b itself. There is strong covariance between the initial signal and the last signal for the same reaction system; this assertion enhances precision of this kinetic method for

Fig. 2. Demonstration of reaction curves of uricase (293 nm) and glutathione-S-tranferase

However, this new kinetic method for substrate assay can by no means concomitantly have wider linear ranges and desirable analysis efficiency. Due to the prerequisites of the quality of data for kinetic analysis of reaction curve, the activity of a tool enzyme for enzymatic analysis of substrate should be reasonably high for higher upper limit of linear response, but it should be reasonably low for favourable lower limit of linear response. On the other hand, the duration to monitor reaction curves should be long enough to have higher upper limit at reasonable cost on a tool enzyme, but should be as short as possible for favourable analysis efficiency. Thus, this new kinetic method alone requires tough optimizations of conditions. Moreover, there are the inevitable random noises from any instrument to record an enzyme reaction curve; when there is much small difference between the initial signal before enzyme action and the last signal recorded after the preset reaction duration, this kinetic method for

(340 nm), and the prediction of the last absorbance after infinite reaction time

substrate assay (Baywenton, 1986; Liao, et al, 2005b; Zhao, Y.S., et al., 2009).

integrated rate equation with the predictor variable of reaction time. The second is to optimize individually their parameters fixed as constants for kinetic analysis of reaction curve. The third is to optimize a ratio of *S*0 to *K*m and duration to monitor reaction curves; usually a ratio of *S*0 to *K*m from 0.5 to 2.5, the duration of 5.0 min and intervals of 10 s are effective. The forth is to refine PSC around 93% of *S*0 to convert *V*m into initial rates.

As for enzyme-coupled reaction system, initial rate itself is estimated by kinetic analysis of reaction curve based on numerical integration and NLSF of calculated reaction curves to a reaction curve of interest. Consequently, neither the conversion of indexes nor the optimization of parameters for such conversion is required and the integration strategy can be realized easily. By kinetic analysis of enzyme-coupled reaction curve, there still should be a minimum number of the effective data and a minimum substrate consumption percentage in the effective data for analysis; these prerequisites lead to unsatisfactory lower limits of linear response for favourable analysis efficiency (the use of reaction duration within 5.0 min). The classical initial rate method is effective to enzyme-coupled reaction systems when activities of the enzyme of interest are not too high. Therefore, this new approach for kinetic analysis of enzyme-coupled reaction curve can be integrated with the classical initial rate method to quantify enzyme initial rates potentially for wider linear ranges.

With enzyme-coupled reaction systems, only the first challenge should be solved to practice the integration strategy. Namely, reaction duration and sampling intervals to record reaction curve should be optimized so that there is an overlapped region of enzyme initial rates measurable by both methods with consistent results. The upper limit of the classical initial rate method should be high enough so that data after reaction of about 5.0 min for enzyme activity at such an upper limit are suitable for kinetic analysis of reaction curve. The integration strategy gives an approximated linear range from the lower limit of linear response by the classical initial rate method to the upper limit of linear response by kinetic analysis of LDH-coupled ALT reaction curve (Yang, et al., 2010).

#### **2.4.2 New integration strategy for enzyme substrate assay**

Analysis of a biochemical as the substrate of a typical tool enzyme, *i.e.*, enzymatic analysis of substrate in biological samples, is important in biomedicine (Bergmeyer, 1983; Dilena, 1986; Guilbault, 1976; Moss, 1980). In general, there are the kinetic method and the end-point method for enzyme substrate assay; the end-point method is called the equilibrium method, and it determines the difference between the initial signal for a reaction system before enzyme action and the last signal after the completion of enzyme reaction; such differences proportional to *S*0 can serve as an index of substrate concentration (Dilena, et al., 1986; Guilbault, 1976; Moss, 1980; Zhao, et al., 2009). For better analysis efficiency and lower cost on tool enzymes, kinetic methods for enzyme substrate assay are preferred. Among available kinetic methods, the initial rate method based on the response of initial rates of an enzyme at a fixed quantity to substrate concentrations is conventional; however, it tolerates sensitivity to any factor affecting enzyme activities, requires tool enzymes of high *K*m, and has narrow linear ranges. Kinetic analysis of reaction curve with a differential rate equation to estimate *S*0 is proposed with favourable resistance to variations in enzyme activities and has upper limit of linear response over *K*m, but it suffers from high sensitivity to background and has unfavourable lower limit of linear response (Dilena, et al., 1986; Hamilton & Pardue, 1982; Moss, 1980). Hence, new kinetic methods for enzyme substrate assay are still desired.

integrated rate equation with the predictor variable of reaction time. The second is to optimize individually their parameters fixed as constants for kinetic analysis of reaction curve. The third is to optimize a ratio of *S*0 to *K*m and duration to monitor reaction curves; usually a ratio of *S*0 to *K*m from 0.5 to 2.5, the duration of 5.0 min and intervals of 10 s are

As for enzyme-coupled reaction system, initial rate itself is estimated by kinetic analysis of reaction curve based on numerical integration and NLSF of calculated reaction curves to a reaction curve of interest. Consequently, neither the conversion of indexes nor the optimization of parameters for such conversion is required and the integration strategy can be realized easily. By kinetic analysis of enzyme-coupled reaction curve, there still should be a minimum number of the effective data and a minimum substrate consumption percentage in the effective data for analysis; these prerequisites lead to unsatisfactory lower limits of linear response for favourable analysis efficiency (the use of reaction duration within 5.0 min). The classical initial rate method is effective to enzyme-coupled reaction systems when activities of the enzyme of interest are not too high. Therefore, this new approach for kinetic analysis of enzyme-coupled reaction curve can be integrated with the classical initial rate

With enzyme-coupled reaction systems, only the first challenge should be solved to practice the integration strategy. Namely, reaction duration and sampling intervals to record reaction curve should be optimized so that there is an overlapped region of enzyme initial rates measurable by both methods with consistent results. The upper limit of the classical initial rate method should be high enough so that data after reaction of about 5.0 min for enzyme activity at such an upper limit are suitable for kinetic analysis of reaction curve. The integration strategy gives an approximated linear range from the lower limit of linear response by the classical initial rate method to the upper limit of linear response by kinetic

Analysis of a biochemical as the substrate of a typical tool enzyme, *i.e.*, enzymatic analysis of substrate in biological samples, is important in biomedicine (Bergmeyer, 1983; Dilena, 1986; Guilbault, 1976; Moss, 1980). In general, there are the kinetic method and the end-point method for enzyme substrate assay; the end-point method is called the equilibrium method, and it determines the difference between the initial signal for a reaction system before enzyme action and the last signal after the completion of enzyme reaction; such differences proportional to *S*0 can serve as an index of substrate concentration (Dilena, et al., 1986; Guilbault, 1976; Moss, 1980; Zhao, et al., 2009). For better analysis efficiency and lower cost on tool enzymes, kinetic methods for enzyme substrate assay are preferred. Among available kinetic methods, the initial rate method based on the response of initial rates of an enzyme at a fixed quantity to substrate concentrations is conventional; however, it tolerates sensitivity to any factor affecting enzyme activities, requires tool enzymes of high *K*m, and has narrow linear ranges. Kinetic analysis of reaction curve with a differential rate equation to estimate *S*0 is proposed with favourable resistance to variations in enzyme activities and has upper limit of linear response over *K*m, but it suffers from high sensitivity to background and has unfavourable lower limit of linear response (Dilena, et al., 1986; Hamilton & Pardue, 1982; Moss, 1980). Hence, new kinetic methods for enzyme substrate assay are still desired.

effective. The forth is to refine PSC around 93% of *S*0 to convert *V*m into initial rates.

method to quantify enzyme initial rates potentially for wider linear ranges.

analysis of LDH-coupled ALT reaction curve (Yang, et al., 2010).

**2.4.2 New integration strategy for enzyme substrate assay**

For enzymatic analysis of substrate, the equilibrium method can still be preferable as long as it has desirable analysis efficiency and favourable cost on tool enzyme. In theory, the last signal for the stable product or the background in the equilibrium method can be estimated by kinetic analysis of reaction curve with data far before the completion of reaction. This process can be a new kinetic method for enzyme substrate assay and is distinguished from the equilibrium method and other kinetic methods by its prediction of the last signal after the completion of enzyme reaction (Liao, 2005; Liao, et al., 2003, 2005a, 2006; Zhao, L.N., et al., 2006; Zhao, Y.S., et al., 2006, 2009). This new kinetic method should have resistance to factors affecting enzyme activities and upper limit of linear response higher than *K*m besides all advantages of common kinetic methods.

An enzyme reaction curve can be monitored by absorbance of a stable product or the substrate itself (Figure 2). The initial absorbance before enzyme action (*A*0) thus is the background (*A*b) when absorbance for a stable product is quantified, or is the absorbance of the substrate plus background when absorbance of the substrate is quantified. The last absorbance after the completion of enzyme reaction, which is predicted by kinetic analysis of reaction curve, is the maximum absorbance of the stable product plus the background (*A*m) or *A*b itself. There is strong covariance between the initial signal and the last signal for the same reaction system; this assertion enhances precision of this kinetic method for substrate assay (Baywenton, 1986; Liao, et al, 2005b; Zhao, Y.S., et al., 2009).

Fig. 2. Demonstration of reaction curves of uricase (293 nm) and glutathione-S-tranferase (340 nm), and the prediction of the last absorbance after infinite reaction time

However, this new kinetic method for substrate assay can by no means concomitantly have wider linear ranges and desirable analysis efficiency. Due to the prerequisites of the quality of data for kinetic analysis of reaction curve, the activity of a tool enzyme for enzymatic analysis of substrate should be reasonably high for higher upper limit of linear response, but it should be reasonably low for favourable lower limit of linear response. On the other hand, the duration to monitor reaction curves should be long enough to have higher upper limit at reasonable cost on a tool enzyme, but should be as short as possible for favourable analysis efficiency. Thus, this new kinetic method alone requires tough optimizations of conditions. Moreover, there are the inevitable random noises from any instrument to record an enzyme reaction curve; when there is much small difference between the initial signal before enzyme action and the last signal recorded after the preset reaction duration, this kinetic method for

Kinetic Analyses of Enzyme Reaction Curves

**2.5.1 Uricase reaction** 

rates are demonstrated with uricase, GST and ADH as examples.

equilibrium method to kinetic analysis of reaction curve.

other uricase method reported.

with New Integrated Rate Equations and Applications 171

al., 2006, 2009), glutathione-S-tranferase (GST) (Liao, et al., 2003b; Zhao, L.N., et al., 2006), butylcholineasterase (Liao, et al., 2009; Yang, et al., 2011), LDH (Cheng, et al., 2008) and LDH-coupled ALT reaction systems (Yang, et al., 2010). Uricase of simple kinetics is a good example to study new methods for kinetic analysis of reaction curve; reactions of GGT and ADH suffer product inhibition and kinetic analyses of their reaction curves are complicated because they require unreported parameters. Hence, our new methods for kinetic analysis of reaction curve and the integration strategies for quantifying enzyme substrates and initial

Uricase follows simple Michaelis-Menten kinetics on single substrate in air-saturated buffers, and suffers neither reversible reaction nor product inhibition (Liao, 2005; Liao, et al., 2005a, 2005b; Zhao, Y.S., et al., 2006). Uricase reaction curve can be monitored by absorbance at 293 nm. The potential interference from the intermediate 5-hydroxylisourate with uric acid absorbance at 293 nm can be alleviated by analyzing data of steady-state reaction in borate buffer at high reaction pH (Kahn & Tipton, 1998; Priest & Pitts, 1972). The integrated rate equation for uricase reaction with the predictor variable of reaction time is Equ.(4). Uricases from different sources have different *K*m (Liao, et al., 2005a, 2006; Zhang, et al., 2010; Zhao, Y.S., et al., 2006). Using Equ.(4), *K*m of *Candidate* utilis is estimated with reasonable reliability (Liao, et al., 2005a). Using Equ.(9) to estimate the ratio of *V*m to *K*m, uricase mutants of better catalytic capacity and their sensitivity to xanthine are routinely characterized (data unpublished). Thus, we used uricases of different *K*m as models to test the two integration strategies for enzyme substrate assay and initial rate assay, respectively. Uricase from *Bacillus* fastidiosus A.T.C.C. 29604 has high *K*m to facilitate predicting *A*<sup>b</sup> (Zhang, et al., 2010; Zhao, Y.S., et al., 2006, 2009). Reaction curves at low levels of uric acid with this uricase at 40 U/L are demonstrated in Fig. 3. Steady-state reaction is not reached within 30 s since reaction initiation; it is difficult to get more than 5 data with absorbance changes over 0.003 for kinetic analysis of reaction curve at uric acid levels below 3.0 mol/L. At 40 U/L of this uricase, the absorbance after reaction for 5.0 min has negligible difference from that after reaction for 30 min for uric acid below 5.0 mol/L. To quantify the difference between *A*0 and *A*b after reaction for 5.0 min, the equilibrium method has an upper limit of about 5.0 mol/L, while kinetic analysis of reaction curve with *K*m as a constant is feasible for *S*0 of about 5.0 mol/L. Thus, the change of absorbance over 0.050 between *A*0 and the absorbance after reaction for 5.0 min can be the switch threshold to change from the

This integration strategy for enzyme substrate assay gives the linear response from about 1.5 mol/L up to 60 mol/L uric acid at 40 U/L uricase (Fig.4, unpublished), and shows resistance to the action of xanthine at 30 mol/L in reaction solutions (this level of xanthine always caused negative interference with all available kits commercialized for serum uric acid assay). Therefore, the integration strategy for uric acid assay is clearly superior to any

Uricases from *Candida* sp. with *K*m of 6.6 mol/L (Sigma U0880) and *Bacillus* fastidious uricase from A.T.C.C. 29604 with *K*m of 0.22 mmol/L are used to test the integration strategy for initial rate assay. The use of uric acid at *S*0 of 25 mol/L to monitor reaction curves

substrate assay always has unsatisfactory precision. Therefore, this new kinetic method itself is still much beyond satisfaction for substrate assay.

To concomitantly have wider linear ranges, desirable analysis efficiency and favourable precision for enzyme substrate assay, the integration of kinetic analysis of reaction curve with the equilibrium method can be used. The indexes of substrate quantities by the two methods have exactly the same physical meanings, and thus the integration strategy can be easily realized for enzyme substrate assay. By this integration strategy, there should still be an overlapped range of concentrations of the substrate measurable consistently by both methods, besides a switch threshold within such an overlapped region to change from the equilibrium method to kinetic analysis of reaction curve. Additionally, this overlapped region of substrate concentration measurable by both methods with consistent results should localize in a range of substrate concentration high enough for reasonable precision of substrate assay based on kinetic analysis of enzyme reaction curve. These requirements can be met as described below. (a) The upper limit of linear response by the equilibrium method should be optimized to be high enough, so that the difference between the initial signal before enzyme action and the last recorded signal for about 80% of this upper limit is 50 times higher than the random noise of an instrument to record enzyme reaction curves; such a difference can be used as the switch threshold. (b) The activity of a tool enzyme and the duration to monitor reaction curve as experimental conditions should be optimized; kinetic parameters except *V*m for kinetic analysis of reaction curve are optimized as well. The resistance of the predicted last signal to reasonable variations in data ranges for analysis can be a criterion to judge the optimized set of preset parameters. For favourable analysis efficiency in clinical laboratories, reaction duration can be about 5.0 min. This reaction duration results in a minimum activity of the tool enzyme for the integration strategy so that the upper limit of linear response by the equilibrium method can be high enough to switch to kinetic analysis of reaction curve. This integration strategy after optimizations can simultaneously have wider linear ranges, higher analysis efficiency and lower cost, better precision and stronger resistance to factors affecting enzyme activities.

Similarly, with the integration strategy for enzyme substrate assay, we also use twice the lower limit of the equilibrium method as the lower limit by the integration strategy if the standard error of estimate is much larger; or else, three times the standard error of estimate by the integration strategy is taken as the lower limit of linear response.

In general, the following steps are required to realize this integration strategy for enzyme substrate assay: (a) to work out the integrated rate equation with the predictor variable of reaction time; (b) to optimize individually the (kinetic) parameters preset as constants for kinetic analysis of reaction curve; (c) to optimize the activity of the tool enzyme so that data for the upper limit of linear response by the equilibrium method within about 5.0-min reaction are suitable for kinetic analysis of reaction curve. As demonstrated later, this integration strategy is applicable to enzymes suffering from strong product inhibition.

#### **2.5 Applications of new methods to some typical enzymes**

We investigated kinetic analysis of reaction curve with arylesterase (Liao, et al., 2001, 2003a, 2007b), alcohol dehydrogenase (ADH) (Liao, et al., 2007a), gama-glutamyltransfease (Li, et al., 2011), uricase (Liao, 2005; Liao, et al., 2005a, 2005b, 2006; Liu, et al., 2009; Zhao, Y.S., et al., 2006, 2009), glutathione-S-tranferase (GST) (Liao, et al., 2003b; Zhao, L.N., et al., 2006), butylcholineasterase (Liao, et al., 2009; Yang, et al., 2011), LDH (Cheng, et al., 2008) and LDH-coupled ALT reaction systems (Yang, et al., 2010). Uricase of simple kinetics is a good example to study new methods for kinetic analysis of reaction curve; reactions of GGT and ADH suffer product inhibition and kinetic analyses of their reaction curves are complicated because they require unreported parameters. Hence, our new methods for kinetic analysis of reaction curve and the integration strategies for quantifying enzyme substrates and initial rates are demonstrated with uricase, GST and ADH as examples.

#### **2.5.1 Uricase reaction**

170 Chemometrics in Practical Applications

substrate assay always has unsatisfactory precision. Therefore, this new kinetic method

To concomitantly have wider linear ranges, desirable analysis efficiency and favourable precision for enzyme substrate assay, the integration of kinetic analysis of reaction curve with the equilibrium method can be used. The indexes of substrate quantities by the two methods have exactly the same physical meanings, and thus the integration strategy can be easily realized for enzyme substrate assay. By this integration strategy, there should still be an overlapped range of concentrations of the substrate measurable consistently by both methods, besides a switch threshold within such an overlapped region to change from the equilibrium method to kinetic analysis of reaction curve. Additionally, this overlapped region of substrate concentration measurable by both methods with consistent results should localize in a range of substrate concentration high enough for reasonable precision of substrate assay based on kinetic analysis of enzyme reaction curve. These requirements can be met as described below. (a) The upper limit of linear response by the equilibrium method should be optimized to be high enough, so that the difference between the initial signal before enzyme action and the last recorded signal for about 80% of this upper limit is 50 times higher than the random noise of an instrument to record enzyme reaction curves; such a difference can be used as the switch threshold. (b) The activity of a tool enzyme and the duration to monitor reaction curve as experimental conditions should be optimized; kinetic parameters except *V*m for kinetic analysis of reaction curve are optimized as well. The resistance of the predicted last signal to reasonable variations in data ranges for analysis can be a criterion to judge the optimized set of preset parameters. For favourable analysis efficiency in clinical laboratories, reaction duration can be about 5.0 min. This reaction duration results in a minimum activity of the tool enzyme for the integration strategy so that the upper limit of linear response by the equilibrium method can be high enough to switch to kinetic analysis of reaction curve. This integration strategy after optimizations can simultaneously have wider linear ranges, higher analysis efficiency and lower cost, better

itself is still much beyond satisfaction for substrate assay.

precision and stronger resistance to factors affecting enzyme activities.

by the integration strategy is taken as the lower limit of linear response.

**2.5 Applications of new methods to some typical enzymes** 

Similarly, with the integration strategy for enzyme substrate assay, we also use twice the lower limit of the equilibrium method as the lower limit by the integration strategy if the standard error of estimate is much larger; or else, three times the standard error of estimate

In general, the following steps are required to realize this integration strategy for enzyme substrate assay: (a) to work out the integrated rate equation with the predictor variable of reaction time; (b) to optimize individually the (kinetic) parameters preset as constants for kinetic analysis of reaction curve; (c) to optimize the activity of the tool enzyme so that data for the upper limit of linear response by the equilibrium method within about 5.0-min reaction are suitable for kinetic analysis of reaction curve. As demonstrated later, this integration strategy is applicable to enzymes suffering from strong product inhibition.

We investigated kinetic analysis of reaction curve with arylesterase (Liao, et al., 2001, 2003a, 2007b), alcohol dehydrogenase (ADH) (Liao, et al., 2007a), gama-glutamyltransfease (Li, et al., 2011), uricase (Liao, 2005; Liao, et al., 2005a, 2005b, 2006; Liu, et al., 2009; Zhao, Y.S., et Uricase follows simple Michaelis-Menten kinetics on single substrate in air-saturated buffers, and suffers neither reversible reaction nor product inhibition (Liao, 2005; Liao, et al., 2005a, 2005b; Zhao, Y.S., et al., 2006). Uricase reaction curve can be monitored by absorbance at 293 nm. The potential interference from the intermediate 5-hydroxylisourate with uric acid absorbance at 293 nm can be alleviated by analyzing data of steady-state reaction in borate buffer at high reaction pH (Kahn & Tipton, 1998; Priest & Pitts, 1972). The integrated rate equation for uricase reaction with the predictor variable of reaction time is Equ.(4). Uricases from different sources have different *K*m (Liao, et al., 2005a, 2006; Zhang, et al., 2010; Zhao, Y.S., et al., 2006). Using Equ.(4), *K*m of *Candidate* utilis is estimated with reasonable reliability (Liao, et al., 2005a). Using Equ.(9) to estimate the ratio of *V*m to *K*m, uricase mutants of better catalytic capacity and their sensitivity to xanthine are routinely characterized (data unpublished). Thus, we used uricases of different *K*m as models to test the two integration strategies for enzyme substrate assay and initial rate assay, respectively.

Uricase from *Bacillus* fastidiosus A.T.C.C. 29604 has high *K*m to facilitate predicting *A*<sup>b</sup> (Zhang, et al., 2010; Zhao, Y.S., et al., 2006, 2009). Reaction curves at low levels of uric acid with this uricase at 40 U/L are demonstrated in Fig. 3. Steady-state reaction is not reached within 30 s since reaction initiation; it is difficult to get more than 5 data with absorbance changes over 0.003 for kinetic analysis of reaction curve at uric acid levels below 3.0 mol/L. At 40 U/L of this uricase, the absorbance after reaction for 5.0 min has negligible difference from that after reaction for 30 min for uric acid below 5.0 mol/L. To quantify the difference between *A*0 and *A*b after reaction for 5.0 min, the equilibrium method has an upper limit of about 5.0 mol/L, while kinetic analysis of reaction curve with *K*m as a constant is feasible for *S*0 of about 5.0 mol/L. Thus, the change of absorbance over 0.050 between *A*0 and the absorbance after reaction for 5.0 min can be the switch threshold to change from the equilibrium method to kinetic analysis of reaction curve.

This integration strategy for enzyme substrate assay gives the linear response from about 1.5 mol/L up to 60 mol/L uric acid at 40 U/L uricase (Fig.4, unpublished), and shows resistance to the action of xanthine at 30 mol/L in reaction solutions (this level of xanthine always caused negative interference with all available kits commercialized for serum uric acid assay). Therefore, the integration strategy for uric acid assay is clearly superior to any other uricase method reported.

Uricases from *Candida* sp. with *K*m of 6.6 mol/L (Sigma U0880) and *Bacillus* fastidious uricase from A.T.C.C. 29604 with *K*m of 0.22 mmol/L are used to test the integration strategy for initial rate assay. The use of uric acid at *S*0 of 25 mol/L to monitor reaction curves

Kinetic Analyses of Enzyme Reaction Curves

similar to that for Equ.(4) is employed (Zhao, L.N., et al., 2006).

1 1

<sup>1</sup> <sup>1</sup>

3

with New Integrated Rate Equations and Applications 173

The following symbols are assigned: *C* to instantaneous concentration of CDNB, *B* to instantaneous concentration of GSH, *Q* to instantaneous concentration of the product, *K*ma to *K*m of GST for CSNB, *K*mb to *K*m of GST for GSH, *K*ia to the dissociation constant of GSH, *K*iq to the dissociation constant of the product, *A* for instantaneous absorbance, *A*m for the maximal absorbance of the product, *ε* to difference in absorptivity of product and CDNB, *V*<sup>m</sup> for the maximal reaction rate of GST. The differential rate equation for GST reaction is Equ.(13). After the definition of *M*1, *M*2 and *M*3, the integrated rate equation with the predictor variable of reaction time is Equ.(19) if GST reaction is irreversible and a process

(13)

(18)

(14)

( )

<sup>m</sup> *lag* (19)

(15)

(16)

(17)

*K V K K QK K C B <sup>V</sup>*

*M*1 ma iq *K K* ( ) 

2 ma ib ma iq m ma iq 0 ma iq

0 0

*m mm* 2 2

*ib mb iq ma m iq*

*K K AK K A A K*

<sup>2</sup> 1 23 <sup>m</sup>

1 1 2

*Y M AA M A M AA M M A M Ln A A A*

*M A M AM dA C V dt*

*m m m*

*Y C V tT abt*

*M K KKK AK K*

*ma m mb m*

mb m ib ma iq mb

( )

*CA K K*

*M K A K CCA*

m

2

1 23 <sup>2</sup>

Fig. 5. Estimated *V*m to changes in data ranges for analyses with 60 mol/L GSH.

*A A*

*K QK C V*

ma iq m

Fig. 3. Reaction curves (absorbance at 293 nm) at low levels of uric acid and 40 U/L uricase (recombinant uricase in *E*. Coli BL21 was as reported before (Zhang, et al., 2010)).

within 8.0 min or at *S*0 of 75 mol/L to monitor reaction curves within 5.0 min, the integration strategy to measure initial rates of both uricases is feasible; the use of PSC of 93% *S*0 to convert *V*m into initial rates gives the linear range of about two magnitudes (Liu, et al., 2009). Therefore, the integration strategy for enzyme initial rate assay is also advantageous.

Fig. 4. Response of absorbance change at 293 nm to preset uric acid levels at 40 U/L uricase.

#### **2.5.2 Glutathione-S-transferase reaction**

Using purified alkaline GST isozyme from porcine liver as model on glutathione (GSH) and 2,4,-dinitrochlorobenzene (CDNB) as substrates, GST reaction curves are monitored by absorbance at 340 nm (Kunze, 1997; Pabst, et al, 1974; Zhao, L.N., et al., 2006). To promote reaction on single substrate, CDNB is fixed at 1.0 mmol/L while GSH concentrations are kept below 0.10 mmol/L (Zhao, L.N., et al., 2006). Because the concentration of product is calculated from absorbance at 340 nm, the background absorbance before GST reaction is adjusted to zero so that there is no need to treat *A*b as a parameter. This treatment of background absorbance eliminate the estimation of *A*b and thus confronts with no problem of covariance between *A*b and *A*m for NLSF. However, GST reaction is more complicated than uricase because it suffers strong product inhibition with an unreported inhibition constant (Kunze, 1997; Pabst, et al, 1974). Thus, the effectiveness of the two integration strategies is tested for measuring initial rate and GSH levels after the inhibition constant of the product is optimized for kinetic analysis of GST reaction curve.

Fig. 3. Reaction curves (absorbance at 293 nm) at low levels of uric acid and 40 U/L uricase

within 8.0 min or at *S*0 of 75 mol/L to monitor reaction curves within 5.0 min, the integration strategy to measure initial rates of both uricases is feasible; the use of PSC of 93% *S*0 to convert *V*m into initial rates gives the linear range of about two magnitudes (Liu, et al., 2009). Therefore, the integration strategy for enzyme initial rate assay is also advantageous.

Fig. 4. Response of absorbance change at 293 nm to preset uric acid levels at 40 U/L uricase.

Using purified alkaline GST isozyme from porcine liver as model on glutathione (GSH) and 2,4,-dinitrochlorobenzene (CDNB) as substrates, GST reaction curves are monitored by absorbance at 340 nm (Kunze, 1997; Pabst, et al, 1974; Zhao, L.N., et al., 2006). To promote reaction on single substrate, CDNB is fixed at 1.0 mmol/L while GSH concentrations are kept below 0.10 mmol/L (Zhao, L.N., et al., 2006). Because the concentration of product is calculated from absorbance at 340 nm, the background absorbance before GST reaction is adjusted to zero so that there is no need to treat *A*b as a parameter. This treatment of background absorbance eliminate the estimation of *A*b and thus confronts with no problem of covariance between *A*b and *A*m for NLSF. However, GST reaction is more complicated than uricase because it suffers strong product inhibition with an unreported inhibition constant (Kunze, 1997; Pabst, et al, 1974). Thus, the effectiveness of the two integration strategies is tested for measuring initial rate and GSH levels after the inhibition constant of

**2.5.2 Glutathione-S-transferase reaction** 

the product is optimized for kinetic analysis of GST reaction curve.

(recombinant uricase in *E*. Coli BL21 was as reported before (Zhang, et al., 2010)).

The following symbols are assigned: *C* to instantaneous concentration of CDNB, *B* to instantaneous concentration of GSH, *Q* to instantaneous concentration of the product, *K*ma to *K*m of GST for CSNB, *K*mb to *K*m of GST for GSH, *K*ia to the dissociation constant of GSH, *K*iq to the dissociation constant of the product, *A* for instantaneous absorbance, *A*m for the maximal absorbance of the product, *ε* to difference in absorptivity of product and CDNB, *V*<sup>m</sup> for the maximal reaction rate of GST. The differential rate equation for GST reaction is Equ.(13). After the definition of *M*1, *M*2 and *M*3, the integrated rate equation with the predictor variable of reaction time is Equ.(19) if GST reaction is irreversible and a process similar to that for Equ.(4) is employed (Zhao, L.N., et al., 2006).

$$\begin{split} \frac{1}{V} &= \left( \mathbf{K\_{mb}} / V\_{\mathbf{m}} \right) \times \left[ \mathbf{1} + \mathbf{K} \mathbf{i} \times \mathbf{K\_{ma}} \times Q \right] \left( \mathbf{K\_{iq}} \times \mathbf{K\_{mb}} \times \mathbf{C} \right) \Bigg| \mathcal{J} \\ &+ \left[ \mathbf{1} + \mathbf{K\_{ma}} \times \left( \mathbf{1} + Q / \mathbf{K\_{iq}} \right) \Big| \mathbf{C} \right] \Bigg| V\_{\mathbf{m}} \end{split} \tag{13}$$

$$M1 = K\_{\rm ma} \Big/ \left(\varepsilon \times K\_{\rm iq}\right) \tag{14}$$

$$\begin{aligned} M2 &= K\_{\rm ma} - K\_{\rm ib} \times K\_{\rm ma} \{ K\_{\rm iq} - A\_{\rm m} \times K\_{\rm ma} \} (\varepsilon \times K\_{\rm iq}) \\ &+ C - A\_{\rm 0} \times K\_{\rm ma} \{ (\varepsilon \times K\_{\rm iq}) \} \end{aligned} \tag{15}$$

$$\begin{aligned} M3 &= \mathbf{K}\_{\text{ma}} \times \mathbf{A}\_{\text{m}} + \boldsymbol{\varepsilon} \times \mathbf{K}\_{\text{mb}} \times \mathbf{C} + \mathbf{C} \times \mathbf{A}\_{\text{m}} \\ &- \mathbf{K}i\boldsymbol{\flat} \times \mathbf{K}\_{\text{mb}} \times \mathbf{A}o / \mathbf{K}\_{\text{i}q} - \mathbf{K}\_{\text{ma}} \times \mathbf{A}\_{\text{m}} \times \mathbf{A}o / \left(\mathbf{K}\_{\text{i}q} \times \boldsymbol{\varepsilon}\right) \end{aligned} \tag{16}$$

$$\frac{M1 \times A^2 + M2 \times A - M3}{A - A\_{\rm m}} \times dA = \mathbb{C} \times \mathbb{z} \times V\_{\rm m} \times dt \tag{17}$$

$$\begin{aligned} Y &= M1 \times \left(A - A\_{m}\right)^{2} \Big/ 2 + \left(2 \times M1 \times A\_{m} + M2\right) \times \left(A - A\_{m}\right) \\ &+ \left(M1 \times A\_{m}\right)^{2} + M2 \times A\_{m} - M3\right) \times L\pi \left|A - A\_{m}\right| \end{aligned} \tag{18}$$

$$Y = C \times \varepsilon \times V\_{\text{fm}} \times \left(\mathbf{t} - T\mathbf{u}\_{\text{\textdegree g}}\right) = a + b \times t \tag{19}$$

Fig. 5. Estimated *V*m to changes in data ranges for analyses with 60 mol/L GSH.

Kinetic Analyses of Enzyme Reaction Curves

**2.5.3 Alcohol dehydrogenase reaction** 

advantageous.

with New Integrated Rate Equations and Applications 175

mol/L to over 90.0 mol/L by kinetic analysis of reaction curve alone (Fig. 6, unpublished). By the equilibrium method alone for reaction within 5.0 min, the assay of 80.0 mol/L GSH requires GST activity that is 50 folds higher due to the inhibition of GST by the accumulated product. Therefore, the integration strategy for GSH assay is obviously

The integration strategy for measuring GST initial rates is tested. For convenience, *S*0 of the final GSH is fixed at 50 mol/L and the duration to monitor reaction curve is optimized. After the analyses of reaction curves recorded within 10 min, it is found that reaction for 6.0 min is sufficient to provide the required overlapped region of GST activities measurable by both methods. By using *K*iq fixed at 4.0 mol/L as a constant, the reaction duration of 6.0 min and PSC at 48 mol/L to convert *V*m to initial rates, the integration strategy gives a linear range from 2.0 U/L to 60 U/L; kinetic analysis of reaction curve alone gives the linear range from 5.0 U/L to 60 U/L while the classical initial rate method alone gives a linear range from 1.0 U/L to 5.0 U/L (Fig. 7, unpublished). Clearly, with enzyme suffering strong product inhibition, the integration strategy for enzyme initial rate assay is advantageous.

ADH is widely used for serum ethanol assay. ADH kinetics is sophisticated due to the reversibility of reaction and the inhibition by both acetaldehyde and NADH as products. To simplify ADH kinetics, some special approaches are employed to make ADH reaction apparently irreversible on single substrate (alcohol). Thus, reaction pH is optimized to 9.2 to scavenge hydrogen ion; semicabarzide at final 75 mmol/L is used to remove acetaldehyde as completely as possible; final nicotinamide adenine dinucleotide (NAD+) is 3.0 mmol/L; final ADH is about 50 U/L (Liao, et al., 2007a). By assigning the maximal absorbance at 340 nm for reduced nicotinamide adenine dinucleotide (NADH) by the equilibrium method to *A*me and that by kinetic analysis of reaction curve to *A*mk, kinetic analysis of ADH reaction

curve should predict *A*mk consistent with *A*me, but requires some special efforts.

Fig. 8. Response of *F* values to preset *C*ald for kinetic analysis of reaction curve for 0.31

The use of semicabarzide reduces concentrations of acetaldehyde (Cald) to unknown levels, and thus complicates the treatment of acetaldehyde inhibition on ADH. The integration rate equation with the predictor variable of reaction time can be worked out for ADH (Liao, et

mmol/L ethanol (reproduced with permission from Liao, et al, 2007a).

As demonstrated in the definition of *M*1, *M*2 and *M*3, kinetic parameters preset as constants for kinetic analysis of GST reaction curve should have strong covariance. Except *K*iq as an unknown kinetic parameter for optimization, other kinetic parameters are those reported (Kunze, 1997; Pabst, et al, 1974). To optimize *K*iq, two criteria are used. The first is the consistency of predicted *A*m at a series of GSH concentrations using data of 6.0-min reaction with that by the equilibrium method after 40 min reaction (GST activity is optimized to complete the reaction within 40 min). The second is the resistance of *V*m to reasonable changes in data ranges for analyses. After stepwise optimization, *K*iq is fixed at 4.0 mol/L; *A*m predicted for GSH from 5.0 mol/L to 50 mol/L is consistent with that by the equilibrium method (Zhao, L.N., et al. 2006); the estimation of *V*m is resistant to changes of data ranges (Fig. 5). Therefore, *K*iq is optimized and fixed as a constant at 4.0 mol/L.

Fig. 6. Response of GSH concentration determined to preset GSH concentrations (the equilibrium method uses data with 6.0 min reaction).

Fig. 7. Response of initial rates to quantities of purified porcine alkaline GST.

Kinetic analysis of GST reaction curve can predict *A*m for GSH over 4.0 mol/L, but there are no sufficient data for analyses at GSH below 3.0 mol/L; after optimization of GST activity for complete conversion of GSH at 5.0 mol/L within 6.0 min, reaction curve within 5.0 min for GSH at 5.0 mol/L can be used for kinetic analysis of reaction curve to predict *A*m. With the optimized GST activity for reaction within 5.0 min, the linear range for GSH assay is from 1.5 mol/L to over 90.0 mol/L by the integration strategy while it is from 4.0

As demonstrated in the definition of *M*1, *M*2 and *M*3, kinetic parameters preset as constants for kinetic analysis of GST reaction curve should have strong covariance. Except *K*iq as an unknown kinetic parameter for optimization, other kinetic parameters are those reported (Kunze, 1997; Pabst, et al, 1974). To optimize *K*iq, two criteria are used. The first is the consistency of predicted *A*m at a series of GSH concentrations using data of 6.0-min reaction with that by the equilibrium method after 40 min reaction (GST activity is optimized to complete the reaction within 40 min). The second is the resistance of *V*m to reasonable changes in data ranges for analyses. After stepwise optimization, *K*iq is fixed at 4.0 mol/L; *A*m predicted for GSH from 5.0 mol/L to 50 mol/L is consistent with that by the equilibrium method (Zhao, L.N., et al. 2006); the estimation of *V*m is resistant to changes of

data ranges (Fig. 5). Therefore, *K*iq is optimized and fixed as a constant at 4.0 mol/L.

Fig. 6. Response of GSH concentration determined to preset GSH concentrations (the

Fig. 7. Response of initial rates to quantities of purified porcine alkaline GST.

Kinetic analysis of GST reaction curve can predict *A*m for GSH over 4.0 mol/L, but there are no sufficient data for analyses at GSH below 3.0 mol/L; after optimization of GST activity for complete conversion of GSH at 5.0 mol/L within 6.0 min, reaction curve within 5.0 min for GSH at 5.0 mol/L can be used for kinetic analysis of reaction curve to predict *A*m. With the optimized GST activity for reaction within 5.0 min, the linear range for GSH assay is from 1.5 mol/L to over 90.0 mol/L by the integration strategy while it is from 4.0

equilibrium method uses data with 6.0 min reaction).

mol/L to over 90.0 mol/L by kinetic analysis of reaction curve alone (Fig. 6, unpublished). By the equilibrium method alone for reaction within 5.0 min, the assay of 80.0 mol/L GSH requires GST activity that is 50 folds higher due to the inhibition of GST by the accumulated product. Therefore, the integration strategy for GSH assay is obviously advantageous.

The integration strategy for measuring GST initial rates is tested. For convenience, *S*0 of the final GSH is fixed at 50 mol/L and the duration to monitor reaction curve is optimized. After the analyses of reaction curves recorded within 10 min, it is found that reaction for 6.0 min is sufficient to provide the required overlapped region of GST activities measurable by both methods. By using *K*iq fixed at 4.0 mol/L as a constant, the reaction duration of 6.0 min and PSC at 48 mol/L to convert *V*m to initial rates, the integration strategy gives a linear range from 2.0 U/L to 60 U/L; kinetic analysis of reaction curve alone gives the linear range from 5.0 U/L to 60 U/L while the classical initial rate method alone gives a linear range from 1.0 U/L to 5.0 U/L (Fig. 7, unpublished). Clearly, with enzyme suffering strong product inhibition, the integration strategy for enzyme initial rate assay is advantageous.

#### **2.5.3 Alcohol dehydrogenase reaction**

ADH is widely used for serum ethanol assay. ADH kinetics is sophisticated due to the reversibility of reaction and the inhibition by both acetaldehyde and NADH as products. To simplify ADH kinetics, some special approaches are employed to make ADH reaction apparently irreversible on single substrate (alcohol). Thus, reaction pH is optimized to 9.2 to scavenge hydrogen ion; semicabarzide at final 75 mmol/L is used to remove acetaldehyde as completely as possible; final nicotinamide adenine dinucleotide (NAD+) is 3.0 mmol/L; final ADH is about 50 U/L (Liao, et al., 2007a). By assigning the maximal absorbance at 340 nm for reduced nicotinamide adenine dinucleotide (NADH) by the equilibrium method to *A*me and that by kinetic analysis of reaction curve to *A*mk, kinetic analysis of ADH reaction curve should predict *A*mk consistent with *A*me, but requires some special efforts.

Fig. 8. Response of *F* values to preset *C*ald for kinetic analysis of reaction curve for 0.31 mmol/L ethanol (reproduced with permission from Liao, et al, 2007a).

The use of semicabarzide reduces concentrations of acetaldehyde (Cald) to unknown levels, and thus complicates the treatment of acetaldehyde inhibition on ADH. The integration rate equation with the predictor variable of reaction time can be worked out for ADH (Liao, et

Kinetic Analyses of Enzyme Reaction Curves

(reproduced with permission from Liao, et al, 2007a).

**2.6 Programming for kinetic analysis of enzyme reaction curve** 

reaction curve in widow-aided mode, self-programming is still favourable.

substrate.

with New Integrated Rate Equations and Applications 177

Fig. 10. Iterative adjustment of *C*ald to predict *A*mk for 0.31 mmol/L ethanol at 50 U/L ADH

Obviously, by this special approach for kinetic analysis of ADH reaction curve, the upper limit of linear response is excellent, but the lower limit of linear response is over 5.0 mol ethanol. Under the stated reaction conditions, the equilibrium method after reaction for 8.0 min is effective to quantify ethanol up to final 6.0 mol. Thus, the equilibrium method with reaction duration of 8.0 min can be integrated with iterative kinetic analysis of reaction curve for quantifying ethanol; this integration strategy gives the linear range from about final 2.0 mol to about 0.30 mmol/L ethanol in reaction solutions; it has CVs below 8% for ethanol below 10 mol/L, and CVs below 5% for ethanol over 20 mol/L (Liao, et al., 2007a). These results clearly supported the advantage of the new integration strategy for substrate assay and the importance of chemometrics in kinetic enzymatic analysis of

Most software package like Origin, SAS, MATLAB can perform kinetic analysis of reaction curve, but they are usually ineffective to implicit functions for kinetic analysis of reaction curve. For convenience and the use of some complicated methods for kinetic analysis of

For simplicity in programming, we used Visual Basic 6.0 to write the source code and working windows (Liu, et al., 2011). The executable program has the main window to perform kinetic analysis of reaction curve (Fig. 11). Original data for each reaction curve is stored as a text file, and keywords are used to indicate specific information related to the reaction curve including sample numbering, the enzyme used, the quantification method, some necessary kinetic parameters, and usually initial signal before enzyme action. Such information is read into memory by the software for kinetic analysis of reaction curve.

On the main window to perform kinetic analysis of reaction curve, original data are listed and plotted for eyesight-checking of data for steady-state reaction. Text boxes are used to input some common parameters like *K*m, and most parameters are read from the text file for the reaction curve. Subprogram for an enzyme reaction system is called for running; results

are displayed on the main window and may be saved in text file for further analysis.

al., 2007a). All kinetic parameters and NAD+ concentrations are preset as those used or reported (Ganzhorn, et al. 1987). However, there are multiple maxima of the goodness of fit with the continuous increase in steady-state Cald for kinetic analysis of reaction curve (Fig. 8). Thus, Cald can not be concomitantly estimated by kinetic analysis of reaction curve, and a special approach is used to approximate steady-state Cald for predicting *A*mk.

Fig. 9. Correlation function of the best steady-state *C*ald with *A*me (reproduced with permission from Liao, et al, 2007a).

Under the same reaction conditions, the equilibrium method can determine *A*me for ethanol below 0.20 mmol/L after reaction for 50 min. For kinetic analyses of such reaction curves, the lag time for steady-state reaction is estimated to be over 40 s and is used to select data of steady-state reaction for analysis. Using the equilibrium method as the reference method, the best steady-state Cald for data of 6.0-min reaction is obtained for consistency of *A*mk with *A*me at each tested ethanol level from 10 mol/L to 0.17 mmol/L. After dilution and determination by the equilibrium method, *A*me for each tested ethanol level from 0.17 mmol/L to 0.30 mmol/L is also available. Consequently, an exponential additive function is obtained to approximate the correlation of the best Cald for predicting *A*mk consistent with *A*me (Fig. 9). This special correlation function for Cald and *A*mk is used as a restriction function to iteratively adjust Cald for predicting *A*mk; namely, iterative kinetic analysis of reaction curve with Cald predicted from the restriction function using previous *A*mk finally gives the desired *A*mk. Such an artificial intelligence approach to the steady-state Cald for kinetic analysis of reaction curve can hardly be found in publications.

To start kinetic analysis of an ADH reaction curve, the highest absorbance under analysis is taken as *A*mk to predict the best Cald for the current run of kinetic analysis of reaction curve. The estimated *A*mk is then used to predict the second Cald for the second run of kinetic analysis of reaction curve (Fig. 10). Such an iterative kinetic analysis of reaction curve can predict *A*mk consistent with *A*me for 0.31 mmol/L ethanol when reaction duration is just 6.0 min and the convergence criterion is set for absorbance change below 0.0015 in *A*mk. Usually convergence is achieved with 7 runs of the iterative kinetic analysis of reaction. Moreover, it is resistant to the change of ADH activities by 50% and coefficients of variation (CV) are below 5% for final ethanol levels from 20 mol/L to 310 mol/L in reaction solutions.

al., 2007a). All kinetic parameters and NAD+ concentrations are preset as those used or reported (Ganzhorn, et al. 1987). However, there are multiple maxima of the goodness of fit with the continuous increase in steady-state Cald for kinetic analysis of reaction curve (Fig. 8). Thus, Cald can not be concomitantly estimated by kinetic analysis of reaction curve, and a

special approach is used to approximate steady-state Cald for predicting *A*mk.

Fig. 9. Correlation function of the best steady-state *C*ald with *A*me (reproduced with

analysis of reaction curve can hardly be found in publications.

Under the same reaction conditions, the equilibrium method can determine *A*me for ethanol below 0.20 mmol/L after reaction for 50 min. For kinetic analyses of such reaction curves, the lag time for steady-state reaction is estimated to be over 40 s and is used to select data of steady-state reaction for analysis. Using the equilibrium method as the reference method, the best steady-state Cald for data of 6.0-min reaction is obtained for consistency of *A*mk with *A*me at each tested ethanol level from 10 mol/L to 0.17 mmol/L. After dilution and determination by the equilibrium method, *A*me for each tested ethanol level from 0.17 mmol/L to 0.30 mmol/L is also available. Consequently, an exponential additive function is obtained to approximate the correlation of the best Cald for predicting *A*mk consistent with *A*me (Fig. 9). This special correlation function for Cald and *A*mk is used as a restriction function to iteratively adjust Cald for predicting *A*mk; namely, iterative kinetic analysis of reaction curve with Cald predicted from the restriction function using previous *A*mk finally gives the desired *A*mk. Such an artificial intelligence approach to the steady-state Cald for kinetic

To start kinetic analysis of an ADH reaction curve, the highest absorbance under analysis is taken as *A*mk to predict the best Cald for the current run of kinetic analysis of reaction curve. The estimated *A*mk is then used to predict the second Cald for the second run of kinetic analysis of reaction curve (Fig. 10). Such an iterative kinetic analysis of reaction curve can predict *A*mk consistent with *A*me for 0.31 mmol/L ethanol when reaction duration is just 6.0 min and the convergence criterion is set for absorbance change below 0.0015 in *A*mk. Usually convergence is achieved with 7 runs of the iterative kinetic analysis of reaction. Moreover, it is resistant to the change of ADH activities by 50% and coefficients of variation (CV) are below 5% for final ethanol levels from 20 mol/L to 310

permission from Liao, et al, 2007a).

mol/L in reaction solutions.

Fig. 10. Iterative adjustment of *C*ald to predict *A*mk for 0.31 mmol/L ethanol at 50 U/L ADH (reproduced with permission from Liao, et al, 2007a).

Obviously, by this special approach for kinetic analysis of ADH reaction curve, the upper limit of linear response is excellent, but the lower limit of linear response is over 5.0 mol ethanol. Under the stated reaction conditions, the equilibrium method after reaction for 8.0 min is effective to quantify ethanol up to final 6.0 mol. Thus, the equilibrium method with reaction duration of 8.0 min can be integrated with iterative kinetic analysis of reaction curve for quantifying ethanol; this integration strategy gives the linear range from about final 2.0 mol to about 0.30 mmol/L ethanol in reaction solutions; it has CVs below 8% for ethanol below 10 mol/L, and CVs below 5% for ethanol over 20 mol/L (Liao, et al., 2007a). These results clearly supported the advantage of the new integration strategy for substrate assay and the importance of chemometrics in kinetic enzymatic analysis of substrate.

#### **2.6 Programming for kinetic analysis of enzyme reaction curve**

Most software package like Origin, SAS, MATLAB can perform kinetic analysis of reaction curve, but they are usually ineffective to implicit functions for kinetic analysis of reaction curve. For convenience and the use of some complicated methods for kinetic analysis of reaction curve in widow-aided mode, self-programming is still favourable.

For simplicity in programming, we used Visual Basic 6.0 to write the source code and working windows (Liu, et al., 2011). The executable program has the main window to perform kinetic analysis of reaction curve (Fig. 11). Original data for each reaction curve is stored as a text file, and keywords are used to indicate specific information related to the reaction curve including sample numbering, the enzyme used, the quantification method, some necessary kinetic parameters, and usually initial signal before enzyme action. Such information is read into memory by the software for kinetic analysis of reaction curve.

On the main window to perform kinetic analysis of reaction curve, original data are listed and plotted for eyesight-checking of data for steady-state reaction. Text boxes are used to input some common parameters like *K*m, and most parameters are read from the text file for the reaction curve. Subprogram for an enzyme reaction system is called for running; results are displayed on the main window and may be saved in text file for further analysis.

Kinetic Analyses of Enzyme Reaction Curves

**4. Acknowledgment** 

Commission (KJ100313).

Press, Beijing, China

**5. References** 

with New Integrated Rate Equations and Applications 179

with the classical initial rate method can measure enzyme initial rates with wide linear ranges, favourable analysis efficiency and practical levels of substrates; it can be applicable

Taken together, kinetic analysis of enzyme reaction curves under optimized conditions can screen common reversible inhibitors and enzyme mutants; the integration strategy for measuring enzyme activities can quantify serum enzymes and enzyme labels in enzymeimmunoassays to expand the quantifiable ranges, and can be applied to quantify irreversible inhibitors as environmental pollutants; the integration strategy to quantify enzyme substrate can be the second-generation approaches and potentially find wide applications in clinical laboratory medicine. Therefore, these new methodologies for enzymatic analyses based on chemometrics can potentially find their important applications in biomedical sciences.

This work is supported by the program for New Century Excellent Talent in University (NCET-09), high-technology-program "863" of China (2011AA02A108), National Natural Science Foundation of China (nos. 30200266, 30672009, 81071427), Chongqing Municipal Commission of Sciences and Technology (CQ CSTC2011BA5039), and Chongqing Education

Atkins, G.L. & Nimmo, I.A. (1973). The reliability of Michaelis–Menten constants and

Baywenton, P. R. (1986). *Data process and error analysis* (Translated into Chinese by Weili Qiu,

Bergmeyer, H.U. (1983). Methods of Enzymatic Analysis, Vol*. I. Fundamentals (3rd Ed.)*, ISBN

Burden, R.L. & Faires, J.D. (2001). *Numerical Analysis* (7th ed.) , ISBN 978-0534382162,

Burguillo, J., Wright, A.J. & Bardsley, W. G. (1983). Use of the F test for determining the

Cheng, Y.C. & Prusoff, W.H. (1973). Relationship between the inhibition constant (KI) and

Cheng, Z.L., Chen, H., Zhao, Y.S., Yang, X.L., Lu, W., Liao, H., Yu, M.A., & Liao, F. (2008).

pp.1209-1212, ISBN 978-1-4244-1748-3, Shanghai, China, May 26-28, 2008

978-3527260416, Wiley VCH, Weinheim, Germany

no.1, (April 1983), pp. 23–34, ISSN0264-6021

pp. 3099-3108, ISSN 0006-2952.

Academic Internet Publishers, Ventura, Carliforlia, USA

maximum velocities estimated by using the integrated Michaelis–Menten equation. *The Biochemical Journal*, vol. 135, no.4, (December 1973), pp. 779–784, ISSN 0264-6021

Genxin Xu, Enguang Zhao, and Shengzhong Chen), ISBN 13214.84, Knowledge

degree of enzyme-kinetic and ligand-binding data. *The Biochemical Journal*, vol. 211,

the concentration of inhibitor which causes 50 per cent inhibition (I50) of an enzymatic reaction. Biochemical Pharmacology, vol.22, no. 23, (December 1973),

The measurement of the activity of rabbit muscle lactic dehydrogenase by integrating the classical initial rate method with an integrated method. *2nd International Conference on Bioinformatics and Biomedical Engineering*, *iCBBE 2008,*

to enzyme-coupled reaction curve or enzyme reaction suffering product inhibition.

Fig. 11. Main window for the executable PCFenzyme

We called the software PCFenzyme. An old version of the executable PCFenzyme can be downloaded at http://dx.doi.org/10.1016/j.clinbiochem.2008.11.016. The latest version of the executable PCFenzyme with new methods included is available upon request by e-mail.

#### **3. Conclusion**

The following conclusions can be drawn. (a) Kinetic analysis of reaction curve can give the initial substrate concentration before enzyme action, the maximal reaction rate, Michaelis-Menten constant and other related parameters of an enzyme reaction system; for reliability, however, it is better to just estimate the initial substrate concentration before enzyme action and the maximal reaction rate with Michaelis-Menten constant and other parameters fixed as constants after optimization. (b) For an enzyme whose integrated rate equation with the predictor variable of reaction time is accessible, kinetic analysis of reaction curve can estimate parameters *via* nonlinear-least-square-fitting after transformation of data from the reaction curve under analysis. (c) For an enzyme reaction system whose kinetics is described by a set of differential rate equations or is difficult to be integrated with the predictor of reaction time, iterative numerical integration of the differential rate equation(s) with a series of preset parameters can produce serial calculated reaction curves; such calculated reaction curves can be fit to the reaction curve under analysis for estimating parameters based on nonlinear-least-square-fitting. This approach is applicable to enzyme-coupled reaction systems of sophisticated kinetics. (d) The integration of kinetic analysis of reaction curve with the equilibrium method can quantify enzyme substrates with expanded linear ranges, favourable analysis efficiency, low cost on tool enzyme, desirable resistance to factors affecting enzyme activities and enhanced precision; it can be applied to enzyme reaction suffering strong product inhibition. (e) The integration of kinetic analysis of reaction curve with the classical initial rate method can measure enzyme initial rates with wide linear ranges, favourable analysis efficiency and practical levels of substrates; it can be applicable to enzyme-coupled reaction curve or enzyme reaction suffering product inhibition.

Taken together, kinetic analysis of enzyme reaction curves under optimized conditions can screen common reversible inhibitors and enzyme mutants; the integration strategy for measuring enzyme activities can quantify serum enzymes and enzyme labels in enzymeimmunoassays to expand the quantifiable ranges, and can be applied to quantify irreversible inhibitors as environmental pollutants; the integration strategy to quantify enzyme substrate can be the second-generation approaches and potentially find wide applications in clinical laboratory medicine. Therefore, these new methodologies for enzymatic analyses based on chemometrics can potentially find their important applications in biomedical sciences.

#### **4. Acknowledgment**

178 Chemometrics in Practical Applications

We called the software PCFenzyme. An old version of the executable PCFenzyme can be downloaded at http://dx.doi.org/10.1016/j.clinbiochem.2008.11.016. The latest version of the executable PCFenzyme with new methods included is available upon request by e-mail.

The following conclusions can be drawn. (a) Kinetic analysis of reaction curve can give the initial substrate concentration before enzyme action, the maximal reaction rate, Michaelis-Menten constant and other related parameters of an enzyme reaction system; for reliability, however, it is better to just estimate the initial substrate concentration before enzyme action and the maximal reaction rate with Michaelis-Menten constant and other parameters fixed as constants after optimization. (b) For an enzyme whose integrated rate equation with the predictor variable of reaction time is accessible, kinetic analysis of reaction curve can estimate parameters *via* nonlinear-least-square-fitting after transformation of data from the reaction curve under analysis. (c) For an enzyme reaction system whose kinetics is described by a set of differential rate equations or is difficult to be integrated with the predictor of reaction time, iterative numerical integration of the differential rate equation(s) with a series of preset parameters can produce serial calculated reaction curves; such calculated reaction curves can be fit to the reaction curve under analysis for estimating parameters based on nonlinear-least-square-fitting. This approach is applicable to enzyme-coupled reaction systems of sophisticated kinetics. (d) The integration of kinetic analysis of reaction curve with the equilibrium method can quantify enzyme substrates with expanded linear ranges, favourable analysis efficiency, low cost on tool enzyme, desirable resistance to factors affecting enzyme activities and enhanced precision; it can be applied to enzyme reaction suffering strong product inhibition. (e) The integration of kinetic analysis of reaction curve

Fig. 11. Main window for the executable PCFenzyme

**3. Conclusion** 

This work is supported by the program for New Century Excellent Talent in University (NCET-09), high-technology-program "863" of China (2011AA02A108), National Natural Science Foundation of China (nos. 30200266, 30672009, 81071427), Chongqing Municipal Commission of Sciences and Technology (CQ CSTC2011BA5039), and Chongqing Education Commission (KJ100313).

#### **5. References**


Kinetic Analyses of Enzyme Reaction Curves

ISSN 0003-2697

0006-2960.

ISSN 1673-1581

2642

with New Integrated Rate Equations and Applications 181

Gutierrez, O.A. & Danielson, U. H. (2006). Sensitivity analysis and error structure of

Hamilton, S. D. & Pardue, H. L. (1982). Kinetic method having a linear range for substrate

Hasinoff, B. B. (1985). A convenient analysis of Michaelis enzyme kinetic progress curves

Kahn, K. & Tipton, P.A. (1998). Spectroscopic characterization of intermediates in the urate

Koerber, S. C. & Fink, A. L. (1987). The analysis of enzyme progress curves by numerical

Liao, F. (2005). *The method for quantitative enzymatic analysis of uric acid in body fluids by predicting the background absorbance*. China patent: ZL O3135649.4, 2005-08-31 Liao, F., Li, J.C., Kang, G.F., Zeng, Z.C., Zuo, Y.P. (2003a). Measurement of mouse liver

*Colleges of PLA*, vol. 18, no.5, (October 2003), pp. 295-300, ISSN 1000-1948 Liao, F., Liu, W.L., Zhou, Q.X., Zeng, Z.C., Zuo, Y.P. (2001). Assay of serum arylesterase

*Chimica Acta*, vol. 314, no.1-2, (December 2001), pp.67-76, ISSN 0009-8981 Liao, F., Tian, K.C., Yang, X., Zhou, Q.X., Zeng, Z.C., Zuo, Y.P. (2003b). Kinetic substrate

Liao, F., Yang, D.Y., Tang, J.Q., Yang, X.L., Liu, B.Z., Zhao, Y.S., Zhao, L.N., Liao, H. & Yu,

*Biochemistry*, vol.42, no.6, (December 2008), pp.926-928. ISSN 0009-9120 Liao, F., Zhao, L.N., Zhao, Y.S., Tao, J., Zuo, Y.P. (2007a). Integrated rate equation

*Analytical Sciences*, vol. 23, no.4, (April 2007), pp. 439-444 , ISSN 0910-6340 Liao, F., Zhao, Y.S., Zhao, L.N., Tao, J., Zhu, X.Y., Liu, L. (2006). The evaluation of a direct

*Science B*, vol. 7, no.6, pp. 497-502, ISSN 1673-1581

no.12, (December 1982), pp.2359–2365, ISSN 0009-9147

Vol. 838, no. 2, (February 1985), pp. 290-292, ISSN 0304-4165

progress curves. Analytical Biochemistry, vol.358, no.1, (August 2006), pp.1-10,

concentration that exceed Michaelis–Menten constants. *Clinical Chemistry*, vol. 28,

based on second derivatives. *Biochimica et Biophysica Acta (BBA) - General Subjects*,

oxidase reaction. *Biochemistry*, vol. 37, no. (August 1998), pp. 11651-11659, ISSN

differentiation, including competitive product inhibition and enzyme reactivation. *Analytical Biochemistry*, vol. 165, no.1, (December 2004), pp. 75-87, ISSN 0003-2697 Li, Z.R., Liu,Y., Yang, X.Y., Pu,J., Liu, B.Z., Yuan, Y.H., Xie, Y.L. & Liao, F. (2011). Kinetic

analysis of gamma-glutamyltransferase reaction process for measuring activity via an integration strategy at low concentrations of gamma-glutamyl p-nitroaniline. *Journal of Zhejiang University Sciecnce B*, vol. 12, no.3, (March 2011), pp. 180-188,

glutathione-S- transferase activity by the integrated method. *Journal of Medical* 

activity by fitting to the reaction curve with an integrated rate equation. *Clinica* 

quantification by fitting to the integrated Michaelis-Menten equation. *Analytical Bioanalytical Chemistry*, vol. 375, no. 6, (Febrary 2003), pp. 756-762, ISSN1618-

M.A. (2009). The measurement of serum cholinesterase activities by an integration strategy with expanded linear ranges and negligible substrate-activation. *Clinical* 

considering product inhibition and its application to kinetic assay of serum ethanol.

kinetic method for serum uric acid assay by predicting the background absorbance of uricase reaction solution with an integrated method. *Journal of Zhejiang University* 


Claro, E. (2000). Understanding initial velocity after the derivatives of progress curves.

Cornish-Bowden, A. (1975). The use of the direct linear plot for determining initial

Cornish-Bowden, A. (1995). *Analysis of enzyme kinetic data*, ISBN 978-0198548775, Oxford

Dagys, R., Tumas, S., Zvirblis, S. & Pauliukonis, A. (1990) Determination of first and second

Dagys, R., Pauliukonis, A., Kazlauskas, D., Mankevicius, M. & Simutis, R. (1986).

Dilena, B.A., Peake, M.J., Pardue, H.L., Skoug, J.W. (1986). Direct ultraviolet method for

Draper, N.R. & Smith, H. (1998). *Applied regression analysis* (3rd ed.), ISBN 978-0471170822,

Duggleby, R. G. (1983). Determination of the kinetic properties of enzymes catalysing

Duggleby, R. G. (1994). Analysis of progress curves for enzyme-catalyzed reactions:

Fresht, A. (1985). *Enzyme structure and Mechanism* (2nd Ed.), ISBN 978-0716716143, Freeman

Ganzhorn, A.J., Green, D.W., Hershey, A.D., Gould, R.M., Plapp, B.V. (1987). Kinetic

Guilbault, G. G. (1976). *Handbook of enzymatic methods of analysis*, ISBN 978-0824764258,

306, ISSN 1470-8175

University Press, London, UK

*0264-6021*

0010-4809

York, USA

1117, ISSN 0003-2654

Wiley-Interscience; New York, USA

no.1, (May 1985), pp. 55-60, ISSN 0264-6021

pp. 268-274, ISSN 0304-4165

1987), pp. 3754-3761, ISSN 0021-9258

Marcel Dekker, New York, USA

WH, New York, USA

*Biochemistry and Molecular Biology Education*, Vol.28, no.6, (November 2000), pp. 304-

velocities. *The Biochemical Journal*, vol. 149, no.2, (August 1975), pp. 305-312, ISSN

derivatives of progress curves in the case of unknown experimental error. *Computers and Biomedical Research*, Vol.23, no. 5, (October 1990), pp. 490-498, ISSN

Determination of initial velocities of enzymic reactions from progress curves. *The Biochemical Journal*, vol.237, no.3, (August 1986), pp. 821-825, ISSN 0264-6021 del Rio F.J., Riu, J. & Rius, F. X. (2001). Robust linear regression taking into account errors in

the predictor and response variables. *Analyst*, vol. 126, no. , (July 2001), pp. 1113–

enzymatic determination of uric acid, with equilibrium and kinetic data-processing options. *Clinical Chemistry*, vol. 32, no.3, (May 1986), pp. 486-491, ISSN 0009-9147 Dixon, M.C. & Webb, EC. (1979). *Enzymes* (3rd ed.), ISBN 0122183584, Academic Press, New

coupled reaction sequences. *Biochimica et Biophysica Acta (BBA) - Protein Structure and Molecular Enzymology*, Vol.744, no. 3, (May 1983), pp. 249-259, ISSN 0167-4838 Duggleby, R.G. (1985). Estimation of the initial velocity of enzyme-catalysed reactions by

non-linear regression analysis of progress curves. *The Biochemical Journal*, vol. 228,

application to unstable enzymes, coupled reactions, and transient state kinetics. *Biochimica et Biophysica Acta (BBA) – General subjects*, vol. 1205, no.2, (April 1994),

characterization of yeast alcohol dehydrogenases. Amino acid residue 294 and substrate specificity. The Journal of Biological chemistry,vol.262, no.8, (March


Kinetic Analyses of Enzyme Reaction Curves

181963-7, New York, USA

1976), pp. 417-421, ISSN 0009-9147

(March 2011), pp.431-439, ISSN 0306-7319

(June 2006), pp.300-303, ISSN 1671-8259

pp.1-6, ISSN 0026-3672

ISSN 0003-2697

2647

4165.

ISSN0910-6340

(December 1974), pp. 779–781. ISSN 0264-6021

No. 2, (July 1983), pp. 457–61, ISSN 0003-2697

with New Integrated Rate Equations and Applications 183

Northrop, D. B. (1983). Fitting enzyme-kinetic data to V/K. *Analytical Biochemistry*, vol. 132,

Orsi, B.A. & Tipton, K. F. (1979). Kinetic analysis of progress curves. In: *Methods in* 

Priest, D.G. & Pitts, O.M. (1972). Reaction intermediate effects on the spectrophotometric

Stromme,J.H. & Theodorsen, L. (1976). Gamma-glutamyltransferase: Substrate inhibition,

Varon, R., Garrido-del Solo, C., Garcia-Moreno, M., Garcoa-Canovas, F., Moya-Garcia, G.,

Walsh, R., Martin, E., Darvesh, S. (2010). A method to describe enzyme-catalyzed reactions

Yang, D., Tang, J., Yang, X., Deng, P., Zhao, Y., Zhu, S., Xie, Y., Dai, X., Liao, H., Yu, M.,

Yang, X. L., Liu, B.Z., Sang,Y., Yuan, Y.H., Pu, J., Liu,Y., Li, Z.R., Feng,J., Xie, Y.L, Tang, R.

Zhang, C., Yang, X.L., Feng,J., Yuan, Y.H., Li,X., Bu, Y.Q., Xie, Y.L., Yuan, H.D. & Liao, F.

2010; vol.74, no.6, (June 2010), pp. 1298-1301, 0916-8451. ISSN 0916-8451. Zhao, L.N., Tao, J., Zhao, Y.S., Liao, F. (2006). Quantification of reduced glutathione by

Zhao,Y.S., Yang, X.Y., Lu,W., Liao,H. & Liao, F. (2009). Uricase based method for

Zhao,Y.S., Zhao,L.N., Yang,G.Q., Tao, J., Bu, Y.Q. & Liao, F. (2006). Characterization of an

integrated Michaelis–Menten equation. *The Biochemical Journal*, vol. 143, no. 3,

*Enzymology*, vol. 63, D. L. Purich, (Ed.), 159-183, Academic Press, ISBN 978-0-12-

uricase assay. *Analytical Biochemistry*, vol.50, no.1, (November 1972), pp. 195-205,

kinetic mechanism, and assay conditions. *Clinical Chemistry*, vol. 22, no.4, (April

Vidal de Labra, J., Havsteen BH. (1998). Kinetics of enzyme systems with unstable suicide substrates. *Biosystems*, vol. 47, no.3, (August 1998), pp.177-192, ISSN 0303-

by combining steady state and time course enzyme kinetic parameters. *Biochimica et Biophysica Acta-General Subjects*, vol.1800, no.1, (October 2009), pp1-5, ISSN 0304-

Liao, J. & Liao, F. (2011). An integration strategy to measure enzyme activities for detecting irreversible inhibitors with dimethoate on butyrylcholinesterase as model. *International Journal of Environmental Analytical Chemistry*, vol.91, no.5,

K., Yuan, H.D. & Liao, F. (2010). Kinetic analysis of lactate-dehydrogenase-coupled reaction process and measurement of alanine transaminase by an integration strategy. *Analytical Sciences*, vol.26, no. 11, (November 2010), pp. 1193-1198,

(2010). Effects of modification of amino groups with poly(ethylene glycol) on a recombinant uricase from Bacillus fastidiosus. *Bioscience Biotechnology Biochemistry*,

analyzing glutathine-S-transferase reaction process taking into account of product inhibition. Journal of Xi'an Jiaotong University (Medical Sciences), vol. 27, no.3,

determination of uric acid in serum. *Microchimica Acta*, vol. 164, no.1, (May 2008),

intracellular uricase from Bacillus fastidious ATCC 26904 and its application to


Liao, F., Zhao, Y.S., Zhao, L.N., Tao, J., Zhu, X.Y., Wang, Y.M., Zuo, Y.P. (2005b). Kinetic

Liao, F., Zhu, X.Y., Wang, Y.M., Zhao, Y.S, Zhu, L.P., Zuo, Y.P. (2007b). Correlation of serum

*University Science B*, vol. 8, no.4, (April 2007), pp.237-241, ISSN 1673-1581 Liao, F., Zhu,X.Y., Wang, Y.M., Zuo, Y.P. (2005a). The comparison on the estimation of

Liu, B.Z., Zhao, Y.S., Zhao, L.N., Xie, Y.L., Zhu,S., Li,Z.R., Liu,Y., Lu,W., Yang,X.L., Xie,

Liu, M., Yang, X.L., Yuan, Y.H., Tao, J. & Liao, F. (2011). PCFenzyme for kinetic analyses of

Lu, W.P. & Fei, L. (2003). A logarithmic approximation to initial rates of enzyme reactions. *Analytical Biochemistry*, vol. 316, no. 1, (May 2003), pp.58-65, ISSN 0003-2697 Marangoni, A. G. (2003). *Enzyme kinetics:a modern approach*, ISBN 978-0471159858, Wiley-

Mey1er-Almes, F.J. & Auer, M. (2000). Enzyme inhibition assay using fluorescence

*Biochemistry*, vol. 39, no.43 (October 2000), pp. 13261–13268, ISSN 0006-2960 Miller, J. C. & Miller, J. N. (1993). *Statistics for analytical chemistry* (3rd), ISBN 978-0130309907,

Morishita, Y., Iinuma, Y., Nakashima, N., Majima, K., Mizuguchi, K. & Kawamura, Y. (2000).

Moruno-Davila, M.A., Solo, C.G., GarcIa-Moreno, M., GarcIa-CAnovas, F. & Varon, R.

Moss, D.W. (1980). Methodological principles in the enzymatic determination of substrates

Newman, P.F.J., Atkins, G.L. & Nimmo, I. A. (1974). The effects of systematic error on the

no.1, (January 2005), pp. 13-24, ISSN 0165-022X

ISSN 1000-1948

22-28. ISSN 0003-2670

pp.582-587, ISSN 1878-0296

Interscience, New York, USA

ISSN 0009-9147

ISSN 0303-2647

Ellis Horwood, Chichester, New York, USA

(August 1980), pp. 351-360, ISSN 0009-8981

method for enzymatic analysis by predicting background with uricase reaction as model. *Journal of Medical Colleges of PLA*, vol.20, no.6, (Deember 2005), pp. 338-344,

arylesterase activity on phenylacetate estimated by the integrated method to common classical biochemical indexes of liver damage. *Journal of Zhejiang* 

kinetic parameters by fitting enzyme reaction curve to the integrated rate equation of different predictor variables. *Journal of Biochemical Biophysical Methods*, vol. 62,

G.M., Zhong, H.S., Yu, M.A., Liao,H. & Liao, F. (2009). An integration strategy to estimate the initial rates of enzyme reactions with much expanded linear ranges using uricases as models. *Analytica Chimica Acta*, vol.631, no.1, (October 2008), pp.

enzyme reaction processes. *Procedia Environmental Sciences*, vol. 8, (December 2011),

correlation spectroscopy: a new algorithm for the derivation of Kcat/KM.and Ki values at substrate concentration much lower than the Michaelis constant.

Total and pancreatic amylase measured with 2-chloro-4-nitrophenyl-4-*O*-ß-Dgalactopyranosylmaltoside. *Clinical Chemistry*, vol. 46, no.7, (July 2000), pp. 928-933,

(2001). Kinetic analysis of enzyme systems with suicide substrate in the presence of a reversible, uncompetitive inhibitor. *Biosystems*, vol. 61, no.1, (June 2001), pp.5-14,

illustrated by the measurement of uric acid. *Clinica Chimica Acta*, Vol. 105, no. 3,

accuracy of Michaelis constant and maximum velocities estimated by using the

integrated Michaelis–Menten equation. *The Biochemical Journal*, vol. 143, no. 3, (December 1974), pp. 779–781. ISSN 0264-6021


**0**

**8**

*Brasil*

**Chemometric Study on Molecules with**

Jardel Pinto Barbosa3 and José Ciríaco Pinheiro<sup>3</sup>

<sup>2</sup>*Instituto Federal de Educação, Ciência e Tecnologia do Pará*

João Elias Vidueira Ferreira1, Antonio Florêncio de Figueiredo2,

<sup>3</sup>*Laboratório de Química Teórica e Computacional, Universidade Federal do Pará*

Cancer is a class of diseases characterized by uncontrolled growth of abnormal cells of an organism. All over the world millions of people die every year owing to one of the different types of cancer. Unfortunately cancer chemotherapy finds a serious limitation since treatment with drugs is followed by drug resistance in the tumorous cells and side effects (Efferth, 2005).

In the late years literature has reported the research on natural products as a good strategy to discover new chemotherapy agents. One of the plants that have shown anticancer properties is *Artemisia annua L.* (*qinghao*). It has the active ingredient artemisinin, which is used as antimalarial. Artemisinin and derivatives have excellent efficacy against multidrug-resistant strains of *P. falciparum* and they are very well tolerated (Price et al., 1998). Recently the sensibility to artemisinin has been evaluated in some tumorous cells. Studies suggest that artemisinin is more toxic to cancerous cells than to normal cells, so giving a new perspective

However ... This book is on chemometrics and what has chemometrics to do with cancer chemotherapy? Well... understanding how these two different areas can be related to one another is the purpose of this chapter. You just must keep on reading this chapter and you will see the many ways chemometrics can be employed to investigate the "behavior" molecules exhibit considering anticancer activity and to make predictions about drugs that were not tested yet. The potential application of chemometrics to analytical data arising from problems in biology and medicine is enormous and, in fact, the applications of chemometrics have diversified substantially over the last few years (Brereton, 2007; 2009). At the end of the chapter you will note that, as in many areas of research, chemometrics plays an important role

Firstly it is necessary to remember that producing a drug is something that takes time and money, so the process must be rationalized! However, in the past, drugs were discovered by synthesizing a lot of molecules, rather without rigorous criteria, and testing experimentally all of them to evaluate their capacity of cure of the disease or at least to control it. But in process

So researches have been directed to make chemotherapy treatment more efficient.

**1. Introduction**

in cancer therapy (Lai et al, 2009).

in medicinal chemistry, fortunately.

**Anticancer Properties**

<sup>1</sup>*Universidade do Estado do Pará*

serum uric acid assay by a patented kinetic method. *Biotechnology Applied Biochemistry*, vol. 45, no.2, (September 2006), pp. 75-80, ISSN 0885-4513

Zou, G.L. & Zhu, R.F. (1997). *Enzymology*, Wuhan University Press, ISBN 7-307-02271-0/Q, Wuhan, China

### **Chemometric Study on Molecules with Anticancer Properties**

João Elias Vidueira Ferreira1, Antonio Florêncio de Figueiredo2, Jardel Pinto Barbosa3 and José Ciríaco Pinheiro<sup>3</sup> <sup>1</sup>*Universidade do Estado do Pará* <sup>2</sup>*Instituto Federal de Educação, Ciência e Tecnologia do Pará* <sup>3</sup>*Laboratório de Química Teórica e Computacional, Universidade Federal do Pará Brasil*

#### **1. Introduction**

184 Chemometrics in Practical Applications

*Biochemistry*, vol. 45, no.2, (September 2006), pp. 75-80, ISSN 0885-4513 Zou, G.L. & Zhu, R.F. (1997). *Enzymology*, Wuhan University Press, ISBN 7-307-02271-0/Q,

Wuhan, China

serum uric acid assay by a patented kinetic method. *Biotechnology Applied* 

Cancer is a class of diseases characterized by uncontrolled growth of abnormal cells of an organism. All over the world millions of people die every year owing to one of the different types of cancer. Unfortunately cancer chemotherapy finds a serious limitation since treatment with drugs is followed by drug resistance in the tumorous cells and side effects (Efferth, 2005). So researches have been directed to make chemotherapy treatment more efficient.

In the late years literature has reported the research on natural products as a good strategy to discover new chemotherapy agents. One of the plants that have shown anticancer properties is *Artemisia annua L.* (*qinghao*). It has the active ingredient artemisinin, which is used as antimalarial. Artemisinin and derivatives have excellent efficacy against multidrug-resistant strains of *P. falciparum* and they are very well tolerated (Price et al., 1998). Recently the sensibility to artemisinin has been evaluated in some tumorous cells. Studies suggest that artemisinin is more toxic to cancerous cells than to normal cells, so giving a new perspective in cancer therapy (Lai et al, 2009).

However ... This book is on chemometrics and what has chemometrics to do with cancer chemotherapy? Well... understanding how these two different areas can be related to one another is the purpose of this chapter. You just must keep on reading this chapter and you will see the many ways chemometrics can be employed to investigate the "behavior" molecules exhibit considering anticancer activity and to make predictions about drugs that were not tested yet. The potential application of chemometrics to analytical data arising from problems in biology and medicine is enormous and, in fact, the applications of chemometrics have diversified substantially over the last few years (Brereton, 2007; 2009). At the end of the chapter you will note that, as in many areas of research, chemometrics plays an important role in medicinal chemistry, fortunately.

Firstly it is necessary to remember that producing a drug is something that takes time and money, so the process must be rationalized! However, in the past, drugs were discovered by synthesizing a lot of molecules, rather without rigorous criteria, and testing experimentally all of them to evaluate their capacity of cure of the disease or at least to control it. But in process

step, which consists on the construction of the structures and the complete optimization of their geometries through a quantum chemistry approach implemented in computer. This is necessary to represent molecules as real as possible and thus to compute their molecular descriptors. The B3LYP/6-31G∗∗ method (Levine, 1991) as implemented in the Gaussian 98 program was employed (Frisch et al., 1998), considering this strategy is suitable for optimizing well all structures since a good description of the geometrical parameters of artemisinin is

Chemometric Study on Molecules with Anticancer Properties 187

The 25 compounds investigated include artemisinin, amides, esters, alcohols, ketones, and five-membered ring derivatives. All compounds have been associated to their in vitro bioactivity against a human hepatocellular carcinoma cell line, HepG2, and were labeled previously into two classes according with their activities: (-) less active (those with IC50 97 *μ*M) and (+) more active (those with IC50 < 97 *μ*M) derivatives. The criteria for choosing this value of IC50 are rather subjective. Nevertheless it is convenient to say that 97 *μ*M is the IC50

After molecular modeling, 1700 descriptors (independent variables) were computed for each molecule in the training set. They represent different source of chemical information (features) regarding the molecules and include geometric, electronic, quantum-chemical, physical-chemical, topological descriptors and others. They are assumed to be important to understand molecular characteristics such as bioactivity against cancer. In fact one of the purposes of a research like this is to find which descriptors of the molecules are better related to the disease under study, in this example cancer. The software used to compute these descriptors were e-Dragon (Virtual Computational Laboratory , 2010), a product from

Substituints

R2 R1 CH3

R2 R1 CH3

R2 R1 CH3

R2 R1 CH3

> O H <sup>N</sup> C16H33

R2 R1 CH3

> OH C18H37

> > C14H29

NH O

O H <sup>N</sup> C14H29

R2 R1 CH3

> OH C18H37

R2 R1

O H <sup>N</sup> C12H25

R2 R1 CH3

> OH C2H5

NH O

R2 R1

C12H25

O

C12H25

O H <sup>N</sup> C10H21

R2 R1 CH3

> OH C2H5

OH OH R2 R1

O O

C12H25

1 3 456 7 8

9 10 11 12 13 14 15 16

<sup>O</sup> C18H37

17 18 19 20 21 22

20 21 22 Fig. 1. Artemisinin and derivatives (training set) with different degrees of cytotoxicities

However, a crucial point to be considered in any data analysis is preprocessing. The original data matrix usually does not have optimal value distribution for the analysis (for example

R2 R1

for artemisinin and the higher IC50 the less active is the compound.

the Virtual Computational Laboratory and Gaussian 98 (Frisch et al., 1998).

R2 R1 CH3

> O H <sup>N</sup> C8H17

R2 R1 CH3

O

R2 R1 CH3

> O N C8H17

achieved.

2

R2 R1 CH3

> O <sup>O</sup> C2H5

R2 R1 CH3

O

C18H37

R2 R1

O H <sup>N</sup> C2H5

R2 R1 CH3

R2 R1 CH3

> O H <sup>N</sup> C4H9

R2 R1 CH3

O

R2 R1 CH3

> O C8H17

> > C16H33

NH O

against human hepatocellular carcinoma HepG2

<sup>O</sup> C12H25

R2

R2 R1 CH3

> O H <sup>N</sup> C18H37

R2 R1 CH3

> O C2H5

CH3 O

R1

of time this methodology became more and more inadequate, for the more new compounds are studied the less a new compound may be discovered to be potent against a disease. It has long been desired to design active structures on the basis of logic and calculations, not relying on chance or trial-and-error (Fujita, 1995).

Nowadays, in science, there is a basic assumpion that molecular properties and structural characteristics are closely connected to biological functions of the compounds. It is often assumed that compounds with similar properties and structures also display similar biological responses. Chemical structure encodes a large amount of information explaining why a certain molecule is active, toxic or insoluble (Rajarshi, 2008). Thus to understand the mechanism of action of a drug it is necessary to interpret the role played by its molecular and structural properties.

In the last decades, much scientific research has focused on how to capture and convert the information encoded in a molecular structure into one or more numbers used to establish quantitative relationships between structures and properties, biological activities or other experimental properties (Puzyn et al., 2010). Quantitative structure-activity relationship (QSAR) studies have been of great value in medicinal chemistry. Statistical tools can be used for the prediction of the biological activities of new compounds based only on the knowledge of their chemical structures, i.e., not depending on experimental data, which are unknown. Such a strategy gives very useful information for the understanding of the mechanisms of the action of drugs and proposals for syntheses, in this way rationalizing drug discovery. QSAR is alive and well (Doweyko, 2008), that is, QSAR has been used with success and so it is still of relevance today.

Moreover advances in computation brought software that made possible to get many different types of information (descriptors) about the molecules. Consequently data gathered through experiments and computers can produce a huge matrix whose elements are information related to molecules. But it seems that analyzing all them will require infinite patience!

What to do?

Chemometics has the solution!

That is true because chemometrics is the art of extracting chemically relevant information from data produced in chemical experiments (Wold, 1995). Most people only think of statistics when faced with a lot of quantitative information to process (Bruns et al., 2006). In this text we show a common and efficient methodology used in medicinal chemistry to rationalize the process of producing a new drug by employing chemometric methods. It is presented a molecular modeling and a chemometric study of 25 artemisinins, which involves artemisinin and derivatives (training set, Fig. 1) with different degrees of cytotoxicities against human hepatocellular carcinoma HepG2 (Liu et al, 2005), since among the malignant tumors in the liver, the hepatocellular carcinoma is very commom. Literature has showed the application of the methodology here described to investigate biological properties (antimalarial and anticancer) of artemisinin and derivatives (Barbosa et al., 2011); (Cardoso et al., 2008); (Pinheiro et al., 2003).

#### **2. Methodology**

Any chemometric study requires data. In this study data are obtained from molecular descriptors calculated through computation. The start point is the molecular modeling 2 Will-be-set-by-IN-TECH

of time this methodology became more and more inadequate, for the more new compounds are studied the less a new compound may be discovered to be potent against a disease. It has long been desired to design active structures on the basis of logic and calculations, not relying

Nowadays, in science, there is a basic assumpion that molecular properties and structural characteristics are closely connected to biological functions of the compounds. It is often assumed that compounds with similar properties and structures also display similar biological responses. Chemical structure encodes a large amount of information explaining why a certain molecule is active, toxic or insoluble (Rajarshi, 2008). Thus to understand the mechanism of action of a drug it is necessary to interpret the role played by its molecular and

In the last decades, much scientific research has focused on how to capture and convert the information encoded in a molecular structure into one or more numbers used to establish quantitative relationships between structures and properties, biological activities or other experimental properties (Puzyn et al., 2010). Quantitative structure-activity relationship (QSAR) studies have been of great value in medicinal chemistry. Statistical tools can be used for the prediction of the biological activities of new compounds based only on the knowledge of their chemical structures, i.e., not depending on experimental data, which are unknown. Such a strategy gives very useful information for the understanding of the mechanisms of the action of drugs and proposals for syntheses, in this way rationalizing drug discovery. QSAR is alive and well (Doweyko, 2008), that is, QSAR has been used with success and so it is still

Moreover advances in computation brought software that made possible to get many different types of information (descriptors) about the molecules. Consequently data gathered through experiments and computers can produce a huge matrix whose elements are information related to molecules. But it seems that analyzing all them will require infinite patience!

That is true because chemometrics is the art of extracting chemically relevant information from data produced in chemical experiments (Wold, 1995). Most people only think of statistics when faced with a lot of quantitative information to process (Bruns et al., 2006). In this text we show a common and efficient methodology used in medicinal chemistry to rationalize the process of producing a new drug by employing chemometric methods. It is presented a molecular modeling and a chemometric study of 25 artemisinins, which involves artemisinin and derivatives (training set, Fig. 1) with different degrees of cytotoxicities against human hepatocellular carcinoma HepG2 (Liu et al, 2005), since among the malignant tumors in the liver, the hepatocellular carcinoma is very commom. Literature has showed the application of the methodology here described to investigate biological properties (antimalarial and anticancer) of artemisinin and derivatives (Barbosa et al., 2011); (Cardoso et al., 2008);

Any chemometric study requires data. In this study data are obtained from molecular descriptors calculated through computation. The start point is the molecular modeling

on chance or trial-and-error (Fujita, 1995).

structural properties.

of relevance today.

What to do?

Chemometics has the solution!

(Pinheiro et al., 2003).

**2. Methodology**

step, which consists on the construction of the structures and the complete optimization of their geometries through a quantum chemistry approach implemented in computer. This is necessary to represent molecules as real as possible and thus to compute their molecular descriptors. The B3LYP/6-31G∗∗ method (Levine, 1991) as implemented in the Gaussian 98 program was employed (Frisch et al., 1998), considering this strategy is suitable for optimizing well all structures since a good description of the geometrical parameters of artemisinin is achieved.

The 25 compounds investigated include artemisinin, amides, esters, alcohols, ketones, and five-membered ring derivatives. All compounds have been associated to their in vitro bioactivity against a human hepatocellular carcinoma cell line, HepG2, and were labeled previously into two classes according with their activities: (-) less active (those with IC50 97 *μ*M) and (+) more active (those with IC50 < 97 *μ*M) derivatives. The criteria for choosing this value of IC50 are rather subjective. Nevertheless it is convenient to say that 97 *μ*M is the IC50 for artemisinin and the higher IC50 the less active is the compound.

After molecular modeling, 1700 descriptors (independent variables) were computed for each molecule in the training set. They represent different source of chemical information (features) regarding the molecules and include geometric, electronic, quantum-chemical, physical-chemical, topological descriptors and others. They are assumed to be important to understand molecular characteristics such as bioactivity against cancer. In fact one of the purposes of a research like this is to find which descriptors of the molecules are better related to the disease under study, in this example cancer. The software used to compute these descriptors were e-Dragon (Virtual Computational Laboratory , 2010), a product from the Virtual Computational Laboratory and Gaussian 98 (Frisch et al., 1998).

Fig. 1. Artemisinin and derivatives (training set) with different degrees of cytotoxicities against human hepatocellular carcinoma HepG2

However, a crucial point to be considered in any data analysis is preprocessing. The original data matrix usually does not have optimal value distribution for the analysis (for example

Furthermore, given the large quantity of multivariate data available, it was necessary to reduce the number of variables. Thus, if two any descriptors had a high Pearson correlation coefficient (r > 0.8), one of the two was randomly excluded from the matrix, since theoretically they describe the same property to be modeled (biological response). Therefore it is sufficient to use only one of them as an independent variable in a predictive model (Ferreira, 2002). Moreover those descriptors that showed the same values for most of the samples were

Chemometric Study on Molecules with Anticancer Properties 189

**Compound** *IC5 Mor29m O1 MlogP* **Activity** 1 4.862 -0.305 -0.246 2.845 97 2 5.253 -0.308 -0.200 2.630 >100 3 5.389 -0.372 -0.202 3.080 >100 4 5.628 -0.445 -0.194 4.845 9.5 5 5.684 -0.474 -0.205 5.250 2.8 6 5.624 -0.525 -0.214 5.644 1.2 7 5.501 -0.514 -0.211 6.027 0.46 8 5.364 -0.518 -0.191 6.400 0.79 9 5.225 -0.501 -0.210 6.765 4.2 10 5.217 -0.236 -0.205 3.036 >100 11 5.597 -0.526 -0.218 6.050 0.72 12 5.197 -0.179 -0.225 7.171 >100 13 5.253 -0.364 -0.246 3.141 >100 14 5.253 -0.322 -0.237 3.141 >100 15 5.159 -0.294 -0.259 7.095 >100 16 5.159 -0.232 -0.258 7.095 >100 17 5.180 -0.443 -0.219 2.996 >100 18 5.168 -0.307 -0.209 7.131 >100 19 5.624 -0.485 -0.186 5.644 1.8 20 5.856 -0.518 -0.218 3.941 3.5 21 5.543 -0.562 -0.344 5.449 1.3 22 5.419 -0.560 -0.320 5.837 0.77 23 5.280 -0.591 -0.281 6.215 0.74 24 5.516 -0.498 -0.269 5.855 3.7 25 5.488 -0.545 -0.273 5.815 0.47

Mean 5.378 -0.425 -0.234 5.164 Stardard Deviation 0.225 0.121 0.040 1.570 Table 1. Values of the four descriptors selected through PCA for compounds from the

After this step, PCA was performed in order to continue reducing the dimensionality of the data, find descriptors that could be useful in characterizing the behavior of the compounds acting against cancer and look for natural clustering in the data and outlier samples. While processing PCA, several attempts to obtain a good classification of the compounds are made. At each attempt, one or more variables are removed, PCA is run and the score and loading

The score plot gives information about the compounds (similarities and differences). The loading plot gives information about the variables (how they are connected to each other and

eliminated too.

training set

plots are analyzed.

it has different units and variances in variables), which requires some pretreatment prior to multivariate analysis. In general, the autoscale preprocessing, which results in scaled variables with zero mean and unit variance, is used (Ferreira, 2002). Then, all variables were auto-scaled as a preprocessing so that they could be standardized and this way could have the same importance regarding the scale.

Then the next step consists on application of multivariate statistical methods to find key features involving molecules, descriptors and anticancer activity. The methods include principal component analysis (PCA), hiererchical cluster analysis (HCA), K-nearest neighbor method (KNN), soft independent modeling of class analogy method (SIMCA) and stepwise discriminant analysis (SDA). The analyses were performed on a data matrix with dimension 25 lines (molecules) x 1700 columns (descriptors), not shown for convenience. For a further study of the methodology applied there are standard books available such as (Varmuza & Filzmoser, 2009) and (Manly, 2004).

#### **2.1 PCA**

Suppose that in your study, like in the example exhibited in this chapter, you have a large set of data, certainly it will not be a simple task to analyze so many variables and extract useful information from them. It would be a "revolution" in your research if you could confidently interpret all data in a simpler way. Fortunately, with the aid of PCA technique, this "revolution" can happen. Through PCA you can reduce the total number of variables to a smaller set while maintaining as much of the original information as is possible. No matter your area of research this is a great advantage.

Fig. 2. Plot of PC1-PC2 scores for artemisinin and derivates (training set) with activity against human hepatocellular carcinoma HepG2. More active compounds displayed on the left side (plus sign) while less active ones on the right side (minus sign)

Now considering our data matrix, PCA was employed looking for a small group of descriptors so that they alone were responsible for classifying all 25 samples into two distinct classes: more active and less active. Besides it is desirable to choose uncorrelated descriptors that could be easier to interpret and analyze, trying to associate them to cytotoxicities against human hepatocellular carcinoma HepG2.

4 Will-be-set-by-IN-TECH

it has different units and variances in variables), which requires some pretreatment prior to multivariate analysis. In general, the autoscale preprocessing, which results in scaled variables with zero mean and unit variance, is used (Ferreira, 2002). Then, all variables were auto-scaled as a preprocessing so that they could be standardized and this way could have

Then the next step consists on application of multivariate statistical methods to find key features involving molecules, descriptors and anticancer activity. The methods include principal component analysis (PCA), hiererchical cluster analysis (HCA), K-nearest neighbor method (KNN), soft independent modeling of class analogy method (SIMCA) and stepwise discriminant analysis (SDA). The analyses were performed on a data matrix with dimension 25 lines (molecules) x 1700 columns (descriptors), not shown for convenience. For a further study of the methodology applied there are standard books available such as (Varmuza &

Suppose that in your study, like in the example exhibited in this chapter, you have a large set of data, certainly it will not be a simple task to analyze so many variables and extract useful information from them. It would be a "revolution" in your research if you could confidently interpret all data in a simpler way. Fortunately, with the aid of PCA technique, this "revolution" can happen. Through PCA you can reduce the total number of variables to a smaller set while maintaining as much of the original information as is possible. No matter

Fig. 2. Plot of PC1-PC2 scores for artemisinin and derivates (training set) with activity against human hepatocellular carcinoma HepG2. More active compounds displayed on the

Now considering our data matrix, PCA was employed looking for a small group of descriptors so that they alone were responsible for classifying all 25 samples into two distinct classes: more active and less active. Besides it is desirable to choose uncorrelated descriptors that could be easier to interpret and analyze, trying to associate them to cytotoxicities against

left side (plus sign) while less active ones on the right side (minus sign)

the same importance regarding the scale.

Filzmoser, 2009) and (Manly, 2004).

your area of research this is a great advantage.

human hepatocellular carcinoma HepG2.

**2.1 PCA**

Furthermore, given the large quantity of multivariate data available, it was necessary to reduce the number of variables. Thus, if two any descriptors had a high Pearson correlation coefficient (r > 0.8), one of the two was randomly excluded from the matrix, since theoretically they describe the same property to be modeled (biological response). Therefore it is sufficient to use only one of them as an independent variable in a predictive model (Ferreira, 2002). Moreover those descriptors that showed the same values for most of the samples were eliminated too.


Table 1. Values of the four descriptors selected through PCA for compounds from the training set

After this step, PCA was performed in order to continue reducing the dimensionality of the data, find descriptors that could be useful in characterizing the behavior of the compounds acting against cancer and look for natural clustering in the data and outlier samples. While processing PCA, several attempts to obtain a good classification of the compounds are made. At each attempt, one or more variables are removed, PCA is run and the score and loading plots are analyzed.

The score plot gives information about the compounds (similarities and differences). The loading plot gives information about the variables (how they are connected to each other and

Fig. 4. HCA dendogram for artemisinin and derivatives (training set) with biological activity against human hepatocellular carcinoma HepG2. Plus sign for more active compounds while

Chemometric Study on Molecules with Anticancer Properties 191

C2H5

R1

C18H37

O 18-

R2

C18H37

R1

OH 16 b

R2

C18H37

The loading plot relative to the first and second principal components can be seen in Fig. 3. PC1 and PC2 are expressed in Equations 1 and 2, respectively, as a function of the four selected descriptors. They represent quantitative variables that provide the overall predictive ability

R1

OH 15 a

R2

OC18H37

R1

O 12-

R2

R1

C2H5

O

R2

OH

R2

C2H5

a b

OH

R2 R1

C4H9

Cluster **B**

O

R2 R1

2-1- 10- 3- 13- 14- 17-

OC2H5

R1

O

R2

Cluster **A**

NHC2H5

R1

O

R2

ART

Fig. 5. Cluster **A**

Fig. 6. Cluster **B**

R1

minus sign for less active ones

which are the best to describe the variance in the original data). Depending on the results displayed by the plots, variables remain removed or included in the data matrix. If a removal of a variable contributes to separate compounds showed by the score plot into two classes (more and less active), then in the next attempt PCA is run without this variable. But if no improvement is achieved, then the variable removed is inserted in the data matrix, another variable is selected to be removed and PCA is run again. The loadings plot gives good clues on which variables must be excluded. Variables that are very close to one another indicate they are correlated and, as stated before, only one of them needs to remain.

This methodology comprises part of the art of variable selection: patience and intuition are the fundamental tools here. It is not necessary to mention that the more you know about the system you are investigating (samples and variables and how they are connected), the more you can have success in the process of finding variables that really are important to your investigation. Variable selection does not occur like magic, at least, not always!

The descriptors selected in PCA were *IC5*, *Mor29m*, *O1* and *MlogP*, which represent four distinct types of interactions related to the molecules, especially between the molecules and the biological receptor. These descriptors are classified as steric (*IC5*), 3D-morse (*Mor29m*), electronic (*O1*) and molecular (*MlogP*). The main properties of a drug that appear to influence its activity are its lipophilicity, the electronic effects within the molecule and the size and shape of the molecule (steric effects) (Gareth, 2003).

Fig. 3. Plot of the PC1-PC2 loadings for the four descriptors selected through PCA

The PCA results show the score plot (Fig. 2) relative to the first and second principal components. In PC1, there is a distinct separation of the compounds into two classes. More active compounds are on the left side, while less active are on the right side. They were chosen among all data set (1700 descriptors) and they are assumed to be very important to investigate anticancer mechanism involving artemisinins. Table 1 displays the values computed for these four descriptors. This step was crucial since a matrix with 1700 columns was reduced to only 4 columns. No doubt it is more appropriate to deal with a smaller matrix. The first three principal components, PC1, PC2 and PC3 explained 43.6%, 28.7% and 20.9% of the total variance, respectively. The Pearson correlation coefficient between the variables is in general low (less than 0.25, in absolute values); exception occurs between *Mor29m* and *IC5*, which is -0.65).

Fig. 4. HCA dendogram for artemisinin and derivatives (training set) with biological activity against human hepatocellular carcinoma HepG2. Plus sign for more active compounds while minus sign for less active ones

Fig. 5. Cluster **A**

6 Will-be-set-by-IN-TECH

which are the best to describe the variance in the original data). Depending on the results displayed by the plots, variables remain removed or included in the data matrix. If a removal of a variable contributes to separate compounds showed by the score plot into two classes (more and less active), then in the next attempt PCA is run without this variable. But if no improvement is achieved, then the variable removed is inserted in the data matrix, another variable is selected to be removed and PCA is run again. The loadings plot gives good clues on which variables must be excluded. Variables that are very close to one another indicate

This methodology comprises part of the art of variable selection: patience and intuition are the fundamental tools here. It is not necessary to mention that the more you know about the system you are investigating (samples and variables and how they are connected), the more you can have success in the process of finding variables that really are important to your

The descriptors selected in PCA were *IC5*, *Mor29m*, *O1* and *MlogP*, which represent four distinct types of interactions related to the molecules, especially between the molecules and the biological receptor. These descriptors are classified as steric (*IC5*), 3D-morse (*Mor29m*), electronic (*O1*) and molecular (*MlogP*). The main properties of a drug that appear to influence its activity are its lipophilicity, the electronic effects within the molecule and the size and shape

they are correlated and, as stated before, only one of them needs to remain.

investigation. Variable selection does not occur like magic, at least, not always!

Fig. 3. Plot of the PC1-PC2 loadings for the four descriptors selected through PCA

The PCA results show the score plot (Fig. 2) relative to the first and second principal components. In PC1, there is a distinct separation of the compounds into two classes. More active compounds are on the left side, while less active are on the right side. They were chosen among all data set (1700 descriptors) and they are assumed to be very important to investigate anticancer mechanism involving artemisinins. Table 1 displays the values computed for these four descriptors. This step was crucial since a matrix with 1700 columns was reduced to only 4 columns. No doubt it is more appropriate to deal with a smaller matrix. The first three principal components, PC1, PC2 and PC3 explained 43.6%, 28.7% and 20.9% of the total variance, respectively. The Pearson correlation coefficient between the variables is in general low (less than 0.25, in absolute values); exception occurs between *Mor29m* and *IC5*, which is

of the molecule (steric effects) (Gareth, 2003).


Fig. 6. Cluster **B**

The loading plot relative to the first and second principal components can be seen in Fig. 3. PC1 and PC2 are expressed in Equations 1 and 2, respectively, as a function of the four selected descriptors. They represent quantitative variables that provide the overall predictive ability

**Compound** K1 K2 K3 K4 K5 K6 1 ------ 2 ------ 3 ------ 4 ++++++ 5 ++++++ 6 ++++++ 7 ++++++ 8 ++++++ 9 ++++++ 10 - - - - - - 11 + + + + + + 12 - - - - - - 13 - - - - - - 14 - - - - - - 15 - - - - - - 16 - - - - - - 17 - - - - - - 18 - - - - - - 19 + + + + + + 20 + + + + + + 21 + + + + + + 22 + + + + + + 23 + + + + + + 24 + + + + + + 25 + + + + + +

Chemometric Study on Molecules with Anticancer Properties 193

Table 2. Classification of compounds from the training set according to KNN method

Fig. 9. Variations in descriptors: a) Variations in *IC5* for each cluster; b) Variations in *Mor29m* for each cluster; c) Variations in *O1* for each cluster; d) Variations in *MlogP* for each cluster

In this work, classification through HCA was based on the Euclidean distance and the average group method. This method established links between samples/cluster. The distance between two clusters was computed as the distance between the average values (the mean vector or centroids) of the two clusters. The descriptors employed in HCA were the same selected in

Fig. 8. Cluster **D**

of the different sets of molecular descriptors selected. In Equation 1 the loadings of *IC5* and *MlogP* are negative whereas they are positive for *Mor29m* and *O1*. Among all of them *IC5* and *Mor29m* are the most important to PC1 due to the magnitude of their coefficients (-0.613 and 0.687, respectively) in comparison to *O1* and *MlogP* (0.234 and -0.313, respectively). For a compound to be more active against cancer, it must generally be connected to negative values for PC1, that is, it must present high values for *IC5* and *MlogP*, but more negative values for *Mor29m* and *O1*.

$$PC1 = -0.613 \text{IC5} + 0.687 \text{Mor} \\ 29m + 0.234 \text{O1} - 0.313 \text{MlogP} \tag{1}$$

$$PC2 = -0.445IC5 + 0.081Mor29m - 0.743O1 + 0.493MlogP \tag{2}$$

#### **2.2 HCA**

Considering the necessity of grouping molecules of similar kind into respective categories (more and less active ones), HCA is suitable for this purpose since it is possible to visualize the disposition of molecules with respect to their similarities and so make suppositions of how they may act against the disease. When performing HCA many approaches are available. Each one differs basically by the way samples are grouped.

8 Will-be-set-by-IN-TECH

Cluster **C**

C18H37 N O

CH2CH(OH)CH2OH

R1 R2

> R1 R2

OC12H25

O

NHC14H29

R1 R2

R1 R2

*PC*1 = −0.613*IC5* + 0.687*Mor29m* + 0.234*O1* − 0.313*MlogP* (1) *PC*2 = −0.445*IC5* + 0.081*Mor29m* − 0.743*O1* + 0.493*MlogP* (2)

NHC16H33

O

21+ 22+ 23+ 24+ 25+

of the different sets of molecular descriptors selected. In Equation 1 the loadings of *IC5* and *MlogP* are negative whereas they are positive for *Mor29m* and *O1*. Among all of them *IC5* and *Mor29m* are the most important to PC1 due to the magnitude of their coefficients (-0.613 and 0.687, respectively) in comparison to *O1* and *MlogP* (0.234 and -0.313, respectively). For a compound to be more active against cancer, it must generally be connected to negative values for PC1, that is, it must present high values for *IC5* and *MlogP*, but more negative values for

Considering the necessity of grouping molecules of similar kind into respective categories (more and less active ones), HCA is suitable for this purpose since it is possible to visualize the disposition of molecules with respect to their similarities and so make suppositions of how they may act against the disease. When performing HCA many approaches are available.

O

OC12H25

O

R1 R2

NHC14H29

O

4+ 5+ 19+ 6+ 11+ 7+ 20+

O

C8H17

O

R1 R2

Each one differs basically by the way samples are grouped.

NHC12H25

O

NHC10H21

R1 R2 R1 R2

O

NHC8H17

R1 R2

Fig. 8. Cluster **D**

*Mor29m* and *O1*.

**2.2 HCA**

R1 R2

O

Fig. 7. Cluster **C**

R1 R2

NHC12H25

R1 R2 R1 R2

Cluster **D**

NHC18H37

O

NHC16H33

R1 R2

8+ 9+

C12H25

O

O


Table 2. Classification of compounds from the training set according to KNN method

Fig. 9. Variations in descriptors: a) Variations in *IC5* for each cluster; b) Variations in *Mor29m* for each cluster; c) Variations in *O1* for each cluster; d) Variations in *MlogP* for each cluster

In this work, classification through HCA was based on the Euclidean distance and the average group method. This method established links between samples/cluster. The distance between two clusters was computed as the distance between the average values (the mean vector or centroids) of the two clusters. The descriptors employed in HCA were the same selected in

**2.4 SIMCA**

**2.5 SDA**

computed;

The SIMCA method develops principal component models for each training set category. The main goal is the reliable classification of new samples. When a prediction is made in SIMCA, new samples insufficiently close to the PC space of a class are considered non-members. Table 4 shows classification for compounds from the training set. Here sample **9** was classified incorrectly since its activity is 4.2 (more active) but it is classified by SIMCA as less active.

Chemometric Study on Molecules with Anticancer Properties 195

Compound 1 2 3 4 5 6 7 8 9 10 11 12 13 Class - - - + + + + + - - + - - Compound 14 15 16 17 18 19 20 21 22 23 24 25 Class - - - - - + + + + + + + Table 4. Classification of compounds from the training set according to SIMCA method

Probably the reason for this misclassification lies in the fact that compound **9** may not be "well grouped" into one of the two classes. In fact when you analyze Fig. 2 you note that **9** is the compound classified as more active that is closer to compounds classified as less active. Group or Class Number of Compounds True group

> Less active 11 0 11 More active 14 14 0

%Correct information 100 100

SDA is also a multivariate method that attempts to maximize the probability of correct allocation. The main objectives of SDA are to separate objects from distinct populations and

The discrimination functions for less active and more active classes are, respectively,

(a) Initially, for each molecule, the values for descriptors (*IC5*, *Mor29m*, *O1* and *MlogP*) are

(b) The values from (a) are inserted in the two discrimination functions (Equation 3 and Equation 4 ). However, since these equations were obtained from autoscaled values from Table 1 (training set), it is necessary that values from Table 7 (test set) are autoscaled before

*YLESS* = −5.728 − 2.825*MlogP* − 0.682*O*1 − 3.243*IC*5 + 7.745*Mor*29*m* (3) *YMORE* = −3.536 + 2.220*MlogP* + 0.536*O*1 + 2.548*IC*5 − 6.086*Mor*29*m* (4)

Total 25

Table 5. Classification matrix obtained by using SDA

Equations 3 and 4, given below:

inserted into the equations;

to allocate new objects into populations previously defined.

The way the method is used is based on the following steps:

More active Less active

PCA, that is, *IC5*, *Mor29m*, *O1* and *MlogP*. The representation of clustering results is shown by the dendogram in Fig. 4, which depicts the similarity of samples. The branches on the bottom of the dendogram represent single samples. The length of the branches linking two clusters is related to their similarity. Long branches are related to low similarity while short branches mean high similarity. On the scale of similarity, a value of 100 is assigned to identical samples and a value of 0 to the most dissimilar samples. For a better interpretation of the dendogram, the clusters are also analyzed alone (Figs. 5, 6, 7 and 8 ), and variations in descriptors in each cluster are presented in Fig. 9. The scale above each figure is associated to the property considered and the letters indicate the cluster in the dendogram. It is easily recognized that descriptors in clusters in general have different pattern of variations, a characteristic supported by the fact that clusters have different groups of molecules.


Table 3. Classification matrix obtained by using KNN

The dendogram shows compounds classified into two different classes according to their activities with no sample incorrectly classified. Less active compounds are on the left side and are divided into clusters **A** (Fig. 5) and **B** (Fig. 6). In cluster **A** substituints have either C2H5 (**2**, **10**, **13**, **14** and **17**) or C4H9 (**3**). Here the lowest values for *IC5* (Fig. 9a) and *MlogP* are found (Fig. 9d). In cluster **B** (**12**, **15**, **16** and **18**) all substituints have C18H37 and are present the highest values for *MlogP* (Fig. 9d). Considering more active samples, right side of the figure, in cluster **C** (Fig. 7) compounds have amide group (exception is **11**, ester, and **19**, ketone) and attached to this group there is an alkyl chain of 8 to 18 carbon atoms. Here the descriptor *IC5* displays the highest values (Fig. 9a). In Cluster **D** (Fig. 8) substituints have an alkyl chain of 12 to 16 carbon atoms and the six-membered ring molecules with oxygen O11 are replaced by five-membered ring molecules. Compounds display the lowest values for *Mor29m* (Fig. 9b) and *O1* (Fig. 9c).

Besides these two methods of classification (PCA and HCA), others (KNN, SIMCA and SDA) were applied to data. They are important to construct reliable models useful to classify new compounds (test set) regarding their ability to face cancer. This is certainly the ultimate purpose of many researches in planning a new drug.

#### **2.3 KNN**

This method categorizes an unknown object based on its proximity to samples already placed in categories. After built the model, compounds from the test set are classified and their classes predicted taking into account the multivariate distance of the compound with respect to K samples in the training set. The model built for KNN in this example employs leave one out method, has 6 (six) as a maximum k value and autoscaled data. Table 2 shows classification for each sample at each k value. Column number corresponds to k setting so that the first column of this matrix holds the class for each training set sample when only one neighbor (the nearest) is polled whereas the last column holds the class for the samples when the kmax nearest neighbors are polled. Tables 2 and 3 summarizes the results for KNN analysis. All 6-nearest neighbors classified samples correctly.

#### **2.4 SIMCA**

10 Will-be-set-by-IN-TECH

PCA, that is, *IC5*, *Mor29m*, *O1* and *MlogP*. The representation of clustering results is shown by the dendogram in Fig. 4, which depicts the similarity of samples. The branches on the bottom of the dendogram represent single samples. The length of the branches linking two clusters is related to their similarity. Long branches are related to low similarity while short branches mean high similarity. On the scale of similarity, a value of 100 is assigned to identical samples and a value of 0 to the most dissimilar samples. For a better interpretation of the dendogram, the clusters are also analyzed alone (Figs. 5, 6, 7 and 8 ), and variations in descriptors in each cluster are presented in Fig. 9. The scale above each figure is associated to the property considered and the letters indicate the cluster in the dendogram. It is easily recognized that descriptors in clusters in general have different pattern of variations, a characteristic

Group or class Number of Compounds Compounds wrongly classified

Less active 11 0 0 0 0 0 0 More active 14 0 0 0 0 0 0 %Correct information 25 100 100 100 100 100 100

The dendogram shows compounds classified into two different classes according to their activities with no sample incorrectly classified. Less active compounds are on the left side and are divided into clusters **A** (Fig. 5) and **B** (Fig. 6). In cluster **A** substituints have either C2H5 (**2**, **10**, **13**, **14** and **17**) or C4H9 (**3**). Here the lowest values for *IC5* (Fig. 9a) and *MlogP* are found (Fig. 9d). In cluster **B** (**12**, **15**, **16** and **18**) all substituints have C18H37 and are present the highest values for *MlogP* (Fig. 9d). Considering more active samples, right side of the figure, in cluster **C** (Fig. 7) compounds have amide group (exception is **11**, ester, and **19**, ketone) and attached to this group there is an alkyl chain of 8 to 18 carbon atoms. Here the descriptor *IC5* displays the highest values (Fig. 9a). In Cluster **D** (Fig. 8) substituints have an alkyl chain of 12 to 16 carbon atoms and the six-membered ring molecules with oxygen O11 are replaced by five-membered ring molecules. Compounds display the lowest values for *Mor29m* (Fig. 9b)

Besides these two methods of classification (PCA and HCA), others (KNN, SIMCA and SDA) were applied to data. They are important to construct reliable models useful to classify new compounds (test set) regarding their ability to face cancer. This is certainly the ultimate

This method categorizes an unknown object based on its proximity to samples already placed in categories. After built the model, compounds from the test set are classified and their classes predicted taking into account the multivariate distance of the compound with respect to K samples in the training set. The model built for KNN in this example employs leave one out method, has 6 (six) as a maximum k value and autoscaled data. Table 2 shows classification for each sample at each k value. Column number corresponds to k setting so that the first column of this matrix holds the class for each training set sample when only one neighbor (the nearest) is polled whereas the last column holds the class for the samples when the kmax nearest neighbors are polled. Tables 2 and 3 summarizes the results for KNN analysis. All

K1 K2 K3 K4 K5 K6

supported by the fact that clusters have different groups of molecules.

Table 3. Classification matrix obtained by using KNN

purpose of many researches in planning a new drug.

6-nearest neighbors classified samples correctly.

and *O1* (Fig. 9c).

**2.3 KNN**

The SIMCA method develops principal component models for each training set category. The main goal is the reliable classification of new samples. When a prediction is made in SIMCA, new samples insufficiently close to the PC space of a class are considered non-members. Table 4 shows classification for compounds from the training set. Here sample **9** was classified incorrectly since its activity is 4.2 (more active) but it is classified by SIMCA as less active.


Table 4. Classification of compounds from the training set according to SIMCA method

Probably the reason for this misclassification lies in the fact that compound **9** may not be "well grouped" into one of the two classes. In fact when you analyze Fig. 2 you note that **9** is the compound classified as more active that is closer to compounds classified as less active.


Table 5. Classification matrix obtained by using SDA

#### **2.5 SDA**

SDA is also a multivariate method that attempts to maximize the probability of correct allocation. The main objectives of SDA are to separate objects from distinct populations and to allocate new objects into populations previously defined.

The discrimination functions for less active and more active classes are, respectively, Equations 3 and 4, given below:

$$Y\_{LESS} = -5.728 - 2.825M \log P - 0.682O1 - 3.243IC5 + 7.745Mor29m \tag{3}$$

$$Y\_{\rm MORE} = -3.536 + 2.220M \log P + 0.536O1 + 2.548IC5 - 6.086Mor29m \tag{4}$$

The way the method is used is based on the following steps:

(a) Initially, for each molecule, the values for descriptors (*IC5*, *Mor29m*, *O1* and *MlogP*) are computed;

(b) The values from (a) are inserted in the two discrimination functions (Equation 3 and Equation 4 ). However, since these equations were obtained from autoscaled values from Table 1 (training set), it is necessary that values from Table 7 (test set) are autoscaled before inserted into the equations;

based on the four descriptors used in the models: *IC5*, *Mor29m*, *O1* and *MlogP*, according to

Chemometric Study on Molecules with Anticancer Properties 197

Substituints

R2 R1 CH3

O

CH3 CH3 R1

R2 R1 CH3

26 27 28 29 30

OH O

31 32 33 34 35

Fig. 10. Compounds from the test set which must be classified as either less active or more

**Compound** PCA HCA KNN SIMCA SDA 26 - - - - - 27 + + + + + 28 - - - - - 29 - - - - - 30 + + + + + 31 - - - - - 32 - - - - - 33 - - - - - 34 + + + + + 35 + + + + +

Table 8. Predicted classification for unknown compounds from the test set through different methods. Minus sign (-) for a compound classified as less active while plus sign (+) for a

The result presented in Table 8 reveal that all samples (test set) receive the same classification by the four methods. Compounds **26**, **28**, **29**, **31**, **32** and **33** were classified as less active while compounds **27**, **30**, **34** and **35** were classified as more active. If you look for an explanation for such a pattern you will note that **26** and **27** present carboxylic acid group at the end of the chain, but only **27** is classified as more active. So it is possible that the change of an ester group by an amide group causes increase in activity. However when two amide groups are considered as occurs in **28** and **30** more carbon atoms in substituent means more active. Now comparing **26**, **29**, **31** and **32**, all of them have ester group associated with another different group and they all are classified as less active. The presence of the second group seams not to

All multivariate statistical methods (PCA, HCA, KNN, SIMCA and SDA) classified the 25 compounds from the training set into two distinct classes: more active and less active according to their degree of anticancer HepG2 activity. This classification was based on *IC5*,

modify activity too much. The same effect is found in **34** and **35**, both more active.

NH(CH2)2CONH2

R2

O

O(CH2)2COOCH3

R1

NH(CH2)4CONHCH3

R2 R1 CH3

O

R2 O

O(CH2)4

NH(CH2)4CONHCH3

H <sup>N</sup> <sup>O</sup>

R1

R2 O

Table 7.

R2 R1 CH3

O

R2 R1 CH3

> O O

active

O(CH2)2COOH

<sup>O</sup> <sup>O</sup>

compound classified as more active

**3. Conclusion**

R2 R1 CH3

O

R2 R1 CH3

> O O

NH(CH2)4COOH

O O


(c) The two values computed from (b) are compared. In case the value calculated from Equation 3 is higher than that from Equation 4, then the molecule is classified as less active. Otherwise, the molecule is classified as more active.

Table 6. Classification matrix obtained by using SDA with Cross Validation

Through SDA all compounds of the training set were classified as presented in Table 5. The classification error rate was 0% resulting in a satisfactory separation between more and less active compounds.

The reliability of the model is determined by carrying out a cross-validation test, which uses the leave-one-out technique. In this procedure, one compound is omitted of the data set and the classification functions are built based on the remaining compounds. Afterwards, the omitted compound is classified according to the classification functions generated. In the next step, the omitted compound is included and a new compound is removed, and the procedure goes on until the last compound is removed. The obtained results with the cross-validation methodology are summarized in Table 6. Since the total of correct information was 100%, the model can be believed as being a good model.


Table 7. Values of the four descriptors for the compounds from the test set

#### **2.6 Classification of unknown compounds**

The models built from compounds from the training set through PCA, HCA, KNN, SIMCA and SDA now can be used to classify others compounds (test set, Fig. 10) whose anticancer activities are unknown. So ten compounds were proposed here to verify if they must be classified as less active or more active against a human hepatocellular carcinoma cell line, HepG2. In fact, they were not selected from any literature, so it is supposed that they have not been tested against this carcinoma. These compounds were selected so that they have substituitions at the same positions as those for the training set (R1 and R2) and the same type of atoms. It is important to keep the main characteristics of the compounds that generated the models. This way good predictions can be achieved. The classification of the test set was 12 Will-be-set-by-IN-TECH

(c) The two values computed from (b) are compared. In case the value calculated from Equation 3 is higher than that from Equation 4, then the molecule is classified as less active.

Group or Class Number of Compounds True group

%Correct information 100 100

Through SDA all compounds of the training set were classified as presented in Table 5. The classification error rate was 0% resulting in a satisfactory separation between more and less

The reliability of the model is determined by carrying out a cross-validation test, which uses the leave-one-out technique. In this procedure, one compound is omitted of the data set and the classification functions are built based on the remaining compounds. Afterwards, the omitted compound is classified according to the classification functions generated. In the next step, the omitted compound is included and a new compound is removed, and the procedure goes on until the last compound is removed. The obtained results with the cross-validation methodology are summarized in Table 6. Since the total of correct information was 100%, the

> **Compound** *IC5 Mor29m O1 MlogP* 26 5.371 -0.437 -0.238 2.461 27 5.526 -0.544 -0.249 2.496 28 5.402 -0.516 -0.241 1.649 29 5.336 -0.481 -0.239 2.461 30 5.572 -0.553 -0.238 2.305 31 5.464 -0.411 -0.226 3.117 32 5.584 -0.323 -0.244 3.328 33 5.282 -0.496 -0.226 3.225 34 5.483 -0.570 -0.345 2.090 35 5.583 -0.667 -0.262 2.922

The models built from compounds from the training set through PCA, HCA, KNN, SIMCA and SDA now can be used to classify others compounds (test set, Fig. 10) whose anticancer activities are unknown. So ten compounds were proposed here to verify if they must be classified as less active or more active against a human hepatocellular carcinoma cell line, HepG2. In fact, they were not selected from any literature, so it is supposed that they have not been tested against this carcinoma. These compounds were selected so that they have substituitions at the same positions as those for the training set (R1 and R2) and the same type of atoms. It is important to keep the main characteristics of the compounds that generated the models. This way good predictions can be achieved. The classification of the test set was

Less active 11 0 11 More active 14 14 0

More active Less active

Otherwise, the molecule is classified as more active.

model can be believed as being a good model.

**2.6 Classification of unknown compounds**

active compounds.

Total 25

Table 6. Classification matrix obtained by using SDA with Cross Validation

Table 7. Values of the four descriptors for the compounds from the test set

based on the four descriptors used in the models: *IC5*, *Mor29m*, *O1* and *MlogP*, according to Table 7.

Fig. 10. Compounds from the test set which must be classified as either less active or more active


Table 8. Predicted classification for unknown compounds from the test set through different methods. Minus sign (-) for a compound classified as less active while plus sign (+) for a compound classified as more active

The result presented in Table 8 reveal that all samples (test set) receive the same classification by the four methods. Compounds **26**, **28**, **29**, **31**, **32** and **33** were classified as less active while compounds **27**, **30**, **34** and **35** were classified as more active. If you look for an explanation for such a pattern you will note that **26** and **27** present carboxylic acid group at the end of the chain, but only **27** is classified as more active. So it is possible that the change of an ester group by an amide group causes increase in activity. However when two amide groups are considered as occurs in **28** and **30** more carbon atoms in substituent means more active. Now comparing **26**, **29**, **31** and **32**, all of them have ester group associated with another different group and they all are classified as less active. The presence of the second group seams not to modify activity too much. The same effect is found in **34** and **35**, both more active.

#### **3. Conclusion**

All multivariate statistical methods (PCA, HCA, KNN, SIMCA and SDA) classified the 25 compounds from the training set into two distinct classes: more active and less active according to their degree of anticancer HepG2 activity. This classification was based on *IC5*,

Efferth, T. (2005). Mechanistic Perspectives for 1,2,4-trioxanes in Anti-cancer Therapy. *Drug Resistance. Updat*, Vol. 8, No.1-2, (February 2005), pp. 85-97, ISSN 1368-7646 Ferreira, M. (2002). Multivariate QSAR. *Journal of the Brazilian Chemical Society*, Vol.13, No. 6,

Chemometric Study on Molecules with Anticancer Properties 199

Fujita, T. (1995). *QSAR and Drug Design: New Developments and Applications*, Elsevier, ISBN

Gareth, T. (2003). *Fundamental of Medicinal Chemistry*, John Wiley & Sons, Ltd, ISBN

Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.;

Varmuza, K. & Filzmoser, P. (2009). *Introduction to Multivariate Statistical Analysis in*

Lai, H.; Nakasi, I.; Lacoste, E.; Singh, N. & Sasaki (2009). T. Artemisinin-Transferrin Conjugate

Levine, I. (1991). *Quantum Chemistry* (4th), Prentice Hall, ISBN 0-205-12770-3, New Jersey, USA Liu,Y.; Wong, V.; Ko, B.; Wong, M. & Che, C. (2005). Synthesis and Cytotoxicity Studies of

Pinheiro, J.; Kiralj, R.; & Ferreira, M. (2003). Artemisinin Derivatives with Antimalarial

Manly, B. (2004). *Multivariate Statistical Methods: A Primer* (3), Chapman and Hall/CRC, ISBN

Price, R.; van Vugt, M.; Nosten, F.; Luxemburger, C.; Brockman, A.; Phaipun, L.;

Puzyn, T.; Leszczynski, J. & Cronin, M. (Ed(s)). (2010). *Recent Advances in QSAR Studies: Methods and Applications*, Springer, ISBN 978-1-4020-9783-6, New York, USA Rajarshi, G. (2008). On the interpretation and interpretability of quantitative structure-activity

Retards Growth of Breast Tumors in the Rat. *Anticancer Research*, Vol. 29, No. 10,

Artemisinin Derivatives Containing Lipophilic Alkyl Carbon Chains. *Organic Letters*,

Activity against Plasmodium falciparum Designed with the Aid of Quantum Chemical and Partial Least Squares Methods. *QSAR & Combinatorial Science*, Vol. 22,

Chongsuphajaisiddhi, T. & White, N. (1998). Artesunate versus Artemether for the Treatment of Recrudescent Multidrug-resistant Falciparum Malaria. *The American Journal of Tropical Medicine and Hygiene*, Vol. 59, No. 6, (December 1998), pp. 883-888,

relationship models. *Journal of Computer-Aided Molecular Design*, Vol. 22, No. 12,

*Chemometrics*, CRC Press, ISBN 9781420059472, Florida, USA

Vol. 7, No. 8, (March 2005), pp. 1561-1564. ISSN 1523-7052

No. 8, (November 2003), pp. 830-842, ISSN 1611-0218

(December 2008), pp. 857-871, ISSN 1573-4951

9781584884149, London, England

ISSN 0002-9637

(October 2009), pp. 3807-3810, ISSN 1791-7530

Zakrzewski, V. G.; Montgomery, Jr J. A.; Stratmann, R. E.; Burant, J. C.; Dapprich, S.; Millam, J. M.; Daniels, A. D.; Kudin, K.N.; Strain, M. C.; Farkas, O.; Tomasi, J.; Barone, V.; Cossi, M.; Cammi, R.; Mennucci, B.; Pomelli, C.; Adamo, C.; Clifford, S.; Ochterski, J.; Petersson, G. A.; Ayala, P. Y.; Cui, Q.; Morokuma, K.; Salvador, P.; Dannenberg, J. J.; Malick, D. K.; Rabuck, A. D.; Raghavachari, K. J.; Foresman, B.; Cioslowski, J.; Ortiz, J. V.; Baboul, A. G.; Stefanov, B. B.; Liu, G.; Liashenko, A.; Piskorz, P.; Komaromi, I.; Gomperts, R.; Martin, R. L.; Fox, D. J.; Keith, T.; Al-Laham, M. A.; Peng, C. Y.; Nanayakkara, A.; Challacombe, M.; Gill, P. M. W.; Johnson, B.; Chen, W.; Wong, M. W.; Andres, J. L.; Gonzalez, C.; Head-Gordon, M.; Replogle, E. S. & Pople, J. A. (1998) *Gaussian, Inc.*, Gaussian 98 Revision A.7, Pittsburgh PA Gemperline, P. (2006). *Practical Guide to Chemometrics* (2nd), CRC Press, ISBN 1-57444-783-1,

(November/December 2002), pp. 742-753, ISSN 1678-4790

0-444-88615-X, Amsterdan, The Netherlands

0-470-84307-1, West Sussex, England

Florida, USA

*Mor29m*, *O1* and *MlogP* descriptors. They represent four distinct classes of interactions related to the molecules, especially between the molecules and the biological receptor: steric (*IC5*), 3D-morse (*Mor29m*), electronic (*O1*) and molecular (*MlogP*).

A test set with ten molecules with unknown anticancer activity has its molecules classified, according to their biological response, into more active or less active compound. The results reveal in which classes they are grouped. In general molecules classified as more active must be seen as more efficient in cancer treatment than those classified as less active. Then the developed studies with PCA, HCA, KNN, SIMCA and SDA can provide valuable insight into the experimental process of syntheses and biological evaluation of the new artemisinin derivatives with activity against cancer HepG2. Without chemometrics no model and, consequently, no classification could be possible unless you are a prophet!

The interfacioal location of chemometrics, falling between measurements on the one side and statistical and computational theory and methods on the other, poses a challenge to the new practioner (Brow et al., 2009). The future of chemometrics lies in the development of innovative solutions to interesting problems. Some of the most exciting opportunities for innovation and new developments in the field of chemometrics lie at the interface between chemical and biological sciences. These opportunities are made possible by the exciting new scientific advances and discoveries of the past decade (Gemperline, 2006).

Finally, after reading this chapter you certainly must have noticed that chemometrics is a useful tool in medicinal chemistry, mainly when the great diversity of data is taken into account, because a lot of conclusions can be achieved. A study like this one here presented, where different methods are employed, is one of the examples of how chemometrics is important in drug design. Thus applications of statistics in chemical data analysis looking for the discovery of more efficacious drugs against diseases must continue and will certainly help researches.

#### **4. References**


14 Will-be-set-by-IN-TECH

*Mor29m*, *O1* and *MlogP* descriptors. They represent four distinct classes of interactions related to the molecules, especially between the molecules and the biological receptor: steric (*IC5*),

A test set with ten molecules with unknown anticancer activity has its molecules classified, according to their biological response, into more active or less active compound. The results reveal in which classes they are grouped. In general molecules classified as more active must be seen as more efficient in cancer treatment than those classified as less active. Then the developed studies with PCA, HCA, KNN, SIMCA and SDA can provide valuable insight into the experimental process of syntheses and biological evaluation of the new artemisinin derivatives with activity against cancer HepG2. Without chemometrics no model and,

The interfacioal location of chemometrics, falling between measurements on the one side and statistical and computational theory and methods on the other, poses a challenge to the new practioner (Brow et al., 2009). The future of chemometrics lies in the development of innovative solutions to interesting problems. Some of the most exciting opportunities for innovation and new developments in the field of chemometrics lie at the interface between chemical and biological sciences. These opportunities are made possible by the exciting new

Finally, after reading this chapter you certainly must have noticed that chemometrics is a useful tool in medicinal chemistry, mainly when the great diversity of data is taken into account, because a lot of conclusions can be achieved. A study like this one here presented, where different methods are employed, is one of the examples of how chemometrics is important in drug design. Thus applications of statistics in chemical data analysis looking for the discovery of more efficacious drugs against diseases must continue and will certainly

Barbosa, J.; Ferreira, J.; Figueiredo, A.; Almeida, R.; Silva, O.; Carvalho, J.; Cristino,

Brereton, R. (2007). *Applied Chemometrics for Scientists*, John Wiley & Sons, Ltd, ISBN

Brown, S.; Tauler, R. & Walczak, B. (Ed(s)) (2009). *Compreensive Chemometrics: Chemical and*

Bruns, R.; Scarminio, I. & Barrros Neto, B. (2006) *Statistical Design - Chemometrics*, Elsevier,

Cardoso, F.; Figueiredo, A.; Lobato, M.; Miranda, R.; Almeida, R. & Pinheiro, J. (2008). A Study

Doweyko, A. (2008). QSAR: Dead or Alive? *Journal of Computer-Aided Molecular Design*, Vol.

M.; Ciriaco-Pinheiro, J.; Vieira, J. & Serra, R. (2011). Molecular Modeling and Chemometric Study of Anticancer Derivatives of Artemisini. *Journal of the Serbian Chemical Society*, Vol. 76, No. 9, (September 2011), pp. 1263-1282, ISSN 0352-5139 Brereton, R. (2009). *Chemometrics for Pattern Recognition*, John Wiley & Sons, Ltd, ISBN

*Biochemical Data Analysis*, Vol. 1, Elsevier, ISBN 978-0-444-52702-8, Amsterdan, The

on Antimalarial Artemisinin Derivatives Using MEP Maps and Multivariate QSAR. *Journal of Molecular Modeling*, Vol. 14, No. 1, (January 2008), pp. 39-48, ISSN 0948-5023

3D-morse (*Mor29m*), electronic (*O1*) and molecular (*MlogP*).

consequently, no classification could be possible unless you are a prophet!

scientific advances and discoveries of the past decade (Gemperline, 2006).

978-0-470-74646-2, West Sussex,England

978-0-470-01686-2, West Sussex, England

ISBN 978-0-444-52181-1, Amsterdan, The Netherlands

22, No. 2, (February 2008), pp. 81-89, ISSN 1573-4951

help researches.

**4. References**

Netherlands


**9** 

*China* 

Yong Xin Yu and Yong Zhao\* *College of Food Science and Technology, Shanghai Ocean University, Shanghai,* 

**Electronic Nose Integrated with Chemometrics** 

**for Rapid Identification of Foodborne Pathogen** 

Diseases caused by foodborne pathogens have been a serious threat to public health and food safety for decades and remain one of the major concerns of our society. There are hundreds of diseases caused by different foodborne pathogenic microorganisms, including pathogenic viruses, bacteria, fungi, parasites, marine phytoplankton, and cyanobacteria, etc (Hui, 2001). Among these, bacteria such as *Salmonella* spp., *Shigella* spp., *Escherichia coli*, *Staphylococcus aureus*, *Campylobacter jejuni*, *Campylobacter coli*, *Bacillus cereus*, *Vibrio parahaemolyticus* and *Listeria monocytogenes* are the most common foodborne pathogens (McClure, 2002), which can spread easily and rapidly under requiring food, moisture and a

Identification and detection pathogens in clinical, environmental or food samples usually involves time-consuming growth in selective media, subsequent isolation and laborious biochemical and molecular diagnostic procedures (Gates, 2011). Many of these techniques are also expensive or not sensitive enough for the early detection of bacterial activity (Adley, 2006). The development of alternative analytical techniques that are rapid and simple has become increasingly important to reduce sample preparation time investment and to

It is well known that microorganisms can produce species-specific microbial volatile organic compounds (MVOCs), or odor compounds, which characterize as odor fingerprinting (Turner & Magan, 2004). Early in this research area, the question arose as to can we use odor fingerprinting like DNA fingerprinting to identify or detect microbe in pure culture or in food samples. To date it is still a very interesting scientific question. Many studies (Bjurman, 1999, Kim et al., 2007, Korpi et al., 1998, Pasanen et al., 1996, Wilkins et al., 2003), especially those using analytical tools such as gas chromatography (GC) or gas chromatography coupled with mass spectrometry (GC–MS) for headspace analysis, have shown that microorganisms produce many MVOCs, including alcohols, aliphatic acids and terpenes,

some of which have characteristic odors (Schnürer et al., 1999).

Corresponding Author, mail address: yzhao@shou.edu.cn

**1. Introduction** 

favorable temperature (Bhunia, 2008).

conduct real time analyses.

 \*


## **Electronic Nose Integrated with Chemometrics for Rapid Identification of Foodborne Pathogen**

Yong Xin Yu and Yong Zhao\*

*College of Food Science and Technology, Shanghai Ocean University, Shanghai, China* 

#### **1. Introduction**

16 Will-be-set-by-IN-TECH

200 Chemometrics in Practical Applications

Wold, S. (1995). *Chemometrics, what do we mean with it, and what do we want from it? Chemometrics*

*Virtual Computational Laboratory, VCCLAB* In: e-Dragon, 13.05.2010, Available from

0169-7439

http://www.vcclab.org

*and Intelligent Laboratory Systems*, Vol. 30, No. 1, (November 1995), pp. 109-115, ISSN

Diseases caused by foodborne pathogens have been a serious threat to public health and food safety for decades and remain one of the major concerns of our society. There are hundreds of diseases caused by different foodborne pathogenic microorganisms, including pathogenic viruses, bacteria, fungi, parasites, marine phytoplankton, and cyanobacteria, etc (Hui, 2001). Among these, bacteria such as *Salmonella* spp., *Shigella* spp., *Escherichia coli*, *Staphylococcus aureus*, *Campylobacter jejuni*, *Campylobacter coli*, *Bacillus cereus*, *Vibrio parahaemolyticus* and *Listeria monocytogenes* are the most common foodborne pathogens (McClure, 2002), which can spread easily and rapidly under requiring food, moisture and a favorable temperature (Bhunia, 2008).

Identification and detection pathogens in clinical, environmental or food samples usually involves time-consuming growth in selective media, subsequent isolation and laborious biochemical and molecular diagnostic procedures (Gates, 2011). Many of these techniques are also expensive or not sensitive enough for the early detection of bacterial activity (Adley, 2006). The development of alternative analytical techniques that are rapid and simple has become increasingly important to reduce sample preparation time investment and to conduct real time analyses.

It is well known that microorganisms can produce species-specific microbial volatile organic compounds (MVOCs), or odor compounds, which characterize as odor fingerprinting (Turner & Magan, 2004). Early in this research area, the question arose as to can we use odor fingerprinting like DNA fingerprinting to identify or detect microbe in pure culture or in food samples. To date it is still a very interesting scientific question. Many studies (Bjurman, 1999, Kim et al., 2007, Korpi et al., 1998, Pasanen et al., 1996, Wilkins et al., 2003), especially those using analytical tools such as gas chromatography (GC) or gas chromatography coupled with mass spectrometry (GC–MS) for headspace analysis, have shown that microorganisms produce many MVOCs, including alcohols, aliphatic acids and terpenes, some of which have characteristic odors (Schnürer et al., 1999).

 \* Corresponding Author, mail address: yzhao@shou.edu.cn

Electronic Nose Integrated with Chemometrics for Rapid Identification of Foodborne Pathogen 203

Several E-nose devices have been developed, all of which comprise three basic building blocks: a volatile gas odor passes over a sensor array, the conductance of the sensors changes owing to the level of binding and results in a set of sensor signals, which are

The main strategy of foodborne pathogen identification based on E-nose, which is composed of three steps: headspace sampling, gas sensor detection and chemometrics analysis

Fig. 2. Electronic nose and chemometrics for the identification of foodborne pathogen. The

Before analysis, the bacterial cultures should be transferred into standard 20 ml headspace vials and sealed with PTFE-lined Teflon caps to equilibrate the headspace. Sample handling is a critical step affecting the analysis by E-nose. The quality of the analysis can be improved by adopting an appropriate sampling technique. To introduce the volatile compounds present in the headspace (HS) of the sample into the E-nose's detection system, several headspace sampling techniques have been used in E-nose. Typically, the methods of headspace sampling (Ayoko, 2004) include static headspace (SHS) technique, purge and trap (P&T) technique, stir bar sorptive extraction (SBSE) technique, inside-needle dynamic

main strategy of foodborne pathogen identification based on E-nose.

coupled to data-analysis software to produce an output (Turner & Magan, 2004).

**2. Detection strategies** 

**2.1 Headspace sampling** 

(see Fig. 2).

Fig. 1. Electronic nose devices mimic the human olfactory system.

The electronic devices simulate the different stages of the human olfactory system, resulting in volatile odor recognition, which can now be used to discriminate between different bacterial infections. (Turner & Magan, 2004)

During the past three decades there has been significant research interest in the development of electronic nose (E-nose) technology for food, agricultural and environmental applications (Buratti et al., 2004, Pasanen et al., 1996, Romain et al., 2000, Wilkins et al., 2003). The term E-nose describes a machine olfaction system, which successfully mimics human olfaction and intelligently integrates of multitudes of technologies like sensing technology, chemometrics, microelectronics and advanced soft computing (see Fig. 1). Basically, this device is used to detect and distinguish complex odor at low cost. Typically, an electronic nose consists of three parts: a sensor array which is exposed to the volatiles, conversion of the sensor signals to a readable format, and software analysis of the data to produce characteristic outputs related to the odor encountered. The output from the sensor array may be interpreted via a variety of chemometrics methods (Capone et al., 2001, Evans et al., 2000, Haugen & Kvaal, 1998) such as principal component analysis (PCA), discriminant function analysis (DFA), cluster analysis (CA), soft independent modelling by class analogy (SIMCA), partial least squares (PLS) and artificial neural networks (ANN) to discriminate between different samples. The data obtained from the sensor array are comparative and generally not quantitative or qualitative in any way. It has the potential to be a sensitive, fast, one-step method to characterize a wide array of different volatile chemicals. Since the first model of an intelligent electronic gas sensing model was described, a significant amount of gas sensing research has been focused on several industrial applications.

Recently, some novel microbiological applications of E-nose have been reported, such as the characterization of fungi (Keshri et al., 1998, Pasanen et al., 1996, Schnürer et al., 1999), bacteria (Dutta et al., 2005, Pavlou et al., 2002a) and the diagnosis of disease (Gardner et al., 2000, Pavlou et al., 2002b, Zhang et al., 2000). It is more and more clear that E-nose techniques coupled with different chemometrics analyses of the odor fingerprinting offer a wide range of applications for food microbiology, including identification of foodborne pathogen.

### **2. Detection strategies**

202 Chemometrics in Practical Applications

The electronic devices simulate the different stages of the human olfactory system, resulting in volatile odor recognition, which can now be used to discriminate between different

During the past three decades there has been significant research interest in the development of electronic nose (E-nose) technology for food, agricultural and environmental applications (Buratti et al., 2004, Pasanen et al., 1996, Romain et al., 2000, Wilkins et al., 2003). The term E-nose describes a machine olfaction system, which successfully mimics human olfaction and intelligently integrates of multitudes of technologies like sensing technology, chemometrics, microelectronics and advanced soft computing (see Fig. 1). Basically, this device is used to detect and distinguish complex odor at low cost. Typically, an electronic nose consists of three parts: a sensor array which is exposed to the volatiles, conversion of the sensor signals to a readable format, and software analysis of the data to produce characteristic outputs related to the odor encountered. The output from the sensor array may be interpreted via a variety of chemometrics methods (Capone et al., 2001, Evans et al., 2000, Haugen & Kvaal, 1998) such as principal component analysis (PCA), discriminant function analysis (DFA), cluster analysis (CA), soft independent modelling by class analogy (SIMCA), partial least squares (PLS) and artificial neural networks (ANN) to discriminate between different samples. The data obtained from the sensor array are comparative and generally not quantitative or qualitative in any way. It has the potential to be a sensitive, fast, one-step method to characterize a wide array of different volatile chemicals. Since the first model of an intelligent electronic gas sensing model was described, a significant amount of gas sensing research has been focused on

Recently, some novel microbiological applications of E-nose have been reported, such as the characterization of fungi (Keshri et al., 1998, Pasanen et al., 1996, Schnürer et al., 1999), bacteria (Dutta et al., 2005, Pavlou et al., 2002a) and the diagnosis of disease (Gardner et al., 2000, Pavlou et al., 2002b, Zhang et al., 2000). It is more and more clear that E-nose techniques coupled with different chemometrics analyses of the odor fingerprinting offer a wide range of applications for food microbiology, including identification of foodborne

Fig. 1. Electronic nose devices mimic the human olfactory system.

bacterial infections. (Turner & Magan, 2004)

several industrial applications.

pathogen.

Several E-nose devices have been developed, all of which comprise three basic building blocks: a volatile gas odor passes over a sensor array, the conductance of the sensors changes owing to the level of binding and results in a set of sensor signals, which are coupled to data-analysis software to produce an output (Turner & Magan, 2004).

The main strategy of foodborne pathogen identification based on E-nose, which is composed of three steps: headspace sampling, gas sensor detection and chemometrics analysis (see Fig. 2).

Fig. 2. Electronic nose and chemometrics for the identification of foodborne pathogen. The main strategy of foodborne pathogen identification based on E-nose.

#### **2.1 Headspace sampling**

Before analysis, the bacterial cultures should be transferred into standard 20 ml headspace vials and sealed with PTFE-lined Teflon caps to equilibrate the headspace. Sample handling is a critical step affecting the analysis by E-nose. The quality of the analysis can be improved by adopting an appropriate sampling technique. To introduce the volatile compounds present in the headspace (HS) of the sample into the E-nose's detection system, several headspace sampling techniques have been used in E-nose. Typically, the methods of headspace sampling (Ayoko, 2004) include static headspace (SHS) technique, purge and trap (P&T) technique, stir bar sorptive extraction (SBSE) technique, inside-needle dynamic

Electronic Nose Integrated with Chemometrics for Rapid Identification of Foodborne Pathogen 205

T40/1, TA2), and this sensor array system is used for monitoring the volatile compounds produced by microorganism, and so on. The descriptors associated with the sensors are shown in Table 1. FOX4000 E-nose assay measurements showed signal with maximum intensities changing with the type of samples, which indicate that discrimination is

**Sensors Volatile description Sensors Volatile description** 

LY2 /AA Alcohol, acetone, ammonia T40 /1 Fluorine LY2 /GH Ammonia, amines compounds P40 /1 Fluorine, chlorine

P30 /2 Hydrogen sulphide, ketone LY2 /gCT Propane, butane

Each sensor element changes its electrical resistance (Rmax) when exposed to volatile compounds. In order to produce consistent data for the classification, the sensor response is presented with a volatile chemical relative to the baseline electrical resistance in fresh air, which is the maximum change in the sensor electrical resistance divided by the initial

Relative electrical resistance change = ( Rmax − R0) / R0 where R0 is the initial baseline electrical resistance of the sensor and Rmax − R0 is the maximum change of the sensor electrical resistance. The baseline of the sensors was acquired in a synthetic air saturated steam at fixed temperature. The relative electrical resistance change value was used for data evaluation because it gives the most stable result,

Data of the relative electrical resistance changes from the 18 sensors can combine with every sample to form a matrix (see Fig. 2: The library data base) and the data is without preprocessing prior to chemometrics analysis. The sensor response is stored in the computer through data acquisition card and these data sets are analyzed to extract

sulphide P30 /1 Hydrocarbons, ammonia,

oxygen compounds T70 /2 Toluene, xylene, carbon

fluoride LY2 /gCTL hydrogen sulfide

chloride T40 /2 chlorine

ethane TA /2 ethanol

hydrocarbon, Ammonia, chlorine PA /2 Ethanol, ammonia, amine

ethanol

monoxide

compounds

LY2/LG Fluoride, chloride, oxynitride,

LY2 /G Ammonia, amines, Carbon

P40 /2 Chlorine, hydrogen sulfide,

T30 /1 Polar compound, hydrogen

P10 /2 Nonpolar compound: Methane,

Table 1. Sensor types and volatile descriptors of FOX4000 E-nose.

and is more robust against sensor baseline variation (Siripatrawan, 2008).

P10 /1 Nonpolar compound:

electrical resistance, as follows:

information.

obtained.

extraction (INDEX) technique, membrane introduction mass spectrometry (MIMS) technique and solid phase micro extraction (SPME) technique.

Unlike the other techniques, SPME has a considerable concentration capacity and is very simple because it does not require especial equipment. The principle involves exposing a silica fibre covered with a thin layer of adsorbent in the HS of the sample in order to trap the volatile components onto the fibre. The adsorbed compounds are desorbed by heating and introduced into the detection system. A SPME sampler consists of a fused silica fiber that is coated by a suitable polymer (e.g. PDMS, PDMS/divinylbenzene, carboxen/PDMS) and housed inside a needle. The fiber is exposed to headspace volatile and after sampling is complete, it is retracted into the needle. Apart from the nature of the adsorbent deposited on the fiber, the main parameters to optimize are the equilibration time, the sample temperature and the duration of extraction. Compared with other sampling methods, SPME is simple to use and reasonably sensitive, so it is a user-friendly pre-concentration method.

In our studies, the headspace sampling method of E-nose was optimized for MVOCs analysis. The samples were placed in the HS100 auto-sampler in arbitrary order. The automatic injection unit heated the samples to 37°C with an incubation time of 600 seconds. The temperature of the injection syringe was 47°C. The delay time between two injections was 300 seconds. Then the adsorbed compounds are desorbed by heating and introduced into the detection system (Yu Y. X., 2010a, Yu Y. X., 2010b).

#### **2.2 Gas sensor detection**

The most complicated part of electronic olfaction process is odor capture and sensor technology to be deployed for such capturing. Once the volatile compounds of samples are introduced into the gas sensor detection system, the sensor array is exposed to the volatile compounds and then the odor fingerprint of samples is generated from sensor respond. By chemical interaction between the volatile compounds and the gas sensors, the state of the sensors is altered giving rise to electrical signals that are registered by the instrument of Enose. In this way the signals from the individual sensor represent a pattern that is unique for the gas mixture measured and those data based on sensors is transformed to a matrix. The ideal sensors to be integrated in an electronic nose should fulfill the following criteria (Barsan & Weimar, 2001, James et al., 2005): high sensitivity toward the volatile chemical compounds, that is, the chemicals to be detected may be present in the concentration range of ppm or ppb, and the sensor should be sufficiently sensitive to small concentration level of gaseous species within a volatile mixture, similar to that of the human nose (down to 10−<sup>12</sup> g/ml); low sensitivity toward humidity and temperature; medium selectivity, that is, they must respond to a range of different compounds present in the headspace of the sample; high stability; high reproducibility and reliability; high speed of response, short reaction and recovery time, that is, in order to be used for online measurements, the response time of the sensor should be in the range of seconds; reversibility, that is, the sensor should be able to recover after exposure to gas; robust and durable; easy calibration; easily processable data output; and small dimensions.

The E-nose used in our studies is a commercial equipment (FOX4000, Alpha M.O.S., Toulouse, France), with 18 metal oxide sensors (LY2/AA, LY2/G, LY2/gCT, LY2/gCTl, LY2/Gh, LY2/LG, P10/1, P10/2, P30/1, P30/2, P40/1, P40/2, PA2, T30/1, T40/2, T70/2,

extraction (INDEX) technique, membrane introduction mass spectrometry (MIMS)

Unlike the other techniques, SPME has a considerable concentration capacity and is very simple because it does not require especial equipment. The principle involves exposing a silica fibre covered with a thin layer of adsorbent in the HS of the sample in order to trap the volatile components onto the fibre. The adsorbed compounds are desorbed by heating and introduced into the detection system. A SPME sampler consists of a fused silica fiber that is coated by a suitable polymer (e.g. PDMS, PDMS/divinylbenzene, carboxen/PDMS) and housed inside a needle. The fiber is exposed to headspace volatile and after sampling is complete, it is retracted into the needle. Apart from the nature of the adsorbent deposited on the fiber, the main parameters to optimize are the equilibration time, the sample temperature and the duration of extraction. Compared with other sampling methods, SPME is simple to use and reasonably sensitive, so it is a user-friendly pre-concentration method. In our studies, the headspace sampling method of E-nose was optimized for MVOCs analysis. The samples were placed in the HS100 auto-sampler in arbitrary order. The automatic injection unit heated the samples to 37°C with an incubation time of 600 seconds. The temperature of the injection syringe was 47°C. The delay time between two injections was 300 seconds. Then the adsorbed compounds are desorbed by heating and introduced

The most complicated part of electronic olfaction process is odor capture and sensor technology to be deployed for such capturing. Once the volatile compounds of samples are introduced into the gas sensor detection system, the sensor array is exposed to the volatile compounds and then the odor fingerprint of samples is generated from sensor respond. By chemical interaction between the volatile compounds and the gas sensors, the state of the sensors is altered giving rise to electrical signals that are registered by the instrument of Enose. In this way the signals from the individual sensor represent a pattern that is unique for the gas mixture measured and those data based on sensors is transformed to a matrix. The ideal sensors to be integrated in an electronic nose should fulfill the following criteria (Barsan & Weimar, 2001, James et al., 2005): high sensitivity toward the volatile chemical compounds, that is, the chemicals to be detected may be present in the concentration range of ppm or ppb, and the sensor should be sufficiently sensitive to small concentration level of gaseous species within a volatile mixture, similar to that of the human nose (down to 10−<sup>12</sup> g/ml); low sensitivity toward humidity and temperature; medium selectivity, that is, they must respond to a range of different compounds present in the headspace of the sample; high stability; high reproducibility and reliability; high speed of response, short reaction and recovery time, that is, in order to be used for online measurements, the response time of the sensor should be in the range of seconds; reversibility, that is, the sensor should be able to recover after exposure to gas; robust and durable; easy calibration; easily processable data

The E-nose used in our studies is a commercial equipment (FOX4000, Alpha M.O.S., Toulouse, France), with 18 metal oxide sensors (LY2/AA, LY2/G, LY2/gCT, LY2/gCTl, LY2/Gh, LY2/LG, P10/1, P10/2, P30/1, P30/2, P40/1, P40/2, PA2, T30/1, T40/2, T70/2,

technique and solid phase micro extraction (SPME) technique.

into the detection system (Yu Y. X., 2010a, Yu Y. X., 2010b).

**2.2 Gas sensor detection** 

output; and small dimensions.

T40/1, TA2), and this sensor array system is used for monitoring the volatile compounds produced by microorganism, and so on. The descriptors associated with the sensors are shown in Table 1. FOX4000 E-nose assay measurements showed signal with maximum intensities changing with the type of samples, which indicate that discrimination is obtained.


Table 1. Sensor types and volatile descriptors of FOX4000 E-nose.

Each sensor element changes its electrical resistance (Rmax) when exposed to volatile compounds. In order to produce consistent data for the classification, the sensor response is presented with a volatile chemical relative to the baseline electrical resistance in fresh air, which is the maximum change in the sensor electrical resistance divided by the initial electrical resistance, as follows:

#### Relative electrical resistance change = ( Rmax − R0) / R0

where R0 is the initial baseline electrical resistance of the sensor and Rmax − R0 is the maximum change of the sensor electrical resistance. The baseline of the sensors was acquired in a synthetic air saturated steam at fixed temperature. The relative electrical resistance change value was used for data evaluation because it gives the most stable result, and is more robust against sensor baseline variation (Siripatrawan, 2008).

Data of the relative electrical resistance changes from the 18 sensors can combine with every sample to form a matrix (see Fig. 2: The library data base) and the data is without preprocessing prior to chemometrics analysis. The sensor response is stored in the computer through data acquisition card and these data sets are analyzed to extract information.

Electronic Nose Integrated with Chemometrics for Rapid Identification of Foodborne Pathogen 207

Fig. 3(a). Principal components analysis (PCA) for the discrimination of three bacteria from different genus on the basis of E-nose. The plot displays clear discrimination between the

Fig. 3(b). Cluster analysis (CA) for the discrimination of three bacteria from different genus on the basis of E-nose. (*S. lentus*: SL1-SL5, *B. cereus*: BC1-BC5, *L. monocytogenes*: LM1-LM5,

control blank culture: CT1-CT5).

four groups, accounting for nearly 99% of the variance within the dataset.

#### **2.3 Chemometrics analysis**

The matrix of signal is interpreted by multivariate chemometrics techniques like the PCA, PLS, ANN, and so on. Samples with similar odor fingerprinting generally give rise to similar sensor response patterns, while samples with different odor fingerprinting show differences in their patterns. The sensors of an E-nose can respond to both odorous and odorless volatile compounds.

These various chemometrics methods are used in those works, according to the aim of the studies. Generally speaking, the chemometrics methods can be divided into two types: unsupervised and supervised methods(Mariey et al., 2001). The objective of unsupervised methods is to extrapolate the odor fingerprinting data without a prior knowledge about the bacteria studied. Principal component analysis (PCA) and Hierarchical cluster analysis (HCA) are major examples of unsupervised methods. Supervised methods, on the other hand, require prior knowledge of the sample identity. With a set of well-characterized samples, a model can be trained so that it can predict the identity of unknown samples. Discriminant analysis (DA) and artificial neural network (ANN) analysis are major examples of supervised methods.

PCA is used to reduce the multidimensionality of the data set into its most dominant components or scores while maintaining the relevant variation between the data points. PCA identifies the natural clusters in the data set with the first principal component (PC) expressing the largest amount of variation, followed by the second PC which conveys the second most important factor of the remaining analysis, and so forth(Di et al., 2009, Huang et al., 2009, Ivosev et al., 2008). Score plots can be used to interpret the similarities and differences between bacteria. The closer the samples are within a score plot, the more similar they are with respect to the principal component score evaluated(Mariey et al., 2001). In our studies, each sample data of 18 sensors is then compared to the others in order to make homogeneous groups. A scatter plot can then be drawn to visualize the results, each sample being represented by a plot.

#### **3. Application of E-nose and chemometrics for bacteria identification**

With the success of the above applications of the E-nose have been published, the authors were interested in determining whether or not an E-nose would be able to identify bacteria. A series of experiments were designed to determine this. In this part, bacteria identification at different levels (genus, species, strains) was cited as an example to illustrate using this integrated technology to foodborne bacteria effective identification.

#### **3.1 At genus level**

In this study, three bacteria, *Listeria monocytogenes*, *Staphylococcus lentus* and *Bacillus cereus*, which from three different genus, were investigated for the odor fingerprint by E-nose. The result of PCA (Fig.3a) shows that, the fingerprints give a good difference between the blank culture and the bacterial culture, and the three bacteria can be classified from each other by the odor fingerprints. Using the cluster analysis to represent the sensor responses (Fig. 3b), it is also possible to obtain a clear separation between the blank control and culture inoculated with bacteria. And the CA result also reveals that successful discrimination between the bacteria at different genus is possible(Yu Y. X., 2010a).

The matrix of signal is interpreted by multivariate chemometrics techniques like the PCA, PLS, ANN, and so on. Samples with similar odor fingerprinting generally give rise to similar sensor response patterns, while samples with different odor fingerprinting show differences in their patterns. The sensors of an E-nose can respond to both odorous and odorless volatile

These various chemometrics methods are used in those works, according to the aim of the studies. Generally speaking, the chemometrics methods can be divided into two types: unsupervised and supervised methods(Mariey et al., 2001). The objective of unsupervised methods is to extrapolate the odor fingerprinting data without a prior knowledge about the bacteria studied. Principal component analysis (PCA) and Hierarchical cluster analysis (HCA) are major examples of unsupervised methods. Supervised methods, on the other hand, require prior knowledge of the sample identity. With a set of well-characterized samples, a model can be trained so that it can predict the identity of unknown samples. Discriminant analysis (DA) and artificial neural network (ANN) analysis are major

PCA is used to reduce the multidimensionality of the data set into its most dominant components or scores while maintaining the relevant variation between the data points. PCA identifies the natural clusters in the data set with the first principal component (PC) expressing the largest amount of variation, followed by the second PC which conveys the second most important factor of the remaining analysis, and so forth(Di et al., 2009, Huang et al., 2009, Ivosev et al., 2008). Score plots can be used to interpret the similarities and differences between bacteria. The closer the samples are within a score plot, the more similar they are with respect to the principal component score evaluated(Mariey et al., 2001). In our studies, each sample data of 18 sensors is then compared to the others in order to make homogeneous groups. A scatter plot can then be drawn to visualize the results, each sample

**3. Application of E-nose and chemometrics for bacteria identification** 

integrated technology to foodborne bacteria effective identification.

bacteria at different genus is possible(Yu Y. X., 2010a).

With the success of the above applications of the E-nose have been published, the authors were interested in determining whether or not an E-nose would be able to identify bacteria. A series of experiments were designed to determine this. In this part, bacteria identification at different levels (genus, species, strains) was cited as an example to illustrate using this

In this study, three bacteria, *Listeria monocytogenes*, *Staphylococcus lentus* and *Bacillus cereus*, which from three different genus, were investigated for the odor fingerprint by E-nose. The result of PCA (Fig.3a) shows that, the fingerprints give a good difference between the blank culture and the bacterial culture, and the three bacteria can be classified from each other by the odor fingerprints. Using the cluster analysis to represent the sensor responses (Fig. 3b), it is also possible to obtain a clear separation between the blank control and culture inoculated with bacteria. And the CA result also reveals that successful discrimination between the

**2.3 Chemometrics analysis** 

examples of supervised methods.

being represented by a plot.

**3.1 At genus level** 

compounds.

Fig. 3(a). Principal components analysis (PCA) for the discrimination of three bacteria from different genus on the basis of E-nose. The plot displays clear discrimination between the four groups, accounting for nearly 99% of the variance within the dataset.

Fig. 3(b). Cluster analysis (CA) for the discrimination of three bacteria from different genus on the basis of E-nose. (*S. lentus*: SL1-SL5, *B. cereus*: BC1-BC5, *L. monocytogenes*: LM1-LM5, control blank culture: CT1-CT5).

Electronic Nose Integrated with Chemometrics for Rapid Identification of Foodborne Pathogen 209

(*P. fragi*: Pfr1-Pfr4, *P. fluorescens*: Pfl1-Pfl4, *P. putida*: Ppu1-Ppu4, *P. aeruginosa*: Pae1-Pae4). Fig. 4(b). Cluster analysis (CA) for the discrimination of four different species of

The next set of experiments involved testing the integrated method to see whether it could correctly differentiate bacteria samples as different strains. In this study, four strains of *Vibrio parahaemolyticus*, named *V. parahaemolyticus* F01, *V. parahaemolyticus* F13, *V. parahaemolyticus* F38 and *V. parahaemolyticus* F54, were compared with the odor fingerprint by E-nose. As shown in a representative data set in Fig. 5(a), the four strains of *V. parahaemolyticus* are separated from each other. However, the result from cluster analysis in Fig. 5(b) shows that some overlap appeared between *V. parahaemolyticus* F01 and *V. parahaemolyticus* F13, and it indicate that the odor fingerprints of these two strains may be

Pseudomonas sp on the basis of E-nose.

too similar to identify by this method.

**3.3 At strains level** 

#### **3.2 At species level**

In this study, using the same collection methodology, the E-nose was tested for its ability to distinguish among bacterial pathogens at species levels. Four species bacteria selected from *Pseudomonas* sp, named *Pseudomonas fragi*, *Pseudomonas fluorescens, Pseudomonas putida* and *Pseudomonas aeruginosa*, were investigated for the odor fingerprint by E-nose. It is clear that the E-nose was able to distinguish amongst all specimens tested. The PCA result in Fig.4(a) shows a representative experiment, where individual species of bacteria clustered in individual groups, separate from each other and the bacteria *Pseudomonas fragi* is given a great difference form the three other bacteria by the odor fingerprints. The result of cluster analysis in Fig. 4(b) also reveals that successful discrimination between the different bacteria at strains level is possible.

Fig. 4(a). Principal components analysis (PCA) for the discrimination of four different species of Pseudomonas sp on the basis of E-nose. The plot displays clear discrimination between the four groups, accounting for nearly 99% of the variance within the dataset.

(*P. fragi*: Pfr1-Pfr4, *P. fluorescens*: Pfl1-Pfl4, *P. putida*: Ppu1-Ppu4, *P. aeruginosa*: Pae1-Pae4). Fig. 4(b). Cluster analysis (CA) for the discrimination of four different species of Pseudomonas sp on the basis of E-nose.

#### **3.3 At strains level**

208 Chemometrics in Practical Applications

In this study, using the same collection methodology, the E-nose was tested for its ability to distinguish among bacterial pathogens at species levels. Four species bacteria selected from *Pseudomonas* sp, named *Pseudomonas fragi*, *Pseudomonas fluorescens, Pseudomonas putida* and *Pseudomonas aeruginosa*, were investigated for the odor fingerprint by E-nose. It is clear that the E-nose was able to distinguish amongst all specimens tested. The PCA result in Fig.4(a) shows a representative experiment, where individual species of bacteria clustered in individual groups, separate from each other and the bacteria *Pseudomonas fragi* is given a great difference form the three other bacteria by the odor fingerprints. The result of cluster analysis in Fig. 4(b) also reveals that successful discrimination between the different bacteria

Fig. 4(a). Principal components analysis (PCA) for the discrimination of four different species of Pseudomonas sp on the basis of E-nose. The plot displays clear discrimination between the four groups, accounting for nearly 99% of the variance within the dataset.

**3.2 At species level** 

at strains level is possible.

The next set of experiments involved testing the integrated method to see whether it could correctly differentiate bacteria samples as different strains. In this study, four strains of *Vibrio parahaemolyticus*, named *V. parahaemolyticus* F01, *V. parahaemolyticus* F13, *V. parahaemolyticus* F38 and *V. parahaemolyticus* F54, were compared with the odor fingerprint by E-nose. As shown in a representative data set in Fig. 5(a), the four strains of *V. parahaemolyticus* are separated from each other. However, the result from cluster analysis in Fig. 5(b) shows that some overlap appeared between *V. parahaemolyticus* F01 and *V. parahaemolyticus* F13, and it indicate that the odor fingerprints of these two strains may be too similar to identify by this method.

Electronic Nose Integrated with Chemometrics for Rapid Identification of Foodborne Pathogen 211

(*V.p* F01: F011-F014, *V.p* F13: F131-F134, *V.p* F38: F381-F384, *V.p* F54: F541-F544).

*parahaemolyticus* on the basis of E-nose.

for the unknown.

Fig. 5(b). Cluster analysis (CA) for the discrimination of four different strains of *V.* 

conducting epidemiological studies, or determining the source of a food poisoning outbreak. Of course the ability to produce information on the physiological state of a microorganism offers many potential benefits. Nevertheless, a variety of different fingerprints, produced under a variety of growth conditions, must be developed for each pathogen, for inclusion in the reference database. To avoid this complication, we should culture the pathogens under controlled conditions. Otherwise, the identification algorithm must be capable of sorting through them all, to find a single, reliable, positive identification

Recently developed chemometrics algorithms are particularly suited to the rapid analysis and depiction of this data. Chemometrics is one approach that may offer novel insights into

Fig. 5(a). Principal components analysis (PCA) for the discrimination of four different strains of *V. parahaemolyticus* on the basis of E-nose. The plot displays clear discrimination between the four groups, accounting for nearly 99% of the variance within the dataset.

#### **4. Future perspectives**

Electronic nose technology is relatively new and holds great promise as a detection tool in food safety area because it is portable, rapid and has potential applicability in foodborne pathogen identification or detection. On the basis of the work described above, we have demonstrated that the E-nose integrated with chemometrics can be used to identify pathogen bacteria at genus, species and strains levels.

As to know, bacteria respond to environmental triggers by switching to different physiological states. If such changes can be detected in the odor fingerprints, then E-nose analysis can produce information that can be very useful in determining virulence,

Fig. 5(a). Principal components analysis (PCA) for the discrimination of four different strains of *V. parahaemolyticus* on the basis of E-nose. The plot displays clear discrimination between the four groups, accounting for nearly 99% of the variance within the dataset.

Electronic nose technology is relatively new and holds great promise as a detection tool in food safety area because it is portable, rapid and has potential applicability in foodborne pathogen identification or detection. On the basis of the work described above, we have demonstrated that the E-nose integrated with chemometrics can be used to identify

As to know, bacteria respond to environmental triggers by switching to different physiological states. If such changes can be detected in the odor fingerprints, then E-nose analysis can produce information that can be very useful in determining virulence,

**4. Future perspectives** 

pathogen bacteria at genus, species and strains levels.

(*V.p* F01: F011-F014, *V.p* F13: F131-F134, *V.p* F38: F381-F384, *V.p* F54: F541-F544).

Fig. 5(b). Cluster analysis (CA) for the discrimination of four different strains of *V. parahaemolyticus* on the basis of E-nose.

conducting epidemiological studies, or determining the source of a food poisoning outbreak. Of course the ability to produce information on the physiological state of a microorganism offers many potential benefits. Nevertheless, a variety of different fingerprints, produced under a variety of growth conditions, must be developed for each pathogen, for inclusion in the reference database. To avoid this complication, we should culture the pathogens under controlled conditions. Otherwise, the identification algorithm must be capable of sorting through them all, to find a single, reliable, positive identification for the unknown.

Recently developed chemometrics algorithms are particularly suited to the rapid analysis and depiction of this data. Chemometrics is one approach that may offer novel insights into

Electronic Nose Integrated with Chemometrics for Rapid Identification of Foodborne Pathogen 213

Evans P, Persaud KC, Mcneish AS, Sneath RW, Hobson N, Magan N, 2000. Evaluation of a

Gardner JW, Shin HW, Hines EL, 2000. An electronic nose system to diagnose illness.

Gates KW, 2011. Rapid Detection and Characterization of Foodborne Pathogens by Molecular Techniques. *Journal of Aquatic Food Product Technology* 20, 108-13. Haugen JE, Kvaal K, 1998. Electronic nose and artificial neural network. *Meat Sci* 49, S273-

Huang SY, Yeh YR, Eguchi S, 2009. Robust kernel principal component analysis. *Neural* 

Ivosev G, Burton L, Bonner R, 2008. Dimensionality reduction and visualization in principal

James D, Scott SM, Ali Z, O'hare WT, 2005. Chemical sensors for electronic nose systems.

Keshri G, Magan N, Voysey P, 1998. Use of an electronic nose for the early detection and

Kim JL, Elfman L, Mi Y, Wieslander G, Smedje G, Norbäck D, 2007. Indoor molds, bacteria,

Korpi A, Pasanen AL, Pasanen P, 1998. Volatile compounds originating from mixed

Mariey L, Signolle J, Amiel C, Travert J, 2001. Discrimination, classification, identification of

Mcclure PJ, 2002. Foodborne pathogens: hazards, risk analysis, and control. *Woodhead Pub* 

Pasanen AL, Lappalainen S, Pasanen P, 1996. Volatile organic metabolites associated with

Pavlou A, Turner A, Magan N, 2002a. Recognition of anaerobic bacterial isolates in vitro

Pavlou AK, Magan N, Mcnulty C*, et al.*, 2002b. Use of an electronic nose system for diagnoses of urinary tract infections. *Biosensors and Bioelectronics* 17, 893-9. Romain AC, Nicolas J, Wiertz V, Maternova J, Andre P, 2000. Use of a simple tin oxide

Schnürer J, Olsson J, Börjesson T, 1999. Fungal volatiles as indicators of food and feeds

Siripatrawan U, 2008. Rapid differentiation between E. coli and Salmonella Typhimurium

sensor array to identify five malodours collected in the field. *Sensors and Actuators* 

using metal oxide sensors integrated with pattern recognition. *Sensors and Actuators* 

some toxic fungi and their mycotoxins. *Analyst* 121, 1949-53.

using electronic nose technology. *Lett Appl Microbiol* 35, 366-9.

spoilage. *Fungal Genetics and Biology* 27, 209-17.

microbial volatile organic compounds and plasticizers in schools–associations with

microbial cultures on building materials under various humidity conditions. *Appl* 

microorganisms using FTIR spectroscopy and chemometrics. *Vibrational* 

differentiation between spoilage fungi. *Lett Appl Microbiol* 27, 261-4.

asthma and respiratory symptoms in pupils. *Indoor Air* 17, 153-63.

electronic nose data. *Sensors and Actuators B: Chemical* 69, 348-58.

*Sensors and Actuators B: Chemical* 70, 19-24.

Hui YH, 2001. Foodborne Disease Handbook: Plant Toxicants. *CRC*.

component analysis. *Anal Chem* 80, 4933-44.

S86.

*Comput* 21, 3179-213.

*Microchimica Acta* 149, 1-17.

*Environ Microbiol* 64, 2914.

*Spectroscopy* 26, 151-9.

*B: Chemical* 62, 73-9.

*B: Chemical* 133, 414-9.

*Ltd.*

radial basis function neural network for the determination of wheat quality from

our understanding of the difference of microbiology. Adopting appropriate chemometrics methods will improve the quality of analysis.

Odor fingerprinting method based on E-nose is still in its infancy. Many recent technological advances, which are outside the scope of this chapter, can be used to transform the odor fingerprinting concept into user-friendly, automated systems for high-throughput analyses. The introduction of smaller, faster and smarter instrumentation of E-nose to the market could also depend much on the embedding of chemometrics. In addition, more and more classification techniques based on odor fingerprinting may be developed to classify the pathogens into exact levels such as genus, species and stains. Further investigation may contribute to make a distinction between the pathogen and non-pathogen bacterial.

In short, E-nose integrated with chemometrics is a reliable, rapid, and economic technique which could be explored as a routine diagnostic tool for microbial analysis.

#### **5. Acknowledgments**

The authors acknowledge the financial support of the project of Shanghai Youth Science and Technology Development (Project No: 07QA14047), the Leading Academic Discipline Project of Shanghai Municipal Education Commission (Project No: J50704), Shanghai Municipal Science, Technology Key Project of Agriculture Flourishing plan (Grant No: 2006, 10-5; 2009, 6-1), Public Science and Technology Research Funds Projects of Ocean (Project No: 201105007), Project of Science and Technology Commission of Shanghai Municipality (Project No: 11310501100), and Shanghai Ocean University youth teacher Fund (Project No: A-2501-10-011506).

#### **6. References**

Adley C, 2006. Food-borne pathogens: methods and protocols. *Humana Pr Inc*.


our understanding of the difference of microbiology. Adopting appropriate chemometrics

Odor fingerprinting method based on E-nose is still in its infancy. Many recent technological advances, which are outside the scope of this chapter, can be used to transform the odor fingerprinting concept into user-friendly, automated systems for high-throughput analyses. The introduction of smaller, faster and smarter instrumentation of E-nose to the market could also depend much on the embedding of chemometrics. In addition, more and more classification techniques based on odor fingerprinting may be developed to classify the pathogens into exact levels such as genus, species and stains. Further investigation may

In short, E-nose integrated with chemometrics is a reliable, rapid, and economic technique

The authors acknowledge the financial support of the project of Shanghai Youth Science and Technology Development (Project No: 07QA14047), the Leading Academic Discipline Project of Shanghai Municipal Education Commission (Project No: J50704), Shanghai Municipal Science, Technology Key Project of Agriculture Flourishing plan (Grant No: 2006, 10-5; 2009, 6-1), Public Science and Technology Research Funds Projects of Ocean (Project No: 201105007), Project of Science and Technology Commission of Shanghai Municipality (Project No: 11310501100), and Shanghai Ocean University youth teacher Fund (Project No:

contribute to make a distinction between the pathogen and non-pathogen bacterial.

which could be explored as a routine diagnostic tool for microbial analysis.

Adley C, 2006. Food-borne pathogens: methods and protocols. *Humana Pr Inc*.

Ayoko GA, 2004. Volatile organic compounds in indoor environments. *Air Pollution*, 1-35. Barsan N, Weimar U, 2001. Conduction model of metal oxide gas sensors. *Journal of* 

Bhunia AK, 2008. Foodborne microbial pathogens: mechanisms and pathogenesis. *Springer* 

Bjurman J, 1999. Release of MVOCs from microorganisms. *Organic Indoor Air Pollutants*, 259-

Buratti S, Benedetti S, Scampicchio M, Pangerod E, 2004. Characterization and classification

Capone S, Epifani M, Quaranta F, Siciliano P, Taurino A, Vasanelli L, 2001. Monitoring of

Di CZ, Crainiceanu CM, Caffo BS, Punjabi NM, 2009. Multilevel functional principal

Dutta R, Morgan D, Baker N, Gardner JW, Hines EL, 2005. Identification of Staphylococcus

of Italian Barbera wines by using an electronic nose and an amperometric electronic

rancidity of milk by means of an electronic nose and a dynamic PCA analysis.

aureus infections in hospital environment: electronic nose based approach. *Sensors* 

methods will improve the quality of analysis.

**5. Acknowledgments** 

A-2501-10-011506).

**6. References** 

*Verlag*.

73.

*Electroceramics* 7, 143-67.

tongue. *Anal Chim Acta* 525, 133-9.

*Sensors and Actuators B: Chemical* 78, 174-9.

*and Actuators B: Chemical* 109, 355-62.

component analysis. *Annals of Applied Statistics* 3, 458-88.


**Part 3** 

**Technology** 


**Part 3** 

**Technology** 

214 Chemometrics in Practical Applications

Turner APF, Magan N, 2004. Electronic noses and disease diagnostics. *Nature Reviews* 

Wilkins K, Larsen K, Simkus M, 2003. Volatile metabolites from indoor molds grown on

Yu Y. X., Liu Y., Sun X. H., Pan Y. J., Zhao Y., 2010a. Recognition of Three Pathogens Using Electronic Nose Technology. *Chinese Journal of Sensors and Actuators* 23, 10-3. Yu Y. X., Sun X. H., Pan Y. J., Zhao Y., 2010b. Research on Food-borne Pathogen Detection

Zhang Q, Wang P, Li J, Gao X, 2000. Diagnosis of diabetes by image detection of breath

Based on Electronic Nose. *chemistry online (in Chinese)*, 154-9.

using gas-sensitive laps. *Biosensors and Bioelectronics* 15, 249-56.

media containing wood constituents. *Environmental Science and Pollution Research*

*Microbiology* 2, 161-6.

10, 206-8.

**10** 

*Italy* 

**Chemometrics in Food Technology** 

*Department of Agricultural Engineering, Università degli Studi di Milano, Milano,* 

Riccardo Guidetti, Roberto Beghi and Valentina Giovenzana

The food sector is one of the most important voices in the economic field as it fulfills one of the main needs of man. The changes in the society in recent years have radically modified the food industry by combining the concept of globalization with the revaluation of local production. Besides the production needs to be global, in fact, there are always strong forces that tend to re-evaluate the expression of the deep local production like social history and

The increase in productivity, in ever-expanding market, has prompted a reorganization of control systems to maximize product standardization, ensuring a high level of food security, promote greater compliance among all batches produced. The protection of large quantities of production, however, necessarily passes through systems to highlight possible fraud present throughout the production chain: from the raw materials (controlled by the producer) to the finished products (controlled by large sales organizations). The fraud also concern the protection of local productions: the products of guaranteed origin must be characterized in such a way to identify specific properties easily and detectable by objective

The laboratories employ analytical techniques that are often inadequate because they require many samples, a long time to get the response, staff with high analytical ability. In a context where the speed is an imperative, technology solutions must require fewer samples or, at least no one (non-destructive techniques); they have to provide quick answers, if not immediate, in order to allow the operator to decide quickly about further steps to control or release the product to market; they must be easy to use, to promote their use throughout the production chain where it is not always possible to have analytical laboratories. The technologies must therefore be adapted to this new approach to production: the sensors and the necessary related data modeling, which allows the "measure", are evolving to meet the needs of the agri-food sector. The trial involves, often, Research Institutions on the side of Companies, a sign of a great interest and a high level of expectations. The manufacturers of technologies, often, provide devices that require calibration phases not always easy to perform, but that are often the subject of actual researches. These are particularly complex

This chapter is essentially divided into two parts: the first part analyzes the theoretical principles of the most important technologies, currently used in the food industry, that used

when the modeling approach must be based on chemometrics.

**1. Introduction** 

centuries-old tradition.

means.

## **Chemometrics in Food Technology**

Riccardo Guidetti, Roberto Beghi and Valentina Giovenzana

*Department of Agricultural Engineering, Università degli Studi di Milano, Milano, Italy* 

#### **1. Introduction**

The food sector is one of the most important voices in the economic field as it fulfills one of the main needs of man. The changes in the society in recent years have radically modified the food industry by combining the concept of globalization with the revaluation of local production. Besides the production needs to be global, in fact, there are always strong forces that tend to re-evaluate the expression of the deep local production like social history and centuries-old tradition.

The increase in productivity, in ever-expanding market, has prompted a reorganization of control systems to maximize product standardization, ensuring a high level of food security, promote greater compliance among all batches produced. The protection of large quantities of production, however, necessarily passes through systems to highlight possible fraud present throughout the production chain: from the raw materials (controlled by the producer) to the finished products (controlled by large sales organizations). The fraud also concern the protection of local productions: the products of guaranteed origin must be characterized in such a way to identify specific properties easily and detectable by objective means.

The laboratories employ analytical techniques that are often inadequate because they require many samples, a long time to get the response, staff with high analytical ability. In a context where the speed is an imperative, technology solutions must require fewer samples or, at least no one (non-destructive techniques); they have to provide quick answers, if not immediate, in order to allow the operator to decide quickly about further steps to control or release the product to market; they must be easy to use, to promote their use throughout the production chain where it is not always possible to have analytical laboratories. The technologies must therefore be adapted to this new approach to production: the sensors and the necessary related data modeling, which allows the "measure", are evolving to meet the needs of the agri-food sector. The trial involves, often, Research Institutions on the side of Companies, a sign of a great interest and a high level of expectations. The manufacturers of technologies, often, provide devices that require calibration phases not always easy to perform, but that are often the subject of actual researches. These are particularly complex when the modeling approach must be based on chemometrics.

This chapter is essentially divided into two parts: the first part analyzes the theoretical principles of the most important technologies, currently used in the food industry, that used

Chemometrics in Food Technology 219

The radiation from the infrared region is able to promote transitions at vibrational level. The infrared spectroscopy is used to acquire information about the nature of the functional groups present in a molecule. The infrared region is conventionally divided into three subregions: near (750-2500 nm), medium (2500-50000 nm) and far infrared (50-1000 µm).

Fundamental vibrational transitions, namely between the ground state and first excited state, take place in the mid-infrared, while in the region of near-infrared absorption bands are due to transitions between the ground state and the second or the third excited state. This type of transitions are called overtones and their absorption bands are generally very weak. The absorption bands associated with overtones can be identified and correlated to the corresponding absorption bands arising from the fundamental vibrational transitions

Following the process of absorption of photons by molecules the intensity of the radiation undergoes a decrease. The law that governs the absorption process is known as the Beer-

The spectrum is a graph where in the abscissa is reported a magnitude related to the nature of radiation such as the wavelength (λ) or the wave number (n) and in the Y-axis a quantity related to the change in the intensity of radiation as absorbance (A) or transmittance (T).

Since '70s producers developed analysis instruments specifically for NIR analysis trying to simplify them to fit also less skillful users, thanks to integrated statistical software and to

Instruments built in this period can be divided in three groups: desk instruments, compact

Devices evolved over the years also for the systems employed to select wavelength. First instruments used filter devices able to select only some wavelength (Fig. 2). These devices are efficient when specific wavelength are needed. Since the second half of '80s instruments capable to acquire simultaneously the sample spectrum in a specific interval of wavelength were introduced, recording the average spectrum of a single defined sample area (diode array systems and FT-NIR instruments) (Stark & Luchter, 2003). At the same time,

A = absorbance [log (incident light intensity/transmitted beam intensity)]; T = transmittance [beam intensity transmitted/incident light intensity];

ε = molar extinction coefficient characteristic of each molecule (l•mol-1•cm-1);

I0 = radiation intensity before interacting with the sample; I = radiation intensity after interaction with the sample;

l = optical path length crossed by radiation (cm);

portable instruments and on-line compatible devices.

chemometric data analysis growth helped to diffuse NIR analysis.

c = sample concentration (mol/l).

partial automation of analysis.

**2.1.3 Instruments** 

A = log (I0/I) = log (1/T) = ε·l·c (1)

**2.1.2 Transitions in the near infrared region (NIR)** 

because they fall at multiple wavelengths of these.

Lambert Law:

where:

a chemometric approach for the analysis of data (spectrophotometry Vis/NIR (Visible and Near InfraRed) and NIR (Near InfraRed), Image Analysis with particular regard to Hyperspectral Image Analysis and Electronic Nose); the second part will present some case studies of particular interest related to the same technologies (fruit and vegetables, wine, meat, fish, dairy, olive, coffee, baked goods, etc.) (Frank & Todeschini, 1994; Massart et al., 1997 and 1998; Basilevsk, 1994; Jackson, 1991).

### **2. Technologies used in the food sector combined with chemometrics**

#### **2.1 NIR and Vis/NIR spectroscopy**

Among the non-destructive techniques has met a significant development in the last 20 years the optical analysis in the region of near infrared (NIR) and visible-near infrared (Vis/NIR), based on the use of information arising from the interaction between the structure of food and light.

#### **2.1.1 Electromagnetic radiation**

Spectroscopic analysis is a group of techniques allowing to get information on the structure of matter through its interaction with electromagnetic radiation.

Radiation is characterized by (Fessenden & Fessenden, 1993):


The entire electromagnetic spectrum is divided into several regions, each characterized by a range of wavelengths (Fig.1)

Fig. 1. The electromagnetic spectrum (Lunadei, 2008).

#### **2.1.2 Transitions in the near infrared region (NIR)**

The radiation from the infrared region is able to promote transitions at vibrational level. The infrared spectroscopy is used to acquire information about the nature of the functional groups present in a molecule. The infrared region is conventionally divided into three subregions: near (750-2500 nm), medium (2500-50000 nm) and far infrared (50-1000 µm).

Fundamental vibrational transitions, namely between the ground state and first excited state, take place in the mid-infrared, while in the region of near-infrared absorption bands are due to transitions between the ground state and the second or the third excited state. This type of transitions are called overtones and their absorption bands are generally very weak. The absorption bands associated with overtones can be identified and correlated to the corresponding absorption bands arising from the fundamental vibrational transitions because they fall at multiple wavelengths of these.

Following the process of absorption of photons by molecules the intensity of the radiation undergoes a decrease. The law that governs the absorption process is known as the Beer-Lambert Law:

$$\mathbf{A} = \log \left( \mathbf{I}\_0 / \mathbf{I} \right) = \log \left( \mathbf{1} / \mathbf{T} \right) = \varepsilon \cdot \mathbf{l} \cdot \mathbf{c} \tag{1}$$

where:

218 Chemometrics in Practical Applications

a chemometric approach for the analysis of data (spectrophotometry Vis/NIR (Visible and Near InfraRed) and NIR (Near InfraRed), Image Analysis with particular regard to Hyperspectral Image Analysis and Electronic Nose); the second part will present some case studies of particular interest related to the same technologies (fruit and vegetables, wine, meat, fish, dairy, olive, coffee, baked goods, etc.) (Frank & Todeschini, 1994; Massart et al.,

Among the non-destructive techniques has met a significant development in the last 20 years the optical analysis in the region of near infrared (NIR) and visible-near infrared (Vis/NIR), based on the use of information arising from the interaction between the

Spectroscopic analysis is a group of techniques allowing to get information on the structure




The entire electromagnetic spectrum is divided into several regions, each characterized by a

of matter through its interaction with electromagnetic radiation. Radiation is characterized by (Fessenden & Fessenden, 1993):

of time and is measured in hertz (cycles/s);

Fig. 1. The electromagnetic spectrum (Lunadei, 2008).

**2. Technologies used in the food sector combined with chemometrics** 

1997 and 1998; Basilevsk, 1994; Jackson, 1991).

**2.1 NIR and Vis/NIR spectroscopy** 

structure of food and light.

measured in cm-1.

range of wavelengths (Fig.1)

in nm;

**2.1.1 Electromagnetic radiation** 

A = absorbance [log (incident light intensity/transmitted beam intensity)]; T = transmittance [beam intensity transmitted/incident light intensity]; I0 = radiation intensity before interacting with the sample; I = radiation intensity after interaction with the sample; ε = molar extinction coefficient characteristic of each molecule (l•mol-1•cm-1); l = optical path length crossed by radiation (cm);

c = sample concentration (mol/l).

The spectrum is a graph where in the abscissa is reported a magnitude related to the nature of radiation such as the wavelength (λ) or the wave number (n) and in the Y-axis a quantity related to the change in the intensity of radiation as absorbance (A) or transmittance (T).

#### **2.1.3 Instruments**

Since '70s producers developed analysis instruments specifically for NIR analysis trying to simplify them to fit also less skillful users, thanks to integrated statistical software and to partial automation of analysis.

Instruments built in this period can be divided in three groups: desk instruments, compact portable instruments and on-line compatible devices.

Devices evolved over the years also for the systems employed to select wavelength. First instruments used filter devices able to select only some wavelength (Fig. 2). These devices are efficient when specific wavelength are needed. Since the second half of '80s instruments capable to acquire simultaneously the sample spectrum in a specific interval of wavelength were introduced, recording the average spectrum of a single defined sample area (diode array systems and FT-NIR instruments) (Stark & Luchter, 2003). At the same time, chemometric data analysis growth helped to diffuse NIR analysis.

Chemometrics in Food Technology 221

Even if less common, alternative light sources are available. For example, LED light sources and ad laser sources could be used. LED sources (light emitting diodes) are certainly interesting sources thank to their efficiency and their small size. They meet, however, a limited distribution due to limited availability of LEDs emitting at wavelengths in the NIR region. Technology to produce LEDs to cover most of the NIR region already exists, but demand for this type of light sources is currently too low and the development of

The use of laser sources guarantees very intense emission in a narrow band. But the reduced spectral range covered by each specific laser source can cause problems in some applications. In any case the complexity and high cost of these devices have limited very

Light source must be very close to the sample to light it up with good intensity. This is not always possible, so systems able to convey light on the samples are needed. Thanks to optic fibers this problem was solved, allowing the development of different shapes devices.

The use of fiber optics allows to separate the area of placement of the instrument from the measuring proper area. There are indeed numerous circumstances on products sorting line in which environmental conditions do not fulfill direct installation of measure instruments. For example, high temperature, excessive vibrations or lack of space are restricting factors to the use of on-line NIR devices. In all these situations optic fibers are the solution to the problem of conveying light. They transmit light from lamp to sample and from sample to spectrophotometer. They allow to have an immediate measure on a localized sample area, thanks to their small dimensions, reaching areas difficult to access. Furthermore, they are made of a dielectric material that protects from electric and electromagnetic interferences. 'Optic fibers' means fibers optically transparent, purposely studied to transmit light thanks to total internal reflection phenomenon. Internal reflection is said to be total because it is highly efficient, in fact more than 99,999% radiation energy is transmitted in every reflection. This means that radiation can be reflected thousands of times during the way without suffer an appreciable attenuation of

Optic fiber consists of an inner core, a covering zone and of an external protection cover. The core is usually made of pure silica, but can also be used plastics or special glasses. The cladding area consists of material with a lower refractive index, while the exterior is only to

In figure 3 are shown the inner core and the cladding of an optical fiber. Index of refraction of inner core have to be bigger than cladding one. Each ray of light that penetrates inside the fiber with an angle ≤ θmax (acceptance angle) is totally reflected with high efficiency within

Samples compartment and measurement zone are highly influenced by the technique of acquisition of spectra. Different techniques are employed, depending on type of samples, solid or liquid, small or large, to be measured in plan or in line, that influence the geometry

commercial product of this type is still in an early stage.

**Light radiation transport system** 

intensity (Osborne et al., 1993).

of the measurement zone.

the fiber.

much their use so far, mostly restricted to the world of research.

protect the fiber from mechanical, thermal and chemical stress.

**Sample compartment and measurement zone** 

Fig. 2. Development of the different analysis technologies scheme (Stark & Luchter, 2003).

Particularly, food sector showed interest towards NIR and Vis/NIR instruments, both mobile and on-line. Devices based on diode array spectrophotometers and FT-NIR desk systems proved to be the best for this sector.

Both in the case of portable and stationary instruments, the fundamental components of these systems are common and are four:


#### **Light source**

Tungsten filament halogen lamps are chosen as the light source by most of the instruments. This is due to a good compromise between good performance and relatively low cost. This type of lamps are particularly suitable for use in low voltage. A little drawback may be represented by sensitivity to vibration of the filament.

Halogen bulbs are filled with halogen gas to extend their lives by using the return of evaporated tungsten to the filament. The life of the lamp depends on the design of the filament and the temperature of use, on average ranges from a minimum of 50 hours and a maximum of 10000 hours at rated voltage. The lamp should be chosen according to the use conditions and the spectral region of interest. An increase in the voltage of the lamp may cause a shift of the peaks of the emission spectrum towards the visible region but can also lead to a reduction of 30% of its useful life. On the contrary, use of lower voltages can increase the lamp life together, however, with an intensity reduction of light radiation, especially in the visible region. Emission spectrum of the tungsten filament changes as a function of temperature and emissivity of the tungsten filament. The spectrum shows high intensity in the VNIR region (NIR region close to the area of the visible).

Fig. 2. Development of the different analysis technologies scheme (Stark & Luchter, 2003).

systems proved to be the best for this sector.


represented by sensitivity to vibration of the filament.

intensity in the VNIR region (NIR region close to the area of the visible).

these systems are common and are four:



**Light source** 

Particularly, food sector showed interest towards NIR and Vis/NIR instruments, both mobile and on-line. Devices based on diode array spectrophotometers and FT-NIR desk

Both in the case of portable and stationary instruments, the fundamental components of

Tungsten filament halogen lamps are chosen as the light source by most of the instruments. This is due to a good compromise between good performance and relatively low cost. This type of lamps are particularly suitable for use in low voltage. A little drawback may be

Halogen bulbs are filled with halogen gas to extend their lives by using the return of evaporated tungsten to the filament. The life of the lamp depends on the design of the filament and the temperature of use, on average ranges from a minimum of 50 hours and a maximum of 10000 hours at rated voltage. The lamp should be chosen according to the use conditions and the spectral region of interest. An increase in the voltage of the lamp may cause a shift of the peaks of the emission spectrum towards the visible region but can also lead to a reduction of 30% of its useful life. On the contrary, use of lower voltages can increase the lamp life together, however, with an intensity reduction of light radiation, especially in the visible region. Emission spectrum of the tungsten filament changes as a function of temperature and emissivity of the tungsten filament. The spectrum shows high Even if less common, alternative light sources are available. For example, LED light sources and ad laser sources could be used. LED sources (light emitting diodes) are certainly interesting sources thank to their efficiency and their small size. They meet, however, a limited distribution due to limited availability of LEDs emitting at wavelengths in the NIR region. Technology to produce LEDs to cover most of the NIR region already exists, but demand for this type of light sources is currently too low and the development of commercial product of this type is still in an early stage.

The use of laser sources guarantees very intense emission in a narrow band. But the reduced spectral range covered by each specific laser source can cause problems in some applications. In any case the complexity and high cost of these devices have limited very much their use so far, mostly restricted to the world of research.

#### **Light radiation transport system**

Light source must be very close to the sample to light it up with good intensity. This is not always possible, so systems able to convey light on the samples are needed. Thanks to optic fibers this problem was solved, allowing the development of different shapes devices.

The use of fiber optics allows to separate the area of placement of the instrument from the measuring proper area. There are indeed numerous circumstances on products sorting line in which environmental conditions do not fulfill direct installation of measure instruments. For example, high temperature, excessive vibrations or lack of space are restricting factors to the use of on-line NIR devices. In all these situations optic fibers are the solution to the problem of conveying light. They transmit light from lamp to sample and from sample to spectrophotometer. They allow to have an immediate measure on a localized sample area, thanks to their small dimensions, reaching areas difficult to access. Furthermore, they are made of a dielectric material that protects from electric and electromagnetic interferences. 'Optic fibers' means fibers optically transparent, purposely studied to transmit light thanks to total internal reflection phenomenon. Internal reflection is said to be total because it is highly efficient, in fact more than 99,999% radiation energy is transmitted in every reflection. This means that radiation can be reflected thousands of times during the way without suffer an appreciable attenuation of intensity (Osborne et al., 1993).

Optic fiber consists of an inner core, a covering zone and of an external protection cover. The core is usually made of pure silica, but can also be used plastics or special glasses. The cladding area consists of material with a lower refractive index, while the exterior is only to protect the fiber from mechanical, thermal and chemical stress.

In figure 3 are shown the inner core and the cladding of an optical fiber. Index of refraction of inner core have to be bigger than cladding one. Each ray of light that penetrates inside the fiber with an angle ≤ θmax (acceptance angle) is totally reflected with high efficiency within the fiber.

#### **Sample compartment and measurement zone**

Samples compartment and measurement zone are highly influenced by the technique of acquisition of spectra. Different techniques are employed, depending on type of samples, solid or liquid, small or large, to be measured in plan or in line, that influence the geometry of the measurement zone.

Chemometrics in Food Technology 223

c. **Transflectance** – This technique is used in case it is preferable to have a single point of measurement, as in the case of acquisitions in reflectance. In this case, however, the incident light passes through the whole sample, is reflected by a special reflective surface, recross the sample and strikes the sensor located near the area of illumination. The incident light so makes a double passage through the sample. Obviously this type of technique can be used only in the case of samples very permeable to light radiation

such as partially transparent fluid. It is therefore not applicable to solid samples. d. **Interactance** - This technique is considered a hybrid between transmittance and reflectance, as it uses characteristics of both techniques previously seen. In this case the light source and sensor are located in areas near the sample but between them physically separated. So the radiation reaches the measurement sensor after interacting with part of internal structure of the sample. This technique is mainly used in the analysis of big solid samples, for example, a whole fruit. Interactance is thus a compromise between reflectance and transmittance and has good ability to detect internal defects of the product combined with a good intensity of light radiation. This analysis is widely used on static equipment where, through the use of special holders, is easily obtained the separation between the areas of incidence of light radiation and the area to which the sensor is placed. It is instead difficult to use this configuration on-line because is complicated to place a barrier between incident and returning light to the

Fig. 4. Setup for the acquisition of (a) reflectance, (b) transmittance, and (c) interactance spectra, with (i) the light source, (ii) fruit, (iii) monochromator/detector, (iv) light barrier, and (v) support. In interactance mode, light due to specular reflection is physically prevented from entering the monochromator by means of a light barrier (Nicolai et al.,

Spectrophotometer can be considered the heart of an instrument for NIR analysis. The employed technology for the wavelengths selection greatly influences the performance of

sensor directly on the process line.

**Spectrophotometer and Personal Computers** 

2007).

Fig. 3. Scheme of an optical fiber. The acceptance cone is determined by the critical angle for incoming light radiation. Buffer absorbs the radiation not reflected by the cladding. Jacket has a protective function.

The techniques to acquire spectra are four: transmittance, reflectance, transflectance and interactance. They are different mainly for the different positioning of the light source and of the measurement sensor around the sample (Fig. 4).


This technique also allows to put in a limited space inside a tip the bundle of fibers that illuminate the sample and the fibers leading to the spectrophotometer the radiation after the interaction with the product. Therefore the use of this type of acquisition technique is particularly versatile and is suitable for compact, portable instruments, designed for use in field or on the process line. The major drawback using this technique is related to the possibility to investigate only the outer area of the sample without having the chance to go deep inside.

222 Chemometrics in Practical Applications

Fig. 3. Scheme of an optical fiber. The acceptance cone is determined by the critical angle for incoming light radiation. Buffer absorbs the radiation not reflected by the cladding. Jacket

The techniques to acquire spectra are four: transmittance, reflectance, transflectance and interactance. They are different mainly for the different positioning of the light source and of

a. **Transmittance** - The transmittance measurements are based on the acquisition of spectral information by measuring the light that goes through the whole sample (Lu & Ariana, 2002). The use of analysis in transmittance can explore much of the internal structure of the product. This showed that is a technique particularly well suited to detect internal defects. To achieve significant results with this technique is required a high intensity light source and a high sensitivity measuring device. This because intensity of light able to cross the product is often very low. The transmittance measurements generally require a particular geometry of the measuring chamber,

b. **Reflectance** - This technique measures the component of radiation reflected from the sample. The radiation is not reflected on the surface but penetrates into the sample a few millimeters, radiation is partly absorbed and partly reflected back again. Measuring this component of reflected radiation after interacting with the sample is possible to establish a relationship of proportionality between reflectance and analyte concentration in the sample. The reflectance measurement technique is well suited to the analysis of solid matrices because the levels of intensity of light radiation after the

This technique also allows to put in a limited space inside a tip the bundle of fibers that illuminate the sample and the fibers leading to the spectrophotometer the radiation after the interaction with the product. Therefore the use of this type of acquisition technique is particularly versatile and is suitable for compact, portable instruments, designed for use in field or on the process line. The major drawback using this technique is related to the possibility to investigate only the outer area of the sample

has a protective function.

the measurement sensor around the sample (Fig. 4).

interaction with the sample are high.

without having the chance to go deep inside.

which can greatly influence the design of the instrument.


Fig. 4. Setup for the acquisition of (a) reflectance, (b) transmittance, and (c) interactance spectra, with (i) the light source, (ii) fruit, (iii) monochromator/detector, (iv) light barrier, and (v) support. In interactance mode, light due to specular reflection is physically prevented from entering the monochromator by means of a light barrier (Nicolai et al., 2007).

#### **Spectrophotometer and Personal Computers**

Spectrophotometer can be considered the heart of an instrument for NIR analysis. The employed technology for the wavelengths selection greatly influences the performance of

Chemometrics in Food Technology 225

systems to be used in real time on the process lines, allowing on-line control and automation of sorting and classification within the production cycle (Guanasekaran & Ding, 1994).

The objective of the application of image analysis techniques, in the food sector, is the quantification of geometric and densitometric characteristic of image, acquired in a form that represents meaningful information (at macro and microscopic level) of appearance of an object (Diezak, 1988). The evolution of these techniques and their implementation in the vision machine in form of hardware and specialized software, allows a wide flexibility of

The benefits of image analysis techniques (Brosnan & Sun, 2004) that rely on the use of

e. reduce the involvement of human personnel in performing tedious tasks and allow the

These suggest the reason that drives scientific research, of the agro-food sector, to devote to the study and analysis of machine vision systems to analyze the internal and external quality characteristics of food, valued according to the optical properties of the products. With a suitable light source, it is possible to extract information about color, shape, size and texture. From these features, it is possible to know many objective aspects of the sample, to be able to correlate, through statistical analysis, the characteristics defined by quality parameters (degree of maturation, the presence of external or internal mechanical defects,

The image analysis may have several applications in the food industry: as a descriptor or as gastronomic and technologic parameter. Vision machine can also be used to know size, structure, color in order to quantify the macro and microscopic surface defects of a product or for the characterization and identification of foods or to monitor the shelf life (Riva, 1999). "Image analysis" is a wide designation that include, in addition to classical studies in grayscale and RGB images, the analysis of images collected by mean multiple spectral channels (multispectral) or, more recently, hyperspectral images, technique exploited for its

The hyperspectral image (Chemical and Spectroscopic Imaging) is an emerging technology, non-destructive, which complements the conventional imaging with spectroscopy in order to obtain, from an object, information, both spectral and spatial. The hyperspectral images are digital images in which each element (pixel) is not a single number or a set of three numbers, like the color pictures (RGB), but a whole spectrum associated with that point. They are three-dimensional blocks of data. Their main advantage is that they provide spatial information necessary for the study of non-homogeneous samples. The advantage of this technique is the ability to detect in a foodstuff even minor constituents, isolated spatially.

To support image analysis, chemometric techniques are necessary to process and to model, data sets, in order to extract the highest possible information content. Methods of

applications, a high capacity of calculation and a rigorous statistical approach.

d. allow a complete analysis of the lots and not just a single sample of the lot;

automation of various functions that would require intensive work shifts;

b. techniques are use-friendly, rapid, precise, accurate and efficient; c. generate objective data that can be recorded for analysis deferred;

machine vision systems can be summarized as follows:

a. are non-destructive techniques;

f. are reasonably affordable cost.

class, etc.) (Du & Sun, 2006, Zheng et al., 2006).

full extension in the spectral direction.

the instrument. For example, the use of filters allows instruments to record the signal of a single wavelength at a time. Modern instruments (diode array instruments and interferometers) allow to record the spectrum of the entire wavelengths range.

Instruments equipped with a diode array spectrophotometer are those who have met the increased use for portable and online applications in food sector. This is due to their compact size, versatility and robustness, thanks to the lack of moving parts during operation and also thanks to a relatively low cost.

As seen before, fiber optic sensor collects the portion of the electromagnetic radiation after interaction with the internal structure of the sample and transfers it to the spectrophotometer. The optical fiber is connected to the optical bench of the instrument. The optical bench allows to decompose the electromagnetic radiation and recording the intensity at different wavelengths.

Optical bench of this type of instrument generally consists of five components:


High sensitivity of the CCD matrix sensor compensate the low intensity of light radiation input due to the reduced diameter of the optical fibers used. Sensors used are generally Sidiode array or InGaAs-diode array. The first ones are certainly the most common and cheap and allow the acquisition of the spectrum in the range between 400 and 1100 nm, so are used for Vis/NIR analysis. InGaAs sensors, more expensive, are used in applications requiring the acquisition of spectra at longer wavelengths, their use should range from 900 to 2300 nm.

Recorded signal by the CCD sensor is digitized and acquired by a PC using the software management tool of the instruments. Software records and allows to display graphically the spectrum of the analyzed sample. The management software also allows to interface with the spectrophotometer enabling to change some parameters during the acquisition of spectra.

#### **2.2 Image analysis**

In the food industry, since some time, there is a growing interest in image analysis techniques, since the appearance of a food contains a variety of information directly related to the quality of the product itself and this characteristics are difficult to measure through use of classical methods of analysis. In addition, image analysis techniques: provide information much more accurate than human vision, are objective and continuous over time and offer the great advantage of being non-destructive. These features, enable vision

the instrument. For example, the use of filters allows instruments to record the signal of a single wavelength at a time. Modern instruments (diode array instruments and

Instruments equipped with a diode array spectrophotometer are those who have met the increased use for portable and online applications in food sector. This is due to their compact size, versatility and robustness, thanks to the lack of moving parts during

As seen before, fiber optic sensor collects the portion of the electromagnetic radiation after interaction with the internal structure of the sample and transfers it to the spectrophotometer. The optical fiber is connected to the optical bench of the instrument. The optical bench allows to decompose the electromagnetic radiation and recording the intensity

a. Optical fiber connector: connects the fiber optic with the optical bench of the

b. First spherical mirror (collimating mirror), has the function of collimating the light and

c. Diffraction grating: in this area of the instrument, the light is split into different

d. Second spherical mirror (focussing mirror), collects diffracted radiation from the

High sensitivity of the CCD matrix sensor compensate the low intensity of light radiation input due to the reduced diameter of the optical fibers used. Sensors used are generally Sidiode array or InGaAs-diode array. The first ones are certainly the most common and cheap and allow the acquisition of the spectrum in the range between 400 and 1100 nm, so are used for Vis/NIR analysis. InGaAs sensors, more expensive, are used in applications requiring the acquisition of spectra at longer wavelengths, their use should range from 900 to 2300

Recorded signal by the CCD sensor is digitized and acquired by a PC using the software management tool of the instruments. Software records and allows to display graphically the spectrum of the analyzed sample. The management software also allows to interface with the spectrophotometer enabling to change some parameters during the acquisition of

In the food industry, since some time, there is a growing interest in image analysis techniques, since the appearance of a food contains a variety of information directly related to the quality of the product itself and this characteristics are difficult to measure through use of classical methods of analysis. In addition, image analysis techniques: provide information much more accurate than human vision, are objective and continuous over time and offer the great advantage of being non-destructive. These features, enable vision

e. Matrix CCD sensor (diode array): records the signal intensity at each wavelength.

interferometers) allow to record the spectrum of the entire wavelengths range.

Optical bench of this type of instrument generally consists of five components:

operation and also thanks to a relatively low cost.

send it to the diffraction grating.

wavelengths and sent to the second spherical mirror.

grating and sends them to the CCD sensor.

at different wavelengths.

instrument.

nm.

spectra.

**2.2 Image analysis** 

systems to be used in real time on the process lines, allowing on-line control and automation of sorting and classification within the production cycle (Guanasekaran & Ding, 1994).

The objective of the application of image analysis techniques, in the food sector, is the quantification of geometric and densitometric characteristic of image, acquired in a form that represents meaningful information (at macro and microscopic level) of appearance of an object (Diezak, 1988). The evolution of these techniques and their implementation in the vision machine in form of hardware and specialized software, allows a wide flexibility of applications, a high capacity of calculation and a rigorous statistical approach.

The benefits of image analysis techniques (Brosnan & Sun, 2004) that rely on the use of machine vision systems can be summarized as follows:


These suggest the reason that drives scientific research, of the agro-food sector, to devote to the study and analysis of machine vision systems to analyze the internal and external quality characteristics of food, valued according to the optical properties of the products. With a suitable light source, it is possible to extract information about color, shape, size and texture. From these features, it is possible to know many objective aspects of the sample, to be able to correlate, through statistical analysis, the characteristics defined by quality parameters (degree of maturation, the presence of external or internal mechanical defects, class, etc.) (Du & Sun, 2006, Zheng et al., 2006).

The image analysis may have several applications in the food industry: as a descriptor or as gastronomic and technologic parameter. Vision machine can also be used to know size, structure, color in order to quantify the macro and microscopic surface defects of a product or for the characterization and identification of foods or to monitor the shelf life (Riva, 1999). "Image analysis" is a wide designation that include, in addition to classical studies in grayscale and RGB images, the analysis of images collected by mean multiple spectral channels (multispectral) or, more recently, hyperspectral images, technique exploited for its full extension in the spectral direction.

The hyperspectral image (Chemical and Spectroscopic Imaging) is an emerging technology, non-destructive, which complements the conventional imaging with spectroscopy in order to obtain, from an object, information, both spectral and spatial. The hyperspectral images are digital images in which each element (pixel) is not a single number or a set of three numbers, like the color pictures (RGB), but a whole spectrum associated with that point. They are three-dimensional blocks of data. Their main advantage is that they provide spatial information necessary for the study of non-homogeneous samples. The advantage of this technique is the ability to detect in a foodstuff even minor constituents, isolated spatially.

To support image analysis, chemometric techniques are necessary to process and to model, data sets, in order to extract the highest possible information content. Methods of

Chemometrics in Food Technology 227

form of binary information. Any digital image can be considered as an array of points, the pixels, that make up the smallest element of an image. Each pixel contains a double set of information: its position in space, identified by the values (x, y) and the value of its intensity. Digital images can be represented using only two colors, typically black and white (binary image), or shades of gray (monochrome image) or a range of colors (multichannel image).

The pixels, in the binary images, can have, as the intensity value, or 0 (equivalent to black) or 1 (white). The value of intensity in monochrome images, will be within a range, defined gray scale, from 0 to L, which usually corresponds to the interval from 0 (or 1) to 255 (or 256), where a value of 0 corresponds to black, a value of 255 corresponds to white, and

Finally, in multi-channel images, the color of each pixel will be identified by three or four values, depending on the reference color model. For example, in RGB color space, each pixel will be characterized by a three values, each between 0 and 255, respectively, corresponding to the intensity in the red, green and blue. When all three values are 0, the color of object is black, and when all three have maximum value, the object will be white, while, when there are equal levels of R, G and B, the gray color is generated. The images of this type, in fact, may be considered as a three-dimensional matrix, consisting of three overlapping matrices having the same number of rows and columns, where the elements of the first matrix represent the pixel intensity in the red channel, those in the second matrix, the green

RGB images, represented by three overlapping monochrome images, are the simplest example of multichannel images. In medical applications, in geotechnical, in the analysis of materials and of remote sensing, instead, are often used sensors capable of acquiring multispectral and hyperspectral images, two particular types of multi-channel images. The multispectral images are typically acquired in three/ten spectral bands including in the range of Vis, but also in the field of IR, fairly spaced (Aleixos et al., 2002). In this way it's possible to extract a larger number of information from the images respect those normally

An example of bands normally used in this type of analysis, are the band of blue (430-490 nm), green (491-560 nm), red (620-700 nm), NIR (700-1400 nm), MIR (1400-1750 nm). The spectral combinations can be different depending on the purpose of analysis. The combination of NIR-RG (near infrared, red, green) is often used to identify green areas in satellite images, because the green color reflects a lot in the NIR wavelength. The combination of NIR-R-B (near infrared, red, blue) is very useful to verify the ripening stage of fruit, this is due to the chlorophyll that shows a peak of adsorption in the wavelength of the red. Finally, the combination of NIR-MIR-blue (NIR, MIR and blue) is useful to observe

Hyperspectral imaging (HSI) combines spectroscopy and the traditional imaging to form a three-dimensional structure of multivariate data (hypercube). The hyperspectral images are consist of many spectral bands acquired in a narrow and contiguous way, allowing to analyze each pixel in the multiple wavelengths simultaneously and, therefore, to obtain a

The value of light intensity will be different depending on the type of image.

intermediate values to the various shades of gray.

channel and those of third matrix, in the blue channel.

**2.2.3 Multispectral and hyperspectral images** 

obtained from the analysis of RGB images.

the sea depth, the green areas in remote sensing images.

classification, modeling, multivariate regression, similarity analysis, principal components analysis, methods of experimental design and optimization, must be applied on the basis of each different condition and needs.

#### **2.2.1 The vision system**

The machine vision systems, appeared in the early sixties and then spread over time, in many fields of application, are composed of a lighting system, a data acquisition system connected to a computer, via a capture card, which digitizes (converts the signal into numerical form) and stores the analogic electrical signal, at the output from the camera sensor (Russ et al., 1988). The scanned image is then "converted" into a numerical matrix. Captured images are elaborated by appropriate processing softwares in order to acquire the useful information. In figure 5 shows an example of vision machine.

Fig. 5. Example of image acquisition of an inspected sample.

It is important to choose the localization of light source, but also the type of light source (incandescent, halogen, fluorescent, etc.) influences the performance of the analysis. In fact, although the light sources emitting electromagnetic radiation corresponding to the visible (VIS, 400-700 nm), ultraviolet (UV 10-400 nm) and near infrared (NIR, 700-1400 nm) are the most widely used, to create a digital image can also be used other types of light sources in order to emit different radiations, depending on the purpose of analysis. For example, to determine the internal structure of objects and/or identify any internal defects, it's possible to use an X-ray source, even if, although this type of source gives good results, its application is much more widespread in the medical field than in the agro-food, this is due to high costs of equipment and low speed of operating.

#### **2.2.2 The digital image**

A digital image is generated from the conversion of an analogic video signal produced by a digital camera into an electronic signal (scanning), then stored in the memory of a PC in the

classification, modeling, multivariate regression, similarity analysis, principal components analysis, methods of experimental design and optimization, must be applied on the basis of

The machine vision systems, appeared in the early sixties and then spread over time, in many fields of application, are composed of a lighting system, a data acquisition system connected to a computer, via a capture card, which digitizes (converts the signal into numerical form) and stores the analogic electrical signal, at the output from the camera sensor (Russ et al., 1988). The scanned image is then "converted" into a numerical matrix. Captured images are elaborated by appropriate processing softwares in order to acquire the

It is important to choose the localization of light source, but also the type of light source (incandescent, halogen, fluorescent, etc.) influences the performance of the analysis. In fact, although the light sources emitting electromagnetic radiation corresponding to the visible (VIS, 400-700 nm), ultraviolet (UV 10-400 nm) and near infrared (NIR, 700-1400 nm) are the most widely used, to create a digital image can also be used other types of light sources in order to emit different radiations, depending on the purpose of analysis. For example, to determine the internal structure of objects and/or identify any internal defects, it's possible to use an X-ray source, even if, although this type of source gives good results, its application is much more widespread in the medical field than in the agro-food, this is due

A digital image is generated from the conversion of an analogic video signal produced by a digital camera into an electronic signal (scanning), then stored in the memory of a PC in the

useful information. In figure 5 shows an example of vision machine.

Fig. 5. Example of image acquisition of an inspected sample.

to high costs of equipment and low speed of operating.

**2.2.2 The digital image** 

each different condition and needs.

**2.2.1 The vision system** 

form of binary information. Any digital image can be considered as an array of points, the pixels, that make up the smallest element of an image. Each pixel contains a double set of information: its position in space, identified by the values (x, y) and the value of its intensity. Digital images can be represented using only two colors, typically black and white (binary image), or shades of gray (monochrome image) or a range of colors (multichannel image). The value of light intensity will be different depending on the type of image.

The pixels, in the binary images, can have, as the intensity value, or 0 (equivalent to black) or 1 (white). The value of intensity in monochrome images, will be within a range, defined gray scale, from 0 to L, which usually corresponds to the interval from 0 (or 1) to 255 (or 256), where a value of 0 corresponds to black, a value of 255 corresponds to white, and intermediate values to the various shades of gray.

Finally, in multi-channel images, the color of each pixel will be identified by three or four values, depending on the reference color model. For example, in RGB color space, each pixel will be characterized by a three values, each between 0 and 255, respectively, corresponding to the intensity in the red, green and blue. When all three values are 0, the color of object is black, and when all three have maximum value, the object will be white, while, when there are equal levels of R, G and B, the gray color is generated. The images of this type, in fact, may be considered as a three-dimensional matrix, consisting of three overlapping matrices having the same number of rows and columns, where the elements of the first matrix represent the pixel intensity in the red channel, those in the second matrix, the green channel and those of third matrix, in the blue channel.

#### **2.2.3 Multispectral and hyperspectral images**

RGB images, represented by three overlapping monochrome images, are the simplest example of multichannel images. In medical applications, in geotechnical, in the analysis of materials and of remote sensing, instead, are often used sensors capable of acquiring multispectral and hyperspectral images, two particular types of multi-channel images. The multispectral images are typically acquired in three/ten spectral bands including in the range of Vis, but also in the field of IR, fairly spaced (Aleixos et al., 2002). In this way it's possible to extract a larger number of information from the images respect those normally obtained from the analysis of RGB images.

An example of bands normally used in this type of analysis, are the band of blue (430-490 nm), green (491-560 nm), red (620-700 nm), NIR (700-1400 nm), MIR (1400-1750 nm). The spectral combinations can be different depending on the purpose of analysis. The combination of NIR-RG (near infrared, red, green) is often used to identify green areas in satellite images, because the green color reflects a lot in the NIR wavelength. The combination of NIR-R-B (near infrared, red, blue) is very useful to verify the ripening stage of fruit, this is due to the chlorophyll that shows a peak of adsorption in the wavelength of the red. Finally, the combination of NIR-MIR-blue (NIR, MIR and blue) is useful to observe the sea depth, the green areas in remote sensing images.

Hyperspectral imaging (HSI) combines spectroscopy and the traditional imaging to form a three-dimensional structure of multivariate data (hypercube). The hyperspectral images are consist of many spectral bands acquired in a narrow and contiguous way, allowing to analyze each pixel in the multiple wavelengths simultaneously and, therefore, to obtain a

Chemometrics in Food Technology 229

As mentioned previously, one of the advantages HSI is the large volume of data available in each hypercube, with which to create the calibration and validation set. But, the information derived from the analysis, contain also redundant information. This abundance of data has two drawback, one due to the high computational load of heavy data size and the second is due to the long acquisition times, given the size of the data being collected (Firtha et al. 2008). Therefore, it is desirable to reduce the load to manageable levels, especially if the goal is the application of HSI techniques in real time, on-line on production lines. In fact, in many cases, the large amount of data acquired from the spectral image, is appropriately reduced (with chemometric processing) so as to select only those wavelengths interesting for the intended purpose. Once the spectral bands of interest were identified, a multispectral system, with only selected wavelengths, can be engineered a system for industrial application. Another negative aspect is that the spectral image analysis is an indirect method to which it is necessary to apply appropriate chemometric techniques and a procedure of

The spectral image is not suitable for liquid and homogeneous samples. In fact, the value of this type of image is evident when applied to heterogeneous samples, and many foods are an excellent heterogeneous matrix. Despite the novelty of applying HSI in the food sector,

The traditional image analysis, based on a computer system, has had a strong development in the food sector with the aim of replacing the human eye on saving costs and improving efficiency, speed and accuracy. But the computer vision technology is not able to select between objects of similar colors, to make complex classifications, to predict quality characteristics (e.g. chemical composition) or detect internal defects. Since the quality of a food is not an individual attribute but it contains a number of inherent characteristics of the food itself, to measure the optical properties of food products has been one of the most studied non-destructive techniques for the simultaneous detection of different quality parameters. In fact, the light reflected from the food contains information about constituents near and at the surface of the foodstuff. Near-infrared spectroscopy technology (NIRS) is rapid, non-destructive, easy to apply on-line and off-line. With this technology, it is possible to obtain spectroscopic information about the components of analyzed sample, but it is not

The only characteristic of appearance (color, shape, etc.) however, are easily detectable with conventional image analysis. The combination of image analysis technology and spectroscopy is the chemical imaging spectroscopy that allows to get spatial and spectral information for each pixel of the foodstuff. This technology allowing to know the location of each chemical component in the scanned image. Table 1 summarizes the main differences between the three analytical technologies: imaging, spectroscopy and hyperspectral

"An instrument which comprises an array of electronic chemical sensors with partial specificity and appropriate pattern recognition system, capable of recognizing simple or complex odors" is the term of "electronic nose" coined in 1988 by Gardner and Bartlett

data transfer.

imaging.

**2.3 Electronic nose (e-nose)** 

(Gardner and Bartlett, 1994).

many jobs are already present in the literature.

possible to know the position of the component.

spectrum associated with a single pixels. The set of data constituting an hyperspectral image can be thought as a kind of data cube, with two spatial directions, ideally resting on the surface observed, and a spectral dimension. Extracting a horizontal plane from the cube it is possible to get a monochrome image, while the set of values, corresponding to a fixed position in the plane (x, y), is the spectrum of a pixel of the image (Fig. 6).

Fig. 6. Example of hyperspectral image.

With the hyperspectral imaging, you can acquire the spectra in reflectance, in transmission and fluorescence as a function of the different kind of sample to analysis, even if the most of the scientific works, present in the literature, using spectral images acquired in reflectance, transmission and emission.

The significant time savings that can be made to the industrial production processes, encourage the use of this instrumentation. The hyperspectral image analysis has many advantages, but still has some defects. The advantages of using hyperspectral analysis for what concerns the agro-food sector can be summarized as follows:


spectrum associated with a single pixels. The set of data constituting an hyperspectral image can be thought as a kind of data cube, with two spatial directions, ideally resting on the surface observed, and a spectral dimension. Extracting a horizontal plane from the cube it is possible to get a monochrome image, while the set of values, corresponding to a fixed

With the hyperspectral imaging, you can acquire the spectra in reflectance, in transmission and fluorescence as a function of the different kind of sample to analysis, even if the most of the scientific works, present in the literature, using spectral images acquired in reflectance,

The significant time savings that can be made to the industrial production processes, encourage the use of this instrumentation. The hyperspectral image analysis has many advantages, but still has some defects. The advantages of using hyperspectral analysis for

b. it is a non-invasive, non-destructive methodology, it avoids the sample loss that can be

c. can be regarded as an economic tool that it allows a saving of time, labor, reagents, and

d. for each pixel of the analyzed sample is possible to have the full spectrum and not a

e. many constituents can be determined simultaneously within a sample, such as color

f. due to its high spectral resolution, it is possible to estimate both qualitative than

g. it is also possible to select a single region of interest of the sample, and save it in a

what concerns the agro-food sector can be summarized as follows:

a. does not necessary to prepare the test sample;

a strong cost-saving for the waste treatment;

only absorbance value for few wavelength;

used for other purposes or analysis;

and morphological characteristics;

quantitative information;

spectral library.

position in the plane (x, y), is the spectrum of a pixel of the image (Fig. 6).

Fig. 6. Example of hyperspectral image.

transmission and emission.

As mentioned previously, one of the advantages HSI is the large volume of data available in each hypercube, with which to create the calibration and validation set. But, the information derived from the analysis, contain also redundant information. This abundance of data has two drawback, one due to the high computational load of heavy data size and the second is due to the long acquisition times, given the size of the data being collected (Firtha et al. 2008). Therefore, it is desirable to reduce the load to manageable levels, especially if the goal is the application of HSI techniques in real time, on-line on production lines. In fact, in many cases, the large amount of data acquired from the spectral image, is appropriately reduced (with chemometric processing) so as to select only those wavelengths interesting for the intended purpose. Once the spectral bands of interest were identified, a multispectral system, with only selected wavelengths, can be engineered a system for industrial application. Another negative aspect is that the spectral image analysis is an indirect method to which it is necessary to apply appropriate chemometric techniques and a procedure of data transfer.

The spectral image is not suitable for liquid and homogeneous samples. In fact, the value of this type of image is evident when applied to heterogeneous samples, and many foods are an excellent heterogeneous matrix. Despite the novelty of applying HSI in the food sector, many jobs are already present in the literature.

The traditional image analysis, based on a computer system, has had a strong development in the food sector with the aim of replacing the human eye on saving costs and improving efficiency, speed and accuracy. But the computer vision technology is not able to select between objects of similar colors, to make complex classifications, to predict quality characteristics (e.g. chemical composition) or detect internal defects. Since the quality of a food is not an individual attribute but it contains a number of inherent characteristics of the food itself, to measure the optical properties of food products has been one of the most studied non-destructive techniques for the simultaneous detection of different quality parameters. In fact, the light reflected from the food contains information about constituents near and at the surface of the foodstuff. Near-infrared spectroscopy technology (NIRS) is rapid, non-destructive, easy to apply on-line and off-line. With this technology, it is possible to obtain spectroscopic information about the components of analyzed sample, but it is not possible to know the position of the component.

The only characteristic of appearance (color, shape, etc.) however, are easily detectable with conventional image analysis. The combination of image analysis technology and spectroscopy is the chemical imaging spectroscopy that allows to get spatial and spectral information for each pixel of the foodstuff. This technology allowing to know the location of each chemical component in the scanned image. Table 1 summarizes the main differences between the three analytical technologies: imaging, spectroscopy and hyperspectral imaging.

#### **2.3 Electronic nose (e-nose)**

"An instrument which comprises an array of electronic chemical sensors with partial specificity and appropriate pattern recognition system, capable of recognizing simple or complex odors" is the term of "electronic nose" coined in 1988 by Gardner and Bartlett (Gardner and Bartlett, 1994).

Chemometrics in Food Technology 231

Fig. 7. The main parts of a typical sensor (Deisingh et al. 2004)

**Sensor Molecules detected**  W1C Aromatic compounds

W1S Methane, low specificity

W6S Hydrogen

W3S Methane

or a region of a multidimensional space.

can to determine

or regression.

Below, ten possible MOS sensors for reading specific molecules (Table 2):

W5S Oxides of nitrogen, low specificity W3C Ammonium compounds, aromatic

W5C Alkanes, aromatic compounds, less polar compounds

W2S Alcohol, partially aromatic compounds, low specificity

W1W Sulfur compounds, terpenes, limonene, pyrazines

W2W Aromatic compounds, organic sulfur compounds

Table 2. Example of sensors in the electronic nose, with their categories of compounds that

The information is initially encoded as electrical unit, but are immediately captured and digitized in order to be numerically translated by a computer system. In practice, an odorant is described by the electronic nose, based on the responses of individual sensors, as a point

Thanks to special algorithms, derived from the discipline called pattern recognition, the system is able to build an olfactory map in order to allow a qualitative and quantitative

The architecture of an electronic nose (Fig. 8) is significantly dependent on the application for which it is designed. In general, the electronic nose, is characterized by the presence of a vacuum system, a large number of gas sensors, a subsystem of acquisition and digitization and by a processing subsystem able to implement appropriate algorithms for classification

analysis, discriminating a foodstuff simply by its olfactory fingerprint.


Table 1. Main differences among imaging, spectroscopy and hyperspectral imaging techniques (ElMarsy & Sun, 2010).

Scientific interest in the use of electronic noses was formalised, the first time, in a workshop on chemosensory information processing during a session of the North Atlantic Treaty Organization (NATO) that was entirely dedicated to the topic of artificial olfaction. Since 1991, interest in biological sensors technology has grown considerably as is evident by numerous scientific articles. Moreover, commercial efforts to improve sensor technologies and to develop tools of greater sophistication and improved capabilities, with diverse sensitivities, are with ever-expanding (Wilson & Baietto, 2009).

Electronic noses are emerging as an innovative analytical-sensorial tool to characterize the sensory comparison of food in terms of freshness, determination of geographical origin, seasoning. The first electronic nose goes back to the '80s, when Persaud and Dodd of the University of Warwick (UK) tried to model and simulate the operation of the olfactory system of mammals with solid state sensors. Since then, artificial olfactory systems are designed closer to the natural one.

The electronic nose is a technology that tends to replace/complement the human olfactory system. The tool does not analyze the chemical composition of the volatile fraction, but it identifies the olfactory fingerprint.

Currently, these electronic devices are characterized by complex architecture, where it is possible to try to reproduce the functioning of the olfactory system of mammals. The tool is a biomimetic system that is designed to mimic the functioning of the olfactory systems that we find in nature, specifically human olfactory system. Typically, an electronic nose collects information through an array of sensors, able to respond in a selective mode and reversible to the presence of chemicals, generating electrical signals as a function of their concentration. Currently, the sensors that have reached the highest level of development are made from metal oxides semiconductor (MOS). The sensors are usually characterized by fast response, low energy consumption, small size, high sensitivity, reliability, stability and reproducibility. In addition to semiconductor of metal, the sensor can be made of transistors, plated with metal semiconductor (MOSFETs), or conductive polymers. The MOS sensors are inorganic, typically made of tin oxide, zinc oxide, titanium oxide, tungsten oxide. The absorption of gas by them change their conductivity. These sensors operate at high temperatures, between 200 and 500 °C and are relatively cheap. In figure 7 is represented the main parts of a typical sensor.

**Imaging** 

**Features Imaging Spectroscopy Hyperspectral** 

Spatial information √ x √ Spectral information x √ √ Multi-costituent information x √ √ Building chimica images x x √

extraction x x <sup>√</sup>

Table 1. Main differences among imaging, spectroscopy and hyperspectral imaging

sensitivities, are with ever-expanding (Wilson & Baietto, 2009).

Scientific interest in the use of electronic noses was formalised, the first time, in a workshop on chemosensory information processing during a session of the North Atlantic Treaty Organization (NATO) that was entirely dedicated to the topic of artificial olfaction. Since 1991, interest in biological sensors technology has grown considerably as is evident by numerous scientific articles. Moreover, commercial efforts to improve sensor technologies and to develop tools of greater sophistication and improved capabilities, with diverse

Electronic noses are emerging as an innovative analytical-sensorial tool to characterize the sensory comparison of food in terms of freshness, determination of geographical origin, seasoning. The first electronic nose goes back to the '80s, when Persaud and Dodd of the University of Warwick (UK) tried to model and simulate the operation of the olfactory system of mammals with solid state sensors. Since then, artificial olfactory systems are

The electronic nose is a technology that tends to replace/complement the human olfactory system. The tool does not analyze the chemical composition of the volatile fraction, but it

Currently, these electronic devices are characterized by complex architecture, where it is possible to try to reproduce the functioning of the olfactory system of mammals. The tool is a biomimetic system that is designed to mimic the functioning of the olfactory systems that we find in nature, specifically human olfactory system. Typically, an electronic nose collects information through an array of sensors, able to respond in a selective mode and reversible to the presence of chemicals, generating electrical signals as a function of their concentration. Currently, the sensors that have reached the highest level of development are made from metal oxides semiconductor (MOS). The sensors are usually characterized by fast response, low energy consumption, small size, high sensitivity, reliability, stability and reproducibility. In addition to semiconductor of metal, the sensor can be made of transistors, plated with metal semiconductor (MOSFETs), or conductive polymers. The MOS sensors are inorganic, typically made of tin oxide, zinc oxide, titanium oxide, tungsten oxide. The absorption of gas by them change their conductivity. These sensors operate at high temperatures, between 200 and 500 °C and are relatively cheap. In figure 7 is represented the

Flexibility of spectral information

techniques (ElMarsy & Sun, 2010).

designed closer to the natural one.

identifies the olfactory fingerprint.

main parts of a typical sensor.

Fig. 7. The main parts of a typical sensor (Deisingh et al. 2004)

Below, ten possible MOS sensors for reading specific molecules (Table 2):


Table 2. Example of sensors in the electronic nose, with their categories of compounds that can to determine

The information is initially encoded as electrical unit, but are immediately captured and digitized in order to be numerically translated by a computer system. In practice, an odorant is described by the electronic nose, based on the responses of individual sensors, as a point or a region of a multidimensional space.

Thanks to special algorithms, derived from the discipline called pattern recognition, the system is able to build an olfactory map in order to allow a qualitative and quantitative analysis, discriminating a foodstuff simply by its olfactory fingerprint.

The architecture of an electronic nose (Fig. 8) is significantly dependent on the application for which it is designed. In general, the electronic nose, is characterized by the presence of a vacuum system, a large number of gas sensors, a subsystem of acquisition and digitization and by a processing subsystem able to implement appropriate algorithms for classification or regression.

Chemometrics in Food Technology 233

The developed models should be tested using independent samples as validation sets to verify model accuracy and robustness. To evaluate model accuracy, the statistics used were the coefficient of correlation in calibration (rcal), coefficient of correlation in prediction (rpred), root mean square error of calibration (RMSEC), and root mean square error of prediction

*y y*

*i i*

ˆ

<sup>2</sup>

*y y*

*n*

*i i*

ˆ

1

*i*

where n is the number of validated objects, and ŷi and yi are the predicted and measured values of the ith observation in the calibration or validation set, respectively. This value gives the average uncertainty that can be expected for predictions of future samples. The optimum calibrations should be selected based on minimizing the RMSEP. Percent errors (RMSEC% and RMSEP%) could be also calculated as: RMSEC (%) = RMSEC/averaged

Prediction capacity of a model can be evaluated with the ratio performance deviation (RPD) (Williams & Sobering, 1996). The RPD is defined as the ratio of the standard deviation of the response variable to the RMSEP. RPD value > 2.5 means that the model has good prediction

During the last 50 years, there has been a lot of emphasis on the quality and safety of the food products, of the production processes, and the relationship between the two (Burns

Near infrared (NIR) spectroscopy has proved to be one of the most efficient and advanced tools for monitoring and controlling of process and product quality in food industry. A lot of work has been done in this area. This review focuses on the use of NIR spectroscopy for the analysis of foods such as meat, fruit, grain, dairy products, oil, honey, wine and other

areas, and looks at the literature published in the last 10 years.

*n*

1

*i n*

*i*

*n*

1

cal pred <sup>2</sup>

where yi are the reference values, ŷi are the values predicted by the PLS model, and ў is the

r or r 1

Standard errors of calibration and prediction (RMSEC and RMSEP):

RMSEC or RMSEP

2

(2)

(3)

*y y*

*i*

(RMSEP).

Correlation coefficients (rcal and rpred):

averaged reference value.

reference values of each parameter.

**3.1 NIR and Vis/NIR spectroscopy** 

accuracy.

**3. Applications** 

and Ciurczak, 2001).

Fig. 8. A generalized structure of an electronic nose. (Deisingh et al. 2004)

The principle of working which operates the electronic nose is distinctly different from that of commonly used analytical instruments (e.g. gas chromatograph). The e-nose gives an overall assessment of the volatile fraction of the foodstuff that is, in large part responsible for the perception of the aroma of the investigated sample, without the need to separate and identify the various components. All the responses of the sensors resulted from the electronic nose creates a "map" of non-specific signals that constitute the profile of the food product, also called olfactory fingerprints.

The goal is to find a relationship between the set of independent variables, resulted from the sensor, and the set of dependent variables, characteristics of the sample. Chemometrics software required for data set processing, in environmental and food sectors, allows to process the data by methods of multivariate analysis such as PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), PLS (Partial Least Square Analysis), DFA (Discriminant Function Analysis). As example the Principal Component Analysis (PCA) is a method for detecting patterns in data sets and express them in order to highlight their similarities and/or differences. Example of electronic nose applications, in the food sector, could be: to monitor of foodstuff shelf life, to check certified quality or the trademark DOP, to make microbiological tests, to check controlled atmosphere in the packaging, to control fermentation stage or to identify the presence of components of the packaging transferred in the product, or to verify the state of cleaning of kegs (on-line measures).

#### **2.4 Chemometrics in food sector**

Chemometrics is an essential part of NIR and Vis/NIR spectroscopy in food sector. NIR and Vis/NIR instrumentation in fact must always be complemented with chemometric analysis to enable to extract useful information present in the spectra separating it both from not useful information to solve the problem and from spectral noise. Chemometric techniques most used are the principal component analysis (PCA) as a technique of qualitative analysis of the data and PLS regression analysis as a technique to obtain quantitative prediction of the parameters of interest (Naes et al., 2002; Wold et al., 2001; Nicolai et al., 2007; Cen & He, 2007).

Fig. 8. A generalized structure of an electronic nose. (Deisingh et al. 2004)

the product, or to verify the state of cleaning of kegs (on-line measures).

product, also called olfactory fingerprints.

**2.4 Chemometrics in food sector** 

2007).

The principle of working which operates the electronic nose is distinctly different from that of commonly used analytical instruments (e.g. gas chromatograph). The e-nose gives an overall assessment of the volatile fraction of the foodstuff that is, in large part responsible for the perception of the aroma of the investigated sample, without the need to separate and identify the various components. All the responses of the sensors resulted from the electronic nose creates a "map" of non-specific signals that constitute the profile of the food

The goal is to find a relationship between the set of independent variables, resulted from the sensor, and the set of dependent variables, characteristics of the sample. Chemometrics software required for data set processing, in environmental and food sectors, allows to process the data by methods of multivariate analysis such as PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), PLS (Partial Least Square Analysis), DFA (Discriminant Function Analysis). As example the Principal Component Analysis (PCA) is a method for detecting patterns in data sets and express them in order to highlight their similarities and/or differences. Example of electronic nose applications, in the food sector, could be: to monitor of foodstuff shelf life, to check certified quality or the trademark DOP, to make microbiological tests, to check controlled atmosphere in the packaging, to control fermentation stage or to identify the presence of components of the packaging transferred in

Chemometrics is an essential part of NIR and Vis/NIR spectroscopy in food sector. NIR and Vis/NIR instrumentation in fact must always be complemented with chemometric analysis to enable to extract useful information present in the spectra separating it both from not useful information to solve the problem and from spectral noise. Chemometric techniques most used are the principal component analysis (PCA) as a technique of qualitative analysis of the data and PLS regression analysis as a technique to obtain quantitative prediction of the parameters of interest (Naes et al., 2002; Wold et al., 2001; Nicolai et al., 2007; Cen & He, The developed models should be tested using independent samples as validation sets to verify model accuracy and robustness. To evaluate model accuracy, the statistics used were the coefficient of correlation in calibration (rcal), coefficient of correlation in prediction (rpred), root mean square error of calibration (RMSEC), and root mean square error of prediction (RMSEP).

Correlation coefficients (rcal and rpred):

$$\begin{array}{l} \mathbf{r}\_{\text{cal}} \quad \text{or} \quad \mathbf{r}\_{\text{pred}} = \sqrt{\mathbf{1} - \frac{\sum\_{i=1}^{n} \left( y\_i - \hat{y}\_i \right)^2}{n}} \\ \qquad \quad \sum\_{i=1}^{n} \left( y\_i - \overline{y} \right)^2 \end{array} \tag{2}$$

where yi are the reference values, ŷi are the values predicted by the PLS model, and ў is the averaged reference value.

Standard errors of calibration and prediction (RMSEC and RMSEP):

$$\text{RMSEC or RMSE} = \sqrt{\frac{\sum\_{i=1}^{n} \left(y\_i - \hat{y}\_i\right)^2}{n}} \tag{3}$$

where n is the number of validated objects, and ŷi and yi are the predicted and measured values of the ith observation in the calibration or validation set, respectively. This value gives the average uncertainty that can be expected for predictions of future samples. The optimum calibrations should be selected based on minimizing the RMSEP. Percent errors (RMSEC% and RMSEP%) could be also calculated as: RMSEC (%) = RMSEC/averaged reference values of each parameter.

Prediction capacity of a model can be evaluated with the ratio performance deviation (RPD) (Williams & Sobering, 1996). The RPD is defined as the ratio of the standard deviation of the response variable to the RMSEP. RPD value > 2.5 means that the model has good prediction accuracy.

#### **3. Applications**

#### **3.1 NIR and Vis/NIR spectroscopy**

During the last 50 years, there has been a lot of emphasis on the quality and safety of the food products, of the production processes, and the relationship between the two (Burns and Ciurczak, 2001).

Near infrared (NIR) spectroscopy has proved to be one of the most efficient and advanced tools for monitoring and controlling of process and product quality in food industry. A lot of work has been done in this area. This review focuses on the use of NIR spectroscopy for the analysis of foods such as meat, fruit, grain, dairy products, oil, honey, wine and other areas, and looks at the literature published in the last 10 years.

Chemometrics in Food Technology 235

In literature there are numerous applications of NIR spectroscopy for the analysis of meat quality. One of the most important aim is to monitor the freshness of meat products. Sinelli et al. in 2010 investigated the ability of Near Infrared spectroscopy to follow meat freshness decay. PCA was applied by authors to the data and was able to discriminate samples on the basis of storage time and temperature. The modelling of PC scores versus time allowed the setting of the time of initial freshness decay for the samples (6–7 days at 4.3 °C, 2–3 days at 8.1 °C and less than 1 day at 15.5 °C). Authors reported that results showed the feasibility of

Sierra et al. in 2007 conducted a study for the rapid prediction of the fatty acid (FA) profile of ground using near infrared transmittance spectroscopy (NIT). The samples were scanned in transmittance mode from 850 to 1050 nm. NIT spectra were able to accurately predict saturated R2=0,837, branched R2=0,701 and monounsaturated R2=0,852 FAs. Results were considered interesting because intramuscular fat content and composition influence

Andrés et al. in 2007 implemented a study to evaluate the potential of visible and near infrared reflectance (NIR) spectroscopy to predict sensory characteristics related to the eating quality of lamb meat samples. A total of 232 muscle samples from Texel and Scottish Blackface lambs was analyzed by chemical procedures and scored by assessors in a taste panel and these parameters were predicted from Vis/NIR spectra. The results obtained by authors suggested that the more important regions of the spectra to estimate the sensory characteristics are related to the absorbance of intramuscular fat and water content in meat

Even in the meat industry have been tried online applications of NIR spectroscopy. A study was conducted by Prieto et al. in 2009a to assess the on-line implementation of visible and near infrared reflectance (Vis/NIR) spectroscopy as an early predictor of beef quality traits, by direct application of a fibre-optic probe to the muscle immediately after exposing the meat surface in the abattoir. Authors reported good correlation results only for prediction of

NIR spectroscopy could be used for the detection of beef contamination from harmful pathogens and the protection of consumer safety. Amamcharla et al. in 2010 investigated the potential of Fourier transform infrared spectroscopy (FTIR) to discriminate the Salmonella contaminated packed beef. Principal component analysis was performed on the entire spectrum (4000–500 cm-1). Authors obtained encouraging classification results with different techniques and confirmed that NIR could be used for non-destructive discrimination of

A review published by Prieto et al. in 2009b indicates that NIR showed high potential to predict chemical meat properties and to categorize meat into quality classes. But authors underlined also that NIR showed in different cases limited ability for estimating technological and sensory attributes, which may be mainly due to the heterogeneity of the meat samples and their preparation, the low precision of the reference methods and the

colour parameters while less good results were achieved for sensory parameters.

Salmonella contaminated packed beef samples from uncontaminated ones.

NIR for estimating quality decay of fresh minced beef during marketing.

consumer selection of meat products.

subjectivity of assessors in taste panels.

**3.1.2 Meat** 

samples.

#### **3.1.1 Fruit and vegetables**

Water is the most important chemical constituent of fruits and vegetables and water highly absorbs NIR radiation, so the NIR spectrum of such materials is dominated by water. Further, the NIR spectrum is essentially composed of a large set of overtones and combination bands. This, in combination with the complex chemical composition of a typical fruit or vegetable causes the NIR spectrum to be highly convoluted. Multivariate statistical techniques are required to extract the information about quality attributes which is buried in the NIR spectrum. Developments in multivariate statistical techniques such as partial least squares (PLS) regression and principal component analysis (PCA) are then applied to extract the required information from such convoluted spectra (Cozzolino et al., 2006b; McClure, 2003; Naes et al., 2004; Nicolai et al., 2007;).

The availability of low cost miniaturised spectrophotometers has opened up the possibility of portable devices which can be used directly on field for monitoring the maturity of fruit.

Guidetti et al. (2008) tested a portable Vis/NIR device (450-980 nm) for the prediction of ripening indexes (soluble solids content and firmness) and presence of compounds with functional properties (total anthocyanins, total flavonoids, total polyphenols and ascorbic acid) of of blueberries ('Brigitta' and 'Duke' varieties). Good predictive statistics were obtained with correlation coefficients (r) between 0.80 and 0.92 for the regression models built for fresh berries (Table 3). Similar results were obtained for the regression models for homogenized samples with r > 0.8 for all the indexes. Results showed that Vis/NIR spectroscopy is an interesting and rapid tool for assessing blueberry ripeness.


Table 3. Results of PLS models for fresh 'Duke' berry samples (r = coefficient of correlation; RMSEC = root mean square of the standard error in calibration; RMSECV = root mean square of the standard error in cross-validation; LV = latent variables). All data were preprocessed by second derivative of reduced and smoothed data.

#### **3.1.2 Meat**

234 Chemometrics in Practical Applications

Water is the most important chemical constituent of fruits and vegetables and water highly absorbs NIR radiation, so the NIR spectrum of such materials is dominated by water. Further, the NIR spectrum is essentially composed of a large set of overtones and combination bands. This, in combination with the complex chemical composition of a typical fruit or vegetable causes the NIR spectrum to be highly convoluted. Multivariate statistical techniques are required to extract the information about quality attributes which is buried in the NIR spectrum. Developments in multivariate statistical techniques such as partial least squares (PLS) regression and principal component analysis (PCA) are then applied to extract the required information from such convoluted spectra (Cozzolino et al.,

The availability of low cost miniaturised spectrophotometers has opened up the possibility of portable devices which can be used directly on field for monitoring the maturity of fruit. Guidetti et al. (2008) tested a portable Vis/NIR device (450-980 nm) for the prediction of ripening indexes (soluble solids content and firmness) and presence of compounds with functional properties (total anthocyanins, total flavonoids, total polyphenols and ascorbic acid) of of blueberries ('Brigitta' and 'Duke' varieties). Good predictive statistics were obtained with correlation coefficients (r) between 0.80 and 0.92 for the regression models built for fresh berries (Table 3). Similar results were obtained for the regression models for homogenized samples with r > 0.8 for all the indexes. Results showed that Vis/NIR

TSS (°Brix) 4 0.86 0.78 0.85 0.79

(MPa) 3 0.87 0.65 0.87 0.66

(mg cat/g) 4 0.87 0.37 0.86 0.37

(mg/100 g f. w.) 4 0.84 1.01 0.83 1.02

Table 3. Results of PLS models for fresh 'Duke' berry samples (r = coefficient of correlation; RMSEC = root mean square of the standard error in calibration; RMSECV = root mean square of the standard error in cross-validation; LV = latent variables). All data were

preprocessed by second derivative of reduced and smoothed data.

4 0.87 0.31 0.87 0.31

11 0.82 0.20 0.81 0.20

**Calibration Cross validation** 

**rcal RMSEC rcv RMSECV** 

spectroscopy is an interesting and rapid tool for assessing blueberry ripeness.

2006b; McClure, 2003; Naes et al., 2004; Nicolai et al., 2007;).

**3.1.1 Fruit and vegetables** 

**Dependent** 

Young's Module

Total anthocyanins (mg/g f. w.)

Total flavonoids

Total polyphenols (mg cat/g f. w.)

Ascorbic acid

**variable LV** 

In literature there are numerous applications of NIR spectroscopy for the analysis of meat quality. One of the most important aim is to monitor the freshness of meat products. Sinelli et al. in 2010 investigated the ability of Near Infrared spectroscopy to follow meat freshness decay. PCA was applied by authors to the data and was able to discriminate samples on the basis of storage time and temperature. The modelling of PC scores versus time allowed the setting of the time of initial freshness decay for the samples (6–7 days at 4.3 °C, 2–3 days at 8.1 °C and less than 1 day at 15.5 °C). Authors reported that results showed the feasibility of NIR for estimating quality decay of fresh minced beef during marketing.

Sierra et al. in 2007 conducted a study for the rapid prediction of the fatty acid (FA) profile of ground using near infrared transmittance spectroscopy (NIT). The samples were scanned in transmittance mode from 850 to 1050 nm. NIT spectra were able to accurately predict saturated R2=0,837, branched R2=0,701 and monounsaturated R2=0,852 FAs. Results were considered interesting because intramuscular fat content and composition influence consumer selection of meat products.

Andrés et al. in 2007 implemented a study to evaluate the potential of visible and near infrared reflectance (NIR) spectroscopy to predict sensory characteristics related to the eating quality of lamb meat samples. A total of 232 muscle samples from Texel and Scottish Blackface lambs was analyzed by chemical procedures and scored by assessors in a taste panel and these parameters were predicted from Vis/NIR spectra. The results obtained by authors suggested that the more important regions of the spectra to estimate the sensory characteristics are related to the absorbance of intramuscular fat and water content in meat samples.

Even in the meat industry have been tried online applications of NIR spectroscopy. A study was conducted by Prieto et al. in 2009a to assess the on-line implementation of visible and near infrared reflectance (Vis/NIR) spectroscopy as an early predictor of beef quality traits, by direct application of a fibre-optic probe to the muscle immediately after exposing the meat surface in the abattoir. Authors reported good correlation results only for prediction of colour parameters while less good results were achieved for sensory parameters.

NIR spectroscopy could be used for the detection of beef contamination from harmful pathogens and the protection of consumer safety. Amamcharla et al. in 2010 investigated the potential of Fourier transform infrared spectroscopy (FTIR) to discriminate the Salmonella contaminated packed beef. Principal component analysis was performed on the entire spectrum (4000–500 cm-1). Authors obtained encouraging classification results with different techniques and confirmed that NIR could be used for non-destructive discrimination of Salmonella contaminated packed beef samples from uncontaminated ones.

A review published by Prieto et al. in 2009b indicates that NIR showed high potential to predict chemical meat properties and to categorize meat into quality classes. But authors underlined also that NIR showed in different cases limited ability for estimating technological and sensory attributes, which may be mainly due to the heterogeneity of the meat samples and their preparation, the low precision of the reference methods and the subjectivity of assessors in taste panels.

Chemometrics in Food Technology 237

An optical, portable, experimental system (Vis/NIR spectrophotometer) for nondestructive and quick prediction of ripening parameters of fresh berries and homogenized samples of grapes in the wavelength range 450-980 nm was built and tested by Guidetti et al. (2010) (Fig. 9). Calibrations for technological ripening and for anthocyanins had good correlation coefficients (rcv > 0.90). These models were extensively validated using independent sample sets. Good statistical parameters were obtained for soluble solids content (r > 0.8, SEP < 1.24 °Brix) and for titratable acidity (r > 0.8, SEP < 2.00 g tartaric acid L-1), showing the validity of the Vis/NIR spectrometer. Similarly, anthocyanins could be predicted accurately compared with the reference determination (Table 4). Finally, for qualitative analysis, spectral data on grapes were divided into two groups on the basis of grapes' soluble content and acidity in order to apply a classification analysis (PLS-DA). Good results were obtained with the Vis/NIR device, with 89% of samples correctly classified for soluble content and 83% of samples correctly classified for acidity. Results indicate that the Vis/NIR portable device could be an interesting and rapid tool for assessing grape ripeness directly in the field or

Fig. 9. Images of spectral acquisition phases on fresh berries and on homogenized samples.

(°Brix) MSC+d2 5 0.93 0.95 0.75 0.95

pH MSC+d2 5 0.85 0.08 0.80 0.13

(mg dm-3) MSC+d2 5 0.95 80.90 0.78 129.00

(mg dm-3) MSC+d2 3 0.93 57.70 0.84 77.70

(OD 280 nm) MSC+d2 4 0.80 3.74 0.70 5.81

[a] MSC = multiplicative scatter correction, and d2 = second derivative. Table 4. Results of PLS models for homogenized samples.

(g tart. acid dm-3) MSC+d2 6 0.95 1.16 0.85 1.12

ment[a] LV Calibration Validation

r RMSEC r RMSEP

upon receiving grapes in the wine industry.

Parameter Pretreat-

TSS

Titratable acidity

PA

EA

TP

#### **3.1.3 Grains, bread and pasta**

Grains including wheat, rice, and corn are main agricultural products in most countries. Grain quality is an important parameter not only for harvesting, but also for shipping (Burns and Ciurczak, 2001). In many countries, the price of grain is determined by its protein content, starch content, and/or hardness, often with substantial price increments between grades.

Measurement of carotenoid content of maize by Vis/NIR spectroscopy was investigated by Brenna and Berardo (2004). They generated calibrations for several individual carotenoids and the total carotenoid content with good results (R2 about 0,9).

Several applications can be found in literature regarding the use of NIR for the prediction of the main physical and rheological parameters of pasta and bread. De Temmerman et al. in 2007 proposed near-infrared (NIR) reflectance spectroscopy for in-line determination of moisture concentrations in semolina pasta immediately after the extrusion process. Several pasta samples with different moisture concentrations were extruded while the reflectance spectra between 308 and 1704 nm were measured. An adequate prediction model was developed based on the Partial Least Squares (PLS) method using leave-one-out crossvalidation. Good results were obtained with R2 = 0,956 and very low level of RMSECV. This creates opportunities for measuring the moisture content with a low-cost sensor.

Zardetto & Dalla Rosa in 2006 studied the evaluation of the chemical and physical characteristics of fresh egg pasta samples obtained by using two different production methodologies: extrusion and lamination. Authors evaluated that it is possible to discriminate the two kinds of products by using FT-NIR spectroscopy. FT-NIR analysis results suggest the presence of a different matrix–water association, a diverse level of starch gelatinization and a distinct starch–gluten interaction in the two kinds of pasteurised samples.

The feasibility of using near infrared spectroscopy for prediction of nutrients in a wide range of bread varieties mainly produced from wheat and rye was investigated by Sørensen in 2009. Very good results were reported for the prediction of total contents of carbohydrates and energy from NIR data with R2 values of 0.98 and 0.99 respectively.

Finally, a quick, non-destructive method, based on Fourier transform near-infrared (FT-NIR) spectroscopy for egg content determination of dry pasta was presented by Fodor et al. (2011) with good results.

#### **3.1.4 Wine**

Quantification of phenolic compounds in wine and during key stages in wine production is therefore an important quality control goal for the industry and several reports describing the application of NIR spectroscopy to this problem have been published.

Grape composition at harvest is one of the most important factors determining the future quality of wine. Measurement of grape characteristics that impact product quality is a requirement for vineyard improvement and for optimum production of wines (Carrara et al., 2008). Inspection of grapes upon arrival at the winery is a critical point in the wine production chain (Elbatawi & Ebaid, 2006).

Grains including wheat, rice, and corn are main agricultural products in most countries. Grain quality is an important parameter not only for harvesting, but also for shipping (Burns and Ciurczak, 2001). In many countries, the price of grain is determined by its protein content, starch content, and/or hardness, often with substantial price increments

Measurement of carotenoid content of maize by Vis/NIR spectroscopy was investigated by Brenna and Berardo (2004). They generated calibrations for several individual carotenoids

Several applications can be found in literature regarding the use of NIR for the prediction of the main physical and rheological parameters of pasta and bread. De Temmerman et al. in 2007 proposed near-infrared (NIR) reflectance spectroscopy for in-line determination of moisture concentrations in semolina pasta immediately after the extrusion process. Several pasta samples with different moisture concentrations were extruded while the reflectance spectra between 308 and 1704 nm were measured. An adequate prediction model was developed based on the Partial Least Squares (PLS) method using leave-one-out crossvalidation. Good results were obtained with R2 = 0,956 and very low level of RMSECV. This

Zardetto & Dalla Rosa in 2006 studied the evaluation of the chemical and physical characteristics of fresh egg pasta samples obtained by using two different production methodologies: extrusion and lamination. Authors evaluated that it is possible to discriminate the two kinds of products by using FT-NIR spectroscopy. FT-NIR analysis results suggest the presence of a different matrix–water association, a diverse level of starch gelatinization and a distinct starch–gluten interaction in the two kinds of pasteurised

The feasibility of using near infrared spectroscopy for prediction of nutrients in a wide range of bread varieties mainly produced from wheat and rye was investigated by Sørensen in 2009. Very good results were reported for the prediction of total contents of carbohydrates and energy from NIR data with R2 values of 0.98 and 0.99 respectively.

Finally, a quick, non-destructive method, based on Fourier transform near-infrared (FT-NIR) spectroscopy for egg content determination of dry pasta was presented by Fodor et al. (2011)

Quantification of phenolic compounds in wine and during key stages in wine production is therefore an important quality control goal for the industry and several reports describing

Grape composition at harvest is one of the most important factors determining the future quality of wine. Measurement of grape characteristics that impact product quality is a requirement for vineyard improvement and for optimum production of wines (Carrara et al., 2008). Inspection of grapes upon arrival at the winery is a critical point in the wine

the application of NIR spectroscopy to this problem have been published.

production chain (Elbatawi & Ebaid, 2006).

creates opportunities for measuring the moisture content with a low-cost sensor.

and the total carotenoid content with good results (R2 about 0,9).

**3.1.3 Grains, bread and pasta** 

between grades.

samples.

with good results.

**3.1.4 Wine** 

An optical, portable, experimental system (Vis/NIR spectrophotometer) for nondestructive and quick prediction of ripening parameters of fresh berries and homogenized samples of grapes in the wavelength range 450-980 nm was built and tested by Guidetti et al. (2010) (Fig. 9). Calibrations for technological ripening and for anthocyanins had good correlation coefficients (rcv > 0.90). These models were extensively validated using independent sample sets. Good statistical parameters were obtained for soluble solids content (r > 0.8, SEP < 1.24 °Brix) and for titratable acidity (r > 0.8, SEP < 2.00 g tartaric acid L-1), showing the validity of the Vis/NIR spectrometer. Similarly, anthocyanins could be predicted accurately compared with the reference determination (Table 4). Finally, for qualitative analysis, spectral data on grapes were divided into two groups on the basis of grapes' soluble content and acidity in order to apply a classification analysis (PLS-DA). Good results were obtained with the Vis/NIR device, with 89% of samples correctly classified for soluble content and 83% of samples correctly classified for acidity. Results indicate that the Vis/NIR portable device could be an interesting and rapid tool for assessing grape ripeness directly in the field or upon receiving grapes in the wine industry.

Fig. 9. Images of spectral acquisition phases on fresh berries and on homogenized samples.


[a] MSC = multiplicative scatter correction, and d2 = second derivative.

Table 4. Results of PLS models for homogenized samples.

Chemometrics in Food Technology 239

The quality of coffee is related to the chemical constituents of the roasted beans, whose composition depends on the composition of green beans (i.e., un-roasted). Unroasted coffee beans contain different chemical compounds, which react amongst themselves during coffee roasting influencing the final product. For this reason, monitoring the row materials and the roasting process is very important. Ribeiro et al. in 2011 elaborated PLS models correlating coffee beverage sensory data and NIR spectra of 51 Arabica roasted coffee samples. Acidity, bitterness, flavour, cleanliness, body and overall quality of coffee beverage were considered. Results were good and authors confirmed that it is possible to estimate the quality of coffee using PLS regression models obtained by using NIR spectra of roasted Arabica coffees.

Da Costa Filho in 2009 elaborated a rapid method to determine sucrose in chocolate mass using near infrared spectroscopy. Data were modelled using partial least squares (PLS) and multiple linear regression (MLR), achieving good results (correlation coefficient of 0.998 and 0.997 respectively for the two chemometric techniques). Results showed that NIR can be

The chemical imaging spectroscopy is applied to various fields, from astronomy to agriculture (Baranowski et al., 2008, Monteiro et al., 2007, V. Smail, 2006), from the pharmaceutical industry (Lyon et al. 2002, Roggo et al., 2005) to medicine (Ferris et al. 2001, Zheng et al., 2004). But in recent years, has also found use for quality control and safety in

In general, classify or quantify the presence of compounds in a sample is the main purpose of the application in the food hyperspectral analysis. There already exist algorithms for classification and regression but, improved algorithms efficiency, could be a target, as well as create datasets, identify anomalies or objects with different spectral characteristics, compared hyperspectral image with those of data library. These goals can be achieved only if the experimental data are processed with chemometric methods. K-nearest neighbors and hierarchical clustering are examples of multivariate analysis that allow to get information from spectral and spatial data (Burger & Gowen, 2011). With the use of spectral image, which allows to obtain in a single determination, spectral and spatial information characterizing the sample, it is possible to identify which chemical species are present and how they are distributed in a matrix. Several chemometric techniques are available for the development of regression models (for example partial least squares regression, principal components regression, and linear regression) capable of estimating the concentrations of constituents in a sample, at the pixel level, allowing the spatial distribution or the mapping of a particular component in the sample analyzed. Moreover the hyperspectral image, combined with chemometric technique, is a powerful method to identify key wavelengths in

Karoui & De Baerdemaeker (2006) wrote a review about the analytical methods coupled with chemometric tools for the determination of the quality and identity of dairy products. Spectroscopic techniques (NIR, MIR, FFFS front face fluorescence spectroscopy, etc.), coupled with chemometric tools have many potential advantages for the evaluation of the

order to develop of multispectral system, for on-line applications.

identity dairy products (milk, ice cream, yogurt, butter, cheese, etc).

used as rapid method to determine sucrose in chocolate mass in chocolate factories.

**3.2 Image analysis** 

food (Gowen et al., 2007b).

The application of some chemometric techniques directly to NIR spectral data with the aim of following the progress of conventional fermentation and maturation was investigated by Cozzolino et al. (2006b). The application of principal components analysis (PCA) allowed similar spectral changes in all samples to be followed over time. The PCA loading structure could be explained on the basis of absorptions from anthocyanins, tannins, phenolics, sugar and ethanol, the content of which changed according to different fermentation time points. This study demonstrated the possibility of applying NIR spectroscopy as a process analytical tool for the wine industry.

Urbano-Cuadrado et al. (2004) analysed by Vis/NIR spectroscopy different parameters commonly monitored in wineries. Coefficients of determination obtained for the fifteen parameters were higher than 0.80 and in most cases higher than 0.90 while SECV values were close to those of the reference method. Authors said that these prediction accuracies were sufficient for screening purposes.

Römisch et al. in 2009 presented a study on the characterization and determination of the geographical origin of wines. In this paper, three methods of discrimination and classification of multivariate data were considered and tested: the classification and regression trees (CART), the regularized discriminant analysis (RDA) and the partial least squares discriminant analysis (PLS-DA). PLS-DA analysis showed better classification results with percentage of correct classified samples from 88 to 100%.

Finally, PLS and artificial neural networks (ANN) techniques were compared by Janik et al. in 2007 for the prediction of total anthocyanin content in redgrape homogenates.

#### **3.1.5 Other applications**

The applications of Vis/NIR and NIR spectroscopy and their chemometric techniques are present in many other sectors of the food industry. In literature are reported works relating to the dairy, oil, coffee, honey and chocolate industry. In particular, interesting studies have been conducted by some authors for the application of NIR spectroscopy in detecting the geographical origin of raw materials and finished products, defending the protected designation of origin (PDO).

Olivieri et al. (2011) worked out the exploration of three different class-modelling techniques to evaluate classification abilities based on geographical origin of two PDO food products: olive oil from Liguria and honey from Corsica. Authors developed the best models for both Ligurian olive oil and Corsican honey by a potential function technique (POTFUN) with values of correctly classified around 83%.

González-Martín et al. in 2011 presented a work on the evaluation by near infrared reflectance (NIR) spectroscopy of different sensorial attributes of different type of cheeses, taking as reference data the evaluation of the sensorial properties obtained by a panel of eight trained experts. NIR spectra were collected with a remote reflectance fibre optic probe applying the probe directly to the cheese samples and the calibration equations were developed by using modified partial least-squares (MPLS) regression for 50 samples of cheese. Authors stated that obtained results can be considered good and acceptable for all the parameters analyzed (presence of holes, hardness, chewiness, creamy, salty, buttery flavour, rancid flavour, pungency and retronasal sensation).

The application of some chemometric techniques directly to NIR spectral data with the aim of following the progress of conventional fermentation and maturation was investigated by Cozzolino et al. (2006b). The application of principal components analysis (PCA) allowed similar spectral changes in all samples to be followed over time. The PCA loading structure could be explained on the basis of absorptions from anthocyanins, tannins, phenolics, sugar and ethanol, the content of which changed according to different fermentation time points. This study demonstrated the possibility of applying NIR spectroscopy as a process

Urbano-Cuadrado et al. (2004) analysed by Vis/NIR spectroscopy different parameters commonly monitored in wineries. Coefficients of determination obtained for the fifteen parameters were higher than 0.80 and in most cases higher than 0.90 while SECV values were close to those of the reference method. Authors said that these prediction accuracies

Römisch et al. in 2009 presented a study on the characterization and determination of the geographical origin of wines. In this paper, three methods of discrimination and classification of multivariate data were considered and tested: the classification and regression trees (CART), the regularized discriminant analysis (RDA) and the partial least squares discriminant analysis (PLS-DA). PLS-DA analysis showed better classification

Finally, PLS and artificial neural networks (ANN) techniques were compared by Janik et al.

The applications of Vis/NIR and NIR spectroscopy and their chemometric techniques are present in many other sectors of the food industry. In literature are reported works relating to the dairy, oil, coffee, honey and chocolate industry. In particular, interesting studies have been conducted by some authors for the application of NIR spectroscopy in detecting the geographical origin of raw materials and finished products, defending the protected

Olivieri et al. (2011) worked out the exploration of three different class-modelling techniques to evaluate classification abilities based on geographical origin of two PDO food products: olive oil from Liguria and honey from Corsica. Authors developed the best models for both Ligurian olive oil and Corsican honey by a potential function technique (POTFUN) with

González-Martín et al. in 2011 presented a work on the evaluation by near infrared reflectance (NIR) spectroscopy of different sensorial attributes of different type of cheeses, taking as reference data the evaluation of the sensorial properties obtained by a panel of eight trained experts. NIR spectra were collected with a remote reflectance fibre optic probe applying the probe directly to the cheese samples and the calibration equations were developed by using modified partial least-squares (MPLS) regression for 50 samples of cheese. Authors stated that obtained results can be considered good and acceptable for all the parameters analyzed (presence of holes, hardness, chewiness, creamy, salty, buttery

in 2007 for the prediction of total anthocyanin content in redgrape homogenates.

results with percentage of correct classified samples from 88 to 100%.

analytical tool for the wine industry.

were sufficient for screening purposes.

**3.1.5 Other applications** 

designation of origin (PDO).

values of correctly classified around 83%.

flavour, rancid flavour, pungency and retronasal sensation).

The quality of coffee is related to the chemical constituents of the roasted beans, whose composition depends on the composition of green beans (i.e., un-roasted). Unroasted coffee beans contain different chemical compounds, which react amongst themselves during coffee roasting influencing the final product. For this reason, monitoring the row materials and the roasting process is very important. Ribeiro et al. in 2011 elaborated PLS models correlating coffee beverage sensory data and NIR spectra of 51 Arabica roasted coffee samples. Acidity, bitterness, flavour, cleanliness, body and overall quality of coffee beverage were considered. Results were good and authors confirmed that it is possible to estimate the quality of coffee using PLS regression models obtained by using NIR spectra of roasted Arabica coffees.

Da Costa Filho in 2009 elaborated a rapid method to determine sucrose in chocolate mass using near infrared spectroscopy. Data were modelled using partial least squares (PLS) and multiple linear regression (MLR), achieving good results (correlation coefficient of 0.998 and 0.997 respectively for the two chemometric techniques). Results showed that NIR can be used as rapid method to determine sucrose in chocolate mass in chocolate factories.

#### **3.2 Image analysis**

The chemical imaging spectroscopy is applied to various fields, from astronomy to agriculture (Baranowski et al., 2008, Monteiro et al., 2007, V. Smail, 2006), from the pharmaceutical industry (Lyon et al. 2002, Roggo et al., 2005) to medicine (Ferris et al. 2001, Zheng et al., 2004). But in recent years, has also found use for quality control and safety in food (Gowen et al., 2007b).

In general, classify or quantify the presence of compounds in a sample is the main purpose of the application in the food hyperspectral analysis. There already exist algorithms for classification and regression but, improved algorithms efficiency, could be a target, as well as create datasets, identify anomalies or objects with different spectral characteristics, compared hyperspectral image with those of data library. These goals can be achieved only if the experimental data are processed with chemometric methods. K-nearest neighbors and hierarchical clustering are examples of multivariate analysis that allow to get information from spectral and spatial data (Burger & Gowen, 2011). With the use of spectral image, which allows to obtain in a single determination, spectral and spatial information characterizing the sample, it is possible to identify which chemical species are present and how they are distributed in a matrix. Several chemometric techniques are available for the development of regression models (for example partial least squares regression, principal components regression, and linear regression) capable of estimating the concentrations of constituents in a sample, at the pixel level, allowing the spatial distribution or the mapping of a particular component in the sample analyzed. Moreover the hyperspectral image, combined with chemometric technique, is a powerful method to identify key wavelengths in order to develop of multispectral system, for on-line applications.

Karoui & De Baerdemaeker (2006) wrote a review about the analytical methods coupled with chemometric tools for the determination of the quality and identity of dairy products. Spectroscopic techniques (NIR, MIR, FFFS front face fluorescence spectroscopy, etc.), coupled with chemometric tools have many potential advantages for the evaluation of the identity dairy products (milk, ice cream, yogurt, butter, cheese, etc).

Chemometrics in Food Technology 241

Li et al. (2011) detected common defects on oranges using hyperspectral (wavelength range: 400-1000) reflectance imaging. The disadvantage of studied algorithm is that it could not

Bhuvaneswari et al. (2011) compared three methods (electronic speck counter, acid hydrolysis and flotation and near-infrared hyperspectral imaging) to investigate the presence of insect fragments (Tribolium castaneum\_ Coleoptera: Tenebrionidae) in the semolina (ingredient for pasta and couscous). NIR hyperspectral imaging is a rapid, nondestructive method, as electronic speck counter, but they showed different correlation between insect fragments in the semolina and detection of specks in the samples: R2 = 0.99 and 0.639-0.767 respectively. For NIR hyperspectral image technique, the prediction model

The most important features in meat are tenderness, juiciness and flavour. Jackmana et al., (2011) wrote a review about recent advances in the use of computer vision technology in the quality assessment of fresh meats. The researcher support that the best opportunities for improving computer vision solutions is the application of hyperspectral imaging in combination with statistical modelling. This synergy can provide some additional information on meat composition and structure. However, in parallel, new image processing algorithms, developed in other scientific disciplines, should be carefully

Other applications concern the possibility of estimating a correlation between characteristics (physical or chemical) of the food and the spectra acquired with spectroscopic image. Moreover these techniques were able to locate and quantify the characteristic of interest

In most cases the range of wavelength used in applications of hyperspectral images is 400- 1000 nm but Maftoonazad et al. (2010) used artificial neural network (ANN) modeling of hyperspectral radiometric (350-2500 nm) data for quality changes associated with avocados during storage. Respiration rate, total color difference, texture and weight loss of samples were measured as conventional quality parameters during storage. Hyperspectral imaging was used to evaluate spectral properties of avocados. Results indicated ANN models can predict the quality changes in avocado fruits better than the conventional regression models. While Mahesh et al. (2011) used near-infrared hyperspectral images (wavelength range: 960– 1700 nm), applied to a bulk samples, to classify the moisture levels (12, 14, 16, 18, and 20%) on the wheat. Principal components analysis (PCA) was used to identify the region (1260– 1360 nm) with more information. The linear and quadratic discriminant analyses (LDA) and quadratic discriminant analysis (QDA) could classify the sample based on moisture contents than also identifying specific moisture levels with a god levels of accuracy (61- 100% in several case). Spectral features at key wavelengths of 1060, 1090, 1340, and 1450 nm were

Manley et al. (2011) used near infrared hyperspectral imaging combined with chemometrics techniques for tracking diffusion of conditioning water in single wheat kernels of different hardnesses. NIR analysers is a commonly, non-destructive, non-contact and fast solution for quality control, and a used tool to detect the moisture-content of carrot samples during storage but Firtha (2009) used hyperspectral system that is able to detect the spatial

ranked at top in classifying wheat classes with different moisture contents.

discriminate between different types of defects.

considered for potential application to meat images.

were developed by PLS regression.

within the image.

In another review Sankaran et al. (2010), compared the benefits and limitations of advanced techniques and multivariate methods to detect plant diseases in order to assist in monitoring health in plants under field conditions. These technologies include evaluation of volatile profiling (Electronic Nose), spectroscopy (fluorescence, visible and infrared) and imaging (fluorescence and hyperspectral) techniques for disease detection.

In literature it's possible find several examples of applications of spectroscopic image analysis. Hyperspectral imaging could be used as critical control points of food processing to inspect for potential contaminants, defects or lesions. Their absence is essential for ensuring food quality and safety. In some case the application on-line was achieved.

Ariana & Lu (2010), evaluated the internal defect and surface color of whole pickles, in a commercial pickle processing. They used a prototype of on-line hyperspectral imaging system, operating in the wavelength range of 400–1000 nm. Color of the pickles was modeled using tristimulus values: there were no differences in chroma and hue angle of good and defective pickles. PCA was applied to the hyperspectral images: transmittance images at 675–1000 nm were much more effective for internal defect detection compared to reflectance images for the visible region of 500–675 nm. A defect classification accuracy was of 86% compared with 70% by the human inspectors.

Mehl et al. (2002), used hyperspectral image analysis and PCA, like chemometrics technique, to reduce the information resulting from HIS and to identify three spectral bands capable of separating normal from contaminated apples. These spectral bands were implemented in a multispectral imaging system. On 153 samples, it's possible to get a good separation between normal and contaminated (scabs, fungal, soil contaminations, and bruises) apples was obtained for Gala (95%) and Golden Delicious (85%), separations were limited for Red Delicious (76%).

HSI application for damage detection on the caps of white mushrooms (Agaricus bisporus) was investigated from Gowen et al. (2007a). They employed a pushbroom line-scanning HSI instrument (wavelength range: 400–1000 nm). They investigated two data reduction methods. In the first method, PCA was applied to the hypercube of each sample, and the second PC (PC 2) scores image was used for identification of bruise-damaged regions on the mushroom surface. In the second method PCA was applied to a dataset comprising of average spectra from regions normal and bruise-damaged tissue. The second method performed better than the first when applied to a set of independent mushroom samples. Further, they (Gowen et al., 2009) identified mushrooms subjected to freeze damage using hyperspectral imaging. In this case they used Standard Normal Variate (SNV) transformation to pretreat the data, then they applied a procedure based on PCA and LDA to classify spectra of mushrooms into undamaged and freeze-damaged groups. The undamaged mushrooms and freeze-damaged mushrooms could be classified with high accuracy (>95% correct classification) after only 45 min thawing (at 23 ± 2 °C) at that time freeze–thaw damage was not visibly evident.

A study on fruits and vegetables (Cubero et al., 2010) used ultraviolet or near-infrared spectra to explore defects or features that the human eye is unable to see, with the aim of applying them for automatic inspection. This work present a summary of inspection systems for fruit and vegetables and the latest developments in the application of this technology to the inspection of internal and external quality of fruits and vegetables.

In another review Sankaran et al. (2010), compared the benefits and limitations of advanced techniques and multivariate methods to detect plant diseases in order to assist in monitoring health in plants under field conditions. These technologies include evaluation of volatile profiling (Electronic Nose), spectroscopy (fluorescence, visible and infrared) and imaging

In literature it's possible find several examples of applications of spectroscopic image analysis. Hyperspectral imaging could be used as critical control points of food processing to inspect for potential contaminants, defects or lesions. Their absence is essential for

Ariana & Lu (2010), evaluated the internal defect and surface color of whole pickles, in a commercial pickle processing. They used a prototype of on-line hyperspectral imaging system, operating in the wavelength range of 400–1000 nm. Color of the pickles was modeled using tristimulus values: there were no differences in chroma and hue angle of good and defective pickles. PCA was applied to the hyperspectral images: transmittance images at 675–1000 nm were much more effective for internal defect detection compared to reflectance images for the visible region of 500–675 nm. A defect classification accuracy was

Mehl et al. (2002), used hyperspectral image analysis and PCA, like chemometrics technique, to reduce the information resulting from HIS and to identify three spectral bands capable of separating normal from contaminated apples. These spectral bands were implemented in a multispectral imaging system. On 153 samples, it's possible to get a good separation between normal and contaminated (scabs, fungal, soil contaminations, and bruises) apples was obtained for Gala (95%) and Golden Delicious (85%), separations were limited for Red

HSI application for damage detection on the caps of white mushrooms (Agaricus bisporus) was investigated from Gowen et al. (2007a). They employed a pushbroom line-scanning HSI instrument (wavelength range: 400–1000 nm). They investigated two data reduction methods. In the first method, PCA was applied to the hypercube of each sample, and the second PC (PC 2) scores image was used for identification of bruise-damaged regions on the mushroom surface. In the second method PCA was applied to a dataset comprising of average spectra from regions normal and bruise-damaged tissue. The second method performed better than the first when applied to a set of independent mushroom samples. Further, they (Gowen et al., 2009) identified mushrooms subjected to freeze damage using hyperspectral imaging. In this case they used Standard Normal Variate (SNV) transformation to pretreat the data, then they applied a procedure based on PCA and LDA to classify spectra of mushrooms into undamaged and freeze-damaged groups. The undamaged mushrooms and freeze-damaged mushrooms could be classified with high accuracy (>95% correct classification) after only 45 min thawing (at 23 ± 2 °C) at that time

A study on fruits and vegetables (Cubero et al., 2010) used ultraviolet or near-infrared spectra to explore defects or features that the human eye is unable to see, with the aim of applying them for automatic inspection. This work present a summary of inspection systems for fruit and vegetables and the latest developments in the application of this

technology to the inspection of internal and external quality of fruits and vegetables.

ensuring food quality and safety. In some case the application on-line was achieved.

(fluorescence and hyperspectral) techniques for disease detection.

of 86% compared with 70% by the human inspectors.

freeze–thaw damage was not visibly evident.

Delicious (76%).

Li et al. (2011) detected common defects on oranges using hyperspectral (wavelength range: 400-1000) reflectance imaging. The disadvantage of studied algorithm is that it could not discriminate between different types of defects.

Bhuvaneswari et al. (2011) compared three methods (electronic speck counter, acid hydrolysis and flotation and near-infrared hyperspectral imaging) to investigate the presence of insect fragments (Tribolium castaneum\_ Coleoptera: Tenebrionidae) in the semolina (ingredient for pasta and couscous). NIR hyperspectral imaging is a rapid, nondestructive method, as electronic speck counter, but they showed different correlation between insect fragments in the semolina and detection of specks in the samples: R2 = 0.99 and 0.639-0.767 respectively. For NIR hyperspectral image technique, the prediction model were developed by PLS regression.

The most important features in meat are tenderness, juiciness and flavour. Jackmana et al., (2011) wrote a review about recent advances in the use of computer vision technology in the quality assessment of fresh meats. The researcher support that the best opportunities for improving computer vision solutions is the application of hyperspectral imaging in combination with statistical modelling. This synergy can provide some additional information on meat composition and structure. However, in parallel, new image processing algorithms, developed in other scientific disciplines, should be carefully considered for potential application to meat images.

Other applications concern the possibility of estimating a correlation between characteristics (physical or chemical) of the food and the spectra acquired with spectroscopic image. Moreover these techniques were able to locate and quantify the characteristic of interest within the image.

In most cases the range of wavelength used in applications of hyperspectral images is 400- 1000 nm but Maftoonazad et al. (2010) used artificial neural network (ANN) modeling of hyperspectral radiometric (350-2500 nm) data for quality changes associated with avocados during storage. Respiration rate, total color difference, texture and weight loss of samples were measured as conventional quality parameters during storage. Hyperspectral imaging was used to evaluate spectral properties of avocados. Results indicated ANN models can predict the quality changes in avocado fruits better than the conventional regression models.

While Mahesh et al. (2011) used near-infrared hyperspectral images (wavelength range: 960– 1700 nm), applied to a bulk samples, to classify the moisture levels (12, 14, 16, 18, and 20%) on the wheat. Principal components analysis (PCA) was used to identify the region (1260– 1360 nm) with more information. The linear and quadratic discriminant analyses (LDA) and quadratic discriminant analysis (QDA) could classify the sample based on moisture contents than also identifying specific moisture levels with a god levels of accuracy (61- 100% in several case). Spectral features at key wavelengths of 1060, 1090, 1340, and 1450 nm were ranked at top in classifying wheat classes with different moisture contents.

Manley et al. (2011) used near infrared hyperspectral imaging combined with chemometrics techniques for tracking diffusion of conditioning water in single wheat kernels of different hardnesses. NIR analysers is a commonly, non-destructive, non-contact and fast solution for quality control, and a used tool to detect the moisture-content of carrot samples during storage but Firtha (2009) used hyperspectral system that is able to detect the spatial

Chemometrics in Food Technology 243

Yu H. & MacGregor J.F. (2003) applied multivariate image analysis and regression for prediction of coating content and distribution in the production of snack foods. Elaboration tools based on PCA and PLS was used for the extraction of features from RGB color images and for their use in predicting the average coating concentration and the coating distribution. On-line and off-line imaging were collected from several different snack food product lines and were used to develop and evaluate the methods. The better methods are now being used in the snack food industry for the on-line monitoring and control of

Siripatrawan et al. (2011) have developed a rapid method for the detection of *Escherichia coli* contamination in packaged fresh spinach using hyperspectral imaging (400–1000 nm) and chemometrics techniques. The PCA was implemented to remove redundant information of the hyperspectral data and artificial neural network (ANN) to correlate spectra with number of E. coli and to construct a prediction map of all pixel spectra of an image to display the

In this study (Barbin et al. 2011) a hyperspectral imaging technique (range from 900 to 1700 nm) was developed to achieve fast, accurate, and objective determination of pork quality grades. The sample investigated were 75 pork cuts of *longissimus dorsi* muscle from three quality grades. Six significant wavelengths (960, 1074, 1124, 1147, 1207 and 1341 nm) that explain most of the variation among pork classes were identified from 2nd derivative spectra. PCA was carried out and the results indicated that pork classes could be precisely discriminated with overall accuracy of 96%. Algorithm was developed to produce

Valous et al. (2010) communicated perspectives and aspects, relating to imaging, spectroscopic and colorimetric techniques on the quality evaluation and control of hams. These no-contact and no-destructive techniques, can provide useful information regarding ham quality. Hams are considered a heterogenic solid system: varying colour, irregular shape and spatial distribution of pores. Fat-connective tissue, water, protein contribute to the microstructural complexity. This review paying attention on applications of imaging and spectroscopy techniques, for measuring properties and extracting features that correlate

In literature is present a review (Mathiassena et al., 2011) that focused the attention on application of imaging technologies (VIS/NIR imaging, VIS/NIR imaging spectroscopy, planar and computed tomography (CT) X-ray imaging, and magnetic resonance imaging) to

Nicolai et al. (2007) wrote a review about the applications of non-destructive measurement of fruit and vegetable quality. Measurement principles are compared, and novel techniques (hyperspectral imaging) are reviewed. Special attention is paid to recent developments in portable systems. The problem of calibration transfer from one spectrophotometer to another is introduced, as well as techniques for calibration transfer. Chemometrics is an essential part of spectroscopy and the choice, of corrected techniques, is primary (linear or nonlinear regression, such as kernel-based methods are discussed). The principal objective of spectroscopy system applications in fruit and vegetables sector have focused on the nondestructive measurement of soluble solids content, texture, dry matter, acidity or disorders of fruit and vegetables. (root mean square error of prediction want to be

product quality.

with ham quality.

achieved).

inspection of fish and fish products.

number of E. coli in the sample.

classification maps of the investigated sample.

distribution of reflectance spectrum as well. Statistical analysis of the data has shown the optimal intensity function to describe moisture-content.

The intent of Junkwon et al. (2009), was to develop a technique for weight and ripeness estimation of palm oil (Elaeis guieensis Jacq. var. tenera) bunches by hyperspectral and RGB color images. In the hyperspectral images, the total number of pixels in the bunch was also counted from an image composed of three wavelengths (560 nm, 680 nm, and 740 nm), while the total number of pixels of space between fruits was obtained at a wavelength of 910 nm. Weight-estimation equations were determined by linear regression (LR) or multiple linear regression (MLR). As a result, the coefficient of determination (R2) of actual weight and estimated weight were at a level of 0.989 and 0.992 for color and hyperspectral images, respectively. About the estimation of palm oil bunch ripeness the bunches was classified in 4 classes of ripeness (overripe, ripe, underripe, and unripe) (Fig. 10). Euclidean distances between the test sample and the standard 4 classes of ripeness were calculated, and the test sample was classified into the ripeness class. In the classification based on color image, (average RGB values of concealed and not-concealed areas), and by hyperspectral images (average intensity values of fruits pixels from the concealed area), the results of validation experiments with the developed estimation methods indicated acceptable estimation accuracy.

Fig. 10. Bunches of palm oil: a) unripe, b) underripe, c) ripe, and d) overripe (Junkwon et al., 2009)

Nguyen et al. (2011) illustrated the potential of combination of hyperspectral imaging chemometrics and image processing as a process monitoring tool for the potato processing industry. They predicted the optimal cooking time by hyperspectral imaging (wavelength range 400–1000 nm). By partial least squares discriminant analysis (PLS-DA), cooked and raw parts of boiled potatoes, were discriminated successfully. By modeling the evolution of the cooking front over time the optimal cooking time could be predicted with less than 10% relative error.

distribution of reflectance spectrum as well. Statistical analysis of the data has shown the

The intent of Junkwon et al. (2009), was to develop a technique for weight and ripeness estimation of palm oil (Elaeis guieensis Jacq. var. tenera) bunches by hyperspectral and RGB color images. In the hyperspectral images, the total number of pixels in the bunch was also counted from an image composed of three wavelengths (560 nm, 680 nm, and 740 nm), while the total number of pixels of space between fruits was obtained at a wavelength of 910 nm. Weight-estimation equations were determined by linear regression (LR) or multiple linear regression (MLR). As a result, the coefficient of determination (R2) of actual weight and estimated weight were at a level of 0.989 and 0.992 for color and hyperspectral images, respectively. About the estimation of palm oil bunch ripeness the bunches was classified in 4 classes of ripeness (overripe, ripe, underripe, and unripe) (Fig. 10). Euclidean distances between the test sample and the standard 4 classes of ripeness were calculated, and the test sample was classified into the ripeness class. In the classification based on color image, (average RGB values of concealed and not-concealed areas), and by hyperspectral images (average intensity values of fruits pixels from the concealed area), the results of validation experiments with the developed estimation methods indicated acceptable estimation

Fig. 10. Bunches of palm oil: a) unripe, b) underripe, c) ripe, and d) overripe (Junkwon et al.,

Nguyen et al. (2011) illustrated the potential of combination of hyperspectral imaging chemometrics and image processing as a process monitoring tool for the potato processing industry. They predicted the optimal cooking time by hyperspectral imaging (wavelength range 400–1000 nm). By partial least squares discriminant analysis (PLS-DA), cooked and raw parts of boiled potatoes, were discriminated successfully. By modeling the evolution of the cooking front over time the optimal cooking time could be predicted with less than 10%

optimal intensity function to describe moisture-content.

accuracy.

2009)

relative error.

Yu H. & MacGregor J.F. (2003) applied multivariate image analysis and regression for prediction of coating content and distribution in the production of snack foods. Elaboration tools based on PCA and PLS was used for the extraction of features from RGB color images and for their use in predicting the average coating concentration and the coating distribution. On-line and off-line imaging were collected from several different snack food product lines and were used to develop and evaluate the methods. The better methods are now being used in the snack food industry for the on-line monitoring and control of product quality.

Siripatrawan et al. (2011) have developed a rapid method for the detection of *Escherichia coli* contamination in packaged fresh spinach using hyperspectral imaging (400–1000 nm) and chemometrics techniques. The PCA was implemented to remove redundant information of the hyperspectral data and artificial neural network (ANN) to correlate spectra with number of E. coli and to construct a prediction map of all pixel spectra of an image to display the number of E. coli in the sample.

In this study (Barbin et al. 2011) a hyperspectral imaging technique (range from 900 to 1700 nm) was developed to achieve fast, accurate, and objective determination of pork quality grades. The sample investigated were 75 pork cuts of *longissimus dorsi* muscle from three quality grades. Six significant wavelengths (960, 1074, 1124, 1147, 1207 and 1341 nm) that explain most of the variation among pork classes were identified from 2nd derivative spectra. PCA was carried out and the results indicated that pork classes could be precisely discriminated with overall accuracy of 96%. Algorithm was developed to produce classification maps of the investigated sample.

Valous et al. (2010) communicated perspectives and aspects, relating to imaging, spectroscopic and colorimetric techniques on the quality evaluation and control of hams. These no-contact and no-destructive techniques, can provide useful information regarding ham quality. Hams are considered a heterogenic solid system: varying colour, irregular shape and spatial distribution of pores. Fat-connective tissue, water, protein contribute to the microstructural complexity. This review paying attention on applications of imaging and spectroscopy techniques, for measuring properties and extracting features that correlate with ham quality.

In literature is present a review (Mathiassena et al., 2011) that focused the attention on application of imaging technologies (VIS/NIR imaging, VIS/NIR imaging spectroscopy, planar and computed tomography (CT) X-ray imaging, and magnetic resonance imaging) to inspection of fish and fish products.

Nicolai et al. (2007) wrote a review about the applications of non-destructive measurement of fruit and vegetable quality. Measurement principles are compared, and novel techniques (hyperspectral imaging) are reviewed. Special attention is paid to recent developments in portable systems. The problem of calibration transfer from one spectrophotometer to another is introduced, as well as techniques for calibration transfer. Chemometrics is an essential part of spectroscopy and the choice, of corrected techniques, is primary (linear or nonlinear regression, such as kernel-based methods are discussed). The principal objective of spectroscopy system applications in fruit and vegetables sector have focused on the nondestructive measurement of soluble solids content, texture, dry matter, acidity or disorders of fruit and vegetables. (root mean square error of prediction want to be achieved).

Chemometrics in Food Technology 245

(discriminant function analysis) statistical technique. In this case the aim of researcher was

Fig. 11. Classification of Emmental cheese by the geographic origin performed with an electronic nose based on mass spectrometry. The graph shows DFA 1 vs. DFA 2 with 100% group classification based on five variables. No validation set was considered due to the limited number of samples. A: Austria, D: Germany, F: France, Fi: Finland (Pillonel et al,

The potential of electronic nose technique was investigated to monitoring storage time and the quality attribute of eggs by Yongwei et al. (2009). Using techniques of multivariate analysis was distinguished eggs under cool and room-temperature storage. Results showed that the E-nose could distinguish eggs of different storage time under cool and room-temperature storage by LDA, PCA, BPNN and GANN. Good distinction between eggs stored for different times were obtained by PCA and LDA (results by LDA were better than those obtained by PCA). By means BP neural network (BPNN) and the combination of a genetic algorithm and BP neural network (GANN) carried out good predictions for egg storage time (GANN demonstrated better correct classification rates than BPNN). The quadratic polynomial step regression (QPSR) algorithm established models that described the relationship between sensor signals and egg quality indices (Haugh unit and yolk factor). The QPST models showed an high predictive ability (R2 =

Guidetti et al. (2011) used electronic nose and infrared thermography to detect physiological disorders on apples (Golden Delicious and Stark Delicious). In particular the aim was to differentiate typical external apple diseases (in particular, physiological, pathological and entomological disorders). The applicability of the e-nose is based on the hypothesis that apples affected by physiopathology produce different volatile compounds from those produced by healthy fruits. The electronic nose data were elaborated by LDA

2003)

0.91-0.93).

to classify samples of a given product by their place of production.

#### **3.3 Electronic nose**

The preservation of quality in post-harvest is the prerequisite for agri-food products in the final stages of commercialization. The fruit quality is related to the appearance (skin color, size, shape, integrity of the fruit), to the sensorial properties (hardness and crispness of the flesh, juicy, acid/sugars) and to safety (residues in fruit).

The agri-food products contain a variety of information, directly related to their quality, traditionally measured by means of tedious, time consuming and destructive analysis. For this reason, there is a growing interest in easy to use, rapid and non-destructive techniques useful for quality assessment.

Currently, electronic noses are mainly applied in the food industry to recognize the freshness of the products, the detection of fraud (source control, adulteration), the detection of contaminants.

An essential step in the analysis with an electronic nose, is the high performance of statistical elaboration. The electronic nose provides multivariated results that need to be processed using chemometric techniques. Even if the best performing programs are sophisticated and, consequently, require the operation of skilled personnel, most companies have implemented user-friendly software for data treatment in commercially available electronic noses (Ampuero & Bosset, 2003).

A commercial electronic nose, as a non-destructive tool, was used to characterise peach cultivars and to monitor their ripening stage during shelf-life (Benedetti et al. 2008). Principal component analysis (PCA) and linear discriminant analysis (LDA) were used to investigate whether the electronic nose was able to distinguish among four diverse cultivars. Classification and regression tree (CART) analysis was applied to characterise peach samples into the three classes of different ripening stages (unripe, ripe, over-ripe). Results classified samples in each respective group with a cross validation error rate of 4.87%.

Regarding the fruit and vegetable sector Torri et al. (2010) investigated the applicability of a commercial electronic nose in monitoring freshness of packaged pineapple slices during storage. The obtained results showed that the electronic nose was able to discriminate between several samples and to monitor the changes in volatile compounds correlated with quality decay. The second derivative of the transition function, used to interpolate the PC1 score trend versus the storage time at each temperature, was calculated to estimate the stability time.

Ampuero and Bosset (2003), presented a review about the application of electronic nose applied to dairy products. The present review deal with as examples the evaluation of the cheese ripening, the detection of mould in cheese, the classification of milk by trademark, by fat level and by preservation process, the classification and the quantification of off-flavours in milk, the evaluation of Maillard reactions during heating processes in block-milk, as well as the identification of single strains of disinfectant–resistant bacteria in mixed cultures in milk. For each application correspond the chemometric method to extrapolate the maximum information. PCA analysis was carried out in order to associate descriptors (chocolate, caramel, burnt and nutty), typical of volatiles generated by Maillard reactions during milk heating. In another case PCA showed a correctly classification of sample in function of the origin of off-flavours. In figure 11 is showed an example of result carried out by DFA

The preservation of quality in post-harvest is the prerequisite for agri-food products in the final stages of commercialization. The fruit quality is related to the appearance (skin color, size, shape, integrity of the fruit), to the sensorial properties (hardness and crispness of the

The agri-food products contain a variety of information, directly related to their quality, traditionally measured by means of tedious, time consuming and destructive analysis. For this reason, there is a growing interest in easy to use, rapid and non-destructive techniques

Currently, electronic noses are mainly applied in the food industry to recognize the freshness of the products, the detection of fraud (source control, adulteration), the detection

An essential step in the analysis with an electronic nose, is the high performance of statistical elaboration. The electronic nose provides multivariated results that need to be processed using chemometric techniques. Even if the best performing programs are sophisticated and, consequently, require the operation of skilled personnel, most companies have implemented user-friendly software for data treatment in commercially available

A commercial electronic nose, as a non-destructive tool, was used to characterise peach cultivars and to monitor their ripening stage during shelf-life (Benedetti et al. 2008). Principal component analysis (PCA) and linear discriminant analysis (LDA) were used to investigate whether the electronic nose was able to distinguish among four diverse cultivars. Classification and regression tree (CART) analysis was applied to characterise peach samples into the three classes of different ripening stages (unripe, ripe, over-ripe). Results classified samples in each respective group with a cross validation error rate of 4.87%.

Regarding the fruit and vegetable sector Torri et al. (2010) investigated the applicability of a commercial electronic nose in monitoring freshness of packaged pineapple slices during storage. The obtained results showed that the electronic nose was able to discriminate between several samples and to monitor the changes in volatile compounds correlated with quality decay. The second derivative of the transition function, used to interpolate the PC1 score trend versus the storage time at each temperature, was calculated to estimate the

Ampuero and Bosset (2003), presented a review about the application of electronic nose applied to dairy products. The present review deal with as examples the evaluation of the cheese ripening, the detection of mould in cheese, the classification of milk by trademark, by fat level and by preservation process, the classification and the quantification of off-flavours in milk, the evaluation of Maillard reactions during heating processes in block-milk, as well as the identification of single strains of disinfectant–resistant bacteria in mixed cultures in milk. For each application correspond the chemometric method to extrapolate the maximum information. PCA analysis was carried out in order to associate descriptors (chocolate, caramel, burnt and nutty), typical of volatiles generated by Maillard reactions during milk heating. In another case PCA showed a correctly classification of sample in function of the origin of off-flavours. In figure 11 is showed an example of result carried out by DFA

flesh, juicy, acid/sugars) and to safety (residues in fruit).

**3.3 Electronic nose** 

useful for quality assessment.

electronic noses (Ampuero & Bosset, 2003).

of contaminants.

stability time.

(discriminant function analysis) statistical technique. In this case the aim of researcher was to classify samples of a given product by their place of production.

Fig. 11. Classification of Emmental cheese by the geographic origin performed with an electronic nose based on mass spectrometry. The graph shows DFA 1 vs. DFA 2 with 100% group classification based on five variables. No validation set was considered due to the limited number of samples. A: Austria, D: Germany, F: France, Fi: Finland (Pillonel et al, 2003)

The potential of electronic nose technique was investigated to monitoring storage time and the quality attribute of eggs by Yongwei et al. (2009). Using techniques of multivariate analysis was distinguished eggs under cool and room-temperature storage. Results showed that the E-nose could distinguish eggs of different storage time under cool and room-temperature storage by LDA, PCA, BPNN and GANN. Good distinction between eggs stored for different times were obtained by PCA and LDA (results by LDA were better than those obtained by PCA). By means BP neural network (BPNN) and the combination of a genetic algorithm and BP neural network (GANN) carried out good predictions for egg storage time (GANN demonstrated better correct classification rates than BPNN). The quadratic polynomial step regression (QPSR) algorithm established models that described the relationship between sensor signals and egg quality indices (Haugh unit and yolk factor). The QPST models showed an high predictive ability (R2 = 0.91-0.93).

Guidetti et al. (2011) used electronic nose and infrared thermography to detect physiological disorders on apples (Golden Delicious and Stark Delicious). In particular the aim was to differentiate typical external apple diseases (in particular, physiological, pathological and entomological disorders). The applicability of the e-nose is based on the hypothesis that apples affected by physiopathology produce different volatile compounds from those produced by healthy fruits. The electronic nose data were elaborated by LDA

Chemometrics in Food Technology 247

Compared with traditional methods, NIR and Vis/NIR are less expensive because of no demand of other materials such as chemical reagents except the electrical consumption. Many works are focused on the study of chemometrics. This is because an important challenge is to build robust calibration models, in fact it is important to apply chemometric methods able to select useful information from a great deal of spectral data. Moreover food researchers and analysts are looking for the sensitive wavelength in Vis/NIR region representing the characteristics of food products, with the aim of develop some simple and

HSI is the new frontier for optical analysis of foods. The performance of HSI instrumentation has developed such that a full hypercube can now be acquired in just a few seconds. In tandem with these developments, advances in component design have led to reductions in the size and cost of HSI systems. This has led to increased interest in their online implementation for quality monitoring in major industries such as food and pharmaceutical (Burger & Gowen, 2011). In future, with further improvement, the HSI

The equipment that the food industry has at its disposal is certainly complex and not easy to use. The chemometric approach has allowed, through different applicative researches, to arrive at algorithms that can support the analysis in the entire food chain from raw material producers to large retail organizations. Despite this, we are still faced with instrumentation with not easy usability and relatively high cost: the studies must move towards a more simplified instrumental approach through greater integration of hardware with software. The challenges are many: optimizing the information that you are able to extract from raw data and aimed at specific problems, simplify electronic components, increase the level of

In conclusion the only way of an interdisciplinary approach can lead to the solution of a system that can provide at different level more immediate response and more food safety

Aleixos, N.; Blasco, J.; Navarrón, F. & Moltó, E. (2002). *Multispectral inspection of citrus in real-*

Amamcharla, J. K.; Panigrahi, S.; Logue, C. M.; Marchello, M. & Sherwood, J. S. (2010).

Andrés, S.; Murray, I.; Navajas, E.A.; Fisher, A.V.; Lambe, N.R. & Bünger, L. (2007).

Ariana, D.P. & Lu, R.A. (2010). *Evaluation of internal defect and surface color of whole pickles using hyperspectral imaging*. Journal of Food Engineering Vol.96, pp. 583–590 Baranowski, P.; Lipecki, J.; Mazurek, W. & Walczak, R.T. (2008). *Detenction of watercore in* 

*time using machine vision and digital signal processors*. Computers and Electronics in

*Fourier transform infrared spectroscopy (FTIR) as a tool for discriminating Salmonella typhimurium contaminated beef.* Sens. & Instrumen. Food Qual., Vol.4, pp. 1–12 Ampuero, S. & Bosset, J.O. (2003). *The electronic nose applied to dairy products: a review*. Sensors

*Prediction of sensory characteristics of lamb meat samples by near infrared reflectance* 

*'Gloster' apples using thermography*. Postharvest Biology and Technology Vol.47, pp.

low-cost instruments (Cen & He, 2007).

interaction tool/operator.

and quality.

**5. References** 

358

system could meet the need of a commercial plant setting.

Agriculture. Vol.33, N.2, pp. 121-137

and Actuators B Vol.94, pp. 1–12

*spectroscopy*. Meat Science Vol.76, pp. 509–516

in order to classify the apples into the four classes. Figure 12 shows how the first two LDA functions discriminate among classes. Considering Stark variety, function 1 seems to discriminate among the physiopathologies while function 2 discriminate the healthy apples from those with physiological disorders. The error rate and the cross validation error rate were of 2.6% and 26.3% respectively. In the case of Golden variety, along the first function there is the separation of Control samples from the apples affected by diseases, while in the vertical direction (function 2) there is an evident discrimination among the three physiopathologies. The error rate and the cross validation error rate were of 0.8% and 18% respectively.

Fig. 12. a) Canonical discriminant functions of LDA for Stark variety; b) Canonical discriminant functions of LDA for Golden variety. C=control, B=bitter pit, T=scab and CY=Cydia Pomonella.

Cerrato Oliveros et al. (2002) selected array of 12 metal oxide sensors to detected adulteration in virgin olive oils samples and to quantify the percentage of adulteration by electronic nose. Multivariate chemometric techniques such as PCA were applied to choose a set of optimally discriminant variables. Excellent results were obtained in the differentiation of adulterated and non-adulterated olive oils, by application of LDA, QDA. The models provide very satisfactory results, with prediction percentages >95%, and in some cases almost 100%. The results with ANN are slightly worse, although the classification criterion used here was very strict. To determine the percentage of adulteration in olive oil samples multivariate calibration techniques based on partial least squares and ANN were employed. Not so good results were carried out, even if there are exceptions. Finally, classification techniques can be used to determine the amount of adulterant oil added with excellent results.

#### **4. Conclusion**

This work shows the principal non-destructive applications for analysis in food sector. They are rapid techniques used in combination with chemometrics analysis for qualitative and quantitative analysis.

NIR spectroscopy is the technique that has been developed further in recent years. This success is because spectral measurement for one sample could be done in a few seconds. Numerous samples could be analyzed and multiindexes analysis can be carried out.

in order to classify the apples into the four classes. Figure 12 shows how the first two LDA functions discriminate among classes. Considering Stark variety, function 1 seems to discriminate among the physiopathologies while function 2 discriminate the healthy apples from those with physiological disorders. The error rate and the cross validation error rate were of 2.6% and 26.3% respectively. In the case of Golden variety, along the first function there is the separation of Control samples from the apples affected by diseases, while in the vertical direction (function 2) there is an evident discrimination among the three physiopathologies. The error rate and the cross validation error rate were

(a) (b)

Fig. 12. a) Canonical discriminant functions of LDA for Stark variety; b) Canonical discriminant functions of LDA for Golden variety. C=control, B=bitter pit, T=scab and

Cerrato Oliveros et al. (2002) selected array of 12 metal oxide sensors to detected adulteration in virgin olive oils samples and to quantify the percentage of adulteration by electronic nose. Multivariate chemometric techniques such as PCA were applied to choose a set of optimally discriminant variables. Excellent results were obtained in the differentiation of adulterated and non-adulterated olive oils, by application of LDA, QDA. The models provide very satisfactory results, with prediction percentages >95%, and in some cases almost 100%. The results with ANN are slightly worse, although the classification criterion used here was very strict. To determine the percentage of adulteration in olive oil samples multivariate calibration techniques based on partial least squares and ANN were employed. Not so good results were carried out, even if there are exceptions. Finally, classification techniques can be used to determine the amount of adulterant oil added with excellent

This work shows the principal non-destructive applications for analysis in food sector. They are rapid techniques used in combination with chemometrics analysis for qualitative and

NIR spectroscopy is the technique that has been developed further in recent years. This success is because spectral measurement for one sample could be done in a few seconds. Numerous samples could be analyzed and multiindexes analysis can be carried out.

of 0.8% and 18% respectively.

CY=Cydia Pomonella.

results.

**4. Conclusion** 

quantitative analysis.

Compared with traditional methods, NIR and Vis/NIR are less expensive because of no demand of other materials such as chemical reagents except the electrical consumption. Many works are focused on the study of chemometrics. This is because an important challenge is to build robust calibration models, in fact it is important to apply chemometric methods able to select useful information from a great deal of spectral data. Moreover food researchers and analysts are looking for the sensitive wavelength in Vis/NIR region representing the characteristics of food products, with the aim of develop some simple and low-cost instruments (Cen & He, 2007).

HSI is the new frontier for optical analysis of foods. The performance of HSI instrumentation has developed such that a full hypercube can now be acquired in just a few seconds. In tandem with these developments, advances in component design have led to reductions in the size and cost of HSI systems. This has led to increased interest in their online implementation for quality monitoring in major industries such as food and pharmaceutical (Burger & Gowen, 2011). In future, with further improvement, the HSI system could meet the need of a commercial plant setting.

The equipment that the food industry has at its disposal is certainly complex and not easy to use. The chemometric approach has allowed, through different applicative researches, to arrive at algorithms that can support the analysis in the entire food chain from raw material producers to large retail organizations. Despite this, we are still faced with instrumentation with not easy usability and relatively high cost: the studies must move towards a more simplified instrumental approach through greater integration of hardware with software. The challenges are many: optimizing the information that you are able to extract from raw data and aimed at specific problems, simplify electronic components, increase the level of interaction tool/operator.

In conclusion the only way of an interdisciplinary approach can lead to the solution of a system that can provide at different level more immediate response and more food safety and quality.

#### **5. References**


Chemometrics in Food Technology 249

Elbatawi, I. E. & Ebaid, M. T. (2006). *A new technique for grape inspection and sorting classification*. Arab Universities J. Agric. Sci. Vol.14, N.2, pp. 555-573 ElMarsy, G. & Sun, D.W. (2010). *Hyperspectral imaging for food quality, analysis and control*.

(http://elsevier.insidethecover.com/searchbook.jsp?isbn=9780123740854) Ferris, D.; Lawhead, R.; Dickman, E.; Holtzapple, N.; Miller, J.; Grogan, S.; et al. 2001.

Fessenden, R. J. & Fessenden, J. S. (1993). Chimica organica. Cap. 9: *Spettroscopia I: Spettri* 

Firtha, F. (2009). *Detecting moisture loss of carrot samples during storage by hyperspectral imaging* 

Firtha, F.; Fekete, A.; Kaszab, T.; Gillay, B.; Nogula-Nagy, M.; Kovács, Z. & Kantor, D.B.

Fodor, M.; Woller, A.; Turza, S. & Szigedi, T. (2011). *Development of a rapid, non-destructive* 

Frank, I.E. & Todeschini, R. (1994). *The Data Analysis Handbook.* Elsevier. ISBN 0-444-81659-3,

Gardner, J.W. & Bartlett, P.N. (1994). *A brief history of electronic noses*. Sens. Actuat. B: Chem.

González-Martín, M.I.; Severiano-Pérez, P.; Revilla, I.; Vivar-Quintana, A.M.; Hernández-

Gowen, A.A.; O'Donnell, C.P.; Cullen, P.J.; Downey, G. & Frias, J.M. (2007b). *Hyperspectral* 

Gowen, A.A.; O'Donnell, C.P.; Taghizadeh, M.; Cullen, P.J.; Frias, J.M. & Downey, G.

Gowen, A.A.; Taghizadeh, M. & O'Donnell, C.P. (2009). *Identification of mushrooms subjected* 

Guanasekaran, S. & Ding, K. (1994). *Using computer vision for food quality evaluation*. Food

Guidetti, R.; Beghi, R. & Bodria, L. (2010). *Evaluation of Grape Quality Parameters by a Simple* 

Guidetti, R.; Beghi, R.; Bodria, L.; Spinardi, A.; Mignani, I. & Folini, L. (2008). *Prediction of* 

Guidetti, R.; Buratti, S. & Giovenzana, V. (2011). *Application of Electronic Nose and Infrared* 

Horticulturae, n° 310, ISBN 978-90-66057-41-8, pp. 877-885

Journal of Lower Genital Tract Disease Vol.5, N.2, pp. 65-72

*infrarossi, Risonanza Magnetica Nucleare*. Piccin Padova, Italy

included in series: Data Handling in Science and Technology

*system*. Acta Alimentaria Vol.38, N.1, pp. 55-66

Food Science & Technology Vol.18, pp.590-598

*images*. Sensors 2008, 8, 3287-3298

Food Engineering 107, 195–199

Vol.18, pp. 211-220

10.1002/cem.1127

Technol. Vol.15, pp. 1-54;

France - April 18-20, 2011

7–12

Book, N.1, pp. 3-43

*Multimodal hyperspectral imaging for the noninvasive diagnosis of cervical neoplasia*.

(2008). *Methods for improving image quality and reducing data load of nir hyperspectral* 

*method for egg content determination in dry pasta using FT-NIR technique*. Journal of

Hierro, J.M.; González-Pérez, C. & Lobos-Ortega, I.A. (2011). *Prediction of sensory attributes of cheese by near-infrared spectroscopy*. Food Chemistry Vol.127, pp. 256–263

*imaging - an emerging process analytical tool for food quality and safety control.* Trends in

(2007a). *Hyperspectral imaging combined with principal component analysis for bruise damage detection on white mushrooms (Agaricus bisporus)*. J. of chemometric DOI:

*to freeze damage using hyperspectral imaging*. Journal of Food Engineering Vol.93, pp.

*Vis/NIR System*. Transaction of the ASABE, Vol.53 N.2, pp. 477-484, ISSN: 2151-0032

*blueberry (Vaccinium corymbosum) ripeness by a portable Vis-NIR device*. Acta

*Thermography to detect physiological disorders on apples (Golden Delicious and Stark Delicious).* CIGR Section VI International Symposium on Towards a Sustainable Food Chain Food Process, Bioprocessing and Food Quality Management. Nantes,


Barbin, D.; Elmasry, G.; Sun, D.W.; Allen, P. (2011). *Near-infrared hyperspectral imaging for* 

Basilevsk, A. (1994) *Statistical factor analysis and related methods: theory and applications.* Wiley-

Benedetti, S.; Buratti, S.; Spinardi, A.; Mannino, S. & Mignani, I. (2008). *Electronic nose as a* 

Brenna, O.V. & Berardo, N. (2004). *Application of near-infrared reflectance spectroscopy (NIRS) to the evaluation of carotenoids content in maize*. J. Agric. Food Chem. Vol.52, 5577 Brosnan, T. & Sun, D.W. (2004). *Improving quality inspection of food products by computer vision:* 

Burger, J. & Gowen, A. (2011). *Data handling in hyperspectral image analysis*. Chemometrics

Burns, D.A. & Ciurczak, E.W. (2001). *Second ed. In: Handbook of Near-Infrared Analysis*..,

Carrara, M.; Catania, P.; Vallone, M. & Piraino, S. (2008). *Mechanical harvest of grapes:* 

Cozzolino, D.; Cynkar, W.; Janik, L.; Dambergs, R. G. & Gishen, M. (2006a). *Analysis of grape* 

Cozzolino, D.; Parker, M.; Dambergs, R.G.; Herderich, M. & Gishen, M. (2006b).

Cubero, S.; Aleixos, N.; Moltó, E.; Gómez-Sanchis, J. & Blasco, J. (2010). *Advances in Machine* 

Da Costa Filho, P. A. (2009). *Rapid determination of sucrose in chocolate mass using near infrared* 

De Temmerman, J.; Saeys, W.; Nicolai, B. & Ramon, H. (2007). *Near infrared reflectance* 

Deisingh, A.K.; Stone, D.C. & Thompson, M. (2004). *Applications of electronic noses and tongues* 

Diezak, J.D. (1988). *Microscopy and imagine analysis for R&D, Special report*. Food Technol. pp.

Du, C.J. & Sun, D.W. (2006). *Learning techniques used in computer vision for food quality evaluation: a review*. Journal of food engineering. Vol.72, N.1, pp. 39–55

*extruded semolina pasta*. Biosystems Engineering Vol.97, pp. 313–321

Biosystems Engineering for a Sustainable World (AgEng 2008). CIGR Cen, H. & He, Y. (2007). *Theory and application of near infrared reectance spectroscopy in determination of food quality.* Trends in Food Science & Technology.Vol.18, pp. 72-83 Cerrato Oliveros, M.C.; Pérez Pavón, J. L.; Garca Pinto, C.; Fernández Laespada, M. E.;

*Assessment of the physical-mechanical characteristics of the berry in order to improve the quality of wines*. In Proc. Intl. Conf. on Agricultural Engineering: Agricultural and

Moreno Cordero, B. & Forina, M. (2002). *Electronic nose based on metal oxide semiconductor sensors as a fast alternative for the detection of adulteration of virgin olive* 

*and wine by near infrared spectroscopy — A review*. J Near Infrared Spectros, Vol.14,

*Chemometrics and visible-near infrared spectroscopic monitoring of red wine fermentation* 

*Vision Applications for Automatic Inspection and Quality Evaluation of Fruits and* 

*spectroscopy as a tool for the in-line determination of the moisture concentration in* 

*in food analysis*. International Journal of Food Science and Technology Vol.39, pp.

*during shelf-life*. Postharvest Biology and Technology Vol.47, pp. 181–188 Bhuvaneswari, K.; Fields, P. G.; White, N.D.G.; Sarkar, A. K.; Singh, C. B. & Jayas, D. S.

*non-destructive tool to characterize peach cultivars and to monitor their ripening stage* 

(2011). *Image analysis for detecting insect fragments in semolina.* Journal of Stored

*grading and classification of pork*. Meat Science. Article in press

*a review*. Journal of Food Engineering. Vol.61, pp. 3-16

and intelligent Laboratory Systems Vol.108, pp. 13-22

Marcel Dekker, New York. Vol.27, N.28, pp. 729–782

*oils*. Analytica Chimica Acta Vol.459, pp. 219–228

*in a pilot scale*. Biotechnol. Bioeng. Vol. 95, pp. 1101

*Vegetables*. Food Bioprocess Technol Vol.4, pp.487–504

*spectroscopy*. Analytica Chimica Acta Vol.631, pp. 206–211

pp. 279−289

587–604

110-124

Interscience Publication. ISBN 0-471-57082-6

Products Research Vol.47, pp. 20-24


(http://elsevier.insidethecover.com/searchbook.jsp?isbn=9780123740854)


Chemometrics in Food Technology 251

Monteiro, S.; Minekawa, Y.; Kosugi, Y.; Akazawa, T. & Oda, K. (2007). *Prediction of sweetness* 

Naes, T.; Isaksson, T.; Fearn, T. & Davies, T. (2002). *A user-friendly guide to multivariate calibration and classification*. Chichester, UK: NIR Publications ISBN 0-9528666-2-5 Nguyen, D.; Trong, N.; Tsuta, M.; Nicolaï, B.M.; De Baerdemaeker, J. & Saeys, W. (2011).

Nicolai, B. M.; Beullens, K.; Bobelyn, E.; Peirs, A.; Saeys, W.; Theron & K. I., Lammertyna J.

Osborne, B.G.; Fearn, T. & Hindle, P.H. (1993). *Practical NIR Spectroscopy with Applications in* 

Pillonel, L.; Ampuero, S.; Tabacchi, R. & Bosset, J.O. (2003). *Analytical methods for the* 

*MS–FID and electronic nose, Eur*. J. Food Res. Technol. Vol.216, pp. 179–183 Prieto, N.; Roehe, R.; Lavín, P.; Batten, G. & Andrés, S. (2009b). *Application of near infrared* 

Prieto, N.; Ross, D.W.; Navajas, E.A.; Nute, G.R.; Richardson, R.I.; Hyslop, J.J.; Simm, G. &

Ribeiro, J.S.; Ferreira, M.M.C. & Salva, T.J.G. (2011). *Chemometric models for the quantitative* 

Roggo, Y.; Edmond, A.; Chalus, P. & Ulmschneider, M. (2005). *Infrared hyperspectral imaging* 

Römisch, U.; Jäger, H.; Capron, X.; Lanteri, S.; Forina, M. & Smeyers-Verbeke, J. (2009).

Russ, J. C.; Stewart, W. & Russ, J. C., J. C. (1988). *The measurement of macroscopic image*. Food

Sankaran, S.; Mishra, A.; Ehsani, R. & Davis, C. (2010). *A review of advanced techniques for detecting plant diseases*. Computers and Electronics in Agriculture Vol.72, pp. 1–13 Sierra, V.; Aldai, N.; Castro, P.; Osoro, K.; Coto-Montes, A. & Oliva, M. (2007). *Prediction of* 

Sinelli, N.; Limbo, S.; Torri, L.; Di Egidio, V. & Casiraghi, E. (2010). *Evaluation of freshness* 

Photogrammetry and Remote Sensing Vol.62, N.1, pp. 2–12

Food Engineering Vol.105, pp. 617–624

pp. 73-76. Longman Scientific & Technical

Vol.125, pp. 1450–1456

Science Vol.83, pp. 175–186

Talanta Vol.83, pp.1352–1358

Technol. Vol.42, pp. 94-102

Science Vol.78, pp. 248–255

Riva, M. (1999). Introduzione alle tecniche di Image Analysis. http://www.distam.unimi.it/image\_analysis/image0.htm

Vol.83, pp. 96–103

N.1-2, pp. 79-87

pp. 31–45

*and amino acid content in soybean crops from hyperspectral imagery*. ISPRS Journal of

*Prediction of optimal cooking time for boiled potatoes by hyperspectral imaging.* Journal of

(2007). *Non-destructive measurement of fruit and vegetable quality by means of NIR spectroscopy: A review*. Postharvest Biology and Technology, Vol.46, pp. 99−118 Oliveri, P.; Di Egidio, V.; Woodcock, T. & Downey, G. (2011). *Application of class-modelling* 

*techniques to near infrared data for food authentication purposes*. Food Chemistry

*Food and Beverage Analylis*. Cap. 4: Fundamentals of near infrared instrumentation,

*determination of the geographic origin of Emmental cheese, volatile compounds by GC–*

*reflectance spectroscopy to predict meat and meat products quality: A review*. Meat

Roehe, R. (2009a). *On-line application of visible and near infrared reflectance spectroscopy to predict chemical–physical and sensory characteristics of beef quality*. Meat Science

*descriptive sensory analysis of Arabica coffee beverages using near infrared spectroscopy*.

*for qualitative analysis of pharmaceutical solid forms.* Analytica Chimica Acta, Vol.535

*Characterization and determination of the geographical origin of wines. Part III: multivariate discrimination and classification methods*. Eur. Food Res. Technol. Vol.230,

*the fatty acid composition of beef by near infrared transmittance spectroscopy*. Meat

*decay of minced beef stored in high-oxygen modified atmosphere packaged at different temperatures using NIR and MIR spectroscopy*. Meat Science Vol.86, pp. 748–752


Jackmana, P.; Sun D.W. & Allen P. (2011). *Recent advances in the use of computer vision* 

Jackson, J.E. (1991). *A user's guide to principal components.* Wiley-Interscience Publication.

Janik, L.J.; Cozzolino, D.; Dambergs, R.; Cynkar W. & Gishen, M. (2007). *The prediction of total* 

Karoui, R. & De Baerdemaeker, J. (2006). *A review of the analytical methods coupled with* 

Li, J.; Rao, X. & Ying, Y. (2011). *Detection of common defects on oranges using hyperspectral reflectance imaging*. Computers and Electronics in Agriculture Vol.78, pp. 38–48 Lu, R. & Ariana, D. (2002). *A Near-Infrared Sensing Technique for Measuring Internal Quality of Apple Fruit*. Applied Engineering in Agricolture, Vol.18, N.5, pp. 585-590 Lunadei, L. (2008). *Image analysis as a methodology to improve the selection of foodstuffs*. PhD

Department of Agricultural Engineering, Università degli Studi di Milano Lyon, R. C.; Lester, D. S.; Lewis, E. N.; Lee, E.; Yu, L. X.; Jefferson, E. H.; et al. (2002). *Near-*

Mahesh, S.; Jayas, D. S.; Paliwal, J. & White, N. D. G. (2011). *Identification of wheat classes at* 

Manley, M.; Du Toit, G. & Geladi, P., (2011). *Tracking diffusion of conditioning water in single* 

Massart, D.L.; Buydens, L.M.C.; De Jong, S.; Lewi P.J. & Smeyers-Verbek, J. (1998). *Handbook* 

Massart, D.L.; Vandeginste, B.G.M.; Buydens, L.M.C.; De Jong, S.; Lewi P.J. & Smeyers-

0-444-89724-0, included in series: Data Handling in Science and Technology Mathiassena, J. R.; Misimib, E.; Bondøb, M.; Veliyulinb, E. & Ove Østvik, S. (2011). *Trends in* 

McClure, W. F. (2003). *204 years of near infrared technology: 1800 - 2003*. Journal of Near

Mehl, P. M.; Chao, K.; Kim, M.; Chen, Y. R. (2002). *Detection of defects on selected apple cultivars* 

included in series: Data Handling in Science and Technology

*spectroscopy and artificial neural networks*. Anal. Chim. Acta pp. 594-107 Junkwon, P.; Takigawa, T.; Okamoto, H.; Hasegawa, H.; Koike, M.; Sakai, K.; Siruntawineti,

Technology Vol.22, pp. 185-197

Research Vol.18, N.2, pp. 72–81

Chemistry Vol.102, pp. 621–640;

Instrumen. Food Qual. Vol.5, pp. 1–9

Science & Technology Vol.22, pp. 257-275

Infrared Spectroscopy, Vol.11, pp. 487−518

Agriculture Vol.18, N.2, pp. 219-226

Chimica Acta Vol.686, pp. 64–75

ISBN 0-471-62267-2

*technology in the quality assessment of fresh meats*. Trends in Food Science &

*anthocyanin concentration in red-grape homogenates using visiblenear- infrared* 

J.; Chaeychomsri, W.; Sanevas, N.; Tittinuchanon, P. & Bahalayodhin, B. (2009). P*otential application of color and hyperspectral images for estimation of weight and ripeness of oil palm (elaeis guineensis jacq. var. tenera).* Agricultural Information

*chemometric tools for the determination of the quality and identity of dairy products*. Food

thesis in "Technological innovation for agro-food and environmental sciences",

*infrared spectral imaging for quality assurance of pharmaceutical products: analysis of tablets to assess powder blend homogeneity*. AAPS PharmSciTech Vol.3, N.3, pp. 17 Maftoonazad, N.; Karimi, Y.; Ramaswamy, H.S. & Prasher, S.O. (2010). *Artificial neural* 

*network modeling of hyperspectral radiometric data for quality changes associated with avocados during storage*. Journal of Food Processing and Preservation ISSN 1745-4549

*different moisture levels using near-infrared hyperspectral images of bulk samples*. Sens. &

*wheat kernels of different hardnesses by near infrared hyperspectral imaging*. Analytica

*of Chemometrics and Qualimetrics: Part B*. Edited by B.G.M. ISBN: 978-0-444-82853-8,

Verbek, J. (1997). *Handbook of Chemometrics and Qualimetrics: Part A*. Elsevier, ISBN:

*application of imaging technologies to inspection of fish and fish products*. Trends in Food

*using hyperspectral and multispectral image analysis*. Applied Engineering in


**11** 

*Brazil* 

**Metabolomics and Chemometrics as Tools** 

Developments in analytical techniques (GC-MS, LC-MS, 1H-, 13C-NMR, FT-MS, e.g.) are progressing rapidly and have been driven mostly by the requirements in the healthcare and food sectors. Simultaneous high-throughput measurements of several analytes at the level of the transcript (transcriptomics), proteins, (proteomics), and metabolites (metabolomics) are currently performed, producing a prodigious amount of data. Thus, the advent of *omic* studies has created an information explosion, resulting in a paradigm shift in the emphasis of analytical research of biological systems. The traditional approaches of biochemistry and molecular cell biology, where the cellular processes have been investigated individually and often independent of each other, are giving way to a wider approach of analyzing the cellular composition in its entirety, allowing achieving a *quasi*-complete metabolic picture. The exponential growth of data, largely from genomics and genomic technologies, has changed the way biologists think about and handle data. In order to derive meaning from these large data sets, tools are required to analyze and identify patterns in the data, and allow data to be placed into a biological context. In this scenario, biologists have a continuous need for tools to manage and analyze the ever-increasing data supply. Optimal use of the data set, primarily of chemical nature, requires effective methods to analyze and manage them. It is obvious that all *omic* approaches will rely heavily upon bioinformatics for the storage, retrieval, and analysis of large data sets. Thus, and taking into account the multivariate nature of analysis in *omic* technologies, there is an increase emphasis in research on the application of chemometric techniques for extracting relevant information.

Shirley Kuhnen1, Priscilla M. M. Lemos1, Simone Kobe de Oliveira1, Diego A. da Silva1,

*4 Chemistry Department, University of Aveiro – Campus Santiago, Aveiro - Portugal* 

Maíra M. Tomazzoli1, Ana Carolina V. Souza1, Rúbia Mara Pinto2,Virgílio G. Uarrota1, Ivanir Cella2, Antônio G. Ferreira3, Amélia R. S. Zeggio1, Maria B.R. Veleirinho4, Ivone Delgadillo4 and Flavia A. Vieira4 *1 Plant Morphogenesis and Biochemistry Laboratory, Federal University of Santa Catarina, Florianopolis, SC,* 

**1. Introduction** 

 \*

*Brazil;* 

*2 EPAGRI – Florianópolis, SC, Brazil;* 

*3 NMR Laboratory, Federal University of São Carlos, São Carlos-SP;* 

**for Chemo(bio)diversity Analysis** 

 **- Maize Landraces and Propolis** 

*Plant Morphogenesis and Biochemistry Laboratory, Federal University of Santa Catarina, Florianopolis, SC,* 

Marcelo Maraschin et al.\*


## **Metabolomics and Chemometrics as Tools for Chemo(bio)diversity Analysis - Maize Landraces and Propolis**

Marcelo Maraschin et al.\* *Plant Morphogenesis and Biochemistry Laboratory, Federal University of Santa Catarina, Florianopolis, SC,* 

*Brazil* 

#### **1. Introduction**

252 Chemometrics in Practical Applications

Siripatrawan, U.; Makino, Y.; Kawagoe, Y. & Oshita, S. (2011). *Rapid detection of Escherichia* 

Smail, V.; Fritz, A. & Wetzel, D. (2006). *Chemical imaging of intact seeds with NIR focal plane array assists plant breeding*. Vibrational Spectroscopy Vol.42, N.2, pp. 215-221 Sørensen, L.K. (2009). Application of reflectance near infrared spectroscopy for bread

Stark, E.K. & Luchter, K. (2003). *Diversity in NIR Instrumentation, in Near Infrared* 

Torri, L.; Sinelli, N. & Limbo S. (2010). *Shelf life evaluation of fresh-cut pineapple by using an electronic nose*. Postharvest Biology and Technology Vol.56, pp. 239–245 Urbano-Cuadrado, M.; de Castro, M.D.L.; Perez-Juan, P.M.; Garcia-Olmo, J. & Gomez-Nieto,

Valous, N. A.; Mendoza, F. & Sun, D.W. (2010). *Emerging non-contact imaging, spectroscopic* 

Williams, P. C. & Sobering, D. (1996). *How do we do it: A brief summary of the methods we use in* 

Wilson, A. D. & Baietto, M. (2009). A*pplications and advances in electronic-nose technologies*.

Wold, S.; Sjöström, M. & Eriksson, L. (2001). *PLS-regression: a basic tool of chemometrics*.

Yongwei, W.; Wang, J.; Zhou, B. & Lu, Q. (2009). *Monitoring storage time and quality attribute of egg based on electronic nose*. Analytica Chimica Acta Vol.650, pp. 183–188 Yu, H. & MacGregor, J.F., (2003). *Multivariate image analysis and regression for prediction of* 

Zardetto, S. & Dalla Rosa, M. (2006). *Study of the effect of lamination process on pasta by physical* 

Zheng, C.; Sun, D.W. & Zheng, L. (2006). *Recent developments and applications of image features* 

Zheng, G.; Chen, Y.; Intes, X.; Chance, B. & Glickson, J. D. (2004). *Contrast-enhanced near-*

analyses. Food Chemistry Vol.113, pp. 1318–1322

pp. 276–281

Chichester, UK, pp. 55-66

Anal. Chim. Acta Vol.527, pp. 81-88

Sensors Vol.9, pp. 5099-5148

Engineering Vol.74, pp. 402–409

&Technology. Vol.17, pp. 642-655

in Food Science & Technology Vol.21, pp. 26-43

Chemom. Intell. Lab. Syst. Vol.58, pp. 109–130

Intelligent Laboratory Systems Vol.67, pp. 125–144

and Phthalocyanines Vol.8, N.9, pp. 1106-1117

*coli contamination in packaged fresh spinach using hyperspectral imaging*. Talanta Vol.85,

*Spectroscopy*.: Proceeding oh the 11th International Conference. NIR Publication,

M.A. (2004). *Near infrared reflectance, spectroscopy and multivariate analysis in enology—Determination or screening of fifteen parameters in different types of wines*.

*and colorimetric technologies for quality evaluation and control of hams: a review*. Trends

*developing near infrared calibrations.* In A. M. C. Davies & P. C. Williams (Eds.), Near infrared spectroscopy: the future waves (pp. 185–188). Chichester: NIR Publications

*coating content and distribution in the production of snack foods*. Chemometrics and

*chemical determination and near infrared spectroscopy analysis*. Journal of Food

*for food quality evaluation and inspection: a review*. Trends in food Science

*infrared (NIR) optical imaging for subsurface cancer detection*. Journal of Porphyrins

Developments in analytical techniques (GC-MS, LC-MS, 1H-, 13C-NMR, FT-MS, e.g.) are progressing rapidly and have been driven mostly by the requirements in the healthcare and food sectors. Simultaneous high-throughput measurements of several analytes at the level of the transcript (transcriptomics), proteins, (proteomics), and metabolites (metabolomics) are currently performed, producing a prodigious amount of data. Thus, the advent of *omic* studies has created an information explosion, resulting in a paradigm shift in the emphasis of analytical research of biological systems. The traditional approaches of biochemistry and molecular cell biology, where the cellular processes have been investigated individually and often independent of each other, are giving way to a wider approach of analyzing the cellular composition in its entirety, allowing achieving a *quasi*-complete metabolic picture.

The exponential growth of data, largely from genomics and genomic technologies, has changed the way biologists think about and handle data. In order to derive meaning from these large data sets, tools are required to analyze and identify patterns in the data, and allow data to be placed into a biological context. In this scenario, biologists have a continuous need for tools to manage and analyze the ever-increasing data supply. Optimal use of the data set, primarily of chemical nature, requires effective methods to analyze and manage them. It is obvious that all *omic* approaches will rely heavily upon bioinformatics for the storage, retrieval, and analysis of large data sets. Thus, and taking into account the multivariate nature of analysis in *omic* technologies, there is an increase emphasis in research on the application of chemometric techniques for extracting relevant information.

<sup>\*</sup> Shirley Kuhnen1, Priscilla M. M. Lemos1, Simone Kobe de Oliveira1, Diego A. da Silva1,

Maíra M. Tomazzoli1, Ana Carolina V. Souza1, Rúbia Mara Pinto2,Virgílio G. Uarrota1, Ivanir Cella2, Antônio G. Ferreira3, Amélia R. S. Zeggio1, Maria B.R. Veleirinho4, Ivone Delgadillo4 and Flavia A. Vieira4 *1 Plant Morphogenesis and Biochemistry Laboratory, Federal University of Santa Catarina, Florianopolis, SC, Brazil;* 

*<sup>2</sup> EPAGRI – Florianópolis, SC, Brazil;* 

*<sup>3</sup> NMR Laboratory, Federal University of São Carlos, São Carlos-SP;* 

*<sup>4</sup> Chemistry Department, University of Aveiro – Campus Santiago, Aveiro - Portugal* 

Metabolomics and Chemometrics as Tools

others are gummy and elastic.

Brazil.

developed and cultured in southern regions of Brazil.

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 255

superior genotypes, as further described in the first part of this chapter for maize landraces

In a second part of this chapter is described the adoption of a typical metabolomic platform, i.e., FTIR and UV-visible spectroscopies coupled to chemometrics, for discriminating propolis samples produced in southern Brazil, a region of huge plant biodiversity. Propolis is typically a complex matrix and has been recognized for its broad pharmacological activities (anti-inflammatory, antibacterial, antifungal, anticancer, and antioxidant, e.g.) since ancient times. Propolis (registration number chemical abstracts service - CAS 9009-62- 5) is a beekeeping resinous and complex product, with a variable physical appearance, collected and transformed by honey bees, *Apis mellifera*, from the vegetation they visit. It may be ochre, red, brown, light brown or green; some are friable and steady, while the

Phenolics such as flavonoids and phenol-carboxylic acids are strategic components in propolis to render it bioactive against several pathogenic microorganisms, for instance as bacteriostatic and/or bactericidal agents. The flora (buds, twigs, bark, and less importantly, flowers) surrounding the hive is the basic source for the phenolics stuff and thus exerts an outstanding importance on the propolis final composition and on its physical, chemical, and biological properties. Although the wax component is an unquestionable supplement provided by the bee secretory apparatus by far less is known about the degree of intensity that these laborious insects play changing all the other chemical constituents collected in the Nature including minor ingredients like essential oils (10%), most of them responsible for the delicate and pleasant odor. All this flora contribution to propolis and the exact wax content may then explain physical properties such as color, taste, texture, melting point, and more importantly, from the health standpoint, a lot of pharmaceutical applications. However, for purpose of industrial applications, the propolis hydroalcoholic extract needs to meet specific composition in order to guarantee any claimed pharmacological activity. One common method used by the industry for quality control is analyzing the propolis sample for the presence of chemical markers known to be present in the specific propolis product they market. Even though this has been the acceptable method for quality control, the presence of the chemical markers do not always guarantee an individual is getting the actual propolis stated by the product label, especially if the product has been spiked with the chemical markers. The quantitation method for the chemical markers will confirm the compounds presence, but it may not confirm the presence of the propolis known to contain the chemical markers. Authentication of the propolis material may be possible by a chemical fingerprint of it and, if possible, of its botanical sources. Thus, chemical fingerprinting, i.e., metabolomics and chemometrics, is an additional method that has been claimed to be included in the quality control process in order to confirm or deny the propolis sample quality being used for manufacturing of a derived product of that resinous and complex matrix. The second part of this chapter aims to demonstrate the possibility of a FTIR and UV-vis metabolomic-chemometrics approach to identify and classify propolis samples originating from nineteen geographic regions (Santa Catarina State, southern Brazil) in different classes, on the basis of the concerted variation in metabolite levels detected by those spectroscopic techniques. Exploratory data analysis and patterns of chemical composition based on, for instance, principal component analysis, as well as discriminating models will be described in order to unravel propolis chemotypes produced in southern

Metabolomics\* and chemometrics† have been used in a number of areas to provide biological information beyond the simple identification of cell constituents. These areas include:


In general sense, strategies to obtain biological information in the above mentioned areas have focused on the analysis of metabolic differences that evidence responses to a range of extrinsic (ambient) and intrinsic (genetic) stimuli. Since no single analytical method has been found to obtain a complete picture of the metabolome of an organism, an association of advanced analytical techniques (GC-MS, LC-MS, FTIR, 1H-, 13C-NMR, FT-MS, e.g.) coupled to chemometrics, e.g., univariate (ANOVA, correlation analysis, regression analysis) or multivariate (PCA, HA, PLS) statistical techniques, has been performed in order to rapidly identify up- or down-regulated endogenous metabolites in complex matrices such as plant extracts, flours, starches, and biofluids, for instance. Plant extracts are recognized to be a complex matrix containing a wide range of primary and secondary metabolites that vary according to the environmental condition, genotype, developmental stage, and agronomic traits, for example. Such a complex matrix has long been used to characterize plant genotypes growing in a given geographic region and/or subjected to external stimuli, giving rise to additional information of interest, e.g., plant genetic breeding programs, local biodiversity conservation, food industry, and quality control in drug development/production processes.

In the former case, programs for genetic breeding of plants have often focused on the analysis of landraces‡ genotypes (i.e., creole and local varieties), aiming at to identify individuals well adapted to specific local environmental conditions (soil and climate) and with superior agronomic performance and biomass yield. Indeed, the analysis and exploitation of the local genotypes' diversity has long been used as a strategy to improve agronomic traits by conventional breeding methods in plant crops of economical interest, as well as for stimulating the preservation of plant genetic resources. Taking into consideration that a series of primary (e.g., proteins and starch) and secondary metabolites (alkaloids, phenolic acids, and carotenoids, for instance) are well recognized compounds associated to the plants' adaptation mechanisms to their surroundings ecological factors, metabolomics and chemometrics have emerged as an interesting approach for helping the selection of

<sup>\*</sup> *Metabolomics*: constitutes a quantitative and qualitative survey of the whole metabolites of an organism as well as a tissue, thus it reflects the genome and proteome of a sample.

<sup>†</sup> *Chemometrics*: according to the definition of the Chemometrics Society, it is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analyzing chemical data.

<sup>‡</sup> Landraces are genotypes with a high capacity to tolerate biotic and abiotic stress, resulting in high yield stability and an intermediate yield level under a low input agricultural system.

Metabolomics\* and chemometrics† have been used in a number of areas to provide biological information beyond the simple identification of cell constituents. These areas

a. Fingerprinting of species, genotypes or ecotypes for taxonomic or biochemical (gene

b. Monitoring the behavior of specific classes of metabolites in relation to applied

c. Studying developmental processes such as establishment of symbiotic associations or

d. Comparing and contrasting the metabolite content of mutant or transgenic plants with

In general sense, strategies to obtain biological information in the above mentioned areas have focused on the analysis of metabolic differences that evidence responses to a range of extrinsic (ambient) and intrinsic (genetic) stimuli. Since no single analytical method has been found to obtain a complete picture of the metabolome of an organism, an association of advanced analytical techniques (GC-MS, LC-MS, FTIR, 1H-, 13C-NMR, FT-MS, e.g.) coupled to chemometrics, e.g., univariate (ANOVA, correlation analysis, regression analysis) or multivariate (PCA, HA, PLS) statistical techniques, has been performed in order to rapidly identify up- or down-regulated endogenous metabolites in complex matrices such as plant extracts, flours, starches, and biofluids, for instance. Plant extracts are recognized to be a complex matrix containing a wide range of primary and secondary metabolites that vary according to the environmental condition, genotype, developmental stage, and agronomic traits, for example. Such a complex matrix has long been used to characterize plant genotypes growing in a given geographic region and/or subjected to external stimuli, giving rise to additional information of interest, e.g., plant genetic breeding programs, local biodiversity conservation, food industry, and quality control in drug

In the former case, programs for genetic breeding of plants have often focused on the analysis of landraces‡ genotypes (i.e., creole and local varieties), aiming at to identify individuals well adapted to specific local environmental conditions (soil and climate) and with superior agronomic performance and biomass yield. Indeed, the analysis and exploitation of the local genotypes' diversity has long been used as a strategy to improve agronomic traits by conventional breeding methods in plant crops of economical interest, as well as for stimulating the preservation of plant genetic resources. Taking into consideration that a series of primary (e.g., proteins and starch) and secondary metabolites (alkaloids, phenolic acids, and carotenoids, for instance) are well recognized compounds associated to the plants' adaptation mechanisms to their surroundings ecological factors, metabolomics and chemometrics have emerged as an interesting approach for helping the selection of

\* *Metabolomics*: constitutes a quantitative and qualitative survey of the whole metabolites of an organism

† *Chemometrics*: according to the definition of the Chemometrics Society, it is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and

‡ Landraces are genotypes with a high capacity to tolerate biotic and abiotic stress, resulting in high

as well as a tissue, thus it reflects the genome and proteome of a sample.

to provide maximum chemical information by analyzing chemical data.

yield stability and an intermediate yield level under a low input agricultural system.

include:

discovery) purposes;

fruit ripening;

exogenous chemical and/or physical stimuli;

that of their wild-type counterparts.

development/production processes.

superior genotypes, as further described in the first part of this chapter for maize landraces developed and cultured in southern regions of Brazil.

In a second part of this chapter is described the adoption of a typical metabolomic platform, i.e., FTIR and UV-visible spectroscopies coupled to chemometrics, for discriminating propolis samples produced in southern Brazil, a region of huge plant biodiversity. Propolis is typically a complex matrix and has been recognized for its broad pharmacological activities (anti-inflammatory, antibacterial, antifungal, anticancer, and antioxidant, e.g.) since ancient times. Propolis (registration number chemical abstracts service - CAS 9009-62- 5) is a beekeeping resinous and complex product, with a variable physical appearance, collected and transformed by honey bees, *Apis mellifera*, from the vegetation they visit. It may be ochre, red, brown, light brown or green; some are friable and steady, while the others are gummy and elastic.

Phenolics such as flavonoids and phenol-carboxylic acids are strategic components in propolis to render it bioactive against several pathogenic microorganisms, for instance as bacteriostatic and/or bactericidal agents. The flora (buds, twigs, bark, and less importantly, flowers) surrounding the hive is the basic source for the phenolics stuff and thus exerts an outstanding importance on the propolis final composition and on its physical, chemical, and biological properties. Although the wax component is an unquestionable supplement provided by the bee secretory apparatus by far less is known about the degree of intensity that these laborious insects play changing all the other chemical constituents collected in the Nature including minor ingredients like essential oils (10%), most of them responsible for the delicate and pleasant odor. All this flora contribution to propolis and the exact wax content may then explain physical properties such as color, taste, texture, melting point, and more importantly, from the health standpoint, a lot of pharmaceutical applications. However, for purpose of industrial applications, the propolis hydroalcoholic extract needs to meet specific composition in order to guarantee any claimed pharmacological activity. One common method used by the industry for quality control is analyzing the propolis sample for the presence of chemical markers known to be present in the specific propolis product they market. Even though this has been the acceptable method for quality control, the presence of the chemical markers do not always guarantee an individual is getting the actual propolis stated by the product label, especially if the product has been spiked with the chemical markers. The quantitation method for the chemical markers will confirm the compounds presence, but it may not confirm the presence of the propolis known to contain the chemical markers. Authentication of the propolis material may be possible by a chemical fingerprint of it and, if possible, of its botanical sources. Thus, chemical fingerprinting, i.e., metabolomics and chemometrics, is an additional method that has been claimed to be included in the quality control process in order to confirm or deny the propolis sample quality being used for manufacturing of a derived product of that resinous and complex matrix. The second part of this chapter aims to demonstrate the possibility of a FTIR and UV-vis metabolomic-chemometrics approach to identify and classify propolis samples originating from nineteen geographic regions (Santa Catarina State, southern Brazil) in different classes, on the basis of the concerted variation in metabolite levels detected by those spectroscopic techniques. Exploratory data analysis and patterns of chemical composition based on, for instance, principal component analysis, as well as discriminating models will be described in order to unravel propolis chemotypes produced in southern Brazil.

Metabolomics and Chemometrics as Tools

flour samples of whole and degermed grains.

grain cultivated in the southern Brazil.

normal deviates corrected.

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 257

Previously to PCA analysis each spectrum within the (3000–600 cm-1) region was standard

Figure 1 shows a PCA scores scatter plot for flour samples from whole and degermed grains using the whole FTIR spectral window data set (3000–600 cm-1). The scores scatter plot (PC1 vs. PC2) that contains 93% of the data set variability shows a clear discrimination among

Fig. 1. Principal component analysis scores scatter plot of the FTIR data set in the spectral window of 3000–600 cm-1 wavenumber of landrace maize flours of whole and degermed

 Whole grains o Degermed grains

The samples of whole grains grouped in PC1+ axis seemed to be more discrepant in their chemical composition, appearing more scattered through the quadrants of the PCA representation. Figure 2 shows the loadings plot of PC1, revealing the most important wavenumbers which explain the distinction of the samples previously found (scores scatter plot). The loadings indicated a prominent effect of the lipid components (2924, 2850, and 1743 cm-1) for the segregation observed. The two major structures of the grains are the endosperm and the germ (embryo) that constitute approximately 80 and 10% of the mature kernel dry weight, respectively. The endosperm is largely starch (approaching 90%) and the

The greatest chemical diversity observed in whole grains can be explained by genetic variation of embryos resulting from sexual reproduction. Some authors suggest that the high level of genetic and epigenetic diversity observed in maize could be responsible for its great adaptation capacity to a wide range of ecological factors. Lemos (2010) analyzing the

germ contains high levels of oil (30%) and protein (18% - Boyer & Hannah, 2001).

#### **2. Maize: metabolomic and chemometric analyses for the study of landraces**

Maize (*Zea mays* L.) was chosen as a model for metabolomic analysis because although most of this cereal produced worldwide is used for animal feeding, an important amount is also used in human diet and for industrial purposes, providing raw material for food, pharmaceuticals, and cosmetics production. The maize grain is composed of several chemicals of commercial value and the diversity of its applications depends on the differences in relative chemical composition, e.g. protein, oil, and starch contents, traits that show prominent genetic components (Baye et al., 2006; White, 2001). Over the last centuries, farmers have created thousands of maize varieties suitable for cultivation in numerous environments. Accordingly, it seems consensual that the maize landraces' phenotypes, e.g., morphological and agronomic traits and special chemical characteristics of grains are resultant of the domestication process. Thus, high throughput metabolomic analysis of maize genotypes could improve metabolic singularities knowledge about landraces, helping their characterization and evaluation, and indicating new alternatives for their use. In this context, to distinguish metabolic profiles it is necessary to consider the use of diverse analytical tools, such as spectroscopic and chromatographic techniques for instance. Techniques that are reproducible, stable with time, and do not require complex sample preparation such as infrared vibrational spectroscopy and nuclear magnetic resonance spectroscopy are desirable for metabolic profiling.

#### **2.1 Metabolic profiling of maize landraces through FTIR-PCA – integral and degermed flours**

Vibrational spectroscopy, and particularly Fourier transform infrared spectroscopy (FTIR) is thought to be interesting as one aims at discriminating and classifying maize landraces according to their chemical traits. FTIR is a physicochemical method that measures the vibrations of bonds within functional groups and generates a spectrum that can be regarded as a metabolic fingerprint. It is a flexible method that can quickly provide qualitative and quantitative information with minimal or no sample preparation of complex biological matrices (Ferreira et al., 2001). By other hand, a FTIR spectrum is complex, containing many variables per sample and making visual analysis very difficult. Hence, to extract useful information from the whole spectra, multivariate data analysis is needed, particularly through the determination of the principal components (PCA - Fukusaki & Kobayashi, 2005). Such a multivariate analysis technique could allow the characterization of the sample relationships (scores plans or axis) and the recovery of their subspectral profiles (loadings). This approach was applied to classify flour samples from whole (integral) and degermed maize grains of twenty-six landraces developed and cultivated by small farmers in the farwest region of Santa Catarina State, southern Brazil (Anchieta County – 26º31'11''S, 53º20'26''W).

Previously to multivariate analysis, FTIR spectra were normalized, baseline-corrected in the region of interest by drawing a straight line before resolution enhancement (k factor of 1.7) was applied using Fourier self deconvolution (Opus v. 5.0, Bruker Biospin, GmbH, Rheinstetten, Germany). Chemometric analysis used normalized, baseline-corrected (3000– 600 cm-1. 1700 data points) and deconvoluted spectra, which were transferred via a JCAMP format (OPUS v. 5.0, Bruker Biospin GmbH, Rheinstetten, Germany) into the data analysis software for PCA (The Unscramble v. 9.1, CAMO Software Inc., Woodbridge, USA).

**2. Maize: metabolomic and chemometric analyses for the study of landraces**  Maize (*Zea mays* L.) was chosen as a model for metabolomic analysis because although most of this cereal produced worldwide is used for animal feeding, an important amount is also used in human diet and for industrial purposes, providing raw material for food, pharmaceuticals, and cosmetics production. The maize grain is composed of several chemicals of commercial value and the diversity of its applications depends on the differences in relative chemical composition, e.g. protein, oil, and starch contents, traits that show prominent genetic components (Baye et al., 2006; White, 2001). Over the last centuries, farmers have created thousands of maize varieties suitable for cultivation in numerous environments. Accordingly, it seems consensual that the maize landraces' phenotypes, e.g., morphological and agronomic traits and special chemical characteristics of grains are resultant of the domestication process. Thus, high throughput metabolomic analysis of maize genotypes could improve metabolic singularities knowledge about landraces, helping their characterization and evaluation, and indicating new alternatives for their use. In this context, to distinguish metabolic profiles it is necessary to consider the use of diverse analytical tools, such as spectroscopic and chromatographic techniques for instance. Techniques that are reproducible, stable with time, and do not require complex sample preparation such as infrared vibrational spectroscopy and nuclear magnetic resonance

**2.1 Metabolic profiling of maize landraces through FTIR-PCA – integral and degermed** 

Vibrational spectroscopy, and particularly Fourier transform infrared spectroscopy (FTIR) is thought to be interesting as one aims at discriminating and classifying maize landraces according to their chemical traits. FTIR is a physicochemical method that measures the vibrations of bonds within functional groups and generates a spectrum that can be regarded as a metabolic fingerprint. It is a flexible method that can quickly provide qualitative and quantitative information with minimal or no sample preparation of complex biological matrices (Ferreira et al., 2001). By other hand, a FTIR spectrum is complex, containing many variables per sample and making visual analysis very difficult. Hence, to extract useful information from the whole spectra, multivariate data analysis is needed, particularly through the determination of the principal components (PCA - Fukusaki & Kobayashi, 2005). Such a multivariate analysis technique could allow the characterization of the sample relationships (scores plans or axis) and the recovery of their subspectral profiles (loadings). This approach was applied to classify flour samples from whole (integral) and degermed maize grains of twenty-six landraces developed and cultivated by small farmers in the farwest region of Santa Catarina State, southern Brazil (Anchieta County – 26º31'11''S,

Previously to multivariate analysis, FTIR spectra were normalized, baseline-corrected in the region of interest by drawing a straight line before resolution enhancement (k factor of 1.7) was applied using Fourier self deconvolution (Opus v. 5.0, Bruker Biospin, GmbH, Rheinstetten, Germany). Chemometric analysis used normalized, baseline-corrected (3000– 600 cm-1. 1700 data points) and deconvoluted spectra, which were transferred via a JCAMP format (OPUS v. 5.0, Bruker Biospin GmbH, Rheinstetten, Germany) into the data analysis software for PCA (The Unscramble v. 9.1, CAMO Software Inc., Woodbridge, USA).

spectroscopy are desirable for metabolic profiling.

**flours** 

53º20'26''W).

Previously to PCA analysis each spectrum within the (3000–600 cm-1) region was standard normal deviates corrected.

Figure 1 shows a PCA scores scatter plot for flour samples from whole and degermed grains using the whole FTIR spectral window data set (3000–600 cm-1). The scores scatter plot (PC1 vs. PC2) that contains 93% of the data set variability shows a clear discrimination among flour samples of whole and degermed grains.

Fig. 1. Principal component analysis scores scatter plot of the FTIR data set in the spectral window of 3000–600 cm-1 wavenumber of landrace maize flours of whole and degermed grain cultivated in the southern Brazil.

The samples of whole grains grouped in PC1+ axis seemed to be more discrepant in their chemical composition, appearing more scattered through the quadrants of the PCA representation. Figure 2 shows the loadings plot of PC1, revealing the most important wavenumbers which explain the distinction of the samples previously found (scores scatter plot). The loadings indicated a prominent effect of the lipid components (2924, 2850, and 1743 cm-1) for the segregation observed. The two major structures of the grains are the endosperm and the germ (embryo) that constitute approximately 80 and 10% of the mature kernel dry weight, respectively. The endosperm is largely starch (approaching 90%) and the germ contains high levels of oil (30%) and protein (18% - Boyer & Hannah, 2001).

The greatest chemical diversity observed in whole grains can be explained by genetic variation of embryos resulting from sexual reproduction. Some authors suggest that the high level of genetic and epigenetic diversity observed in maize could be responsible for its great adaptation capacity to a wide range of ecological factors. Lemos (2010) analyzing the

Metabolomics and Chemometrics as Tools

flours samples in two groups (Fig. 3) by PC1 axis.

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 259

data points) was performed including the spectra of amylose and amylopectin standards. The chemical structures and the purity of the standards of amylose and amylopectin were confirmed by 13C-NMR spectroscopy. The PC1 (32%) vs. PC2 (28%) scores scatter plot allowed a clear segregation of the amylopectin standard and a discrimination for the maize

Fig. 3. Classification plot of starch fractions of maize landraces for 1H-NMR after PCA

**3. Propolis: ATR-FTIR and UV-visible spectrophotometry coupled to chemometrics as an analytical approach for pattern recognition 3.1 Attenuated total reflectance-Fourier transform infrared spectroscopy** 

The Roxo 41 variety located closer to the amylopectin standard suggesting the predominance of that polysaccharide in relation to amylose in its starchy fraction. This result is in accordance to PCA analysis of the IR data set of the fingerprint region of carbohydrates that diagnosed the starch granules from Roxo 41 with superior amylopectin amount in respect to their amylose content (data not shown). By other hand, the amylose standard was grouped with MPA1 and Rajado 8 Carreiras varieties, suggesting that the starch granules contain superior amount of that polysaccharide in its starchy fraction.

Propolis (registration number chemical abstracts service - CAS 9009-62-5) is a sticky colored material, which honeybees collect from different plants exudates and modify in its hypopharyngeal glands, being used in the hive to fill gaps and to protect against invaders as insects and microorganisms. Raw propolis usually contains 50% resin and balsam, 30% wax,

analysis. The arrows shows the amylose and amylopectin stardands.

metabolic profile of maize landraces' leaf tissues from Anchieta County (southern Brazil) found a prominent chemical variability among individuals of same variety, although intervariety variability has also been observed.

Fig. 2. PC1 loadings plot of the FTIR spectra of maize flours of whole and degermed grains in the 3000–600 cm-1 wavenumber region.

#### **2.2 Starch recognition pattern of maize landraces by NMR spectroscopy and PCA**

The composition of maize grains can be heterogeneous for both the quantity and quality of compounds from endosperm as starch, protein, and oil. In this context, metabolomics coupled to chemometrics approach was successfully applied to the discrimination of starches from the studied twenty-six maize landraces. The starches were extracted from flours with distilled water (1: 70, w/v) under reflux (80oC, 1 h), precipitated with ethyl alcohol (12 h, 4oC), and oven-dried (55oC until constant weight). Samples (50 mg) were dissolved in DMSO-6 (0.6 mL) and 1H-NMR spectra obtained under standard conditions. Sodium-3-trimethylsilylpropionate (TMSP-2, 2, 3, 3-d4) was used as internal reference (ppm 0.0). Spectra were processed using 32768 data points by applying an exponential line broadening of 0.3 Hz for sensitivity enhancement before Fourier transformation and were accurately phased, baseline adjusted, and converted into JCAMP format to build the data matrix. All calculations were carried out using the Pirouette software (v. 3.11, InfoMetrix, Woodinville, Washington, USA). The PCA analysis of the whole 1H-NMR data set (32.000

metabolic profile of maize landraces' leaf tissues from Anchieta County (southern Brazil) found a prominent chemical variability among individuals of same variety, although inter-

Fig. 2. PC1 loadings plot of the FTIR spectra of maize flours of whole and degermed grains

*Wavenumber (cm-1*

*)* 

The composition of maize grains can be heterogeneous for both the quantity and quality of compounds from endosperm as starch, protein, and oil. In this context, metabolomics coupled to chemometrics approach was successfully applied to the discrimination of starches from the studied twenty-six maize landraces. The starches were extracted from flours with distilled water (1: 70, w/v) under reflux (80oC, 1 h), precipitated with ethyl alcohol (12 h, 4oC), and oven-dried (55oC until constant weight). Samples (50 mg) were dissolved in DMSO-6 (0.6 mL) and 1H-NMR spectra obtained under standard conditions. Sodium-3-trimethylsilylpropionate (TMSP-2, 2, 3, 3-d4) was used as internal reference (ppm 0.0). Spectra were processed using 32768 data points by applying an exponential line broadening of 0.3 Hz for sensitivity enhancement before Fourier transformation and were accurately phased, baseline adjusted, and converted into JCAMP format to build the data matrix. All calculations were carried out using the Pirouette software (v. 3.11, InfoMetrix, Woodinville, Washington, USA). The PCA analysis of the whole 1H-NMR data set (32.000

**2.2 Starch recognition pattern of maize landraces by NMR spectroscopy and PCA** 

variety variability has also been observed.

in the 3000–600 cm-1 wavenumber region.

data points) was performed including the spectra of amylose and amylopectin standards. The chemical structures and the purity of the standards of amylose and amylopectin were confirmed by 13C-NMR spectroscopy. The PC1 (32%) vs. PC2 (28%) scores scatter plot allowed a clear segregation of the amylopectin standard and a discrimination for the maize flours samples in two groups (Fig. 3) by PC1 axis.

Fig. 3. Classification plot of starch fractions of maize landraces for 1H-NMR after PCA analysis. The arrows shows the amylose and amylopectin stardands.

The Roxo 41 variety located closer to the amylopectin standard suggesting the predominance of that polysaccharide in relation to amylose in its starchy fraction. This result is in accordance to PCA analysis of the IR data set of the fingerprint region of carbohydrates that diagnosed the starch granules from Roxo 41 with superior amylopectin amount in respect to their amylose content (data not shown). By other hand, the amylose standard was grouped with MPA1 and Rajado 8 Carreiras varieties, suggesting that the starch granules contain superior amount of that polysaccharide in its starchy fraction.

#### **3. Propolis: ATR-FTIR and UV-visible spectrophotometry coupled to chemometrics as an analytical approach for pattern recognition**

#### **3.1 Attenuated total reflectance-Fourier transform infrared spectroscopy**

Propolis (registration number chemical abstracts service - CAS 9009-62-5) is a sticky colored material, which honeybees collect from different plants exudates and modify in its hypopharyngeal glands, being used in the hive to fill gaps and to protect against invaders as insects and microorganisms. Raw propolis usually contains 50% resin and balsam, 30% wax,

Metabolomics and Chemometrics as Tools

propolis but also discriminating its geographical origin.

manufacturing of a derived product of that resinous and complex matrix.

the authentication and detection of adulteration of vegetable oils.

primary or secondary metabolites among the propolis samples.

(Wu et al*.,* 2008).

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 261

The search for faster screening methods capable of characterizing propolis samples of different geographic origins and composition has lead to the use of direct insertion mass spectrometric fingerprinting techniques (ESI-MS and EASI-MS), which has proven to be a fast and robust method for propolis characterization (Sawaya et al*.,* 2011), although this analytical approach can only detect compounds that ionize under the experimental conditions. Similarly, Fourier transform infrared vibrational spectroscopy (FTIR) has also demonstrated to be valuable to chemically characterize complex matrices such as propolis

In order to achieve the goal of treat propolis sample as a whole than just be focused only in marker compounds, chemometric methods are being considered an important tool to analyze the huge data sets generated by non-selective analytical techniques such as UV-vis, MS, NMR, and FT-IR, generating information not only about chemical composition of

Authentication of propolis material may be possible by a chemical fingerprint of it and, if possible, of its botanical sources. Thus, chemical fingerprinting, i.e., metabolomics and chemometrics, is an additional method that has been claimed to be included as a quality control method in order to confirm or deny the propolis sample being used for the

Over the last decades, infrared (IR) vibrational spectroscopy has been well established as a useful tool for structure elucidation and quality control in several industrial applications. Indeed, the development of Fourier transform (FT) IR and attenuated total reflectance (ATR) techniques have also evolved allowing rapid IR measurements of organosolvent extracts of plant tissues, edible oils, and essential oils, for example (Damm et al., 2005; Lai et al., 1994; Schulz & Baranska, 2007). In consequence of the strong dipole moment of water, IR spectroscopy applications have mostly focused on the analysis of dried or non-aqueous plant matrices and currently IR methods are widely used as a fast analytical technique for

ATR-FTIR spectroscopy was applied to propolis samples collected in the autumn-2010 and originated from nineteen geographic regions of Santa Catarina State (southern Brazil) in order to gain insights as to the chemical profile of those complex matrices. FTIR spectroscopy measures the vibrations of bonds within functional groups and generates a spectrum that can be regarded as a metabolic fingerprint. Similar IR spectral profiles (3000 – 600 cm-1, figure 4) were found by a preliminary visual analysis for purpose of an exploratory overview of data, revealing typical signals of e.g., lipids (2910 – 2845 cm-1), monoterpenes (1732, 1592, 1114, 1022, 972 cm-1), sesquiterpenes (1472 cm-1), and sucrose (1122 cm-1 - Schulz & Baranska, 2007) for all the studied samples. However, we were not able to identify by visual inspection of the spectra a clear picture regarding a discriminating effect of any

A FTIR spectrum is complex, containing many variables per sample and making visual analysis very difficult. Hence, to extract extra useful information, i.e., latent variables, from the whole spectra chemometric analysis was performed considering the whole FTIR data set using principal components analysis (PCA) for an exploratory overview of data. This method could reveal similarity/dissimilarity patterns among propolis samples, simplifying

10% aromatic oils, 5% pollen, and 5% other substances as inorganic salts and amino acids. This resin has been used by humanity since ancient civilizations like Egyptian, Assyrian, Greek, Roman, and Inca. In these days, a number of studies have confirmed important biological activities such as antibacterial, antifungal, antiviral, antioxidant, antiinflammatory, hepatoprotective, and antitumoral (for review see Bankova, 2009; Banksota et al., 2001; Castaldo & Capasso, 2002).

The aspect, texture and the chemical composition of propolis is highly variable and depends on the climate, season, bee species and mainly the local flora which is visited by bees to collect resin (Markham et al*.,* 1996). For this reason, comparing propolis samples from distinct regions might be the same as to compare extracts of two plants that belong to different taxonomical families (Bankova, 2005).

Propolis from Europe is the best known type of propolis. In European regions with temperate climate bees obtain resin mainly from the buds of *Poplus* species and the main bioactive components are flavonoids (Greenaway et al*.,* 1990). In tropical countries, the botanical resources are much more variable in respect to temperate zones, so bees find much more possibilities of collecting resins and hence the chemical composition of tropical propolis are more variable and distinct from European ones (Sawaya et al*.,* 2011). Different compounds have been reported in tropical propolis such as terpenoids and prenylated derivatives of *p*-coumaric acids in Brazilian propolis (Marcucci, 1995), lignans in Chilean samples (Valcic et al*.,* 1998), and polyisoprenylated benzophenones in Venezuelan, Brazilian, and Cuban propolis (Cuesta-Rubio et al.*,* 1999; Marcucci, 1995).

In order to be accepted officially into the main stream of the healthcare system and for industrial applications, propolis needs chemical standardization that guarantees its quality, safety, efficacy, and provenance. The chemical diversity mainly caused by the botanical origin makes the standardization difficult. Since the chemistry and biological activity of propolis depends on its geographical origin, a proper method to discriminate its origin is needed (Bankova, 2005).

Chromatographic methods (HPLC, TLC, GC, e.g.) are largely used to identification and quantification of propolis compounds, but it its becoming clear that to separate and evaluate all constituents of propolis is an almost impossible task (Sarbu & Mot, 2011). Even thought the presence of the chemical markers are considered an acceptable method for quality control, not always is guarantee about what is stated by the product label, especially if the product has been spiked with the chemical markers. Besides, literature has demonstrated that is not possible to ascribe the pharmacological activity solely to a unique compound and until now no single propolis component has shown to possess anti-bacterial activity higher than total extract (Kujumgiev et al*.,* 1999; Popova et al*.,* 2004). Thus, a possibility is offered by the fingerprinting methods that can analyze in a non-selective way the propolis samples as a whole.

Poplar propolis, for example, can be distinguished by UV-visible spectrophotometric determination of all three important components (flavones and flavonols, flavonones and dihydroflavonols, and total phenolics - Popova et al*.*, 2004), but some constraints regarding such an analytical approach has been claimed for propolis from tropical regions (Bankova & Marcucci, 2000).

10% aromatic oils, 5% pollen, and 5% other substances as inorganic salts and amino acids. This resin has been used by humanity since ancient civilizations like Egyptian, Assyrian, Greek, Roman, and Inca. In these days, a number of studies have confirmed important biological activities such as antibacterial, antifungal, antiviral, antioxidant, antiinflammatory, hepatoprotective, and antitumoral (for review see Bankova, 2009; Banksota et

The aspect, texture and the chemical composition of propolis is highly variable and depends on the climate, season, bee species and mainly the local flora which is visited by bees to collect resin (Markham et al*.,* 1996). For this reason, comparing propolis samples from distinct regions might be the same as to compare extracts of two plants that belong to

Propolis from Europe is the best known type of propolis. In European regions with temperate climate bees obtain resin mainly from the buds of *Poplus* species and the main bioactive components are flavonoids (Greenaway et al*.,* 1990). In tropical countries, the botanical resources are much more variable in respect to temperate zones, so bees find much more possibilities of collecting resins and hence the chemical composition of tropical propolis are more variable and distinct from European ones (Sawaya et al*.,* 2011). Different compounds have been reported in tropical propolis such as terpenoids and prenylated derivatives of *p*-coumaric acids in Brazilian propolis (Marcucci, 1995), lignans in Chilean samples (Valcic et al*.,* 1998), and polyisoprenylated benzophenones in Venezuelan,

In order to be accepted officially into the main stream of the healthcare system and for industrial applications, propolis needs chemical standardization that guarantees its quality, safety, efficacy, and provenance. The chemical diversity mainly caused by the botanical origin makes the standardization difficult. Since the chemistry and biological activity of propolis depends on its geographical origin, a proper method to discriminate its origin is

Chromatographic methods (HPLC, TLC, GC, e.g.) are largely used to identification and quantification of propolis compounds, but it its becoming clear that to separate and evaluate all constituents of propolis is an almost impossible task (Sarbu & Mot, 2011). Even thought the presence of the chemical markers are considered an acceptable method for quality control, not always is guarantee about what is stated by the product label, especially if the product has been spiked with the chemical markers. Besides, literature has demonstrated that is not possible to ascribe the pharmacological activity solely to a unique compound and until now no single propolis component has shown to possess anti-bacterial activity higher than total extract (Kujumgiev et al*.,* 1999; Popova et al*.,* 2004). Thus, a possibility is offered by the fingerprinting methods that can analyze in a non-selective way the propolis samples

Poplar propolis, for example, can be distinguished by UV-visible spectrophotometric determination of all three important components (flavones and flavonols, flavonones and dihydroflavonols, and total phenolics - Popova et al*.*, 2004), but some constraints regarding such an analytical approach has been claimed for propolis from tropical regions (Bankova &

Brazilian, and Cuban propolis (Cuesta-Rubio et al.*,* 1999; Marcucci, 1995).

al., 2001; Castaldo & Capasso, 2002).

needed (Bankova, 2005).

as a whole.

Marcucci, 2000).

different taxonomical families (Bankova, 2005).

The search for faster screening methods capable of characterizing propolis samples of different geographic origins and composition has lead to the use of direct insertion mass spectrometric fingerprinting techniques (ESI-MS and EASI-MS), which has proven to be a fast and robust method for propolis characterization (Sawaya et al*.,* 2011), although this analytical approach can only detect compounds that ionize under the experimental conditions. Similarly, Fourier transform infrared vibrational spectroscopy (FTIR) has also demonstrated to be valuable to chemically characterize complex matrices such as propolis (Wu et al*.,* 2008).

In order to achieve the goal of treat propolis sample as a whole than just be focused only in marker compounds, chemometric methods are being considered an important tool to analyze the huge data sets generated by non-selective analytical techniques such as UV-vis, MS, NMR, and FT-IR, generating information not only about chemical composition of propolis but also discriminating its geographical origin.

Authentication of propolis material may be possible by a chemical fingerprint of it and, if possible, of its botanical sources. Thus, chemical fingerprinting, i.e., metabolomics and chemometrics, is an additional method that has been claimed to be included as a quality control method in order to confirm or deny the propolis sample being used for the manufacturing of a derived product of that resinous and complex matrix.

Over the last decades, infrared (IR) vibrational spectroscopy has been well established as a useful tool for structure elucidation and quality control in several industrial applications. Indeed, the development of Fourier transform (FT) IR and attenuated total reflectance (ATR) techniques have also evolved allowing rapid IR measurements of organosolvent extracts of plant tissues, edible oils, and essential oils, for example (Damm et al., 2005; Lai et al., 1994; Schulz & Baranska, 2007). In consequence of the strong dipole moment of water, IR spectroscopy applications have mostly focused on the analysis of dried or non-aqueous plant matrices and currently IR methods are widely used as a fast analytical technique for the authentication and detection of adulteration of vegetable oils.

ATR-FTIR spectroscopy was applied to propolis samples collected in the autumn-2010 and originated from nineteen geographic regions of Santa Catarina State (southern Brazil) in order to gain insights as to the chemical profile of those complex matrices. FTIR spectroscopy measures the vibrations of bonds within functional groups and generates a spectrum that can be regarded as a metabolic fingerprint. Similar IR spectral profiles (3000 – 600 cm-1, figure 4) were found by a preliminary visual analysis for purpose of an exploratory overview of data, revealing typical signals of e.g., lipids (2910 – 2845 cm-1), monoterpenes (1732, 1592, 1114, 1022, 972 cm-1), sesquiterpenes (1472 cm-1), and sucrose (1122 cm-1 - Schulz & Baranska, 2007) for all the studied samples. However, we were not able to identify by visual inspection of the spectra a clear picture regarding a discriminating effect of any primary or secondary metabolites among the propolis samples.

A FTIR spectrum is complex, containing many variables per sample and making visual analysis very difficult. Hence, to extract extra useful information, i.e., latent variables, from the whole spectra chemometric analysis was performed considering the whole FTIR data set using principal components analysis (PCA) for an exploratory overview of data. This method could reveal similarity/dissimilarity patterns among propolis samples, simplifying

Metabolomics and Chemometrics as Tools

furnishing distinct raw materials for propolis production.

*NR1*

*NR2*

accounts for 88% of the variance preserved.

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 263

highlands (1360m altitude, annual maximum and minimum temperatures average of 18.9ºC and 9.3ºC, respectively) were discrepant regarding their monoterpene (1114 cm-1 and 972 cm-1 - -CH2) and sesquiterpene (1472 cm-1 - CH2) compounds (Schulz & Baranska, 2007) – Figure 5. In despite of *NR1 and NR2* propolis samples have grouped in PC1- and PC2+, they differ somewhat in respect to their chemical composition, an effect attributed to the flora composition found in those regions, e.g., mostly Atlantic Rainforest in *NR1* as *NR2* shows extensive areas covered by artificial reforestations i.e., *Eucalyptus* spp and *Pinus* spp,

*HL*

Fig. 5. Principal component analysis scores scatter plot of the FTIR data set in the spectral window of 3000–600 cm-1 wavenumber (1700 data points) of propolis samples produced in the southern Brazil (Santa Catarina State). *NR1*, *NR2* and *HL* refer to propolis samples originated from northern and highland regions of Santa Catarina State. The calculations were carried out using The Unscrambler software (v. 9.1, Oslo - Norway). PC1 and PC2

Further chemometric analysis took into consideration the fact that propolis is a very well known source of phenolic compounds, e.g., phenolics acids and flavonoids. Indeed, phenolic compounds occur ubiquitously in most plant species and take part of the chemical constitution of propolis worldwide. IR spectroscopy allows to identify phenolic compounds since they demonstrate strong IR bands due to C-H wagging vibrations between 1260 – 1180 cm-1 and 900 – 820 cm-1 (Schulz & Baranska, 2007). The principal

Fig. 4. IR spectral profile of propolis samples (autumn, 2010) produced in southern Brazil, according to the geographic regions of origin in Santa Catarina State. IR spectra are shown from top to bottom following the geographic precedence, i.e. 19 counties, of the propolis samples: Angelina *(ANG),* Balneário Gaivotas *(BG),* Bom Retiro *(BR1 and BR2),* Caçador *(Cç),* Campo-Erê *(CE),* Canoinhas *(CA),* Campos Novos *(CN),* Descanso *(DS),* José Boiteux *(JB),* Porto União *(PU),* Serra Alta *(SA),* São Joaquim *(SJ1 and SJ2),* São José do Cerrito *(SJC),*  Urupema *(URU),* Vidal Ramos *(VR),* Florianópolis *(FLN),* and Xaxim *(XX).*

data dimensions and results interpretation, without missing the more relevant information associated to them (Fukusaki & Kobayashi, 2005; Leardi, 2003). The covariance was choose for matrix construction in PCA calculation, since all variables considered were expressed in the same unit. By doing so, the magnitude differences were maintained, i.e., data were not standardized and variables contribution to samples distribution along axes was direct proportional to their magnitude. For the purpose of the propolis chemical profile analysis this kind of information is thought to be very useful, because wavenumber with higher absorbances (higher metabolites concentration) contribute more significantly with objects distribution into PCA, introducing quantitative information beside the compositional information of the sample data.

The principal component analysis (PCA) of the whole spectral data (3000 – 600 cm-1, 1700 data points) revealed that PC1 and PC2 defined 88% of the variability from the original IR data and a peculiar pattern of lipids (2914 cm-1 and 2848 cm-1 - C-H stretching vibrations) for the samples from the northern region (*NR*) of Santa Catarina State. The climate in that region is typically mesothermic, humid subtropical with a mild summer and an annual temperature average of 17.2ºC – 19.4ºC. On the other hand, the propolis produced in the

Fig. 4. IR spectral profile of propolis samples (autumn, 2010) produced in southern Brazil, according to the geographic regions of origin in Santa Catarina State. IR spectra are shown from top to bottom following the geographic precedence, i.e. 19 counties, of the propolis samples: Angelina *(ANG),* Balneário Gaivotas *(BG),* Bom Retiro *(BR1 and BR2),* Caçador *(Cç),* Campo-Erê *(CE),* Canoinhas *(CA),* Campos Novos *(CN),* Descanso *(DS),* José Boiteux *(JB),* Porto União *(PU),* Serra Alta *(SA),* São Joaquim *(SJ1 and SJ2),* São José do Cerrito *(SJC),* 

data dimensions and results interpretation, without missing the more relevant information associated to them (Fukusaki & Kobayashi, 2005; Leardi, 2003). The covariance was choose for matrix construction in PCA calculation, since all variables considered were expressed in the same unit. By doing so, the magnitude differences were maintained, i.e., data were not standardized and variables contribution to samples distribution along axes was direct proportional to their magnitude. For the purpose of the propolis chemical profile analysis this kind of information is thought to be very useful, because wavenumber with higher absorbances (higher metabolites concentration) contribute more significantly with objects distribution into PCA, introducing quantitative information beside the compositional

The principal component analysis (PCA) of the whole spectral data (3000 – 600 cm-1, 1700 data points) revealed that PC1 and PC2 defined 88% of the variability from the original IR data and a peculiar pattern of lipids (2914 cm-1 and 2848 cm-1 - C-H stretching vibrations) for the samples from the northern region (*NR*) of Santa Catarina State. The climate in that region is typically mesothermic, humid subtropical with a mild summer and an annual temperature average of 17.2ºC – 19.4ºC. On the other hand, the propolis produced in the

Urupema *(URU),* Vidal Ramos *(VR),* Florianópolis *(FLN),* and Xaxim *(XX).*

information of the sample data.

highlands (1360m altitude, annual maximum and minimum temperatures average of 18.9ºC and 9.3ºC, respectively) were discrepant regarding their monoterpene (1114 cm-1 and 972 cm-1 - -CH2) and sesquiterpene (1472 cm-1 - CH2) compounds (Schulz & Baranska, 2007) – Figure 5. In despite of *NR1 and NR2* propolis samples have grouped in PC1- and PC2+, they differ somewhat in respect to their chemical composition, an effect attributed to the flora composition found in those regions, e.g., mostly Atlantic Rainforest in *NR1* as *NR2* shows extensive areas covered by artificial reforestations i.e., *Eucalyptus* spp and *Pinus* spp, furnishing distinct raw materials for propolis production.

Fig. 5. Principal component analysis scores scatter plot of the FTIR data set in the spectral window of 3000–600 cm-1 wavenumber (1700 data points) of propolis samples produced in the southern Brazil (Santa Catarina State). *NR1*, *NR2* and *HL* refer to propolis samples originated from northern and highland regions of Santa Catarina State. The calculations were carried out using The Unscrambler software (v. 9.1, Oslo - Norway). PC1 and PC2 accounts for 88% of the variance preserved.

Further chemometric analysis took into consideration the fact that propolis is a very well known source of phenolic compounds, e.g., phenolics acids and flavonoids. Indeed, phenolic compounds occur ubiquitously in most plant species and take part of the chemical constitution of propolis worldwide. IR spectroscopy allows to identify phenolic compounds since they demonstrate strong IR bands due to C-H wagging vibrations between 1260 – 1180 cm-1 and 900 – 820 cm-1 (Schulz & Baranska, 2007). The principal

Metabolomics and Chemometrics as Tools

the compositional information of the sample data.

scanning spectrophotometry, grouping in PC1+/PC2+.

*FW*

*PU***,** *Cç***,** *CA*

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 265

wavelengths with higher absorbances (higher metabolites concentration) contribute more significantly with objects distribution into PCA, introducing quantitative information beside

Principal component analysis was performed using The Unscrambler software (v. 9.1, Oslo - Norway) and revealed mostly a distribution of the propolis samples along the PC1 axis (91% sample total variability), as PC2 (5% sample total variability) seemed to be lesser discriminator of the objects. A clear separation of the samples according to the east-west axis of Santa Catarina State could be found, where propolis produced near coastal regions (*CR1 and CR2*) grouped in PC1+/PC2-, as the sample from the far-west region *(FW*) was detected in the opposite side of PC1 axis, along with the samples from the northern region (*PU*, *Cç*, and *CA* - Figure 6). Interestingly, propolis samples from the counties *SJ*, *URU*, *BR* (highlands counties), and *ANG*, which shown geographic proximities and a certain common floral composition, seemed to be similar in their chemical profiles as determined by UV-visible

*SJ***,** *URU***,** *BR, ANG*

*CR2*

*CR1*

Fig. 6. Principal component analysis scores scatter plot of the UV-visible data set in the spectral window of 200 m to 700 m (450 data points) of propolis samples produced in the southern Brazil (Santa Catarina State). *CR1*, *CR2,* and *FW* refer to propolis samples originated from coastal (*BG* and *FLN* Counties) and far-west (*CE* County) regions, respectively, of Santa Catarina State. The sample grouping of propolis with similar UV-visivel scanning profiles regarding their (poly)phenolic composition is detached in the PC1+/PC2+ quadrant. PC1 and PC2 resolved 96% of the total variability of the spectral data set.

component calculations were performed for both 1260 – 1180 cm-1 and 900 – 820 cm-1 spectral windows and PC1 and PC2 resolved about 96% of the spectral data variability. An interesting discrimination profile was detected where samples from the far-west (*FW*) region grouped distinctly in respect to northern (*NR1*) ones, which also differed from the highlands (*HL*) propolis samples. Such findings can be explained in any extension based on the flora composition of the studied geographic regions. In the northern and far-west regions of Santa Catarina State the Atlantic Rainforest is typically found, but the floristic composition varies according to the altitude, e.g., 240 m altitude – *NR1* and 830 m – *FW*. Besides, as a mesothermic humid subtropical climate is found in *NR1*, the *FW* region is characterized by a temperate climate that determines a discrepant composition of plant species. Finally, the *HL* region (1360m altitude, temperate climate) is covered by the Araucaria Forest, where parana pine (*Araucaria angustifolia, Gymnospermae*, *Araucariaceae*) occurs as a dominant plant species. *A. angustifolia* produces a resinous exudate rich in guaiacyl type lignans, fatty acids, sterols (Anderegg & Rowe, 2009), phenolics, and terpenic acids that is thought to be used by honey bee (*Apis mellifera*) for the propolis production. Since the plant species populations influence the propolis chemical composition, the discrimination profile detected by ATR-FTIR coupled to chemometrics seems to be an interesting analytical approach to gain insights as to the effect of the climatic factors and floristic composition on the chemical traits of that complex matrix.

#### **3.2 Ultraviolet-visible scanning spectrophotometry**

Combination of UV-visible spectrophotometric wavelength scans and chemometric (PCA) analysis seems to be a simple and fast way to prospect plant extracts. This analytical strategy revealed to be fruitful for discrimination of *habanero* peppers according to their content of capsaicinoids, substances responsible for the pungency of their fruits (Davis et al., 2007).

Chemometric analysis was performed considering the absorbance values of the total UVvisible data set (200 m to 700 m, 450 data points) for the propolis samples in study, by using principal components analysis (PCA) for an exploratory overview of data.

In a first approach, principal components analysis (PCA) was tested by both correlation and covariance matrices of calculations. If correlation is used, the data set is standardized (meancentered and columns scaled to the unit of variance), decreasing the effect of differences in magnitude between variables and leading to a distribution of objects (*eigenvalues*) with equal influence from all variables. On the other hand, if covariance is used, data is only meancentered; retaining its original scale. The resulting distribution is then determined either by composition and magnitude of variables, leading to a PCA representation more influenced by larger observed values (Manetti et al., 2004). A similar distribution of objects was found for both correlation and covariance matrices in PC calculations, as PC1 and PC2 resolved 91.2% and 96.3%, respectively of the variability of the spectrophotometric data set. Thus, the covariance matrix was chosen for PCA calculations, since all variables considered were expressed in the same unit. By doing so, the magnitude differences were maintained, i.e., data were not standardized and variables contribution to samples distribution along axes was direct proportional to their magnitude. For the purpose of the chemical profile analysis of the propolis samples this kind of information is thought to be very useful, because

component calculations were performed for both 1260 – 1180 cm-1 and 900 – 820 cm-1 spectral windows and PC1 and PC2 resolved about 96% of the spectral data variability. An interesting discrimination profile was detected where samples from the far-west (*FW*) region grouped distinctly in respect to northern (*NR1*) ones, which also differed from the highlands (*HL*) propolis samples. Such findings can be explained in any extension based on the flora composition of the studied geographic regions. In the northern and far-west regions of Santa Catarina State the Atlantic Rainforest is typically found, but the floristic composition varies according to the altitude, e.g., 240 m altitude – *NR1* and 830 m – *FW*. Besides, as a mesothermic humid subtropical climate is found in *NR1*, the *FW* region is characterized by a temperate climate that determines a discrepant composition of plant species. Finally, the *HL* region (1360m altitude, temperate climate) is covered by the Araucaria Forest, where parana pine (*Araucaria angustifolia, Gymnospermae*, *Araucariaceae*) occurs as a dominant plant species. *A. angustifolia* produces a resinous exudate rich in guaiacyl type lignans, fatty acids, sterols (Anderegg & Rowe, 2009), phenolics, and terpenic acids that is thought to be used by honey bee (*Apis mellifera*) for the propolis production. Since the plant species populations influence the propolis chemical composition, the discrimination profile detected by ATR-FTIR coupled to chemometrics seems to be an interesting analytical approach to gain insights as to the effect of the climatic factors and floristic composition on the chemical traits of that complex

Combination of UV-visible spectrophotometric wavelength scans and chemometric (PCA) analysis seems to be a simple and fast way to prospect plant extracts. This analytical strategy revealed to be fruitful for discrimination of *habanero* peppers according to their content of capsaicinoids, substances responsible for the pungency of their fruits (Davis et

Chemometric analysis was performed considering the absorbance values of the total UVvisible data set (200 m to 700 m, 450 data points) for the propolis samples in study, by

In a first approach, principal components analysis (PCA) was tested by both correlation and covariance matrices of calculations. If correlation is used, the data set is standardized (meancentered and columns scaled to the unit of variance), decreasing the effect of differences in magnitude between variables and leading to a distribution of objects (*eigenvalues*) with equal influence from all variables. On the other hand, if covariance is used, data is only meancentered; retaining its original scale. The resulting distribution is then determined either by composition and magnitude of variables, leading to a PCA representation more influenced by larger observed values (Manetti et al., 2004). A similar distribution of objects was found for both correlation and covariance matrices in PC calculations, as PC1 and PC2 resolved 91.2% and 96.3%, respectively of the variability of the spectrophotometric data set. Thus, the covariance matrix was chosen for PCA calculations, since all variables considered were expressed in the same unit. By doing so, the magnitude differences were maintained, i.e., data were not standardized and variables contribution to samples distribution along axes was direct proportional to their magnitude. For the purpose of the chemical profile analysis of the propolis samples this kind of information is thought to be very useful, because

using principal components analysis (PCA) for an exploratory overview of data.

matrix.

al., 2007).

**3.2 Ultraviolet-visible scanning spectrophotometry** 

wavelengths with higher absorbances (higher metabolites concentration) contribute more significantly with objects distribution into PCA, introducing quantitative information beside the compositional information of the sample data.

Principal component analysis was performed using The Unscrambler software (v. 9.1, Oslo - Norway) and revealed mostly a distribution of the propolis samples along the PC1 axis (91% sample total variability), as PC2 (5% sample total variability) seemed to be lesser discriminator of the objects. A clear separation of the samples according to the east-west axis of Santa Catarina State could be found, where propolis produced near coastal regions (*CR1 and CR2*) grouped in PC1+/PC2-, as the sample from the far-west region *(FW*) was detected in the opposite side of PC1 axis, along with the samples from the northern region (*PU*, *Cç*, and *CA* - Figure 6). Interestingly, propolis samples from the counties *SJ*, *URU*, *BR* (highlands counties), and *ANG*, which shown geographic proximities and a certain common floral composition, seemed to be similar in their chemical profiles as determined by UV-visible scanning spectrophotometry, grouping in PC1+/PC2+.

Fig. 6. Principal component analysis scores scatter plot of the UV-visible data set in the spectral window of 200 m to 700 m (450 data points) of propolis samples produced in the southern Brazil (Santa Catarina State). *CR1*, *CR2,* and *FW* refer to propolis samples originated from coastal (*BG* and *FLN* Counties) and far-west (*CE* County) regions, respectively, of Santa Catarina State. The sample grouping of propolis with similar UV-visivel scanning profiles regarding their (poly)phenolic composition is detached in the PC1+/PC2+ quadrant. PC1 and PC2 resolved 96% of the total variability of the spectral data set.

Metabolomics and Chemometrics as Tools

west regions.

**4. Conclusions** 

matrices in study.

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 267

In order to check the chemical similarity pattern of propolis samples detected by PCA, further cluster analysis of the whole absorbance UV-vis data set, i.e., absorbance values of 200 m to 700 m (450 data points), was performed by using the Unweighted Pair Group Method with Arithmatic Mean (UPGMA) based on Bray-Curtis dissimilarity coefficient. UPGMA is a simple agglomerative or hierachical clustering method used in for the creation of phonetic trees, i.e., phenograms, hierarchical trees or dendrograms that indicate the similarity degree among samples/objects of interest, so that observations in the same cluster are similar in some sense. In UPGMA method after two objects with the least dissimilarity fuse together an arithmetic average of the dissimilarity of this new cluster and the rest of the objects is calculated. This leads to a reduction in the size of the original dissimilarity matrix. The procedure continues with the dissimilarity matrix being correspondingly reduced. When the average between an object and a cluster is calculated, the method gives equal weights to the members of the clusters when averaging, i.e., unweighted. Thus, in the progressive reduction of the dissimilarity matrix, only relationships between groups are considered, which are given equal weighting and this leads to loss of information about the

The hierarchical tree of the similarity of chemical profiles of the propolis samples is shown in figure 7. The findings suggest a resemblance of grouping as found by PCA calculations in respect to the *SJ*, *URU*, *BR*, and *ANG* samples, as well as for the propolis originated from the coastal (*BG* and *FLN*) and northern regions (*CA, PU*, and *Cç*). Additionally, UPGMA analysis also discriminate the propolis produced in the western (*AC, XX*, and *CN*) and far-

The chemo(bio)diversity analysis of maize landraces and propolis produced in southern regions of Brazil was successfully assessed by using a typical metabolomic platform involving spectroscopic techniques (FTIR, 1H- and 13C-NMR, and UV-visible) and chemometrics. The huge amount of data afforded by those spectroscopic techniques was analyzed using multivariate statistical methods such as principal component analysis and cluster analysis allowing obtaining extra information on the metabolic profile of the complex

The analytical approach described showed to be suitable when ones aim to discriminate maize flour samples from whole and degermed maize, an issue thought to be important for the food, cosmetic, and pharmaceutical industries regarding the usage and quality control process of that raw material. Similarly, the classification of maize landraces according to their starch traits is considered technologically relevant in order to optimize the usage of

The classification of Brazilian propolis as to their chemical profiles and geographic regions seems to be relevant because that biomass is typically quite complex, making difficult and expensive to perform a complete characterization in that sense. By doing so, the propolis produced in southern Brazil might be better evaluated as to their potential usage in cosmetic and pharmaceutical industry, taking into consideration their secondary metabolite

relationships between pairs of objects (Legendre, 1998; Singh, 2008).

non-chemically modified starches in industrial process, for instance.

High loadings associated to the wavelengths 394 m, 360 m, 440 m, and 310 m seemed to influence the observed distribution of the propolis samples and could be associated to the presence of (poly)phenolic compounds. In fact, the *λ*max for the cinnamic acid and its derivatives is near 310-320 m as for the flavonols is usually around 360 m (Tsao & Deng, 2004). Further chemical analysis of the total content of phenolics ad flavonoids in the propolis originated from the counties *SJ*, *URU*, *BR*, and *ANG* revealed similar contents, with average concentrations of 1411.52 µg/ml and 4.61 µg/ml of those secondary metabolites, respectively, in the hydroalcoholic (70: 30, v/v) extract. Such findings differed (P<0.05 – *Tukey* test) in respect to the concentrations detected for the propolis samples produced in the coastal (793.67 µg/ml – total phenolics and 2.82 µg/ml – flavonoids) and far-west (952.97 µg/mL – total phenolics and 0.59 g/ml flavonoids) regions of Santa Catarina State, corroborating the PCA results herein shown.

Fig. 7. Dendrogram of propolis samples using average linkage with Bray-Curtis dissimilarity measure. Data calculations were based on the absorbance values for the UVvisible spectral window of 200 m to 700 m of propolis samples produced in Santa Catarina State - southern Brazil, autumn-2010.

In order to check the chemical similarity pattern of propolis samples detected by PCA, further cluster analysis of the whole absorbance UV-vis data set, i.e., absorbance values of 200 m to 700 m (450 data points), was performed by using the Unweighted Pair Group Method with Arithmatic Mean (UPGMA) based on Bray-Curtis dissimilarity coefficient. UPGMA is a simple agglomerative or hierachical clustering method used in for the creation of phonetic trees, i.e., phenograms, hierarchical trees or dendrograms that indicate the similarity degree among samples/objects of interest, so that observations in the same cluster are similar in some sense. In UPGMA method after two objects with the least dissimilarity fuse together an arithmetic average of the dissimilarity of this new cluster and the rest of the objects is calculated. This leads to a reduction in the size of the original dissimilarity matrix. The procedure continues with the dissimilarity matrix being correspondingly reduced. When the average between an object and a cluster is calculated, the method gives equal weights to the members of the clusters when averaging, i.e., unweighted. Thus, in the progressive reduction of the dissimilarity matrix, only relationships between groups are considered, which are given equal weighting and this leads to loss of information about the relationships between pairs of objects (Legendre, 1998; Singh, 2008).

The hierarchical tree of the similarity of chemical profiles of the propolis samples is shown in figure 7. The findings suggest a resemblance of grouping as found by PCA calculations in respect to the *SJ*, *URU*, *BR*, and *ANG* samples, as well as for the propolis originated from the coastal (*BG* and *FLN*) and northern regions (*CA, PU*, and *Cç*). Additionally, UPGMA analysis also discriminate the propolis produced in the western (*AC, XX*, and *CN*) and farwest regions.

#### **4. Conclusions**

266 Chemometrics in Practical Applications

High loadings associated to the wavelengths 394 m, 360 m, 440 m, and 310 m seemed to influence the observed distribution of the propolis samples and could be associated to the presence of (poly)phenolic compounds. In fact, the *λ*max for the cinnamic acid and its derivatives is near 310-320 m as for the flavonols is usually around 360 m (Tsao & Deng, 2004). Further chemical analysis of the total content of phenolics ad flavonoids in the propolis originated from the counties *SJ*, *URU*, *BR*, and *ANG* revealed similar contents, with average concentrations of 1411.52 µg/ml and 4.61 µg/ml of those secondary metabolites, respectively, in the hydroalcoholic (70: 30, v/v) extract. Such findings differed (P<0.05 – *Tukey* test) in respect to the concentrations detected for the propolis samples produced in the coastal (793.67 µg/ml – total phenolics and 2.82 µg/ml – flavonoids) and far-west (952.97 µg/mL – total phenolics and 0.59 g/ml flavonoids) regions of Santa Catarina State,

Fig. 7. Dendrogram of propolis samples using average linkage with Bray-Curtis

Catarina State - southern Brazil, autumn-2010.

dissimilarity measure. Data calculations were based on the absorbance values for the UVvisible spectral window of 200 m to 700 m of propolis samples produced in Santa

corroborating the PCA results herein shown.

The chemo(bio)diversity analysis of maize landraces and propolis produced in southern regions of Brazil was successfully assessed by using a typical metabolomic platform involving spectroscopic techniques (FTIR, 1H- and 13C-NMR, and UV-visible) and chemometrics. The huge amount of data afforded by those spectroscopic techniques was analyzed using multivariate statistical methods such as principal component analysis and cluster analysis allowing obtaining extra information on the metabolic profile of the complex matrices in study.

The analytical approach described showed to be suitable when ones aim to discriminate maize flour samples from whole and degermed maize, an issue thought to be important for the food, cosmetic, and pharmaceutical industries regarding the usage and quality control process of that raw material. Similarly, the classification of maize landraces according to their starch traits is considered technologically relevant in order to optimize the usage of non-chemically modified starches in industrial process, for instance.

The classification of Brazilian propolis as to their chemical profiles and geographic regions seems to be relevant because that biomass is typically quite complex, making difficult and expensive to perform a complete characterization in that sense. By doing so, the propolis produced in southern Brazil might be better evaluated as to their potential usage in cosmetic and pharmaceutical industry, taking into consideration their secondary metabolite

Metabolomics and Chemometrics as Tools

Publications, West Sussex.

ISSN: 0031-9422

153X

0232

25. ISSN: 0924-2031

for Chemo(bio)diversity Analysis - Maize Landraces and Propolis 269

Kujumgiev, A., Tsvetkova, I., Serkedjieva, YU., Bankova, V., Christov, R & Popov, S. (1999).

origins. *Journal of Ethnopharmacology,* Vol. 64, pp. 235-240. ISSN: 0378-8741 Lai, YW., Kemsley, EK & Wilson, J. (1994). Potential of Fourier transform infrared

Leardi, R. (2003). Chemometrics in data analysis. In: A user-friendly guide to multivariate

Lemos PMM (2010). Análise do metaboloma foliar parcial de variedades locais de milho (*Zea* 

Marcucci, MC. (1995). Propolis: chemical composition, biological properties and therapeutic

Markham, KR., Mitchell, KA., Wilkins, AL., Daldy, JA & Lu, Y. (1996). HPLC and CG-MS

Popova, M., Bankova, V., Butovska, D., Petkov, V., Damynova, BN., Sabatini, AG.,

Sarbu, C & Mot, AC. (2011). Ecosystem discrimination and fingerprinting of Romain

Sawaya, ACHF., Silva, IB & Marcucci, MC. (2011). Analytical methods applied to diverse

Schulz, H & Baranska, M. (2007). Identification and quantification of valuable plant

Singh, W. (2008). Robustness of three hierarchical agglomerative clustering techniques for

Tsao, R & Deng, Z. (2004). Separation procedures for naturally occurring antioxidant

Valcic, S., Montenegro, G & Timmermann, BN. (1998). Lignans from Chilean propolis.

White, PJ. (2001). Properties of corn starch. In: Specialty corns. HALLAUER, AR. (Ed.). 2nd

ecological data. M.Sc. thesis, University of Iceland, Iceland.

*Journal of Natural Products,* Vol. 61, pp.771-775. ISSN: 0974-5211

domesticus. PhD thesis, Federal University of Santa Catarina, Brazil. Manetti, C., Bianchetti, C., Bizarri, M., Casciani, L., Castro, C., D´Ascenzo, G., Delfini, M., DI

*Chemistry*, Vol. 42, pp.1154-1159. ISSN: 0021-8561

Legendre, P. (1998). *Numerical Ecology*. Elsevier Science, New York..

activity. *Apidologie*, Vol. 26, pp.83-99. ISSN: 1297-9678

*Phytochemistry* Vol. 42, pp.205-211. ISSN: 0031-9422

*Talanta,* Vol. 85, pp.1112-1117. ISSN: 0039-9140

15, pp.235-240. ISSN: 1099-1565

ed. pp. 189, CRC Press, London.

Antibacterial, antifungal and antiviral activity of propolis of different geographical

spectroscopy for the authentication of vegetable oils. *Journal of Agricultural and Food* 

calibration and classification. Naes, T., Isaksson, T., Fearn, T & Davies, T (eds). NIR

*mays* L.) e dos efeitos anti-tumoral *in vitro* e na morfogênese embrionária de Gallus

Cocco, ME., Laganà, A., Miccheli, A., Motto, M & Conti, F. (2004). NMR-based metabonomic study of transgenic maize. *Phytochemistry*, Vol. 65, pp.3187-3198.

identification of the major organic constituents in New Zealand propolis.

Marcazzan, GL & Bogdanov, S (2004). Validated methods for the quantifications of biologically active constituents of poplar-type propolis. *Phytochemical Analysis*, Vol.

propolis by hierarchical fuzzy clustering and image analysis of TLC patterns.

types of Brazilian propolis. *Chemistry Central Journal,* Vol. 5, pp.1-10. ISSN: 1752-

substances by IR and Raman spectroscopy. *Vibrational Spectroscopy*, Vol. 43, pp.13-

phytochemicals. *Journal of Chromatography B*, Vol. 812, pp.85-99. ISSN: 1570-

constituents, e.g., mono/sesquiterpenes and phenolics. The coupling of chemometricsspectroscopic techniques used is thought to be essential to allow detecting peculiar chemical traits of the propolis samples according to their geographic regions in a simple and fast way.

#### **5. Acknowledgment**

Authors are indebt to FAPESC, CNPq, and CAPES for financial support and fellowships.

#### **6. References**


constituents, e.g., mono/sesquiterpenes and phenolics. The coupling of chemometricsspectroscopic techniques used is thought to be essential to allow detecting peculiar chemical traits of the propolis samples according to their geographic regions in a simple and fast way.

Authors are indebt to FAPESC, CNPq, and CAPES for financial support and fellowships.

Anderegg, RJ & Rowe, JW. (2009). Lignans, the major component of resin from Araucaria

Bankova, V. (2005). Chemical diversity of propolis and the problem of standardization.

Bankova, V & Marcucci, MC. (2000). Standardization of propolis: present status and

Banksota, AH., Tezuka, Y & Kadota, S. (2001). Recent progress in pharmacological research of propolis. *Phytotherapy Research,* Vol. 15, pp.561-571. ISSN: 1099-1573 Baye, TM., Pearson, TC & Settles, AM. (2006). Development of a calibration to predict maize

Boyer, CD & Hannah, C. (2001). Kernel mutants of corn. In: Specialty corns. HALLAUER,

Castaldo, S & Capasso, F. (2002). Propolis, an old remedy used in modern medicine.

Cuesta-Rubio, O., Cuellar, AC., Rojas, N., Velez, HC., Rastrelli, L & Aquino, R. (1999). A

Damm, U., Lampen, P., Heise, HM., Davies, AN & Mcintyre, PS. (2005). Spectral variable

spectroscopy. *Applied Spectroscopy*, Vol. 59, pp.1286-1294. ISSN: 0003-7028 Davis, CB., Markey, CE., Busch, MA & Busch, KW. (2007). Determination of capsaicinoids in

*Agricultural and Food Chemistry*, Vol. 55, pp. 5925-5933. ISSN: 0021-8561 Ferreira, D., Barros, A., Coimbra, MA & Delgadillo, I. (2001). Use of FTIR spectroscopy to

Fukusaki, E & Kobayashi, A. (2005). Plant metabolomics: potential for practical operation. *Journal of Bioscience and Bioengineering*, Vol. 100, pp.347–354. ISSN: 1389-1723 Greenaway, W., Scaysbrook, T & Whately, FR. (1990). The composition and plant origin of

pear. *Carbohydrate Polymers*, Vol. 45, pp.175–182. ISSN: 0144-8617

*Journal of Ethnopharmacology,* Vol. 100, pp.114-117. ISSN: 0378-8741

perspectives. *Bee World,* Vol. 81, pp.182-188. ISSN*:* 0005-772X

*Technology of Wood,* Vol. 28, pp.171–175. ISSN 0018-3830

*Science*, Vol. 43, pp.236–243. ISSN: 0733-5210

AR. (Ed.). 2nd ed. pp. 153, CRC Press, London.

*Fitoterapia,* Vol. 73, pp.S1-S6. ISSN: 0367-326X

Vol. 62, pp.1013-1015. ISSN: 0974-5211

angustifolia knots. *International Journal of the Biology, Chemistry, Physics and* 

seed composition using single kernel near infrared spectroscopy. *Journal of Cereal* 

polyisoprenylated benzophenone from Cuban propolis. *Journal of Natural Products,* 

selection for partial least squares calibration applied to authentication and quantification of extra virgin olive oils using Fourier transform Raman

habanero peppers by chemometric analysis of UV spectral data. *Journal of* 

follow the effect of processing in cell wall polysaccharide extracts of a sun-dried

propolis: a report of work at Oxford. *Bee World,* Vol. 71, pp. 107-118. ISSN: 0005-

**5. Acknowledgment** 

772X

**6. References** 


**12** 

*Malaysia* 

**Using Principal Component Scores** 

Rashid Atta Khan2, Sharifuddin M. Zain2, Hafizan Juahir1,

*1Department of Environmental Science, Faculty of Environmental Study,* 

*2Chemistry Department, Faculty of Science, University of Malaya, Kuala Lumpur* 

The management of river water quality is a major environmental challenge. One of the major challenges is in determining point and non-point sources of pollutants. Industrial and municipal wastewater discharges can be considered as constant polluting sources, unlike surface water runoff which is seasonal and highly affected by climate. According to Aiken et al. (1982), 42 tributaries in Peninsular Malaysia are categorized as very polluted including the Langat River. Until 1999, there were about 13 polluted tributaries and 36 polluted rivers due to human activities such as, industry, construction and agriculture (Department of Environment, Malaysia (DOE), 1999). In 1990, there were 48 clean rivers classified as clean

Surface water pollution is identified as the major problem affecting the Langat River Basin in Malaysia. Increase in developing areas within the river basin has in turn increased pollution loading into the Langat River. To avoid further degradation, the DOE have installed telemetric stations along the river basin to continuously monitor the water quality. As a result, abundant data were collected since 1988. There are 927 monitoring stations located within 120 river basins throughout Malaysia. Water quality data were used to determine the water quality status and to classify the rivers based on water quality index (WQI) and Interim National Water Quality Standards for Malaysia (INWQS). WQI provides a useful way to predict changes and trends in the water quality by considering multiple parameters. WQI is calculated from six selected water quality variables, namely dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), suspended solid (SS), ammonical nitrogen (AN) and pH (DOE, 1997). It is a well-known phenomenon that the contribution of pollution loading into river systems from the environment involves a complex interaction of many factors (e.g. chemical, physical and meteorological interaction). These primary pollutants are emitted from land use activities surrounding the river basin (e.g. agriculture, forest, urban, industrial and others) Rapid urbanization along the Langat River plays an important role in the increase of point source

but the number is reduced to 32 rivers in 1999 (Rosnani Ibrahim, 2001).

**1. Introduction** 

**and Artificial Neural Networks** 

Mohd Kamil Yusoff1 and Tg Hanidza T.I.1

*University Putra Malaysia, Serdang* 

**in Predicting Water Quality Index** 

Wu, YW., Sun, SQ., Zhao, Y., Li, Q & Zhou, J. (2008). Rapid discrimination of extracts of Chinese propolis and poplar buds by FT-IR and 2D IR correlation spectroscopy. *Journal of Molecular Structure,* Vol. 884, pp.48-54. ISSN: 0022-2860

## **Using Principal Component Scores and Artificial Neural Networks in Predicting Water Quality Index**

Rashid Atta Khan2, Sharifuddin M. Zain2, Hafizan Juahir1, Mohd Kamil Yusoff1 and Tg Hanidza T.I.1 *1Department of Environmental Science, Faculty of Environmental Study,* 

*University Putra Malaysia, Serdang 2Chemistry Department, Faculty of Science, University of Malaya, Kuala Lumpur Malaysia* 

#### **1. Introduction**

270 Chemometrics in Practical Applications

Wu, YW., Sun, SQ., Zhao, Y., Li, Q & Zhou, J. (2008). Rapid discrimination of extracts of

*Journal of Molecular Structure,* Vol. 884, pp.48-54. ISSN: 0022-2860

Chinese propolis and poplar buds by FT-IR and 2D IR correlation spectroscopy.

The management of river water quality is a major environmental challenge. One of the major challenges is in determining point and non-point sources of pollutants. Industrial and municipal wastewater discharges can be considered as constant polluting sources, unlike surface water runoff which is seasonal and highly affected by climate. According to Aiken et al. (1982), 42 tributaries in Peninsular Malaysia are categorized as very polluted including the Langat River. Until 1999, there were about 13 polluted tributaries and 36 polluted rivers due to human activities such as, industry, construction and agriculture (Department of Environment, Malaysia (DOE), 1999). In 1990, there were 48 clean rivers classified as clean but the number is reduced to 32 rivers in 1999 (Rosnani Ibrahim, 2001).

Surface water pollution is identified as the major problem affecting the Langat River Basin in Malaysia. Increase in developing areas within the river basin has in turn increased pollution loading into the Langat River. To avoid further degradation, the DOE have installed telemetric stations along the river basin to continuously monitor the water quality. As a result, abundant data were collected since 1988. There are 927 monitoring stations located within 120 river basins throughout Malaysia. Water quality data were used to determine the water quality status and to classify the rivers based on water quality index (WQI) and Interim National Water Quality Standards for Malaysia (INWQS). WQI provides a useful way to predict changes and trends in the water quality by considering multiple parameters. WQI is calculated from six selected water quality variables, namely dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), suspended solid (SS), ammonical nitrogen (AN) and pH (DOE, 1997). It is a well-known phenomenon that the contribution of pollution loading into river systems from the environment involves a complex interaction of many factors (e.g. chemical, physical and meteorological interaction). These primary pollutants are emitted from land use activities surrounding the river basin (e.g. agriculture, forest, urban, industrial and others) Rapid urbanization along the Langat River plays an important role in the increase of point source

Using Principal Component Scores

WQI.

river.

**2. Methodology** 

Langat River (Fig. 1).

**2.1 The data and monitoring sites** 

Helena et al., 2000; Loska and Wiechula, 2003)

and Artificial Neural Networks in Predicting Water Quality Index 273

In natural environment, water quality is a multivariate phenomenon, at least as reflected in the multitude of constituents which are used to characterize the quality of water body. Water quality is very difficult to model because of the different interactions between pollutants and meteorological variables. The principal component analysis (PCA) is one of the approaches to avoid this problem and has received increasing attention as an accepted method in environmental pattern recognition (Simeonov et al., 2003; Wunderline et al., 2001;

The objective of this study is to use the PCA method to classify predictor variables according to their interrelation, and to obtain parsimonious prediction model (i.e., model that depend on as few variables as necessary) for WQI with other physico-chemical and biological data as predictor variables to model the water quality of the Langat river. For this purpose, principal component scores of 23 physico-chemical and biological water quality parameters were generated and selected appropriately as input variables in ANN models for predicting

The water quality data in this study were obtained from seven stations along the main

Fig. 1. Data from seven water quality stations (Sb) were selected in this study along the main

(PS) and non-point source (NPS). In view of this complex interaction, use of modelling techniques to solve this problem, is needed. However, the problem of obtaining models that adequately represent the dynamic behaviour of field data is not easy. Lack of good understanding and description of the phenomena involved, the availability of reliable and complete field data set and the estimation of the numerous parameters involved are the major factors contributing to this problem. Beck (1986) noted that, increase in model complexity will undoubtedly increase the number of parameters, leading to the problems of identification.

Applications of ANN (Artificial Neural Networks) to environmental problems are becoming more common (Silverman and Dracup, 2000; Scardi, 2001; Recknagel et al., 2002; Bowden et al., 2005; Muttil and Chau, 2007). The applications of ANN, which are computing systems that were originally designed to simulate the structure and function of the brain (Rumelhart et al, 1986) is a relatively new concept in environmental modeling. If trained properly, a neural network model is capable of 'learning' linear as well as the nonlinear features in the data (Elsner and Tronis, 1992).

ANN consists of a set of simple processing units (neurons) arranged in a defined architecture and connected by weighted channels which act to transform remotely-sensed data into a classification. The classification techniques of ANN are unlike the conventional ones. It is distribution-free, may sometimes use small training sets (Hepner et al., 1990) and, once trained; it is rapid computationally, which will be of value in processing large data sets (Gershon and Miller, 1993). Furthermore, ANNs have been shown to be able to map land cover more accurately compared to many widely used statistical classification techniques (Benediktsson et al., 1990; Foody et al., 1995) and alternatives such as evidential reasoning (Peddle et al., 1994).

It has been proposed that the best tool to model non-linear environmental relationship is ANN (Zhang and Stanley, 1997; Jain and Indurthy, 2003). Research have been undertaken at Imperial College, London which attempts to investigate the capability of ANN approach in modelling spatial and temporal variations in river water quality (Clarici, 1995). ANNs were used as a predictive model to predict cyanobacteria Anabaena spp. in the River Murray, South Australia (Maier et al., 1998). DeSilets et al. (1992), have also used ANN to predict salinity. Ha and Stenstrom (2003), proposed a neural network approach to examine the relationship between storm water quality and various types of land use.

ANN has been successfully applied on the study of river water quality in Malaysia (Zarita Zainudin, 2001; Mohd Ekhwan Toriman and Hafizan Juahir, 2003; Hafizan Juahir et al., 2003a,b; Hafizan et al, 2004a,b; 2005; Ruslan Rainis et al., 2004). An approach for identifying possibilities of water quality improvement could be developed by using this concept. Such information could provide opportunities for better river basin management to control river water pollution in Malaysia. In the Malaysian context, Hafizan Juahir et al. (2003a) showed that the ANN model gives a better performance compared to the autoregressive integrated moving average (ARIMA) model in forecasting DO. The use of ANN for river regulation (Mohd. Ekhwan Toriman and Hafizan Juahir, 2003) and the application of the second order back propagation method (Hafizan Juahir et al., 2004a) on water quality of the Langat River have also been demonstrated.

In natural environment, water quality is a multivariate phenomenon, at least as reflected in the multitude of constituents which are used to characterize the quality of water body. Water quality is very difficult to model because of the different interactions between pollutants and meteorological variables. The principal component analysis (PCA) is one of the approaches to avoid this problem and has received increasing attention as an accepted method in environmental pattern recognition (Simeonov et al., 2003; Wunderline et al., 2001; Helena et al., 2000; Loska and Wiechula, 2003)

The objective of this study is to use the PCA method to classify predictor variables according to their interrelation, and to obtain parsimonious prediction model (i.e., model that depend on as few variables as necessary) for WQI with other physico-chemical and biological data as predictor variables to model the water quality of the Langat river. For this purpose, principal component scores of 23 physico-chemical and biological water quality parameters were generated and selected appropriately as input variables in ANN models for predicting WQI.

### **2. Methodology**

272 Chemometrics in Practical Applications

(PS) and non-point source (NPS). In view of this complex interaction, use of modelling techniques to solve this problem, is needed. However, the problem of obtaining models that adequately represent the dynamic behaviour of field data is not easy. Lack of good understanding and description of the phenomena involved, the availability of reliable and complete field data set and the estimation of the numerous parameters involved are the major factors contributing to this problem. Beck (1986) noted that, increase in model complexity will undoubtedly increase the number of parameters, leading to the problems of

Applications of ANN (Artificial Neural Networks) to environmental problems are becoming more common (Silverman and Dracup, 2000; Scardi, 2001; Recknagel et al., 2002; Bowden et al., 2005; Muttil and Chau, 2007). The applications of ANN, which are computing systems that were originally designed to simulate the structure and function of the brain (Rumelhart et al, 1986) is a relatively new concept in environmental modeling. If trained properly, a neural network model is capable of 'learning' linear as well as the nonlinear features in the

ANN consists of a set of simple processing units (neurons) arranged in a defined architecture and connected by weighted channels which act to transform remotely-sensed data into a classification. The classification techniques of ANN are unlike the conventional ones. It is distribution-free, may sometimes use small training sets (Hepner et al., 1990) and, once trained; it is rapid computationally, which will be of value in processing large data sets (Gershon and Miller, 1993). Furthermore, ANNs have been shown to be able to map land cover more accurately compared to many widely used statistical classification techniques (Benediktsson et al., 1990; Foody et al., 1995) and alternatives such as evidential reasoning

It has been proposed that the best tool to model non-linear environmental relationship is ANN (Zhang and Stanley, 1997; Jain and Indurthy, 2003). Research have been undertaken at Imperial College, London which attempts to investigate the capability of ANN approach in modelling spatial and temporal variations in river water quality (Clarici, 1995). ANNs were used as a predictive model to predict cyanobacteria Anabaena spp. in the River Murray, South Australia (Maier et al., 1998). DeSilets et al. (1992), have also used ANN to predict salinity. Ha and Stenstrom (2003), proposed a neural network approach to examine the

ANN has been successfully applied on the study of river water quality in Malaysia (Zarita Zainudin, 2001; Mohd Ekhwan Toriman and Hafizan Juahir, 2003; Hafizan Juahir et al., 2003a,b; Hafizan et al, 2004a,b; 2005; Ruslan Rainis et al., 2004). An approach for identifying possibilities of water quality improvement could be developed by using this concept. Such information could provide opportunities for better river basin management to control river water pollution in Malaysia. In the Malaysian context, Hafizan Juahir et al. (2003a) showed that the ANN model gives a better performance compared to the autoregressive integrated moving average (ARIMA) model in forecasting DO. The use of ANN for river regulation (Mohd. Ekhwan Toriman and Hafizan Juahir, 2003) and the application of the second order back propagation method (Hafizan Juahir et al., 2004a) on water quality of the Langat River

relationship between storm water quality and various types of land use.

identification.

data (Elsner and Tronis, 1992).

(Peddle et al., 1994).

have also been demonstrated.

#### **2.1 The data and monitoring sites**

The water quality data in this study were obtained from seven stations along the main Langat River (Fig. 1).

Fig. 1. Data from seven water quality stations (Sb) were selected in this study along the main river.

Using Principal Component Scores

variables.

WQI. The principal components (PCs) can be expressed as

and Artificial Neural Networks in Predicting Water Quality Index 275

the 23 water quality parameters were used as input variables in ANN model to predict the

Where *z* is the component score, *a* is the component loading, *x* the measured value of variable, *i* is the component number, *j* is the sample number and *m* is the total number of

The PCs generated by PCA are sometimes not readily interpreted; therefore, it is advisable to rotate the PCs by varimax rotation. Varimax rotation ensures that each variable is maximally correlated with only one PC and a near zero association with the other components (Abdul-Wahab et al., 2005; Sousa *et al*., 2007). Varimax rotations applied on the PCs with eigenvalues more than 1 are considered significant (Kim and Mueller, 1987) where the typical criteria are 75-95% of total variance (Chen and Mynett, 2003). The rotations were carried out, in order to obtain new groups of variables. Variables with communality greater

In this work, the back propagation (BP) ANN was used in the development of all the prediction models. The Activation Transfer Function of a back-propagation network is usually a differentiable Sigmoid (S-shape) function, which helps to apply the non-linear mapping from inputs to outputs. A three layer back-propagation ANN is used in this study. The number of input and output neurons is determined by the nature of the problem under study. In this study, the networks were trained, tested and validated with one hidden layer and 1 to 10 hidden neurons. This choice was based on the work of Jiang et al. (2004), who found that the results with one hidden layer was better than that of two hidden layers, and the best performance was obtained using a structure with 3 to 6 neurons in the hidden layer.

Two different types of ANN models were developed. In the first type, prediction was performed based on the original PCs. In the second type of ANNs developed, scores of rotated (varimax rotation) PCs (ANN-RPCs) with eigenvalues greater than 1 were selected as input. For this model, prediction of WQI was performed using two to six rotated

The original PCs and rotated PCs (RPCs) data sets consist of 305 observations (305 rows) and are divided into training, testing and validating phases for WQI prediction. The ANN predicted WQI values are compared to the WQI values calculated using the DOE-WQI formula which is based on 6 water quality parameters, namely the DO, COD, BOD, AN, SS and pH (DOE, 1997). The input data matrix consists of 23 water quality variables (column) and 305 observations (rows) [23×305]. The observed data for each station is arranged according to time of observation from September 13, 1995 to June 7, 2002. Table 2 describes the data structure. The validation data is at least 10% of the whole data set, with 75%

than 0.7 are considered, having significant factor loadings (Stevens, 1986).

**2.3 Artificial neural networks for WQI prediction** 

The output neuron (layer) gives the predicted WQI value.

training set and 25% testing set data (Kuo et al., 2007).

principal components separately.

*ij <sup>i</sup> <sup>j</sup> <sup>i</sup> <sup>j</sup> im mj z a x a x* ... *a x* <sup>1</sup> <sup>1</sup> <sup>2</sup> <sup>2</sup> (1)

The water quality monitoring stations are manned by the DOE and Ministry of Natural Resource and Environment of Malaysia. The selected stations are illustrated in Table 1. The data used in the study is from September 1995 to May 2002. Seven sites were chosen, namely, Teluk Panglima Garang (site 7), Teluk Datok (site 6), Putrajaya (site 5), Kajang (site 4), Cheras (site 3), Hulu Langat (site 2), Pangsoon and Ulu Lui (site 1). Sites 3 to site 7 are located in the region of high pollution load as there are a several wastewater drains situated in the middle and downstream of the Langat River basin. Site 2 is partly situated in the middle stream region, designated as moderately polluted. Site 1 and a part of site 2 are located upstream of the Langat River, in an area of relatively low river pollution. It is worth mentioning here that some stations have missing data and not all stations were consistently sampled.

Although there are 30 water quality parameters available, only 23 completely monitored parameters were selected. A total of 254 samples were used for the analysis. The 23 water quality parameters were dissolved oxygen (DO), biological oxygen demand (BOD), electrical conductivity (EC), chemical oxygen demand (COD), ammoniacal nitrogen (AN), pH, suspended solids (SS), temperature (T), salinity (Sal), turbidity (Tur), dissolved solid (DS), total solid (TS), nitrate (NO ), chlorine (Cl ), phosphate (PO ), zinc (Zn), calcium (Ca), iron (Fe), potassium (K), magnesium (Mg), sodium (Na), E.coli and coliform.


Table 1. DOE sampling station at study area.

#### **2.2 Principal component analysis**

In this work, PCA was performed on the above mentioned water quality parameters to rank their relative significance and to describe their interrelation patterns. Chosen PC scores of

The water quality monitoring stations are manned by the DOE and Ministry of Natural Resource and Environment of Malaysia. The selected stations are illustrated in Table 1. The data used in the study is from September 1995 to May 2002. Seven sites were chosen, namely, Teluk Panglima Garang (site 7), Teluk Datok (site 6), Putrajaya (site 5), Kajang (site 4), Cheras (site 3), Hulu Langat (site 2), Pangsoon and Ulu Lui (site 1). Sites 3 to site 7 are located in the region of high pollution load as there are a several wastewater drains situated in the middle and downstream of the Langat River basin. Site 2 is partly situated in the middle stream region, designated as moderately polluted. Site 1 and a part of site 2 are located upstream of the Langat River, in an area of relatively low river pollution. It is worth mentioning here that some stations have missing data and not all stations were consistently

Although there are 30 water quality parameters available, only 23 completely monitored parameters were selected. A total of 254 samples were used for the analysis. The 23 water quality parameters were dissolved oxygen (DO), biological oxygen demand (BOD), electrical conductivity (EC), chemical oxygen demand (COD), ammoniacal nitrogen (AN), pH, suspended solids (SS), temperature (T), salinity (Sal), turbidity (Tur), dissolved solid (DS), total solid (TS), nitrate (NO ), chlorine (Cl ), phosphate (PO ), zinc (Zn), calcium (Ca),

**Grid Reference Location** 

Kampung Air Tawar (penghujung jalan)

Town

Kajang bridge

Bridge at Batu 18

at Batu 11

Telok Datuk, near Banting

Bridge at Kampung Dengkil

Near West Country Estate

Junction to Serdang, Cheras

26.241'E

30.780'E

40.882'E

47.030'E

47.219'E

46.387'E

50.926'E

In this work, PCA was performed on the above mentioned water quality parameters to rank their relative significance and to describe their interrelation patterns. Chosen PC scores of

iron (Fe), potassium (K), magnesium (Mg), sodium (Na), E.coli and coliform.

**Distance From Estuary (km)** 

2814602 Sb07 4.19 2O. 52.027'N 101O

2815603 Sb06 33.49 2O 48.952'N 101O

2817641 Sb05 63.43 2O 51.311'N 101O

2918606 Sb04 81.14 2O 57.835'N 101O

2917642 Sb03 86.94 2O 59.533'N 101O

3017612 Sb02 93.38 3O 02.459'N 101O

3118647 Sb01 113.99 3O 09.953'N 101O

Table 1. DOE sampling station at study area.

**2.2 Principal component analysis** 

sampled.

**DOE Station** 

**Study Code** 

**No.** 

the 23 water quality parameters were used as input variables in ANN model to predict the WQI. The principal components (PCs) can be expressed as

$$
\varpi\_y = a\_{t1}\mathbf{x}\_{1j} + a\_{t2}\mathbf{x}\_{2j} + \dots + a\_{tm}\mathbf{x}\_{mj} \tag{1}
$$

Where *z* is the component score, *a* is the component loading, *x* the measured value of variable, *i* is the component number, *j* is the sample number and *m* is the total number of variables.

The PCs generated by PCA are sometimes not readily interpreted; therefore, it is advisable to rotate the PCs by varimax rotation. Varimax rotation ensures that each variable is maximally correlated with only one PC and a near zero association with the other components (Abdul-Wahab et al., 2005; Sousa *et al*., 2007). Varimax rotations applied on the PCs with eigenvalues more than 1 are considered significant (Kim and Mueller, 1987) where the typical criteria are 75-95% of total variance (Chen and Mynett, 2003). The rotations were carried out, in order to obtain new groups of variables. Variables with communality greater than 0.7 are considered, having significant factor loadings (Stevens, 1986).

#### **2.3 Artificial neural networks for WQI prediction**

In this work, the back propagation (BP) ANN was used in the development of all the prediction models. The Activation Transfer Function of a back-propagation network is usually a differentiable Sigmoid (S-shape) function, which helps to apply the non-linear mapping from inputs to outputs. A three layer back-propagation ANN is used in this study. The number of input and output neurons is determined by the nature of the problem under study. In this study, the networks were trained, tested and validated with one hidden layer and 1 to 10 hidden neurons. This choice was based on the work of Jiang et al. (2004), who found that the results with one hidden layer was better than that of two hidden layers, and the best performance was obtained using a structure with 3 to 6 neurons in the hidden layer. The output neuron (layer) gives the predicted WQI value.

Two different types of ANN models were developed. In the first type, prediction was performed based on the original PCs. In the second type of ANNs developed, scores of rotated (varimax rotation) PCs (ANN-RPCs) with eigenvalues greater than 1 were selected as input. For this model, prediction of WQI was performed using two to six rotated principal components separately.

The original PCs and rotated PCs (RPCs) data sets consist of 305 observations (305 rows) and are divided into training, testing and validating phases for WQI prediction. The ANN predicted WQI values are compared to the WQI values calculated using the DOE-WQI formula which is based on 6 water quality parameters, namely the DO, COD, BOD, AN, SS and pH (DOE, 1997). The input data matrix consists of 23 water quality variables (column) and 305 observations (rows) [23×305]. The observed data for each station is arranged according to time of observation from September 13, 1995 to June 7, 2002. Table 2 describes the data structure. The validation data is at least 10% of the whole data set, with 75% training set and 25% testing set data (Kuo et al., 2007).

Using Principal Component Scores

**3. Results and discussion** 

RPC4, NO3-

and Artificial Neural Networks in Predicting Water Quality Index 277

Post PCA, out of the 23 principal components generated, only six PCs with eigenvalues higher than 1 (Table 3) were selected for the ANN input parameters. Selected PCs explained 75.1% of the total variation. Furthermore, communality values were high for the selected PCs, for example, the values are 93% for Cond., 95% for Sal, 98% for DS and TS (Table 4). These results further confirm the choice of the selected number of PCs (Stevens, 1986).

For the first six rotated PCs (RPCs), the loadings from PCA are given in Table 4. The highest correlations between variables are noted in bold. For instance, Cond., Sal, DS, TS, Cl, Ca, K, Mg and Na, have high correlations with RPC1. Eighteen variables with strong loadings were included in the six selected RPCs. Significant variables in RPC1 are Cond., Sal., DS, TS, Cl, Ca, K, Mg, and Na; in RPC2 they are DO, BOD and AN; in RPC3 they are SS and Tur and in

and PO43-. The only meaningful loads in RPC5 and RPC6 are pH and Zn.

**Eigenvalue** 9.074 2.387 2.067 1.492 1.225 1.026 **Variability (%)** 39.451 10.380 8.987 6.488 5.326 4.459 **Cumulative %** 39.451 49.830 58.817 65.305 70.631 75.091

Table 3. Descriptive statistics of selected original PCs with eigenvalues more than 1.

**Variables RPC1 RPC2 RPC3 RPC4 RPC5 RPC6 Communalities**  DO -0.205 **-0.722** -0.121 0.046 0.485 -0.066 0.82 BOD 0.035 **0.740** 0.071 0.110 0.110 0.022 0.58 COD 0.340 0.103 0.081 -0.166 0.268 0.326 0.34 SS -0.042 -0.009 **0.920** 0.010 -0.025 0.017 0.85 pH 0.189 -0.109 -0.204 0.020 **0.792** -0.083 0.72 AN -0.092 **0.797** -0.151 0.161 0.023 -0.032 0.69 T 0.337 0.368 -0.242 -0.298 -0.317 0.208 0.54 Cond. **0.963** 0.022 -0.043 0.035 0.013 -0.022 0.93 Sal. **0.974** 0.023 -0.038 0.030 0.008 -0.004 0.95 Tur. -0.031 -0.007 **0.863** 0.011 -0.140 -0.035 0.77 DS **0.988** 0.017 -0.034 0.013 0.009 -0.005 0.98 TS **0.985** 0.017 0.069 0.014 0.007 -0.003 0.98 NO3- 0.018 0.033 0.107 **0.688** -0.126 0.300 0.59 Cl **0.986** 0.010 -0.029 -0.004 0.020 0.005 0.97 PO43- 0.023 0.312 -0.106 **0.700** 0.112 -0.073 0.62 Zn -0.019 0.044 -0.011 0.186 -0.128 **0.767** 0.64 Ca **0.980** 0.028 -0.026 -0.043 -0.024 0.039 0.97 Fe -0.080 0.043 0.475 0.540 0.066 0.192 0.57 K **0.984** 0.004 -0.031 -0.004 0.004 0.010 0.97 Mg **0.974** 0.000 -0.022 -0.028 -0.002 0.037 0.95 Na **0.986** 0.002 -0.025 -0.020 0.005 0.017 0.97 COLI -0.254 0.361 0.097 -0.424 0.457 0.056 0.60 COLIFORM -0.032 0.049 -0.025 0.042 -0.077 -0.517 0.28

Table 4. Rotated factor loadings using six PCs.

**PC1 PC2 PC3 PC4 PC5 PC6** 


Table 2. The data structure for ANN prediction model.

#### **2.4 Determination of model performance**

The model's behaviour in both learning (training and testing) and validating phase, is evaluated using the following statistical methods; the correlation coefficient (R) at 95% confidence limit, given by equations;

$$\text{Coefficient of correlation (R), } r = \frac{\left[\sum\_{i=1}^{n} x\_i \hat{\boldsymbol{x}}\_i - \frac{1}{n} (\sum \boldsymbol{x}\_i) \sum \hat{\boldsymbol{x}}\_i\right]^2}{\sqrt{\left[\sum \boldsymbol{x}\_i^2 - \frac{1}{n} (\sum \boldsymbol{x}\_i)^2\right] \left[\sum \hat{\boldsymbol{x}}\_i - \frac{1}{n} (\sum \hat{\boldsymbol{x}}\_i)^2\right]}} \tag{2}$$

and the mean bias error or residual error given by;

$$\text{Mean bias error (MBE)}, \quad MBE = \frac{1}{n} \sum\_{\iota=1}^{n} (\hat{\mathfrak{x}}\_{\iota} - \mathfrak{x}\_{\iota}) \tag{3}$$

Where ˆ*<sup>i</sup> x* and *<sup>i</sup> x* represent observed values and the corresponding forecast values for i =1,2,…..,n.

The prediction performance evaluated using these two methods are used to evaluate the accuracy of the forecast and for comparing the forecasting ability of each approach.

The 95% confidence limit is used to determine that the predicted output lie within the confidence range. It is assumed that a predicted value fall into an interval within which there is an associated uncertainty. According to Wackerly et al. (1996), this uncertainty is derived from the residual errors that have already been calculated within that range of values. If the residual errors are randomly distributed, there is a general rule of thumb which states that they will lie within two standard deviations of their mean with a probability of 0.95. This method was used in the measurements of the ANN prediction performance conducted by some researchers (Bishop, 1995; Tibshirani, 1996; Shao et al., 1997; Zhang et al., 1998; Lowe and Zapart, 1999; Townsend and Tarassenko, 1999)

ANN models and statistical analyses were carried out using MATLAB 7.0 and XLSTAT2008 (Excel2003 add-in) for Windows.

#### **3. Results and discussion**

276 Chemometrics in Practical Applications

**No. of Observations Input parameters Output**  Input1 Input2 Input3 . . . . Input23 Output1 1 Obs1,1 Obs1,2 Obs1,3 . . . . Obs1,23 O1,1 2 Obs2,1 Obs2,2 Obs2,3 . . . . Obs2,23 O2,1

> . . . ...

305 *Obs305,1 Obs305,2 1 Obs305,23 O305,1* 

The model's behaviour in both learning (training and testing) and validating phase, is evaluated using the following statistical methods; the correlation coefficient (R) at 95%

> 

Where ˆ*<sup>i</sup> x* and *<sup>i</sup> x* represent observed values and the corresponding forecast values for i

The prediction performance evaluated using these two methods are used to evaluate the

The 95% confidence limit is used to determine that the predicted output lie within the confidence range. It is assumed that a predicted value fall into an interval within which there is an associated uncertainty. According to Wackerly et al. (1996), this uncertainty is derived from the residual errors that have already been calculated within that range of values. If the residual errors are randomly distributed, there is a general rule of thumb which states that they will lie within two standard deviations of their mean with a probability of 0.95. This method was used in the measurements of the ANN prediction performance conducted by some researchers (Bishop, 1995; Tibshirani, 1996; Shao et al.,

ANN models and statistical analyses were carried out using MATLAB 7.0 and XLSTAT2008

Mean bias error (MBE),

accuracy of the forecast and for comparing the forecasting ability of each approach.

1997; Zhang et al., 1998; Lowe and Zapart, 1999; Townsend and Tarassenko, 1999)

. . . .... . . . .... . . . .... . . . ....

*x x*

 

*<sup>i</sup> <sup>i</sup> x x*

*i i i i*

*x x*

2 2 2

<sup>ˆ</sup> <sup>1</sup> <sup>ˆ</sup> <sup>1</sup>

*<sup>r</sup>* (2)

<sup>ˆ</sup> <sup>1</sup> <sup>ˆ</sup>

*i i i i*

*n*

 *n*

1

*i*

*n*

*n*

*x x*

*x*

*MBE*

*i*

1

*n*

*n*

(ˆ ) <sup>1</sup> (3)

  2

. . . ... . . . . ...

*x*

. . . ...

=1,2,…..,n.

. . . ....

**2.4 Determination of model performance** 

Coefficient of correlation (R),

and the mean bias error or residual error given by;

confidence limit, given by equations;

(Excel2003 add-in) for Windows.

Table 2. The data structure for ANN prediction model.

. . . .... Post PCA, out of the 23 principal components generated, only six PCs with eigenvalues higher than 1 (Table 3) were selected for the ANN input parameters. Selected PCs explained 75.1% of the total variation. Furthermore, communality values were high for the selected PCs, for example, the values are 93% for Cond., 95% for Sal, 98% for DS and TS (Table 4). These results further confirm the choice of the selected number of PCs (Stevens, 1986).

For the first six rotated PCs (RPCs), the loadings from PCA are given in Table 4. The highest correlations between variables are noted in bold. For instance, Cond., Sal, DS, TS, Cl, Ca, K, Mg and Na, have high correlations with RPC1. Eighteen variables with strong loadings were included in the six selected RPCs. Significant variables in RPC1 are Cond., Sal., DS, TS, Cl, Ca, K, Mg, and Na; in RPC2 they are DO, BOD and AN; in RPC3 they are SS and Tur and in RPC4, NO3 and PO43-. The only meaningful loads in RPC5 and RPC6 are pH and Zn.



Table 3. Descriptive statistics of selected original PCs with eigenvalues more than 1.

Table 4. Rotated factor loadings using six PCs.

Using Principal Component Scores

and Artificial Neural Networks in Predicting Water Quality Index 279

Part I

Using the original principal component scores as inputs, the best architecture consist of a three layer network with 23 input neurons, 10 neurons in the hidden layer and one neuron in the output layer. Considering RPC scores as inputs, the best architectures were achieved with almost the same number of hidden neurons. The hidden neurons consist of 9 and 10 neurons respectively. Training was carried out for a maximum 10000 iterations. Selection of the network was performed at maximum correlation coefficient (R) and 95% confidence limit.

Table 5 and Figure 2 illustrate the prediction performances of ANN models using different combinations of PC scores as input variables. ANN using the first 2 PCs (PC1 and PC2) does not perform very well as far as accuracy is concerned for all the training, testing and validation phases. It is observed that the prediction performance of the validation phase is slightly worse compared to the training and testing phases. It is important to point out that for this model, the cumulative percentage in explaining the variance given by these two RPCs is only 49.8%. None of the strong loading variables contains the variables forming the WQI equation. DO, BOD and pH loadings in PC2 explain only 10.4% of the total variance.

Based on the results, it is apparent that the WQI prediction performance increases with the increase in number of input variables. The highest accuracy in predicting WQI is given by model ANN-RPC6, which contains six RPCs with 75.1% variation explained, giving an R2 value of 0.64 (training), 0.87 (testing), and 0.72 (validation) respectively.


Table 5. The prediction performances of the different ANN models.

From table 5, it can be observed that the prediction performance of the ANN model using original PCs (23 input PC scores) is not significantly different from the RPC models. However, as RPC models use fewer variables and is far less complex, the advantage over the ANN-PC23 model is obvious. Comparing the MBE values, it is generally observed that the signs for the validation phases are negative for both the un-rotated and rotated PC models. This is an indication that the predicted WQI values are consistently underestimated in both approaches.

Using the original principal component scores as inputs, the best architecture consist of a three layer network with 23 input neurons, 10 neurons in the hidden layer and one neuron in the output layer. Considering RPC scores as inputs, the best architectures were achieved with almost the same number of hidden neurons. The hidden neurons consist of 9 and 10 neurons respectively. Training was carried out for a maximum 10000 iterations. Selection of the network was performed at maximum correlation coefficient (R) and 95% confidence

Table 5 and Figure 2 illustrate the prediction performances of ANN models using different combinations of PC scores as input variables. ANN using the first 2 PCs (PC1 and PC2) does not perform very well as far as accuracy is concerned for all the training, testing and validation phases. It is observed that the prediction performance of the validation phase is slightly worse compared to the training and testing phases. It is important to point out that for this model, the cumulative percentage in explaining the variance given by these two RPCs is only 49.8%. None of the strong loading variables contains the variables forming the WQI equation. DO, BOD and pH loadings in PC2

Based on the results, it is apparent that the WQI prediction performance increases with the increase in number of input variables. The highest accuracy in predicting WQI is given by model ANN-RPC6, which contains six RPCs with 75.1% variation explained, giving an R2

ANN-RPC2 (2 inputs) 2 0.43 0.70 0.32 28.01 -167.90 -40.71 ANN-RPC3 (3 inputs) 3 0.60 0.78 0.61 64.95 -109.68 6.60 ANN-RPC4 (4 inputs) 4 0.53 0.79 0.47 0 -165.04 -89.78 ANN-RPC5 (5 inputs) 5 0.53 0.79 0.47 140.12 -143.75 -44.77 ANN-RPC6 (6 inputs) 6 0.64 0.87 0.72 67.93 -58.57 -44.61

(23 original PC inputs) 23 0.60 0.85 0.66 -18 -81.59 -49.83

From table 5, it can be observed that the prediction performance of the ANN model using original PCs (23 input PC scores) is not significantly different from the RPC models. However, as RPC models use fewer variables and is far less complex, the advantage over the ANN-PC23 model is obvious. Comparing the MBE values, it is generally observed that the signs for the validation phases are negative for both the un-rotated and rotated PC models. This is an indication that the predicted WQI values are consistently underestimated in both

**R squared MBE Training Testing Validation Training Testing Validation** 

value of 0.64 (training), 0.87 (testing), and 0.72 (validation) respectively.

**PC** 

Table 5. The prediction performances of the different ANN models.

limit.

explain only 10.4% of the total variance.

**Model No.of** 

ANN-PC23

approaches.

Part I

Using Principal Component Scores

and, (vi) 23 original PCs.

and Artificial Neural Networks in Predicting Water Quality Index 281

Part III Fig. 2. The prediction performances for different combination of PC scores during training, testing and validation phases : (i) 2 RPCs, (ii) 3 RPCs, (iii) 4 RPCs, (iv) 5 RPCs, (v) 6 RPCs

Part II

Part II

Fig. 2. The prediction performances for different combination of PC scores during training, testing and validation phases : (i) 2 RPCs, (ii) 3 RPCs, (iii) 4 RPCs, (iv) 5 RPCs, (v) 6 RPCs and, (vi) 23 original PCs.

Using Principal Component Scores

rotated PCs, and (b) 23 original PCs.

**WQI**

**WQI**

and Artificial Neural Networks in Predicting Water Quality Index 283

LL Predicted UL

LL Predicted UL

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 **No. of observations**

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 **No. of observations**

(b)

Fig. 4. Predicted WQI within the 95% confidence interval during testing phase using (a) six

(a)

This study also attempts to allocate 95% confidence interval on the WQI prediction produced by the best ANN model. Figure 3, 4 and 5 show the comparison between predicted values and the upper (UL) and lower limits (LL) lying within 95% confidence interval. This was carried out for ANN-RPC6 and ANN-PC23 models. It can be seen that only 4.3% out of the 305 predicted values were identified beyond the 95% confidence limit (1% fall below the LL and 3.3% fall beyond the UL) for ANN-RPC6. For ANN-PC23, 25% of the 305 observations fall beyond the upper and lower of 95% confidence interval limit (14% fall below the LL and 11.8% fall beyond the UL). This basically shows that by using reduced rotated PC scores as input, better results can be obtained without losing information. It is thus apparent that ANN prediction using scores of varimax rotated PCs result in a more accurate WQI prediction.

Fig. 3. Predicted WQI within the 95% confidence interval during training phase using (a) six rotated PCs, and (b) 23 original PCs.

This study also attempts to allocate 95% confidence interval on the WQI prediction produced by the best ANN model. Figure 3, 4 and 5 show the comparison between predicted values and the upper (UL) and lower limits (LL) lying within 95% confidence interval. This was carried out for ANN-RPC6 and ANN-PC23 models. It can be seen that only 4.3% out of the 305 predicted values were identified beyond the 95% confidence limit (1% fall below the LL and 3.3% fall beyond the UL) for ANN-RPC6. For ANN-PC23, 25% of the 305 observations fall beyond the upper and lower of 95% confidence interval limit (14% fall below the LL and 11.8% fall beyond the UL). This basically shows that by using reduced rotated PC scores as input, better results can be obtained without losing information. It is thus apparent that ANN prediction using scores of varimax rotated PCs result in a more

> 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 **No. of observations**

> 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205

**No. of observations**

(b) Fig. 3. Predicted WQI within the 95% confidence interval during training phase using (a) six

(a)

LL Predicted UL

LL Predicted UL

accurate WQI prediction.

0

rotated PCs, and (b) 23 original PCs.

**WQI**

20

40 60

**WQI**

80

100

120

<sup>(</sup>a)

Fig. 4. Predicted WQI within the 95% confidence interval during testing phase using (a) six rotated PCs, and (b) 23 original PCs.

Using Principal Component Scores

important information.

**5. Acknowledgment** 

and valuable advice.

Singapore.

*Modelling* 162, p.55-67.

College.

p.227-285

processes. *IEE Proceeding* 133, p.254-264

*Transactions on Geoscience and Remote Sensing*, 28, 540-551

methodology. *Journal of Hydrology* 301, p.75-92.

**6. References** 

and Artificial Neural Networks in Predicting Water Quality Index 285

accuracy (within 95% confidence limit). Moreover, the ANN model using the 23 original PCs as input, do not render the prediction more accurate, even with a complex network structure. The use of rotated PC scores based models is clearly more effective and efficient due to the elimination of collinearity and reduction of predictor variables without losing

The authors acknowledge the financial and technical support for this project provided by the Ministry of Science, Technology and Innovation and Universiti Putra Malaysia under the Science Fund Project no. 01-01-04-SF0733. The authors wish to thank,the Department of Environment, and Department of Irrigation and Drainage, Ministry of Natural Resources and Environment of Malaysia, Institute for Development and Environment (LESTARI), Universiti Kebangsaan Malaysia, Universiti Malaya Consultancy Unit (UPUM) and Chemistry Department of Universiti Malaya, who have provided us with secondary data

Abdul-Wahab, S.A., Bakheit, C.S. and Al-Alawi, S.M., 2005. Principal component and

Beck, M.B., 1986. Identification, estimation and control of biological waste-water treatment

Benediktsson, J.A., Swain, P.H., and Ersoy, O.K., 1990. Neural network approaches versus

Bishop, C. M., 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford

Chen, Q. and Mynett, A.E., 2003. Integration of data mining techniques and heuristic

Clarici, E., 1995. Environmental Modelling Using Neural Networks, PhD Thesis, Imperial

Department of Environment Malaysia, DOE 1997. Malaysia environmental quality reports 1999. Kuala Lumpur: Ministry of Science, Technology and Environment. Department of Environment Malaysia, DOE 1999. Malaysia environmental quality reports 1999. Kuala Lumpur: Ministry of Science, Technology and Environment. DeSilets, L., Golden, B., Wang, Q., and Kumar, R., 1992. Predicting salinity in the

multiple regression analysis in modeling of ground-level ozone and factors affecting its concentrations. *Environmental Modelling & Software* 20, p.1263-1271. Aiken, R.S., Leigh, C.H., Leinbach, T.R., and Moss, M.R., 1982. Development and

Environment in Peninsular Malaysia. McGraw-Hill International Book Company:

statistical methods in classification of multisource remote sensing data. *I.E.E.E.* 

Bowden, G.J., Dandy, G.C. and Maier, H.R., 2005. Input determination for neural network models in water resources applications. Part 1-background and

knowledge in fuzzy logic modeling of eutrophication in Taihu Lake. *Ecological* 

Chesapeake Bay using backpropagation. *Computer and Operations Research*, 19,

(a)

Fig. 5. Predicted WQI within the 95% confidence interval during validation phase using (a) six rotated PCs, and (b) 23 original PCs.

#### **4. Conclusion**

In this work, a combination of PCA and ANN is used to predict WQI based on 23 historical water quality parameters. The original predictors were selected based on the available Malaysian DOE data. To obtain the latent variables as inputs into the ANN, two different approaches were used; one based on un-rotated original PCs and the other based on varimax rotated PCs.

Using six PCs, significant loadings are observed for Cond, Sal, DS, TS, Cl, Ca, K, Mg and Na in PC1, DO, BOD and AN in PC2, SS and Tur in PC3, NO3- and PO43- in PC4, pH in PC5 and Zn in PC6. ANN models based on these 6 PC scores can predict WQI with acceptable

### **5. Acknowledgment**

284 Chemometrics in Practical Applications

LL Predicted UL

LL Predicted UL

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 **No. of observations**

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

**No of observations**

(b) Fig. 5. Predicted WQI within the 95% confidence interval during validation phase using (a)

In this work, a combination of PCA and ANN is used to predict WQI based on 23 historical water quality parameters. The original predictors were selected based on the available Malaysian DOE data. To obtain the latent variables as inputs into the ANN, two different approaches were used; one based on un-rotated original PCs and the other based on

Using six PCs, significant loadings are observed for Cond, Sal, DS, TS, Cl, Ca, K, Mg and Na in PC1, DO, BOD and AN in PC2, SS and Tur in PC3, NO3- and PO43- in PC4, pH in PC5 and Zn in PC6. ANN models based on these 6 PC scores can predict WQI with acceptable

(a)

0

**4. Conclusion** 

varimax rotated PCs.

six rotated PCs, and (b) 23 original PCs.

**WQI**

20

40

60

**WQI**

80

100

120

The authors acknowledge the financial and technical support for this project provided by the Ministry of Science, Technology and Innovation and Universiti Putra Malaysia under the Science Fund Project no. 01-01-04-SF0733. The authors wish to thank,the Department of Environment, and Department of Irrigation and Drainage, Ministry of Natural Resources and Environment of Malaysia, Institute for Development and Environment (LESTARI), Universiti Kebangsaan Malaysia, Universiti Malaya Consultancy Unit (UPUM) and Chemistry Department of Universiti Malaya, who have provided us with secondary data and valuable advice.

#### **6. References**


Using Principal Component Scores

744.

134.

York, Wiley

Jersey, USA, p.515.

and Artificial Neural Networks in Predicting Water Quality Index 287

Loska, K. and Wiechula, D., 2003. Application of principal component analysis for the

Lowe, D. and Zapart, C., 1999. Point-Wise Confidence Interval Estimation by Neural

Mohd. Ekhwan Toriman and Hafizan Juahir, 2003. Artificial Neural Network Modelling For

Penyelidikan dan Inovasi UKM, Pusat Pengurusan Penyelidikan, 3-5 Julai. Muttil, N. and Chau, K.,-W., 2007. Machine-learning paradigms for selecting ecologically

Multisource image classification II: An empirical comparison of evidential reasoning and neural network approaches. *Canadian Journal of Remote Sensing*, 12, 277-302. Recknagel, F., Bobbin, J., Whigham, P., and Wilson, H., 2002. Comparative application of

Rosnani Ibrahim, 2001. River Water quality Status In Malaysia. *Proceedings National* 

Rumelhart, D. E., Hinton, E. and Williams, J., 1986. Learning internal representation by error

Ruslan Rainis, Kamarul Ismail and Hafizan Juahir, 2004. Modeling The Relationship

Schalkoff, R., 1992. Pattern Recognition: Statistical, Structural and Neural Approaches. New

Shao, R., Martin, E.B., Zhang, J. and Morris, A.J., 1997. Confidence bounds for neural network representations. *Computers and Chemical Engineering*, 21, p.1173-1178. Silverman, D., and Dracup, J.A., 2000. Artificial neural networks and long-range precipitation in California. *Journal of Applied Meteorology* 31(1), p.57-66. Simeonov, V., Stratis, J.A., Samara, C., Zachariadis, G., Voutsa, D., Anthemidis, A., Sofoniou,

Sousa, S.I.V., Martins, F.G., Alvim-Ferraz, M.C.M. and Pereira, M.C., 2007. Multiple linear

Tibshirani, R., 1999. A comparison of some error estimates for neural network models.

ozone concentrations. *Environmental Modelling & Software* 22, p.97-103. Stevens, J., 1986. Applied Multivariate Statistics for the Social Science. Hill Sdale: New

Peddle, D.R., Foody, G.M., Zhang, A., Franklin, S.E., and Ledrew, E.F., 1994.

propagation. *Parallel Distributed Processing*, 1, p. 318-362.

production. *Ecological Modelling* 146, p.33-45.

Greece. *Water Research* 37, p.4119-4124.

*Neural Computation*, 8, p.152-163.

Rybnik Reservoir. *Chemosphere* 51, p.723-733.

*Computing & Applications*, Vol. 8, p.77-85.

2000, Kuala Lumpur, Malaysia.

estimation of source of heavy metal contamination in surface sediments from the

Networks: A Comparative Study based on Automotive Engine Calibration. *Neural* 

Langat River Discharge: Implication For River Restoration. Pertandingan Minggu

significant input variables. *Engineering Application of Artificial Intelligence* 20, p.735-

artificial neural networks and genetic algorithms for multivariate time-series modeling of algal blooms in freshwater lakes. *Journal of Hydroinformatics* 4(2), p.125-

*Conference On Sustainable River Basin Management* In Malaysia, 13 & 14 November

Between River Water Quality Index (WQI) and Land Uses Using Artificial Neural Networks (ANN). Presented in JSPS Seminar, December 15-17, Kyoto, Japan. Scardi, M., 2001. Advances in neural network modeling of phytoplankton primary

M. and Kouimtzis, Th., 2003. Assessment of the surface water quality in Northern

regression and artificial neural networks based on principal components to predict


Elsner, J.B., and Tronis, A.A., 1992. Nonlinear prediction, chaos, and noise. Bull. *Am. Meterol.* 

Foody, G.M., McCulloch, M.B., and Yates, W.B., 1995. Classification of remotely sensed data

Ha, H. and Stenstrom, M. K., 2003. Identification of land use with water quality data in

Hafizan Juahir, Sharifuddin M. Zain, Zainol Mustafa, and Azme Khamis, 2001. Dissolved

Hafizan Juahir, Sharifuddin M. Zain, M. Nazari Jaafar, Zainal Majid and M. Ekhwan

*National Workshop*, Universiti Sains Malaysia, 3-4 September, p.157-164. Hafizan Juahir, Sharifuddin M. Zain, Mohd. Ekhwan Toriman, M. Nazari Jaafar and W.

stormwater using a neural network. *Water Research*, 37, p.4222-4230

by an artificial neural network: issues related to training data characteristics.

oxygen forecasting due to landuse activities using time series analysis at Sungai Lang at, Hulu Lang at, Selangor. *Ecological Environmental Modelling, Proceeding of the* 

Klaewtanong, 2003a. Performance of autoregressive integrated moving average and neural network approaches for forecasting dissolved oxygen at Langat River Malaysia. *Urban Ecosystem Studies In Malaysia: A study of change.* Universal

Toriman, 2003b. Land use temporal changes: application of GIS and statistical analysis on the impact of water quality at Langat River Basin, Malaysia. presented *in 2nd Annual Asian Conference of Map Asia 2003*, 17-19, Oct., PWTC Kuala Lumpur. Hafizan Juahir, Sharifuddin M. Zain, M. Nazari Jaafar and Zainal Ahmad, 2004a. An

Application of Second order backpropagation method in Modeling River Discharge at Sungai Langat, Malaysia. *Water Environmental Planning: Towards integrated planning and management of water resources for environmental risks*, IIUM, p.300-307. Hafizan Juahir, Sharifuddin M. Zain, M. Ekhwan Toriman and Mazlin Mokhtar, 2004b.

Application of Artificial Neural Network Model In the Predicting Water Quality

evaluation of grounwater composition in an alluvial aquifer (Pisuerga river, Spain)

classification using a minimal training set: comparison to conventional supervised

runoff Modeling Techniques-Deterministic, Statistical, and Artificial Neural

ANN model for air pollution index forecast. *Atmospheric Environment* 38, p.7055-

it. *Quantitative Applications in the Social Sciences Series.* Sage University Press,

Helena, B., Pardo, R., Vega, M., Barrado, E., Fernandez, J.M., Fernandez, L., 2000. Temporal

Hepner, G.F., Logan, T., Ritter, N., and Bryant, N., 1990. Artificial Neural Network

classification. *Photogrammetric Engineering and Remote Sensing*, 56, 469-473 Jain, A. & Prasad Indurthy, S. K. V. (2003) Comparative Analysis of Event-based Rainfall-

Jiang, D., Zhang, Y., Hu, X., Zeng, Y., Tan, J. And Shao, D., 2004. Progress in developing an

Kim, J.,-O., and Mueller, C.W., 1987. Introduction to factor analysis: what it is and how to do

Kuo, J.-T., Hsieh, M.-H., Lung, W.-S. And She, N., 2007. Using artificial neural network for reservoir eutrophication prediction. *Ecological Modelling*, 200, p.171-177

*Soc.* 73(1), p.303-314.

Publishers, p. 145-165.

7064.

Newbury Park.

*Photogrammetric Engineering and Remote Sensing.*

Index. *Jurnal Kejuruteraan Awam*, 16 (2), p.42-55

by principal component analysis. *Water Research* 34, 807-816.

Networks, *Journal of Hydrologic Engineering,* 8, p. 93-98.


**13** 

**PARAFAC Analysis** 

*Nagoya University* 

*Japan* 

Kenzi Suzuki3 and Wataru Kanematsu1 *1Research Institute of Instrumentation Frontier, Advanced Industrial Science and Technology (AIST)* 

**for Temperature-Dependent NMR** 

**Spectra of Poly(Lactic Acid) Nanocomposite** 

*2Mikawa Textile Research Center, Aichi Industrial Technology Institute (AITEC)* 

This chapter provides a tutorial on the fundamental concept of Parallel factor (PARAFAC) analysis and a practical example of its application. PARAFAC, which attains clarity and simplicity in sorting out convoluted information of highly complex chemical systems, is a powerful and versatile tool for the detailed analysis of multi-way data, which is a dataset represented as a multidimensional array. Its intriguing idea to condense the essence of the information present in the multi-way data into a very compact matrix representation referred to as scores and loadings has gained considerable popularity among scientists in

The basic idea of PARAFAC is so flexible and general that its application is not limited to a particular field of spectroscopy confined to a specific electromagnetic probe. Examples of the application include fluorescence (Christensen *et al.,* 2005; Rinnan *et al.,* 2005), IR (Wu *et al.,* 2003), NMR (Bro *et al.,* 2010), UV (Ebrahimi et al., 2008; Van Benthem et al., 2011) and mass spectroscopy (Amigo *et al.,* 2008). The first part of this chapter covers the theoretical background of trilinear decomposition of three-way data by PARAFAC with comparison to

In the second part of this chapter, an illustrative example of PARAFAC analysis for threeway data obtained in an actual laboratory experiment is presented to show how PARAFAC trilinear model can be constructed and analyzed to derive in-depth understanding of the system from the data. Thermal deformation of several types of poly lactic acid (PLA) nanocomposites undergoing grass-to-rubber transition is probed by cross-polarization magic-angle (CP-MAS) NMR spectroscopy. Namely, sets of temperature-dependent NMR spectra are measured under varying clay content in the PLA nanocomposite samples. While temperature strongly affects molecular dynamics of PLA, the clay content in the samples also influences the molecular mobility. Thus, NMR spectra in this study become a three-way

bilinear decomposition of two-way data by Principal component analysis (PCA).

**1. Introduction** 

many different areas of research activities.

Hideyuki Shinzawa1, Masakazu Nishida1, Toshiyuki Tanaka2,

*3Department of Chemical Engineering, Graduate School of Engineering,* 


## **PARAFAC Analysis for Temperature-Dependent NMR Spectra of Poly(Lactic Acid) Nanocomposite**

Hideyuki Shinzawa1, Masakazu Nishida1, Toshiyuki Tanaka2, Kenzi Suzuki3 and Wataru Kanematsu1 *1Research Institute of Instrumentation Frontier, Advanced Industrial Science and Technology (AIST) 2Mikawa Textile Research Center, Aichi Industrial Technology Institute (AITEC) 3Department of Chemical Engineering, Graduate School of Engineering, Nagoya University Japan* 

#### **1. Introduction**

288 Chemometrics in Practical Applications

Townsend, N.W. and Tarassenko, L., 1999. Estimations of error bounds for neural-network

Wackerly, D.D., Mendenhall, W and Scheaffer, R.L., 1996. Mathematical Statistics with

Wunderlin, D.A., Diaz, M.P., Ame, M.V., Pesce, S.F., Hued, A.C., Bistoni, M.A., 2001. Pattern

Zarita Zainuddin, 2004. Modelling Nonlinear Relationship in Ecology and Biology using

Zhang, J., Morris, A.J., Martin, A.J. and Kiparissides, C., 1998. Prediction of polymer quality

Zhang, Q. & Stanley, S. J.,1997. Forecasting raw-water quality parameters for North Saskatchewan River by neural network modeling. *Water Resource*, 31, p. 2340-2350.

recognition techniques for the evaluation of spatial and temporal variations in water quality. A case study: Suquia river basin (Cordoba, Argentina). *Water* 

Neural Networks. In Koh Hock Lye and Yahya Abu Hassan, Ecological Environmental Modelling (ECOMOD 2001): *Proceedings of the National Workshop,* 3-4

in batch polymerization using robust neural networks. *Chemical Engineering Journal*.

function approximators. *IEEE Trans Neural Netwoks*, 10(2), p.217.

Applications. 5th. Ed., Duxbury Press: Belmont, USA.

*Research* 35, 2881-2894.

September, USM, p.88-95

69, p.135-143.

This chapter provides a tutorial on the fundamental concept of Parallel factor (PARAFAC) analysis and a practical example of its application. PARAFAC, which attains clarity and simplicity in sorting out convoluted information of highly complex chemical systems, is a powerful and versatile tool for the detailed analysis of multi-way data, which is a dataset represented as a multidimensional array. Its intriguing idea to condense the essence of the information present in the multi-way data into a very compact matrix representation referred to as scores and loadings has gained considerable popularity among scientists in many different areas of research activities.

The basic idea of PARAFAC is so flexible and general that its application is not limited to a particular field of spectroscopy confined to a specific electromagnetic probe. Examples of the application include fluorescence (Christensen *et al.,* 2005; Rinnan *et al.,* 2005), IR (Wu *et al.,* 2003), NMR (Bro *et al.,* 2010), UV (Ebrahimi et al., 2008; Van Benthem et al., 2011) and mass spectroscopy (Amigo *et al.,* 2008). The first part of this chapter covers the theoretical background of trilinear decomposition of three-way data by PARAFAC with comparison to bilinear decomposition of two-way data by Principal component analysis (PCA).

In the second part of this chapter, an illustrative example of PARAFAC analysis for threeway data obtained in an actual laboratory experiment is presented to show how PARAFAC trilinear model can be constructed and analyzed to derive in-depth understanding of the system from the data. Thermal deformation of several types of poly lactic acid (PLA) nanocomposites undergoing grass-to-rubber transition is probed by cross-polarization magic-angle (CP-MAS) NMR spectroscopy. Namely, sets of temperature-dependent NMR spectra are measured under varying clay content in the PLA nanocomposite samples. While temperature strongly affects molecular dynamics of PLA, the clay content in the samples also influences the molecular mobility. Thus, NMR spectra in this study become a three-way

PARAFAC Analysis for Temperature-Dependent

distributed in the two-way data as follows,

behaviour of the sample.

**2.2 PARAFAC model** 

NMR Spectra of Poly(Lactic Acid) Nanocomposite 291

described as a three-way array with the *(i,j,k)*th element denoting the spectral intensity value at the *i*th concentration, the *j*th temperature, and the *k*th wavenumber. For example, the samples will show the variation of their molecular structure depending on the temperature. This may be also influenced by the change in the concentration. Thus the spectral intensities of the samples are potentially influenced by the temperature as well as concentration.

It is possible to condense the essence of the information present in multi-way data into a very compact matrix representation referred to as scores and loadings. The basic hypothesis of factor analysis techniques is that the improved proxy of the original data matrix can be reconstructed from only a limited number of significant factors. Thus, while the score and loading matrices contain only a small number of factors, it effectively carries all the necessary information about spectral features and, eventually, it becomes possible to sorting out the convoluted information content of highly complex chemical systems. The detailed analysis of such matrices potentially brings useful insight into building a mechanistic model

Principal component analysis (PCA) is mathematical decomposition of two-way data in terms of the orthogonal set of dominant factors, i.e., eigenvectors (Smilde *et al.,* 2004; Shinzawa *et al.,* 2010). Two-way data decomposition by PCA results in yielding two matrices called scores and loadings which complementarily represent the entire features broadly

where **T** and **P** are PCA score and loading matrices consisting of *r* vectors, respectively. The rank *r* corresponds to the number of principal components representing the significant portion of the information contained within the data matrix **X**. The selection of *r* is somewhat arbitrary. It is usually set to be a number, as small as possible but sufficiently large enough such that there are no obvious spectral features found in the residual matrix **EPCA**. The residual matrix **EPCA** is the portion of the original data, which is not accounted for by the first *r* principal components used for the data representation. The two matrices **T** and **P** complementally represent the entire features broadly distributed in **X**. Namely, **T** holds abstract information concerning the relationship among the samples and **P** contains summary of variable, e.g. wavenumber which provides chemically or physically meaningful interpretation to the pattern observed in **T**. For example, PCA of the two-way data based on temperature-dependent spectra provides **T** describing similar or dissimilar thermal behaviour of the sample during the perturbation period and corresponding **P** represent information on key molecular structure associated with such similar or dissimilar thermal

For even more data, PARAFAC is used to decompose a multi-way data and Fig. 2 illustrates graphical representation of PARAFAC operation to decompose a three-way data into score and loading vectors. PARAFAC is utilized to decompose the multi-way data into a linear combination of score and loading matrices (Smilde *et al.,* 2004; Bro, 2004). The information on behavior induced by the perturbations is effectively described by score vectors and corresponding loading vectors provide chemically or physically meaningful interpretation

**<sup>t</sup> X TP EPCA** (1)

for understanding complex phenomena studied by spectroscopic method.

dataset described as a function of both temperature and clay content. Details of the effects of the temperature and clay content on the physical state of nanocomposite are elucidated by using PARAFAC trilinear model.

#### **2. PARAFAC**

#### **2.1 Multi-way data**

So, what does a multi-way data look like? It is insightful first to note the data structures of two-way and three-way data. Schematic descriptions of two-way and three-way data based on external perturbation(s) are shown in Fig. 1. In a general spectroscopic measurement, external perturbations are applied to the system of interest to induce the response to the stimuli. Characteristic response of the system is presented in the form of spectrum. For example, when the thermal behaviour of a sample is studied by a spectroscopic method, such as IR, Raman and NMR, the sample is heated up to undergo thermal deformation and its molecular level variation induced by the stimulus is captured at each spectral variable, e.g. wavenumber. The spectral dataset thus obtained will be represented as a two-way array with the *(i,j)*th element denoting the spectral intensity value at the *i*th temperature and the *j*th wavenumber.

Fig. 1. Schematic illustration of two-way and three-way data.

Now, let us consider another experiment with one more perturbation. As described above, stimulation of a single sample ends up with two-way data array. But what if we still have some more samples, whose properties (e.g. concentration) are different? We will repeat a similar experiment for every single sample. This generates multiple two-way data. Thus, the entire dataset eventually becomes a stuck of the multiple two-way data like a cube, which contains two dimensions concerning applied two perturbations. Such spectral dataset is

described as a three-way array with the *(i,j,k)*th element denoting the spectral intensity value at the *i*th concentration, the *j*th temperature, and the *k*th wavenumber. For example, the samples will show the variation of their molecular structure depending on the temperature. This may be also influenced by the change in the concentration. Thus the spectral intensities of the samples are potentially influenced by the temperature as well as concentration.

#### **2.2 PARAFAC model**

290 Chemometrics in Practical Applications

dataset described as a function of both temperature and clay content. Details of the effects of the temperature and clay content on the physical state of nanocomposite are elucidated by

So, what does a multi-way data look like? It is insightful first to note the data structures of two-way and three-way data. Schematic descriptions of two-way and three-way data based on external perturbation(s) are shown in Fig. 1. In a general spectroscopic measurement, external perturbations are applied to the system of interest to induce the response to the stimuli. Characteristic response of the system is presented in the form of spectrum. For example, when the thermal behaviour of a sample is studied by a spectroscopic method, such as IR, Raman and NMR, the sample is heated up to undergo thermal deformation and its molecular level variation induced by the stimulus is captured at each spectral variable, e.g. wavenumber. The spectral dataset thus obtained will be represented as a two-way array with the *(i,j)*th element denoting the spectral intensity value at the *i*th temperature and the

Spectral variable

**X**

Perturbation

Perturbation 1

Now, let us consider another experiment with one more perturbation. As described above, stimulation of a single sample ends up with two-way data array. But what if we still have some more samples, whose properties (e.g. concentration) are different? We will repeat a similar experiment for every single sample. This generates multiple two-way data. Thus, the entire dataset eventually becomes a stuck of the multiple two-way data like a cube, which contains two dimensions concerning applied two perturbations. Such spectral dataset is

Perturbation 2

Spectral variable

**X**

Spectral variable

Perturbation

Spectral variable

Fig. 1. Schematic illustration of two-way and three-way data.

**Two-way data**

Perturbation 2

**Three-way data**

Perturbation 1

using PARAFAC trilinear model.

**2. PARAFAC** 

**2.1 Multi-way data** 

*j*th wavenumber.

It is possible to condense the essence of the information present in multi-way data into a very compact matrix representation referred to as scores and loadings. The basic hypothesis of factor analysis techniques is that the improved proxy of the original data matrix can be reconstructed from only a limited number of significant factors. Thus, while the score and loading matrices contain only a small number of factors, it effectively carries all the necessary information about spectral features and, eventually, it becomes possible to sorting out the convoluted information content of highly complex chemical systems. The detailed analysis of such matrices potentially brings useful insight into building a mechanistic model for understanding complex phenomena studied by spectroscopic method.

Principal component analysis (PCA) is mathematical decomposition of two-way data in terms of the orthogonal set of dominant factors, i.e., eigenvectors (Smilde *et al.,* 2004; Shinzawa *et al.,* 2010). Two-way data decomposition by PCA results in yielding two matrices called scores and loadings which complementarily represent the entire features broadly distributed in the two-way data as follows,

$$\mathbf{X} = \mathbf{T}\mathbf{P}^{\mathbf{t}} + \mathbf{E}\_{\text{PCA}} \tag{1}$$

where **T** and **P** are PCA score and loading matrices consisting of *r* vectors, respectively. The rank *r* corresponds to the number of principal components representing the significant portion of the information contained within the data matrix **X**. The selection of *r* is somewhat arbitrary. It is usually set to be a number, as small as possible but sufficiently large enough such that there are no obvious spectral features found in the residual matrix **EPCA**. The residual matrix **EPCA** is the portion of the original data, which is not accounted for by the first *r* principal components used for the data representation. The two matrices **T** and **P** complementally represent the entire features broadly distributed in **X**. Namely, **T** holds abstract information concerning the relationship among the samples and **P** contains summary of variable, e.g. wavenumber which provides chemically or physically meaningful interpretation to the pattern observed in **T**. For example, PCA of the two-way data based on temperature-dependent spectra provides **T** describing similar or dissimilar thermal behaviour of the sample during the perturbation period and corresponding **P** represent information on key molecular structure associated with such similar or dissimilar thermal behaviour of the sample.

For even more data, PARAFAC is used to decompose a multi-way data and Fig. 2 illustrates graphical representation of PARAFAC operation to decompose a three-way data into score and loading vectors. PARAFAC is utilized to decompose the multi-way data into a linear combination of score and loading matrices (Smilde *et al.,* 2004; Bro, 2004). The information on behavior induced by the perturbations is effectively described by score vectors and corresponding loading vectors provide chemically or physically meaningful interpretation

PARAFAC Analysis for Temperature-Dependent

criteria, one repeats Eqs. (4)-(9) until convergence.

1998; Wang *et al.,* 2006; Awa *et al.,* 2008).

**B** is obtained as

Update **Z** as

**C** is given by

**3. Example** 

**3.1 PLA nanocomposite** 

actual laboratory experiment.

plastics in many applications.

NMR Spectra of Poly(Lactic Acid) Nanocomposite 293

If the residual between the original **X** and reconstructed **X** by Eq. 2 is greater than error

The initial estimates for **B** and **C** is important to obtain sufficient minimization of the error criteria (Shinzawa *et al.,* 2007, 2008a & 2008b). Although ALS algorithm usually offers an eventual convergence to the optimal solution with a sufficiently large number of iterations, it sometimes reaches the suboptimal local minimum (Jiang *et al.,* 2003 & 2004). Unfortunately, such local convergence does not usually offer a global minimum, but it may just be stuck in a local minimum, producing insufficient solution. The major cause of the suboptimal local convergence may be a poor initial estimation. One possible solution for this problem is to select proper initial estimate which is less sensitive to the presence of a local minimum, e.g. via signaler value decomposition (Bro & de Jong, 1997; Bro & Sidiropoulos,

A pertinent example for PARAFAC analysis based on NMR spectra of PLA nanocomposites is provided here to show how certain useful information can be effectively extracted from an

Fig. 3 shows the molecular structure of PLA. PLA polymer is made up of many long chains consisting of the repeat unit shown in the figure. PLA is derived from renewable resources, such as corn starch via fermentation and it is biodegradable under the right conditions, such as the presence of oxygen (Tsuji *et al.,* 2010). Thus, PLA is a possible candidate of a new class of renewable polymers as a substitute for the petrochemical polymers. However, the physical properties of PLA are inadequate for the replacement of conventional commodity

Nanocomposite is a technique to improve the physical strength, thermal resistance and gas barrier by the dispersion of nanoclay into the polymer (Katti *et al.,* 2006). The improvement

where the superscript + means the Moore-Penrose inverse. Then update **Z** as

*(I JK)* **<sup>t</sup> A X Z(Z Z)** (5)

**ZCA** (6)

*(J IK)* **<sup>t</sup> B X Z(Z Z)** (7)

**ZBA** (8)

*(K IJ)* **<sup>t</sup> C X Z(Z Z)** (9)

**A**, **B** and **C**: score and loading matrices

Fig. 2. Schematic illustration of PARAFAC trilinear model.

to the patterns observed in the scores of the PARAFAC trilinear model. Namely, by using PARAFAC operation, *I×J×K* array matrix **X** can be expressed in terms of a product of score and loading matrices, **A**, **B**, and **C**, and a residual matrix **E** as follows

$$\mathbf{X}^{(l\times\text{jK})} = \mathbf{A}(\mathbf{C}|\otimes|\mathbf{B})^{\mathbf{t}} + \mathbf{E}^{(l\times\text{jK})} \tag{2}$$

where (*I × JK*) refers to the way that the multi-way array is unfolded. The notation means Khatri-Rao product which operate Kronecker product on partitioned matrices defined as

$$\mathbf{C} \middle| \otimes \big| \mathbf{B} = \left[ \mathbf{c}\_1 \otimes \mathbf{b}\_1 \quad \mathbf{c}\_2 \otimes \mathbf{b}\_2 \quad \cdots \quad \mathbf{c}\_{\mathbb{F}} \otimes \mathbf{b}\_{\mathbb{F}} \right] \tag{3}$$

In PARAFAC analysis, the set of matrices **A**, **B** and **C** are usually obtained by iteratively solving alternating least-squares (ALS) problems ( ) min *I JK* **<sup>t</sup> A,B,C X A(C B)** over **A** for fixed

**B** and **C,** as well as the minimization over **B** or **C** in the similar matrix operation manner under appropriate model constraints, such as the non-negativity of concentration and spectral intensity (Bro & de Jong, 1997; Bro & Sidiropoulos, 1998). General procedure of PARAFAC becomes as follows,

Initialize **B** and **C** to obtain **Z** as

$$\mathbf{Z} = \mathbf{C} \otimes \mathbf{B} \tag{4}$$

**A** is given by

$$\mathbf{A} = \mathbf{X}^{(l \times |\mathbf{K}|)} \mathbf{Z} (\mathbf{Z}^{\dagger} \mathbf{Z})^{+} \tag{5}$$

where the superscript + means the Moore-Penrose inverse. Then update **Z** as

$$\mathbf{Z} = \mathbf{C} \otimes \mathbf{A} \tag{6}$$

**B** is obtained as

292 Chemometrics in Practical Applications

**Spectral data**

defined as

**A** is given by

PARAFAC becomes as follows, Initialize **B** and **C** to obtain **Z** as **<sup>t</sup> A B)(CX**

5 10

**X A(C B)** over **A** for fixed

**ZCB** (4)

75 70 65

**C1**

**B1**

**B2**

**C2**

( ) *I JK* ( ) *I JK* **<sup>t</sup> X A(C B) E** (2)

[ ] **CBc bc b c b 1 12 2** *F F* (3)

**A**, **B** and **C**: score and loading matrices

to the patterns observed in the scores of the PARAFAC trilinear model. Namely, by using PARAFAC operation, *I×J×K* array matrix **X** can be expressed in terms of a product of score

where (*I × JK*) refers to the way that the multi-way array is unfolded. The notation means Khatri-Rao product which operate Kronecker product on partitioned matrices

In PARAFAC analysis, the set of matrices **A**, **B** and **C** are usually obtained by iteratively

**B** and **C,** as well as the minimization over **B** or **C** in the similar matrix operation manner under appropriate model constraints, such as the non-negativity of concentration and spectral intensity (Bro & de Jong, 1997; Bro & Sidiropoulos, 1998). General procedure of

**A,B,C**

40 60

**A2**

and loading matrices, **A**, **B**, and **C**, and a residual matrix **E** as follows

solving alternating least-squares (ALS) problems ( ) min *I JK* **<sup>t</sup>**

Fig. 2. Schematic illustration of PARAFAC trilinear model.

**A1**

$$\mathbf{B} = \mathbf{X}^{(l \times lK)} \mathbf{Z} (\mathbf{Z}^{\dagger} \mathbf{Z})^{+} \tag{7}$$

Update **Z** as

$$\mathbf{Z} = \mathbf{B} \otimes \mathbf{A} \tag{8}$$

**C** is given by

$$\mathbf{C} = \mathbf{X}^{(\mathbb{K} \times \mathbb{I})} \mathbf{Z} (\mathbf{Z}^{\dagger} \mathbf{Z})^{+} \tag{9}$$

If the residual between the original **X** and reconstructed **X** by Eq. 2 is greater than error criteria, one repeats Eqs. (4)-(9) until convergence.

The initial estimates for **B** and **C** is important to obtain sufficient minimization of the error criteria (Shinzawa *et al.,* 2007, 2008a & 2008b). Although ALS algorithm usually offers an eventual convergence to the optimal solution with a sufficiently large number of iterations, it sometimes reaches the suboptimal local minimum (Jiang *et al.,* 2003 & 2004). Unfortunately, such local convergence does not usually offer a global minimum, but it may just be stuck in a local minimum, producing insufficient solution. The major cause of the suboptimal local convergence may be a poor initial estimation. One possible solution for this problem is to select proper initial estimate which is less sensitive to the presence of a local minimum, e.g. via signaler value decomposition (Bro & de Jong, 1997; Bro & Sidiropoulos, 1998; Wang *et al.,* 2006; Awa *et al.,* 2008).

#### **3. Example**

#### **3.1 PLA nanocomposite**

A pertinent example for PARAFAC analysis based on NMR spectra of PLA nanocomposites is provided here to show how certain useful information can be effectively extracted from an actual laboratory experiment.

Fig. 3 shows the molecular structure of PLA. PLA polymer is made up of many long chains consisting of the repeat unit shown in the figure. PLA is derived from renewable resources, such as corn starch via fermentation and it is biodegradable under the right conditions, such as the presence of oxygen (Tsuji *et al.,* 2010). Thus, PLA is a possible candidate of a new class of renewable polymers as a substitute for the petrochemical polymers. However, the physical properties of PLA are inadequate for the replacement of conventional commodity plastics in many applications.

Nanocomposite is a technique to improve the physical strength, thermal resistance and gas barrier by the dispersion of nanoclay into the polymer (Katti *et al.,* 2006). The improvement

PARAFAC Analysis for Temperature-Dependent

physical crosslinks created by the crystalline domain.

5

Fig. 5. Physical property of PLA samples proved by TMA.

**3.2 PALAFAC analysis of NMR spectra of PLA nomocomposites** 

10

Elongation, %

approximately 10 °C per an hour.

15

NMR Spectra of Poly(Lactic Acid) Nanocomposite 295

increases with the increase of temperature and it finally reached constant levels at the close of the observation period, indicating that the observed plastic deformation is closely related to glass-to-rubber transition of the amorphous component of PLA. It is also noted that the samples results in the different levels of elongation depending on the clay content. For example, the neat PLA sample shows 14.4 % of elongation. In contrast, the PLAnanocomposite including 15 wt % of clay ends up with 9.1 % of elongation. The leveling off of the elongation indicates the formation of a network structure due to the presence of

Although such observation effectively detects the macroscopic changes in the mechanical properties caused by the presence of clay particles, additional fundamental molecular level understanding of the reinforcement mechanism is also desired. Spectroscopic method

0 wt%

5 wt%

15 wt%

should become an important tool to probe the phenomena at the molecular level.

Tg

<sup>60</sup> <sup>80</sup> <sup>100</sup> <sup>120</sup> <sup>0</sup>

Temperature, ºC

The temperature-dependent NMR spectra of the PLA samples collected under the varying temperature from 20 to 80 °C are shown in Fig. 6. Cross polarization-magic angle spinning (CP-MAS) NMR experiments were carried out on a Varian 400 NMR system spectrometer operated at 100.56 MHz for 13C resonance with a cross polarization contact time of 2 ms (Fawcett, 1996). A zirconium oxide rotor of 4 mm diameter was used to acquire the NMR spectra at a spinning rate of 15 kHz. Each sample was packed into a 4 mm cylinder-type MAS rotor. A set of temperature-dependent NMR spectra were obtained under varying ambient temperature from 20 to 80 °C at every 20 °C step. The heating rate was

Samples of semicrystalline polymers prepared from their melt possess complex supermolecular structure consisting of crystalline lamellae embedded in an amorphous

of such polymer properties by using nanocomposite is one of the primary areas of interest due to its potential applications. The polymer nanocomposites are generally formed by the addition of a small amount of nanoclay dispersion.

Fig. 3. Molecular structure of PLA.

Fig. 4 shows a schematic illustration of polymer nanocomposite. A typical form of the nanocomposite is intercalated nanocomposite, in which the unit cells of clay structure are expanded by the insertion of polymer into the interlayer spacing, while the periodicity of clay crystal structure is maintained. Most commonly, montmorillonite (MMT) is used as clay due to its highly expansive characteristic (Suguna Lakshmi *et al.,* 2008; Cervantes-Uc *et al.,*  2009). The MMT unit cell is composed of aluminum octahedra sandwiched between two silica tetrahedra with the unit cell dimension of about 1 nm in thickness. For facilitating better miscibility of hydrophobic polymer with the clay and increasing the spacing of the interlayer clay gallery, it is often treated with organic modifiers which are generally long carbon chain compounds with alkylammonium or alkylphosphonium cations.

Fig. 4. Schematic illustration of polymer nanocomposite.

PLA nanocomposite samples used in this study were prepared with PLA (Teramac®, Unitika) and organically modified clay (S-BEN W®, Hojun). The samples were put into a Labo-plastomill consisting of a 30C150 kneader and an R100 mixer (Toyo Seiki Seisaku-sho, Ltd., Tokyo) to melt-blend at 190 °C and 50 rpm for about 10 minutes. Pellets thus obtained were pressed into 0.2 mm thick sheet sandwiched between two thick Teflon® films by a hot press at 190 °C.

Fig. 5 represents the effect of nanocomposite on PLA probed by Thermomechanical analysis (TMA). TMA is a technique to monitor the physical deformation of object under a constant load, while varying the temperature. For example, in this case, the elongation of the PLA nanocomposite samples (clay content = 0, 5 and 15 wt%) were measured by imposing a 9.8 mN load, while varying the temperature from 35 to 140 °C at a rate of 10 °C per a minute. The elongation of the samples starts when the temperature reaches glass transition temperature (*T*g) of PLA, i.e. approximately 60 °C (Zhang et al., 2010). Then it gradually

of such polymer properties by using nanocomposite is one of the primary areas of interest due to its potential applications. The polymer nanocomposites are generally formed by the

Fig. 4 shows a schematic illustration of polymer nanocomposite. A typical form of the nanocomposite is intercalated nanocomposite, in which the unit cells of clay structure are expanded by the insertion of polymer into the interlayer spacing, while the periodicity of clay crystal structure is maintained. Most commonly, montmorillonite (MMT) is used as clay due to its highly expansive characteristic (Suguna Lakshmi *et al.,* 2008; Cervantes-Uc *et al.,*  2009). The MMT unit cell is composed of aluminum octahedra sandwiched between two silica tetrahedra with the unit cell dimension of about 1 nm in thickness. For facilitating better miscibility of hydrophobic polymer with the clay and increasing the spacing of the interlayer clay gallery, it is often treated with organic modifiers which are generally long

carbon chain compounds with alkylammonium or alkylphosphonium cations.

Fig. 4. Schematic illustration of polymer nanocomposite.

press at 190 °C.

Polymer Organoclay Intercalated

PLA nanocomposite samples used in this study were prepared with PLA (Teramac®, Unitika) and organically modified clay (S-BEN W®, Hojun). The samples were put into a Labo-plastomill consisting of a 30C150 kneader and an R100 mixer (Toyo Seiki Seisaku-sho, Ltd., Tokyo) to melt-blend at 190 °C and 50 rpm for about 10 minutes. Pellets thus obtained were pressed into 0.2 mm thick sheet sandwiched between two thick Teflon® films by a hot

Fig. 5 represents the effect of nanocomposite on PLA probed by Thermomechanical analysis (TMA). TMA is a technique to monitor the physical deformation of object under a constant load, while varying the temperature. For example, in this case, the elongation of the PLA nanocomposite samples (clay content = 0, 5 and 15 wt%) were measured by imposing a 9.8 mN load, while varying the temperature from 35 to 140 °C at a rate of 10 °C per a minute. The elongation of the samples starts when the temperature reaches glass transition temperature (*T*g) of PLA, i.e. approximately 60 °C (Zhang et al., 2010). Then it gradually

nanocomposite

addition of a small amount of nanoclay dispersion.

Fig. 3. Molecular structure of PLA.

increases with the increase of temperature and it finally reached constant levels at the close of the observation period, indicating that the observed plastic deformation is closely related to glass-to-rubber transition of the amorphous component of PLA. It is also noted that the samples results in the different levels of elongation depending on the clay content. For example, the neat PLA sample shows 14.4 % of elongation. In contrast, the PLAnanocomposite including 15 wt % of clay ends up with 9.1 % of elongation. The leveling off of the elongation indicates the formation of a network structure due to the presence of physical crosslinks created by the crystalline domain.

Although such observation effectively detects the macroscopic changes in the mechanical properties caused by the presence of clay particles, additional fundamental molecular level understanding of the reinforcement mechanism is also desired. Spectroscopic method should become an important tool to probe the phenomena at the molecular level.

Fig. 5. Physical property of PLA samples proved by TMA.

#### **3.2 PALAFAC analysis of NMR spectra of PLA nomocomposites**

The temperature-dependent NMR spectra of the PLA samples collected under the varying temperature from 20 to 80 °C are shown in Fig. 6. Cross polarization-magic angle spinning (CP-MAS) NMR experiments were carried out on a Varian 400 NMR system spectrometer operated at 100.56 MHz for 13C resonance with a cross polarization contact time of 2 ms (Fawcett, 1996). A zirconium oxide rotor of 4 mm diameter was used to acquire the NMR spectra at a spinning rate of 15 kHz. Each sample was packed into a 4 mm cylinder-type MAS rotor. A set of temperature-dependent NMR spectra were obtained under varying ambient temperature from 20 to 80 °C at every 20 °C step. The heating rate was approximately 10 °C per an hour.

Samples of semicrystalline polymers prepared from their melt possess complex supermolecular structure consisting of crystalline lamellae embedded in an amorphous

PARAFAC Analysis for Temperature-Dependent

500

Score

crystalline components in PLA samples.

in the score matrices **A** and **C**.

behaviours of the crystalline components in PLA samples.

500

1000

1500

1000

1500

NMR Spectra of Poly(Lactic Acid) Nanocomposite 297

these crystalline peaks are disappeared and compensated by the development of seemingly unimodal peak probably assigned to the amorphous of PLA (Tsuji *et al.,* 2010; Kister *et al.,* 1998). This indicates that the presence of the clay substantially influences supermolecular structure of the PLA. Consequently, it is very likely that the change in the spectral feature of the three-way data is closely related to temperature and clay content of the system. Thus, in turn, the fully detailed analysis of the data provides an interesting opportunity to probe the nature of the PLA nanocomposite by elucidating the variation of the NMR spectral intensity

40 60

20 40 60 80

Temperature, ºC

Fig. 7. Score vectors in score matrix **A** representing thermal behaviours of amorphous and

Fig. 7, 8 and 9 show results obtained from **A**, **B** and **C** matrices derived from PARAFAC analysis of the three-way NMR spectral data collected under varying temperature and clay content, respectively. Two major factors are indicated here, reflecting the fact that there are two species present in the system. One of the important benefits derived from PARAFAC decomposition of the multi-way data is the ability to rationally clarify the effect of the applied perturbations. For example, the matrix **A** represents abstract information on the temperature-induced behavior of the PLA under the influence of the clay content. In contrast, the matrix **C** holds essential information on the spectral intensity variation induced by the addition of the clay under the influence of the temperature. The matrix **B** contains loading vectors which provides chemical or physical interpretation to the patterns observed

It is noted that the loading vector of the first component of the matrix **B** (Fig. 8) resembles the spectral feature of the amorphous component of PLA. The loading vector of the second component of the matrix **B** shows characteristic three peaks assignable to crystalline component of the PLA. Thus it is most likely that the second factor represents thermal

Once the assignments for the loading vectors are established, it becomes possible to provide chemically meaningful interpretation to the score matrices **A** and **C** representing the dynamic behaviour of the components induced by the perturbations. For example, the score

induced by the each perturbation with PARAFAC trilinear decomposition.

Amorphous

Factor 1

Crystalline Factor 2

matrix (Wunderlich, 1980). PLA essentially undergoes highly convoluted transition process, when temperature and its constitution are altered. These transitions include the melting of ordered molecular segments, as well as the grass-to-rubber transition and other relaxation of process of the amorphous component (Zhang et al., 2005; Meaurio *et al.,* 2006).

Fig. 6. Temperature-dependent CP-MAS NMR spectra of neat PLA and PLA nanocomposite samples.

The CP-MAS technique is ideal for the observation of 13C spectra of solid samples. Since the local environment of a chemical group in solids are generally rigid, this leads to further considerations for crystallography or, more generally, molecular packing (Fawcett, 1996). The CPMAS NMR study of semicrystalline PLA samples is often complicated by the presence of overlapped contributions from coexisting crystalline and amorphous. For example, the unimodal peak observed around 69.5 ppm is assignable to CH structure which represents mobility of the main chain of the PLA (Tsuji *et al.,* 2010; Kister *et al.,* 1998). It is noted that the peak intensity gradually decreases with the increase of the temperature. This may be explained as the decrease in the cross polarization efficiency by the change in the molecular dynamics during the heating. Thus, the variation of the spectral intensity here reflects the structural alternation of PLA induced by the temperature.

More importantly, careful comparison of the samples reveals that the main feature of the NMR spectra of the three samples looks somewhat different. For example, the temperaturedependent NMR spectra of the PLA nanocomposite including 15 wt% clay provides specific three peaks at 70.5, 69.5 and 68.4, indicating the presence of the crystalline structure in the sample (Tsuji *et al.,* 2010; Kister *et al.,* 1998). When the sample has no clay in the system,

matrix (Wunderlich, 1980). PLA essentially undergoes highly convoluted transition process, when temperature and its constitution are altered. These transitions include the melting of ordered molecular segments, as well as the grass-to-rubber transition and other relaxation of

> 74 72 70 68 66 ppm

Fig. 6. Temperature-dependent CP-MAS NMR spectra of neat PLA and PLA nanocomposite

The CP-MAS technique is ideal for the observation of 13C spectra of solid samples. Since the local environment of a chemical group in solids are generally rigid, this leads to further considerations for crystallography or, more generally, molecular packing (Fawcett, 1996). The CPMAS NMR study of semicrystalline PLA samples is often complicated by the presence of overlapped contributions from coexisting crystalline and amorphous. For example, the unimodal peak observed around 69.5 ppm is assignable to CH structure which represents mobility of the main chain of the PLA (Tsuji *et al.,* 2010; Kister *et al.,* 1998). It is noted that the peak intensity gradually decreases with the increase of the temperature. This may be explained as the decrease in the cross polarization efficiency by the change in the molecular dynamics during the heating. Thus, the variation of the spectral intensity here

More importantly, careful comparison of the samples reveals that the main feature of the NMR spectra of the three samples looks somewhat different. For example, the temperaturedependent NMR spectra of the PLA nanocomposite including 15 wt% clay provides specific three peaks at 70.5, 69.5 and 68.4, indicating the presence of the crystalline structure in the sample (Tsuji *et al.,* 2010; Kister *et al.,* 1998). When the sample has no clay in the system,

80 ºC

reflects the structural alternation of PLA induced by the temperature.

20 ºC

74 72 70 68 66 ppm

samples.

Clay = 0 wt%

<sup>20</sup>º<sup>C</sup> Clay = 5 wt%

74 72 70 68 66 ppm

80 ºC

<sup>20</sup>º<sup>C</sup> Clay = 15 wt%

80 ºC

process of the amorphous component (Zhang et al., 2005; Meaurio *et al.,* 2006).

these crystalline peaks are disappeared and compensated by the development of seemingly unimodal peak probably assigned to the amorphous of PLA (Tsuji *et al.,* 2010; Kister *et al.,* 1998). This indicates that the presence of the clay substantially influences supermolecular structure of the PLA. Consequently, it is very likely that the change in the spectral feature of the three-way data is closely related to temperature and clay content of the system. Thus, in turn, the fully detailed analysis of the data provides an interesting opportunity to probe the nature of the PLA nanocomposite by elucidating the variation of the NMR spectral intensity induced by the each perturbation with PARAFAC trilinear decomposition.

Fig. 7. Score vectors in score matrix **A** representing thermal behaviours of amorphous and crystalline components in PLA samples.

Fig. 7, 8 and 9 show results obtained from **A**, **B** and **C** matrices derived from PARAFAC analysis of the three-way NMR spectral data collected under varying temperature and clay content, respectively. Two major factors are indicated here, reflecting the fact that there are two species present in the system. One of the important benefits derived from PARAFAC decomposition of the multi-way data is the ability to rationally clarify the effect of the applied perturbations. For example, the matrix **A** represents abstract information on the temperature-induced behavior of the PLA under the influence of the clay content. In contrast, the matrix **C** holds essential information on the spectral intensity variation induced by the addition of the clay under the influence of the temperature. The matrix **B** contains loading vectors which provides chemical or physical interpretation to the patterns observed in the score matrices **A** and **C**.

It is noted that the loading vector of the first component of the matrix **B** (Fig. 8) resembles the spectral feature of the amorphous component of PLA. The loading vector of the second component of the matrix **B** shows characteristic three peaks assignable to crystalline component of the PLA. Thus it is most likely that the second factor represents thermal behaviours of the crystalline components in PLA samples.

Once the assignments for the loading vectors are established, it becomes possible to provide chemically meaningful interpretation to the score matrices **A** and **C** representing the dynamic behaviour of the components induced by the perturbations. For example, the score

PARAFAC Analysis for Temperature-Dependent

0.5

Score

nucleation of the PLA crystals.

the clay layers, namely intercalation.

0.5 0.6 0.7

and crystalline components in PLA samples.

around *T*g clearly decreases with the inclusion of the clay.

0.6

0.7

NMR Spectra of Poly(Lactic Acid) Nanocomposite 299

5 10

Amorphous

Factor 1

Crystalline Factor 2

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> 0.4

Clay content, wt%

Fig. 9. Score vectors in score matrix **C** representing clay-induced behaviours of amorphous

clay. It seems that the decrease in the amorphous is compensated by the development of the crystalline structure. In other words, the clay increases the frequency of the spontaneous

PARAFAC trilinear model of the three-way NMR data of the PLA nanocomposites reveals that the crystalline and amorphous structures of the PLA nanocomposites undergo different transition under the heating. Namely, the change in the micro-Brownian motion of the polymer segments mainly occurs in the amorphous region. In addition, the different variations between the crystalline and amorphous component suggest the different effects of the presence of clay particles on them, i.e. nucleating effect of the clay. The decrease in the amorphous portion should result in the reduction of the structure undergoing the glass-torubber transition. Such variation of the crystallinity agrees well with the decreased elongation observed in the TMA. For example, in Fig. 5, the level of the elongation starting

This hypothesis is also clearly supported with corresponding transmission electron microscope (TEM) images and differential scanning calorimetry (DSC) results of the PLA nanocomposite sample. Fig. 10 represents the TEM images of the PLA nanocomposite sample including 15 wt% clay. For example, in Fig. 10(a), one can see that the clay is broadly distributed over the PLA matrix. On the other hand, Fig. 10(b) reveals that some parts of the interlayer gallery is obviously extended, suggesting the insertion of the PLA polymer into

DSC curves of the PLA samples, represented in Fig. 11, clearly show the presence of glass transition temperature around 60 °C. It is important to point out that this glass-to-rubber transition of amorphous component agrees well with the change in the elongation observed in the TMA. More importantly, the samples also provide obvious crystallization peak around 110 °C. The crystallization peak shows gradual increase by the inclusion of the clay content, suggesting quantitative increase in the amount of the crystalline structure. Thus, it

vector of the first factor in the matrix **A** represents the temperature-induced behaviour of the amorphous component of the PLA. On the other hand the score vector of the second factor means that of the crystalline component of the PLA. It is noted the score vector of the amorphous components exhibits obvious decrease with the temperature and such decrease becomes significant when the temperature exceeds its *T*g. In contrast, the change in the score value of the crystalline component is small, indicating no major variation during the heating process.

Fig. 8. Loading vectors in score matrix **B** representing thermal behaviours of amorphous and crystalline components in PLA samples.

The predominant variation of the amorphous component in the temperature region is explained as its glass-to-rubber transition. The change induced in the temperature region is associated with the Micro-Brownian motion of the PLA polymer segment. At a low temperature the amorphous regions of a polymer are in the glassy state. In this state the molecules are frozen on place. They may be able to vibrate slightly, but do not have any segmental motion. When the polymer is heated up to reach its *T*g, the molecules can start to wiggle around to become rubbery state. Such segmental motion predominantly occurs in amorphous region of PLA while such motion is strongly restricted in systematically folded crystalline lamellae structure. Thus, it is very likely the observed change of the amorphous is related to glass-to-rubber transition of the amorphous component.

Now it is important to point out again that the predominant elongation in the TMA occurred around *T*g. This elongation behaviour agrees well with the thermal behaviour of the amorphous component observed in the score matrix **A**. It thus suggests the physical elongation of the samples is essentially associated with the glass-to-rubber transition mainly occurred in the amorphous region.

It also becomes possible to provide the detailed interpretation to the pattern observed in the matrix **C** representing the clay-induced behaviours of amorphous and crystalline components in the PLA samples. The gradual decrease of the score of the first factor can be explained as the decrease of the amorphous component and the change in the sore of the first factor corresponds to the increase in the crystalline component by the addition of the

vector of the first factor in the matrix **A** represents the temperature-induced behaviour of the amorphous component of the PLA. On the other hand the score vector of the second factor means that of the crystalline component of the PLA. It is noted the score vector of the amorphous components exhibits obvious decrease with the temperature and such decrease becomes significant when the temperature exceeds its *T*g. In contrast, the change in the score value of the crystalline component is small, indicating no major variation during the heating

Amorphous

75 70 65

Crystalline Factor 2

Factor 1

80 75 70 65 60

ppm

Fig. 8. Loading vectors in score matrix **B** representing thermal behaviours of amorphous and

The predominant variation of the amorphous component in the temperature region is explained as its glass-to-rubber transition. The change induced in the temperature region is associated with the Micro-Brownian motion of the PLA polymer segment. At a low temperature the amorphous regions of a polymer are in the glassy state. In this state the molecules are frozen on place. They may be able to vibrate slightly, but do not have any segmental motion. When the polymer is heated up to reach its *T*g, the molecules can start to wiggle around to become rubbery state. Such segmental motion predominantly occurs in amorphous region of PLA while such motion is strongly restricted in systematically folded crystalline lamellae structure. Thus, it is very likely the observed change of the amorphous is

Now it is important to point out again that the predominant elongation in the TMA occurred around *T*g. This elongation behaviour agrees well with the thermal behaviour of the amorphous component observed in the score matrix **A**. It thus suggests the physical elongation of the samples is essentially associated with the glass-to-rubber transition mainly

It also becomes possible to provide the detailed interpretation to the pattern observed in the matrix **C** representing the clay-induced behaviours of amorphous and crystalline components in the PLA samples. The gradual decrease of the score of the first factor can be explained as the decrease of the amorphous component and the change in the sore of the first factor corresponds to the increase in the crystalline component by the addition of the

0 0.05 0.1 0.15

0

related to glass-to-rubber transition of the amorphous component.

0.05

crystalline components in PLA samples.

occurred in the amorphous region.

0.1

Loading weight

process.

Fig. 9. Score vectors in score matrix **C** representing clay-induced behaviours of amorphous and crystalline components in PLA samples.

clay. It seems that the decrease in the amorphous is compensated by the development of the crystalline structure. In other words, the clay increases the frequency of the spontaneous nucleation of the PLA crystals.

PARAFAC trilinear model of the three-way NMR data of the PLA nanocomposites reveals that the crystalline and amorphous structures of the PLA nanocomposites undergo different transition under the heating. Namely, the change in the micro-Brownian motion of the polymer segments mainly occurs in the amorphous region. In addition, the different variations between the crystalline and amorphous component suggest the different effects of the presence of clay particles on them, i.e. nucleating effect of the clay. The decrease in the amorphous portion should result in the reduction of the structure undergoing the glass-torubber transition. Such variation of the crystallinity agrees well with the decreased elongation observed in the TMA. For example, in Fig. 5, the level of the elongation starting around *T*g clearly decreases with the inclusion of the clay.

This hypothesis is also clearly supported with corresponding transmission electron microscope (TEM) images and differential scanning calorimetry (DSC) results of the PLA nanocomposite sample. Fig. 10 represents the TEM images of the PLA nanocomposite sample including 15 wt% clay. For example, in Fig. 10(a), one can see that the clay is broadly distributed over the PLA matrix. On the other hand, Fig. 10(b) reveals that some parts of the interlayer gallery is obviously extended, suggesting the insertion of the PLA polymer into the clay layers, namely intercalation.

DSC curves of the PLA samples, represented in Fig. 11, clearly show the presence of glass transition temperature around 60 °C. It is important to point out that this glass-to-rubber transition of amorphous component agrees well with the change in the elongation observed in the TMA. More importantly, the samples also provide obvious crystallization peak around 110 °C. The crystallization peak shows gradual increase by the inclusion of the clay content, suggesting quantitative increase in the amount of the crystalline structure. Thus, it

PARAFAC Analysis for Temperature-Dependent

Heat flow, mW

**3. Conclusion** 

complex chemical systems.

**4. Acknowledgment** 

Ultra-hybrid Materials" Project.

NMR Spectra of Poly(Lactic Acid) Nanocomposite 301

40 60 80 100 120 140

Tc

Temperature, ºC

The basic background of PARAFAC and its practical example based on the temperaturedependent NMR spectra of the PLA nanocomposite samples are presented. The central concept of PARAFA decomposition of multi-way data lies in the fact that it can condense the essence of the information present in the multi-way data into a very compact matrix representation referred to as scores and loadings. Thus, while the score and loading matrices contain only a small number of factors, it effectively carries all the necessary information about spectral features and leads to sorting out the convoluted information content of highly

The effect of PLA nanocomposite is studied by the PARAFAC analysis of the temperaturedependent NMR spectra of several PLA nanocomposite samples including different clay contents. The PARAFAC analysis for the three-way data of the PLA nanocomposites revealed that the crystalline and amorphous structures of the PLA nanocomposites substantially undergo different transition under the heating. Namely, the change in the micro-Brownian motion of the polymer segments mainly occurs in the amorphous region when the PLA samples are heated up to their *T*g. It also revealed that clay potentially works as nucleating effect of the clay. Namely, it increases the frequency of the spontaneous nucleation of the PLA crystals. Thus, in turn, the change in the population of the rigid crystalline and rubbery amorphous provides the improvement of the physical property. Consequently, it is possible to derive in-depth understanding of the PLA nanocomposites.

A part of this work was financially supported by NEDO "Technological Development of

Clay = 0 wt%

Tg

Fig. 11. DSC curves of neat PLA and PLA nanocomposite samples.

Clay = 5 wt%

Clay = 15 wt%

Fig. 10. TEM images of PLA nanocomposite sample.

is very likely that the clay works as the nucleating agent to increase the frequency of the spontaneous nucleation of the PLA crystals.

All the results put together, it provides overall picture of the system. When the clay is dispersed in the PLA matrix, the PLA polymer located at interlayer or around surface layer of the clay develops crystalline structure more frequently. The generation of the crystalline structure of PLA is compensated by the decrease of the amorphous content. This should decrease the structural portion substantially undergoing glass-to-rubber transition above *T*g. Thus, in turn, it restricts the elongation of the samples during the heating process under a certain level of load. Consequently, it is demonstrated that PARAFAC analysis of the threeway data of the PLA nanocomposite samples effectively elucidates the mechanisms of the improvement of the mechanical property by the clay. By carrying out detailed band position shift analysis of the three way data of the temperature- and clay- dependent NMR spectra of the PLA samples, it becomes possible to extract chemically meaningful information concerning the variation of the crystalline structure closely associated with the nanocomposite system.

Fig. 11. DSC curves of neat PLA and PLA nanocomposite samples.

#### **3. Conclusion**

300 Chemometrics in Practical Applications

**(a)**

**(b)**

Fig. 10. TEM images of PLA nanocomposite sample.

spontaneous nucleation of the PLA crystals.

nanocomposite system.

**1.0 μm**

**50 nm**

is very likely that the clay works as the nucleating agent to increase the frequency of the

All the results put together, it provides overall picture of the system. When the clay is dispersed in the PLA matrix, the PLA polymer located at interlayer or around surface layer of the clay develops crystalline structure more frequently. The generation of the crystalline structure of PLA is compensated by the decrease of the amorphous content. This should decrease the structural portion substantially undergoing glass-to-rubber transition above *T*g. Thus, in turn, it restricts the elongation of the samples during the heating process under a certain level of load. Consequently, it is demonstrated that PARAFAC analysis of the threeway data of the PLA nanocomposite samples effectively elucidates the mechanisms of the improvement of the mechanical property by the clay. By carrying out detailed band position shift analysis of the three way data of the temperature- and clay- dependent NMR spectra of the PLA samples, it becomes possible to extract chemically meaningful information concerning the variation of the crystalline structure closely associated with the The basic background of PARAFAC and its practical example based on the temperaturedependent NMR spectra of the PLA nanocomposite samples are presented. The central concept of PARAFA decomposition of multi-way data lies in the fact that it can condense the essence of the information present in the multi-way data into a very compact matrix representation referred to as scores and loadings. Thus, while the score and loading matrices contain only a small number of factors, it effectively carries all the necessary information about spectral features and leads to sorting out the convoluted information content of highly complex chemical systems.

The effect of PLA nanocomposite is studied by the PARAFAC analysis of the temperaturedependent NMR spectra of several PLA nanocomposite samples including different clay contents. The PARAFAC analysis for the three-way data of the PLA nanocomposites revealed that the crystalline and amorphous structures of the PLA nanocomposites substantially undergo different transition under the heating. Namely, the change in the micro-Brownian motion of the polymer segments mainly occurs in the amorphous region when the PLA samples are heated up to their *T*g. It also revealed that clay potentially works as nucleating effect of the clay. Namely, it increases the frequency of the spontaneous nucleation of the PLA crystals. Thus, in turn, the change in the population of the rigid crystalline and rubbery amorphous provides the improvement of the physical property. Consequently, it is possible to derive in-depth understanding of the PLA nanocomposites.

#### **4. Acknowledgment**

A part of this work was financially supported by NEDO "Technological Development of Ultra-hybrid Materials" Project.

PARAFAC Analysis for Temperature-Dependent

Vol. 110, No. 11 pp. 5790-5800

*Laboratory Systems,* Vol. 76, No. 1, pp. 91-99

*Structure*, Vol. 883-884, No. 30, pp. 27-30

*Structure*, Vol. 883-884, No. 30, pp. 73-78

Vol. 595, No. 1-2, pp. 275-281

*Systems,* Vol. 106, No. 1, pp. 115-124

Academic Press, New York, USA

1720-1725

2215-2220

7706–7715

NMR Spectra of Poly(Lactic Acid) Nanocomposite 303

Meaurio, E., Zuza, E., López-Rodríguez. N. & Sarasua, J. R. (2006). Conformational Behavior

Rinnan, Å. & Andersen, C. M. (2005). Handling of first-order Rayleigh scatter in PARAFAC

Shinzawa, H., Iwahashi, M., Noda, I. & Ozaki. Y. (2008a). Asynchronous Kernel Analysis for

Shinzawa, H., Iwahashi, M., Noda, I. & Ozaki, Y. (2008b). A Convergence Criterion in

Shinzawa, H., Jiang, J.-H., Iwahashi, M., Noda, I. & Ozaki, Y. (2007). Self-modeling Curve

Shinzawa, H., Awa, K., Kanematsu, W. & Ozaki, Y. (2010). Multivariate Data Analysis for

Smilde, A., Bro, R. & Geladi, P. (November 2004). Multi-way Analysis: Applications in the Chemical Sciences. John Wiley & Sons, ISBN: 0471986911, West Sussex, UK Suguna Lakshmi, M., Narmadha, B. & Reddy, B. S. R. (2008). Enhanced thermal stability and

Tsuji, H., Kamo, S. & Horii, F. (2010). Solid-state 13C NMR analyses of the structures of

Van Benthem, M. H., Lane, T. W., Davis, R. W., Lane, R. D. & Keenan, M. R., (2011).

Wang, Z.-G., Jiang, J.-H., Ding, Y.-J., Wu, H.-L. & Yu, Ru-Qin., (2006). Trilinear evolving

Wu, Y., Yuan, B., Zhao, J.-G. & Ozaki, Y. (2003). Hybrid Two-Dimensional Correlation and

Wunderlich, B. (1980). *Macromolecular Physics: Vol. 2 Crystal Nucleation, Growth, Annealing,*

Zhang, J., Li, C., Duan, Y., Domb, A. J. & Ozaki, Y. (2010). Glass transition and disorder-to-

spectroscopy. *Vibrational Spectroscopy,* Vol. 53, No. 2, pp. 307–310

*Polymer Degradation and Stability,* Vol. 93, No. 1, pp 201-213

*Analytica Chimica Acta*, Vol. 558, No. 1-2, pp. 137-143

of Poly(L-lactide) Studied by Infrared Spectroscopy. *Journal of Physical Chemistry B*,

modelling of fluorescence excitation–emission data. *Chemometrics and Intelligent* 

Binary Mixture Solutions of Ethanol and Carboxylic Acids. *Journal of Molecular* 

Alternating Least Squares (ALS) by Global Phase Angle. *Journal of Molecular* 

Resolution (SMCR) by Particle Swarm Optimization (PSO). *Analytica Chimica Acta*,

Raman Spectroscopic Imaging. *Journal of Raman Spectroscopy,* Vol. 40, No. 12, pp.

structural characteristics of different MMT-Clay/epoxy-nanocomposite materials.

crystallized and quenched poly(lactide)s: Effects of crystallinity, water absorption, hydrolytic degradation, and tacticity. *Polymer,* Vol. 51, No. 10, pp.

PARAFAC modeling of three-way hyperspectral images: Endogenous fluorophores as health biomarkers in aquatic species, *Chemometrics and Intelligent Laboratory* 

factor analysis for the resolution of three-way multi-component chromatograms.

Parallel Factor Studies on the Switching Dynamics of a Surface-stabilized Ferroelectric Liquid Crystal. *Journal of Physical Chemistry B,* Vol. 107, No. 31, pp.

order phase transition behavior of poly(l-lacticacid) revealed by infrared

#### **5. References**


Amigo, J. M., Skov, T., Coello, J., Maspoch, S. & Bro, R. (2008) Solving GC-MS problems with PARAFAC2. *TrAC Trends in Analytical Chemistry,* Vo. 27, No. 8, pp. 714-725 Awa, K., Okumura, T., Shinzawa, H., Otsuka, M. & Ozaki, Y. (2008). Self-modeling Curve

Pharmaceutical Tablets. *Analytica Chimica Acta*, Vol. 619, No. 1, pp. 81-86 Bro, R. (2004). PARAFAC. Tutorial and applications. *Chemometrics and Intelligent Laboratory* 

Bro, R. & de Jong, S. (1997). A fast non-negativity constrained linear least squares

Bro, R. & Sidiropoulos, N. (1998). Least squares algorithms under unimodality and nonnegativity constraints. *Journal of Chemometrics.* Vol. 12, No. 4, pp. 223–247 Bro, R., Viereck, N., Toft, M., Toft, H., Hansen, P. I. & Engelsen, S. B. (2010). Mathematical

PARAFAC. *TrAC Trends in Analytical Chemistry*, Vol. 29, No. 4, pp. 281-284 Cervantes-Uc, J. M., Espinosa, J. I. M., Cauich-Rodriguez, J. V., Avila-Ortega, A., Vazquez-

Christensen, J., Miquel Becker, B. & Frederiksen, C. S. (2005). Fluorescence spectroscopy and

Ebrahimi, D., Kennedy, D. F., Messerle, B. A. & Hibbert, D. B. (2008). High throughput

Fawcett, A. H. (1996). *Polymer Spectroscopy*, John Wiley & Sons, ISBN 0471960292, West

Jiang, J.-H., Šašic, S., Yu, R.-Q. & Ozaki, Y. (2003). Resolution of two-way data from

methods and simulations. *Journal of Chemometrics*, Vol. 17, No. 3, pp. 186-197 Jiang, J.-H., Liang, Y. & Ozaki, Y. (2004). Principles and methodologies in self-modeling

Katti, K. S., Sikdar, D., Katti, D. R., Ghosh, P. & Verma. D. (2006). Molecular interactions in

Kister, G., Cassanas, G. & Vert, M. (1998). Structure and morphology of solid lactide-

Experiments and modeling. *Polymer,* Vol. 47, No. 1, pp. 403-414

*Systems*, Vol. 37, No. 2, pp. 149-171

Resolution (SMCR) Analysis of Near-infrared (NIR) Iimaging Data of

algorithm for use in multi-way algorithms. *Journal of Chemometrics*, Vol. 11, No. 5,

chromatography solves the cocktail party effect in mixtures using 2D spectra and

Torres, H., Marcos-Fernandez. A. & San Roman, J. (2009). TGA/FTIR studies of segmented aliphatic polyurethanes and their nanocomposites prepared with commercial montmorillonites. *Polymer Degradation and Stability,* Vol. 94, No. 10, pp.

PARAFAC in the analysis of yogurt. *Chemometrics and Intelligent Laboratory Systems,*

screening arrays of rhodium and iridium complexes as catalysts for intramolecular hydroamination using parallel factor analysis. *Analyst,* Vol. 133,

spectroscopic monitoring of reaction or process systems by parallel vector analysis (PVA) and window factor analysis (WFA): inspection of the effect of mass balance,

curve resolution. *Chemometrics and Intelligent Laboratory Systems*, Vol. 71, No. 1, pp.

intercalated organically modied clay andclay–polycaprolactam nanocomposites:

glycolide copolymers from 13C n.m.r., infra-red and Raman spectroscopy. *Polymer,* 

**5. References** 

pp. 393-401

1666-1677

Vol. 75, No. 2, pp. 201-208

Vol. 39, No. 15, pp. 3335-3340

No. 6, pp. 817-822

Sussex, UK

1-12


**14** 

*Canada* 

**Application of Chemometrics to** 

*Department of Chemistry, University of Alberta* 

**the Interpretation of Analytical Separations Data** 

Interesting real-world samples are almost always present as mixtures containing the analyte(s) of interest and a matrix of components that are irrelevant to answering the analytical question at hand. Additionally, the compounds comprising the matrix are usually present in far greater abundance (both number and concentration) than the analytes of interest, making quantification or even detection of these analytes difficult if not impossible. When tasked with these types of samples, analysts turn to some form of separations technique such as gas or liquid chromatography (GC or LC) or capillary electrophoresis (CE) so that individual components in each sample may be quantified. More recently, more complex analytical questions are being probed, for example profiling blood or urine to identify a disease state or ascertaining the geographic origin of a food/beverage sample. These tasks often go beyond the simple quantification of one or two analytes in a sample. For these and other similar questions, separations scientists are turning more often to chemometric tools as a means of visualizing and interpreting the rich data that they obtain

Here we present a brief overview of separations approaches, with a focus on the data that are derived from different methods and on phenomena in the separations approach that lead to challenges in data interpretation. This is followed by a discussion of approaches that exist for the chemometric interpretation of separations data, specific challenges that arise in the chemometric treatment of these data, and solutions that have been implemented to deal

Chromatography is widely used for the separation, purification, and analysis of mixtures. In general, analytes contained in either a gaseous or liquid mobile phase are flowed past a stationary phase which is usually confined within a column. Depending on the chemistries of the analytes and the conditions of the separation (mobile/stationary phase compositions, temperature, etc.) different compounds will partition between the two phases to varying degrees. The separation arises due to this differential partitioning, with analytes which associate weakly with the stationary phase passing through the column more quickly than

those with a greater affinity for the stationary phase (Miller, 2005; Cazes, 2010).

**1. Introduction** 

from their separations systems.

with these challenges.

**1.1 Separations techniques** 

James J. Harynuk, A. Paulina de la Mata and Nikolai A. Sinkov

Zhang, J., Sato, H., Tsuji, H., Noda, I. & Ozaki, Y. (2005). Differences in the CH3…O=C interactions among poly(L-lactide), poly(L-lactide)/poly(D-lactide) stereocomplex, and poly(3-hydroxybutyrate) studied by infrared spectroscopy. *Journal of Molecular Structure.* Vol. 735–736, No. 14, pp. 249–257

## **Application of Chemometrics to the Interpretation of Analytical Separations Data**

James J. Harynuk, A. Paulina de la Mata and Nikolai A. Sinkov *Department of Chemistry, University of Alberta Canada* 

#### **1. Introduction**

304 Chemometrics in Practical Applications

Zhang, J., Sato, H., Tsuji, H., Noda, I. & Ozaki, Y. (2005). Differences in the CH3…O=C

*Structure.* Vol. 735–736, No. 14, pp. 249–257

interactions among poly(L-lactide), poly(L-lactide)/poly(D-lactide) stereocomplex, and poly(3-hydroxybutyrate) studied by infrared spectroscopy. *Journal of Molecular* 

> Interesting real-world samples are almost always present as mixtures containing the analyte(s) of interest and a matrix of components that are irrelevant to answering the analytical question at hand. Additionally, the compounds comprising the matrix are usually present in far greater abundance (both number and concentration) than the analytes of interest, making quantification or even detection of these analytes difficult if not impossible.

> When tasked with these types of samples, analysts turn to some form of separations technique such as gas or liquid chromatography (GC or LC) or capillary electrophoresis (CE) so that individual components in each sample may be quantified. More recently, more complex analytical questions are being probed, for example profiling blood or urine to identify a disease state or ascertaining the geographic origin of a food/beverage sample. These tasks often go beyond the simple quantification of one or two analytes in a sample. For these and other similar questions, separations scientists are turning more often to chemometric tools as a means of visualizing and interpreting the rich data that they obtain from their separations systems.

> Here we present a brief overview of separations approaches, with a focus on the data that are derived from different methods and on phenomena in the separations approach that lead to challenges in data interpretation. This is followed by a discussion of approaches that exist for the chemometric interpretation of separations data, specific challenges that arise in the chemometric treatment of these data, and solutions that have been implemented to deal with these challenges.

#### **1.1 Separations techniques**

Chromatography is widely used for the separation, purification, and analysis of mixtures. In general, analytes contained in either a gaseous or liquid mobile phase are flowed past a stationary phase which is usually confined within a column. Depending on the chemistries of the analytes and the conditions of the separation (mobile/stationary phase compositions, temperature, etc.) different compounds will partition between the two phases to varying degrees. The separation arises due to this differential partitioning, with analytes which associate weakly with the stationary phase passing through the column more quickly than those with a greater affinity for the stationary phase (Miller, 2005; Cazes, 2010).

Application of Chemometrics to the Interpretation of Analytical Separations Data 307

is one consideration for a chromatographic detector: it must be sufficient to faithfully record the profile of each compound as it passes through the detector. In order to obtain an accurate peak profile, the minimum number of acquisition points required across a peak is 10. Thus, the required speed of the detector is intrinsically linked to the nature of the separation. In separations where the base width of the peaks are on the order of 5 s, a data rate of 2 Hz would be acceptable, but when peak widths are 100-200 ms, as in GC×GC, then

From a point of view of chemometric analysis of separations data, another important consideration is whether the detector is univariate or multivariate. Univariate detectors, such as the flame ionisation detector, or single-wavelength UV-visible spectrometer, record only one variable as a function of time, generating data which take the form of a vector of instrument response. Other detectors, typically mass spectrometers and multi-channel spectroscopic instruments, can be operated such that they record a multivariate response. Data from these instruments comprise an array of signal responses with each row representing a time when a response was recorded, and each column representing a variable that was recorded (e.g.: detector wavelength, ion mass-to-charge ratio). To the chemometrician, it is immediately obvious that there are numerous advantages to collecting multivariate chromatographic data; however, it is worth noting that most of this advantage has been by and large ignored by chromatographers. Typically, only the profile of a single variable vs. time would be used to selectively quantify an analyte, or the detector response

One other aspect of raw separations data is the sheer number of variables measured for each sample. When a univariate detector is used for a 15 min separation, operating with an acquisition speed of 10 Hz, the data will be a vector of 9000 individual measurements per sample. If a multivariate detector is employed instead, for example a mass spectrometer operating over a 30-300 m/z mass range, this number increases to 2 439 000 individual variables arranged in a 9000 × 271 array per sample! In the case of GC×GC-MS analyses, which are typically 60 min in length but have a high-speed MS collecting data at rates of

Variations in analytical separations data are, in principle, no different from those derived from any other instrument; being based on both chemical and non-chemical aspects of the analysis. All relevant information will be contained within the chemical variations and any chemometric approach to interpreting chromatographic data must be capable of identifying relevant chemical variation while minimizing the effects of irrelevant chemical and nonchemical variations. Sources of irrelevant chemical variation include matrix peaks, here defined as any chemical source of signal introduced with the sample, but having no bearing on the conclusions drawn from the data. Additionally, there is background signal which can for example derive from changes in mobile phase concentration which influence detector signals in LC or chemical "bleed" signatures from stationary phases as they degrade in GC. Non-chemical variations include, for example, baseline drift (for non-chemical reasons), retention time shifts (due to minor fluctuations in operating conditions), and electronic noise. These may easily interfere with the relevant chemical information, degrading model performance and the validity of results (de la Mata-Espinosa et al., 2011a). Figure 1 presents

~100 Hz, there are on the order of 100 million data points collected for each sample.

detector rates on the order of 50-100 Hz are required for quantitative analysis.

across all channels at a given time used to help identify a peak.

**2. Challenges with chromatographic data** 

There are many types of chromatography, with the most common being liquid chromatography (LC) where analytes partition between a mobile liquid phase and an immobile stationary phase, and gas chromatography (GC) where the mobile phase is a gas and the stationary phase is a solid or more often a viscous, liquid-like polymer. There are numerous modes for LC separations, including for example reverse-phase (RPLC), normalphase (NPLC), ion (IC), size exclusion (SEC), and hydrophilic interaction (HILIC) to name a few. From a point of view of chemometric data interpretation and the discussion in this chapter, all of these LC separations generate data which are equivalent. In any chromatographic separation, the sample is delivered to the inlet of the column while the outlet is connected to a detector, which records a continuous signal. The detector response rises and then falls to baseline based on the analyte flux passing through it, ideally generating one separate peak with an approximately Gaussian shape for each individual analyte. Assuming that the conditions for repeat analyses are not changed, the peak for a given analyte will appear at the same time in every analysis, with the peak area/height being proportional to the quantity of analyte present in a sample (Poole, 2003; Miller, 2005).

Another separations technique which is popular for some samples is capillary electrophoresis (CE). Here, an electric field applied across a fused silica capillary containing a buffer induces motion of the buffer and analytes in the sample. The CE separation is dependent on differential mobilities of analytes in the solution in the presence of the electric field. This difference in mobilities is based on the fact that different analytes have different charges and sizes in solution. While the separation mechanism of CE is fundamentally different from the chromatographic mechanism, the data are a series of peaks recorded as a function of time. Consequently, the same tools can be applied to data from a CE separation, and similar concerns exist for the interpretation of these data (Poole, 2003; Miller, 2005). For ease of readability, and because chemometrics are more often applied to chromatographic data than electrophoretic data, we will often refer to a chromatogram in this chapter. This could equally be an electropherogram; when considering the application of chemometric techniques to separations data whether the origin is electrophoretic or chromatographic is largely irrelevant.

When tasked with incredibly complex samples, analysts are now turning more and more frequently to so-called comprehensive multidimensional separations (e.g.: GC×GC, LC×LC, CE×CE) (Liu & Phillips, 1991; Erni & Frei, 1978; Michels et al., 2002). In these techniques, the mixture of compounds is sequentially separated by two different separation mechanisms. In the case of GC×GC, for example, a sample might be separated first on an apolar column, followed by a polar column. The exact workings of comprehensive multidimensional separations are beyond the scope of this work, and are discussed elsewhere (Górecki et al., 2004; Cortes et al., 2009; François et al., 2009; Kivilompolo et al., 2011; Li et al., 2011). However, these techniques are gaining in popularity, and are capable of separating exceedingly complex mixtures comprising thousands of individual compounds. Due to the vastly improved separation power of these techniques, the data are much more informationrich, and without some form of chemometric treatment it is essentially impossible to do more than scratch the surface of the information contained therein.

#### **1.2 Separations data**

The detector signal from a separations experiment, when plotted vs. time, yields a series of (ideally) Gaussian peaks, each representing one compound in the sample. Acquisition speed

There are many types of chromatography, with the most common being liquid chromatography (LC) where analytes partition between a mobile liquid phase and an immobile stationary phase, and gas chromatography (GC) where the mobile phase is a gas and the stationary phase is a solid or more often a viscous, liquid-like polymer. There are numerous modes for LC separations, including for example reverse-phase (RPLC), normalphase (NPLC), ion (IC), size exclusion (SEC), and hydrophilic interaction (HILIC) to name a few. From a point of view of chemometric data interpretation and the discussion in this chapter, all of these LC separations generate data which are equivalent. In any chromatographic separation, the sample is delivered to the inlet of the column while the outlet is connected to a detector, which records a continuous signal. The detector response rises and then falls to baseline based on the analyte flux passing through it, ideally generating one separate peak with an approximately Gaussian shape for each individual analyte. Assuming that the conditions for repeat analyses are not changed, the peak for a given analyte will appear at the same time in every analysis, with the peak area/height being proportional to the quantity of analyte present in a sample (Poole, 2003; Miller, 2005). Another separations technique which is popular for some samples is capillary electrophoresis (CE). Here, an electric field applied across a fused silica capillary containing a buffer induces motion of the buffer and analytes in the sample. The CE separation is dependent on differential mobilities of analytes in the solution in the presence of the electric field. This difference in mobilities is based on the fact that different analytes have different charges and sizes in solution. While the separation mechanism of CE is fundamentally different from the chromatographic mechanism, the data are a series of peaks recorded as a function of time. Consequently, the same tools can be applied to data from a CE separation, and similar concerns exist for the interpretation of these data (Poole, 2003; Miller, 2005). For ease of readability, and because chemometrics are more often applied to chromatographic data than electrophoretic data, we will often refer to a chromatogram in this chapter. This could equally be an electropherogram; when considering the application of chemometric techniques to separations data whether the origin is electrophoretic or chromatographic is

When tasked with incredibly complex samples, analysts are now turning more and more frequently to so-called comprehensive multidimensional separations (e.g.: GC×GC, LC×LC, CE×CE) (Liu & Phillips, 1991; Erni & Frei, 1978; Michels et al., 2002). In these techniques, the mixture of compounds is sequentially separated by two different separation mechanisms. In the case of GC×GC, for example, a sample might be separated first on an apolar column, followed by a polar column. The exact workings of comprehensive multidimensional separations are beyond the scope of this work, and are discussed elsewhere (Górecki et al., 2004; Cortes et al., 2009; François et al., 2009; Kivilompolo et al., 2011; Li et al., 2011). However, these techniques are gaining in popularity, and are capable of separating exceedingly complex mixtures comprising thousands of individual compounds. Due to the vastly improved separation power of these techniques, the data are much more informationrich, and without some form of chemometric treatment it is essentially impossible to do

The detector signal from a separations experiment, when plotted vs. time, yields a series of (ideally) Gaussian peaks, each representing one compound in the sample. Acquisition speed

more than scratch the surface of the information contained therein.

largely irrelevant.

**1.2 Separations data** 

is one consideration for a chromatographic detector: it must be sufficient to faithfully record the profile of each compound as it passes through the detector. In order to obtain an accurate peak profile, the minimum number of acquisition points required across a peak is 10. Thus, the required speed of the detector is intrinsically linked to the nature of the separation. In separations where the base width of the peaks are on the order of 5 s, a data rate of 2 Hz would be acceptable, but when peak widths are 100-200 ms, as in GC×GC, then detector rates on the order of 50-100 Hz are required for quantitative analysis.

From a point of view of chemometric analysis of separations data, another important consideration is whether the detector is univariate or multivariate. Univariate detectors, such as the flame ionisation detector, or single-wavelength UV-visible spectrometer, record only one variable as a function of time, generating data which take the form of a vector of instrument response. Other detectors, typically mass spectrometers and multi-channel spectroscopic instruments, can be operated such that they record a multivariate response. Data from these instruments comprise an array of signal responses with each row representing a time when a response was recorded, and each column representing a variable that was recorded (e.g.: detector wavelength, ion mass-to-charge ratio). To the chemometrician, it is immediately obvious that there are numerous advantages to collecting multivariate chromatographic data; however, it is worth noting that most of this advantage has been by and large ignored by chromatographers. Typically, only the profile of a single variable vs. time would be used to selectively quantify an analyte, or the detector response across all channels at a given time used to help identify a peak.

One other aspect of raw separations data is the sheer number of variables measured for each sample. When a univariate detector is used for a 15 min separation, operating with an acquisition speed of 10 Hz, the data will be a vector of 9000 individual measurements per sample. If a multivariate detector is employed instead, for example a mass spectrometer operating over a 30-300 m/z mass range, this number increases to 2 439 000 individual variables arranged in a 9000 × 271 array per sample! In the case of GC×GC-MS analyses, which are typically 60 min in length but have a high-speed MS collecting data at rates of ~100 Hz, there are on the order of 100 million data points collected for each sample.

#### **2. Challenges with chromatographic data**

Variations in analytical separations data are, in principle, no different from those derived from any other instrument; being based on both chemical and non-chemical aspects of the analysis. All relevant information will be contained within the chemical variations and any chemometric approach to interpreting chromatographic data must be capable of identifying relevant chemical variation while minimizing the effects of irrelevant chemical and nonchemical variations. Sources of irrelevant chemical variation include matrix peaks, here defined as any chemical source of signal introduced with the sample, but having no bearing on the conclusions drawn from the data. Additionally, there is background signal which can for example derive from changes in mobile phase concentration which influence detector signals in LC or chemical "bleed" signatures from stationary phases as they degrade in GC. Non-chemical variations include, for example, baseline drift (for non-chemical reasons), retention time shifts (due to minor fluctuations in operating conditions), and electronic noise. These may easily interfere with the relevant chemical information, degrading model performance and the validity of results (de la Mata-Espinosa et al., 2011a). Figure 1 presents

Application of Chemometrics to the Interpretation of Analytical Separations Data 309

Chemometric approaches to handling chromatographic data should incorporate baseline correction of some form. When raw chromatographic data are processed, the method of baseline correction and its importance are generally obvious to the analyst. In the case where integrated peak tables are used, this is often done automatically by the chromatographic software with little consideration by the analyst, even though the manner in which the baseline is calculated will significantly influence the determination of peak

In all separations, retention times of peaks can easily shift by a few seconds from one analysis to the next. This is not much of an issue with simple samples having only a few peaks which are then integrated prior to chemometric analysis. However, retention times of peaks are used for identifying the compounds. With complex separations, unstable retention times may result in unreliable peak identification, making comparisons from one run to the next impossible. When comparing raw data this is even more important as one must ensure that the peak for a given component is always registered in the exact same position in the

The causes of retention time shifts depend on the separations technique being used. In GC, peaks may shift due to degradation of the stationary phase, decreasing retention times over time; build-up of heavy matrix components which foul the column, effectively changing the chemistry of the stationary phase; minor gas leaks which alter the flow rate; or even matrix effects on the evaporation rate in the injector, affecting the rate of mass transfer to the column. In LC, peak shifts may be due to small fluctuations in mobile phase chemistry from one run to the next; temperature fluctuations which in turn affect solvent viscosity and solute diffusion coefficients, altering the kinetics as well as the thermodynamics of the separation; or degradation / fouling of the stationary phase of the column. CE is the technique most prone to drastic shifts in migration time, due to the instability of the electroosmotic flow in the capillary (Figure 2). Electroosmotic flow depends on the applied voltage, the buffer concentration and composition, and is incredibly sensitive to the surface chemistry of the capillary. The act of analyzing a sample by CE will often have a minor, possibly irreversible effect on the capillary surface, resulting in a change in the migration

Shifts in retention times are minimized by proper instrument maintenance, precise control of instrumental conditions or by using approaches such as retention time locking in GC to account for variations in instrument performance (Etxebarria et al., 2009; Mommers et al., 2011) and relative retention times in CE. Even with these approaches, some retention time shifting will occur and require more advanced alignment techniques for correction prior to

Another challenge with the interpretation of chromatographic data is incomplete separation of peaks. If two or more compounds have similar retention characteristics under a given set of separation conditions, they will not be completely resolved, as evidenced by the peak clusters in Figure 1. In these cases, apportioning the signal between the different compounds

data matrix so that the algorithms will recognize the signals correctly.

areas/heights.

time of an analyte.

chemometric analysis.

**2.3 Incomplete separation** 

**2.2 Retention time shifts** 

an overlay of several LC chromatograms of similar samples exemplifying the challenges of baseline drift and retention time shifts. One of the major challenges in handling chromatographic data using chemometric tools is appropriate pre-processing to remove as many non-chemical and irrelevant chemical variations as possible from the data set.

Fig. 1. LC chromatograms of edible oils showing a high degree of variation in baseline.

Initial efforts into the application of statistical and chemometric tools to chromatographic data were accomplished using data that were processed to provide a list of detected, integrated peak areas or heights (or the calibrated concentrations for known compounds). However, the trend in recent years has turned towards the direct chemometric interpretation of raw chromatographic signals (Watson et al., 2006; Johnson & Synovec, 2002). The reason for this trend is that many errors can occur during integration of raw signals (Asher et al., 2009; de la Mata-Espinosa et al., 2011b). By applying chemometric tools directly to the raw data, many of these errors can be avoided. Of course, when working with the raw data, other issues become more important, most notably retention time shifts and the population of available variables.

#### **2.1 Baseline and noise**

Baseline variations, such as noise and drift, are due to small changes in experimental conditions, for example changes in detector response due to the mobile phase gradient in LC separations or increased levels of stationary phase bleed at higher temperatures in temperature-programmed GC. Other sources of noise and drift could include changes in detector response as its components age, contamination of solvents or gases, and of course electronic noise (which is minimal in modern chromatographic systems).

Chemometric approaches to handling chromatographic data should incorporate baseline correction of some form. When raw chromatographic data are processed, the method of baseline correction and its importance are generally obvious to the analyst. In the case where integrated peak tables are used, this is often done automatically by the chromatographic software with little consideration by the analyst, even though the manner in which the baseline is calculated will significantly influence the determination of peak areas/heights.

#### **2.2 Retention time shifts**

308 Chemometrics in Practical Applications

an overlay of several LC chromatograms of similar samples exemplifying the challenges of baseline drift and retention time shifts. One of the major challenges in handling chromatographic data using chemometric tools is appropriate pre-processing to remove as

many non-chemical and irrelevant chemical variations as possible from the data set.

Fig. 1. LC chromatograms of edible oils showing a high degree of variation in baseline.

the population of available variables.

**2.1 Baseline and noise** 

Initial efforts into the application of statistical and chemometric tools to chromatographic data were accomplished using data that were processed to provide a list of detected, integrated peak areas or heights (or the calibrated concentrations for known compounds). However, the trend in recent years has turned towards the direct chemometric interpretation of raw chromatographic signals (Watson et al., 2006; Johnson & Synovec, 2002). The reason for this trend is that many errors can occur during integration of raw signals (Asher et al., 2009; de la Mata-Espinosa et al., 2011b). By applying chemometric tools directly to the raw data, many of these errors can be avoided. Of course, when working with the raw data, other issues become more important, most notably retention time shifts and

Baseline variations, such as noise and drift, are due to small changes in experimental conditions, for example changes in detector response due to the mobile phase gradient in LC separations or increased levels of stationary phase bleed at higher temperatures in temperature-programmed GC. Other sources of noise and drift could include changes in detector response as its components age, contamination of solvents or gases, and of course

electronic noise (which is minimal in modern chromatographic systems).

In all separations, retention times of peaks can easily shift by a few seconds from one analysis to the next. This is not much of an issue with simple samples having only a few peaks which are then integrated prior to chemometric analysis. However, retention times of peaks are used for identifying the compounds. With complex separations, unstable retention times may result in unreliable peak identification, making comparisons from one run to the next impossible. When comparing raw data this is even more important as one must ensure that the peak for a given component is always registered in the exact same position in the data matrix so that the algorithms will recognize the signals correctly.

The causes of retention time shifts depend on the separations technique being used. In GC, peaks may shift due to degradation of the stationary phase, decreasing retention times over time; build-up of heavy matrix components which foul the column, effectively changing the chemistry of the stationary phase; minor gas leaks which alter the flow rate; or even matrix effects on the evaporation rate in the injector, affecting the rate of mass transfer to the column. In LC, peak shifts may be due to small fluctuations in mobile phase chemistry from one run to the next; temperature fluctuations which in turn affect solvent viscosity and solute diffusion coefficients, altering the kinetics as well as the thermodynamics of the separation; or degradation / fouling of the stationary phase of the column. CE is the technique most prone to drastic shifts in migration time, due to the instability of the electroosmotic flow in the capillary (Figure 2). Electroosmotic flow depends on the applied voltage, the buffer concentration and composition, and is incredibly sensitive to the surface chemistry of the capillary. The act of analyzing a sample by CE will often have a minor, possibly irreversible effect on the capillary surface, resulting in a change in the migration time of an analyte.

Shifts in retention times are minimized by proper instrument maintenance, precise control of instrumental conditions or by using approaches such as retention time locking in GC to account for variations in instrument performance (Etxebarria et al., 2009; Mommers et al., 2011) and relative retention times in CE. Even with these approaches, some retention time shifting will occur and require more advanced alignment techniques for correction prior to chemometric analysis.

#### **2.3 Incomplete separation**

Another challenge with the interpretation of chromatographic data is incomplete separation of peaks. If two or more compounds have similar retention characteristics under a given set of separation conditions, they will not be completely resolved, as evidenced by the peak clusters in Figure 1. In these cases, apportioning the signal between the different compounds

Application of Chemometrics to the Interpretation of Analytical Separations Data 311

with the two most common approaches being to fit a curve to the data and subtract this value from the signal, and modeling the baseline to exclude it using factor models (Amigo et

Curve fitting is the classical approach used in virtually all commercial software packages provided by vendors of separations equipment. The algorithms used in this approach fit a polynomial function across segments of the chromatogram using regions where no analyte peaks elute to determine the coefficients of the polynomial and then interpolating the background signal for regions where peaks are eluting. The functions are usually first-order polynomials; however, higher-order polynomials or a series of connected first-order polynomials are also used in some situations. Having determined the equation of the background signal, the fitted line is then subtracted from the signal (Brereton, 2003; Gan et al., 2006; Kaczmarek et al., 2005; Zhang et al., 2010; Persson & Strang, 2003; Eilers, 2003).

Correction of the baseline using curve fitting is demonstrated in Figure 3.

Fig. 3. An LC chromatogram before (blue) and after (red) baseline correction.

The approach of using models such as parallel factor analysis (PARAFAC) for background correction is analogous to the use of these approaches for deconvoluting coeluting peaks. As these models are more often used for this purpose than for simple background correction, they will be discussed in more detail in Section 3.3. These approaches often rely on having a multivariate signal and are applied to the chromatogram or more typically small selected regions where a single analyte elutes. The result of applying these deconvolution techniques for background correction is essentially the deconvolution of a single analyte peak, with the background noise making up the error matrix (Amigo et al., 2010). These approaches are generally more powerful and likely result in better quality analytical data, but they are not widely used in separation science. The reason for this is likely historical as these tools have

al., 2010).

Fig. 2. CE of substituted benzenes showing extreme misalignment.

becomes a challenge, especially for univariate signals. The general approach used for these cases is one of deconvolution: decomposing the analytical signal to determine the contribution of each coeluting compound, or to determine the contribution of the compound of interest, disregarding the remaining data.

#### **2.4 Data overload**

As shown in Section 1.2, raw chromatographic signals present an overabundance of data to the analyst. This poses several challenges. From a practical point of view, attempts to construct a chemometric model using the entirety of the data set could easily exceed the capabilities of the computer system being used. More fundamentally, if the raw data are considered, the number of variables measured for each sample will vastly outnumber the number of samples available in the data set. These overdetermined systems can defeat many chemometric techniques due, for example, to collinear variables. Finally, for most chromatograms, especially multidimensional ones, only a small fraction of the data points actually contain meaningful signal. Most of the signal is due to background noise or irrelevant matrix components. Consequently, the raw data must somehow be reduced in size prior to chemometric analysis. This is typically achieved via a feature selection approach, as discussed in Section 3.3.3.

#### **3. Pre-processing steps for chromatographic data**

#### **3.1 Baseline correction**

The aim of baseline correction is to separate the analyte signal of interest from signal which arises due to changes in mobile phase composition or stationary phase bleed and signal due to electronic noise. Several baseline correction methods have been proposed in literature,

becomes a challenge, especially for univariate signals. The general approach used for these cases is one of deconvolution: decomposing the analytical signal to determine the contribution of each coeluting compound, or to determine the contribution of the compound

As shown in Section 1.2, raw chromatographic signals present an overabundance of data to the analyst. This poses several challenges. From a practical point of view, attempts to construct a chemometric model using the entirety of the data set could easily exceed the capabilities of the computer system being used. More fundamentally, if the raw data are considered, the number of variables measured for each sample will vastly outnumber the number of samples available in the data set. These overdetermined systems can defeat many chemometric techniques due, for example, to collinear variables. Finally, for most chromatograms, especially multidimensional ones, only a small fraction of the data points actually contain meaningful signal. Most of the signal is due to background noise or irrelevant matrix components. Consequently, the raw data must somehow be reduced in size prior to chemometric analysis. This is typically achieved via a feature selection

The aim of baseline correction is to separate the analyte signal of interest from signal which arises due to changes in mobile phase composition or stationary phase bleed and signal due to electronic noise. Several baseline correction methods have been proposed in literature,

Fig. 2. CE of substituted benzenes showing extreme misalignment.

of interest, disregarding the remaining data.

approach, as discussed in Section 3.3.3.

**3.1 Baseline correction** 

**3. Pre-processing steps for chromatographic data** 

**2.4 Data overload** 

with the two most common approaches being to fit a curve to the data and subtract this value from the signal, and modeling the baseline to exclude it using factor models (Amigo et al., 2010).

Curve fitting is the classical approach used in virtually all commercial software packages provided by vendors of separations equipment. The algorithms used in this approach fit a polynomial function across segments of the chromatogram using regions where no analyte peaks elute to determine the coefficients of the polynomial and then interpolating the background signal for regions where peaks are eluting. The functions are usually first-order polynomials; however, higher-order polynomials or a series of connected first-order polynomials are also used in some situations. Having determined the equation of the background signal, the fitted line is then subtracted from the signal (Brereton, 2003; Gan et al., 2006; Kaczmarek et al., 2005; Zhang et al., 2010; Persson & Strang, 2003; Eilers, 2003). Correction of the baseline using curve fitting is demonstrated in Figure 3.

Fig. 3. An LC chromatogram before (blue) and after (red) baseline correction.

The approach of using models such as parallel factor analysis (PARAFAC) for background correction is analogous to the use of these approaches for deconvoluting coeluting peaks. As these models are more often used for this purpose than for simple background correction, they will be discussed in more detail in Section 3.3. These approaches often rely on having a multivariate signal and are applied to the chromatogram or more typically small selected regions where a single analyte elutes. The result of applying these deconvolution techniques for background correction is essentially the deconvolution of a single analyte peak, with the background noise making up the error matrix (Amigo et al., 2010). These approaches are generally more powerful and likely result in better quality analytical data, but they are not widely used in separation science. The reason for this is likely historical as these tools have

Application of Chemometrics to the Interpretation of Analytical Separations Data 313

as this approach to the warping of the chromatogram has been shown to affect peak areas,

A fast and simple alignment algorithm is coshift. This algorithm is useful when data only require a single left-right shift in retention time. The entire data matrix is shifted in one direction or the other by a set amount, maximizing the correlation between a target and the data matrix that required alignment. The single shifting value for the entire data matrix is a weakness, especially for chromatographic data where peaks can shift in different directions and to different extents in a single file. To handle this, an algorithm termed icoshift (interval-correlation-shifting) has been derived from coshift. Icoshift aligns each data matrix to a target by maximizing the cross-correlation between the sample and the target within a series of user-defined intervals (Savorani et al., 2010). The use of multiple intervals permits the alignment of separations data where shifts of different magnitudes and directions occur. These alignment algorithms have been used successfully for both one-dimensional data (de la Mata-Espinosa, 2011a; Liang, 2010; Laursen, 2010) and two-dimensional data, with some modifications (Zhang, 2008). It is important to note that the shifting of chromatograms using coshift or icoshift does not lead to distortions of peak shape, and consequently does not

The piecewise peak matching approach (Johnson et al., 2003) provides another avenue for chromatographic alignment. In this approach, peaks are identified in a target signal to which all other signals will be aligned. The algorithm then identifies peaks within the sample signals located within predetermined windows of the peaks in the target. Peaks within windows are deemed to come from the same compound, and matched. The chromatograms are aligned by stretching or compressing the regions between peak apexes. A variant of this algorithm can be used when MS data are available. In this case, the mass spectrum at the apex of each peak in the target signal is compared to the mass spectrum of each peak within a set window on the sample signal and peaks are matched if their spectra have a high enough match quality (Watson et al., 2006). A general scheme for peak alignment using this approach is described in Figure 4. Depending on the number and relative positions of the peaks in chromatograms matched using this approach, peak shapes

One of the biggest challenges for all alignment algorithms is that they depend on the data to be aligned being reasonably similar in terms of both matrix and analyte peaks. In some instances this will not be the case. In our laboratory, we have observed this when analyzing arson debris where the matrix and analytes form an incredibly complex and variable chromatogram from one sample to the next. A similar situation can be easily imagined when processing samples of biological origin. One solution to this issue is to add markers to every sample prior to the separation step in the analysis. These markers should be easily identifiable within the samples, even under conditions where they coelute with matrix components; should occur in multiple, evenly distributed locations along the chromatogram, and should not occur natively in the samples. One choice is a series of deuterated compounds which, with MS detection, are trivial to identify even in a complex mixture (Sinkov et al., 2011b). One additional benefit is that these compounds can act as

leading to poor quantitative conclusions (Nielsen et al., 1998; Tomasi et al., 2004).

introduce errors into quantitative results.

may be altered, possibly affecting quantitative results.

internal standards if quantitative results are desired.

only recently become available to the separation sciences, while the classical curve fitting approach is well established, works with univariate detectors, and performs well in most practical situations.

#### **3.2 Alignment of separations data**

The retention times of analytes in separations fluctuate from one analytical run to the next and, in order for chemometric techniques to be applied to separations data, these fluctuations must be corrected during pre-processing. This ensures that the signal from each analyte in each analysis is correctly registered within the data matrix to be processed. There are essentially two approaches to this problem: integrated peak tables, or mathematical warping and alignment of the raw signal.

#### **3.2.1 Peak tables**

Integrated peak tables are the simplest way to ensure that analytical separations data are properly aligned for chemometric processing. In order to use this approach, one must be able to reliably assign a unique identifier to each peak in each sample of the data set, and ensure that the same compound is identified with the same identifier in each sample. It should be noted that while the compound name is an obvious identifier, a series of labels such as *Unknown x*, where *x* is a numerical identifier would also be acceptable in the event that compound names were unknown, so long as compounds are matched correctly. Rather than identifying peaks by retention time, one could use relative retention times or retention indices in order to adjust for slight variations in the retention times of peaks. Algorithms for aligning peak tables exist and perform well, so long as some peaks can be easily and reliably matched across all chromatograms (Lavine et al., 2001).

The challenges with this approach stem from its reliance on integrated peak tables. Thus, any integration errors due to poorly-resolved peaks or peaks that are missed due to falling outside of integration parameters in the software will impact any subsequent analysis.

#### **3.2.2 Raw signal alignment**

Alignment of raw chromatographic signals prior to chemometric processing is more complex than the alignment of peak tables. In addition to the three more popular algorithms that will be presented below, there are several others that have been developed (Yao et al., 2007; Toppo et al., 2008; Eilers, 2004; Van Nederkassel et al., 2006). In deciding which approach to use, one of the first questions to be answered is if the analysis is to be qualitative or quantitative. This is because some alignment methods can distort peaks, affecting their quantification. Some of the more common algorithms include correlation optimized warping (COW) (Nielsen et al., 1998; Tomasi et al., 2004), correlation optimized shifting (coshift) (Van den Berg, 2005), and a piecewise peak-matching algorithm (Johnson et al., 2003).

In instances where there are non-systematic peak shifts, COW is a popular algorithm. COW relies on stretching or compressing segments of a sample signal such that the correlation coefficient between it and a reference signal is maximized for each interval. Care must be taken with the selection of the input parameters to avoid significant changes in peak shapes

only recently become available to the separation sciences, while the classical curve fitting approach is well established, works with univariate detectors, and performs well in most

The retention times of analytes in separations fluctuate from one analytical run to the next and, in order for chemometric techniques to be applied to separations data, these fluctuations must be corrected during pre-processing. This ensures that the signal from each analyte in each analysis is correctly registered within the data matrix to be processed. There are essentially two approaches to this problem: integrated peak tables, or mathematical

Integrated peak tables are the simplest way to ensure that analytical separations data are properly aligned for chemometric processing. In order to use this approach, one must be able to reliably assign a unique identifier to each peak in each sample of the data set, and ensure that the same compound is identified with the same identifier in each sample. It should be noted that while the compound name is an obvious identifier, a series of labels such as *Unknown x*, where *x* is a numerical identifier would also be acceptable in the event that compound names were unknown, so long as compounds are matched correctly. Rather than identifying peaks by retention time, one could use relative retention times or retention indices in order to adjust for slight variations in the retention times of peaks. Algorithms for aligning peak tables exist and perform well, so long as some peaks can be easily and reliably

The challenges with this approach stem from its reliance on integrated peak tables. Thus, any integration errors due to poorly-resolved peaks or peaks that are missed due to falling outside of integration parameters in the software will impact any subsequent analysis.

Alignment of raw chromatographic signals prior to chemometric processing is more complex than the alignment of peak tables. In addition to the three more popular algorithms that will be presented below, there are several others that have been developed (Yao et al., 2007; Toppo et al., 2008; Eilers, 2004; Van Nederkassel et al., 2006). In deciding which approach to use, one of the first questions to be answered is if the analysis is to be qualitative or quantitative. This is because some alignment methods can distort peaks, affecting their quantification. Some of the more common algorithms include correlation optimized warping (COW) (Nielsen et al., 1998; Tomasi et al., 2004), correlation optimized shifting (coshift) (Van den Berg, 2005), and a piecewise peak-matching algorithm (Johnson et

In instances where there are non-systematic peak shifts, COW is a popular algorithm. COW relies on stretching or compressing segments of a sample signal such that the correlation coefficient between it and a reference signal is maximized for each interval. Care must be taken with the selection of the input parameters to avoid significant changes in peak shapes

practical situations.

**3.2.1 Peak tables** 

**3.2.2 Raw signal alignment** 

al., 2003).

**3.2 Alignment of separations data** 

warping and alignment of the raw signal.

matched across all chromatograms (Lavine et al., 2001).

as this approach to the warping of the chromatogram has been shown to affect peak areas, leading to poor quantitative conclusions (Nielsen et al., 1998; Tomasi et al., 2004).

A fast and simple alignment algorithm is coshift. This algorithm is useful when data only require a single left-right shift in retention time. The entire data matrix is shifted in one direction or the other by a set amount, maximizing the correlation between a target and the data matrix that required alignment. The single shifting value for the entire data matrix is a weakness, especially for chromatographic data where peaks can shift in different directions and to different extents in a single file. To handle this, an algorithm termed icoshift (interval-correlation-shifting) has been derived from coshift. Icoshift aligns each data matrix to a target by maximizing the cross-correlation between the sample and the target within a series of user-defined intervals (Savorani et al., 2010). The use of multiple intervals permits the alignment of separations data where shifts of different magnitudes and directions occur. These alignment algorithms have been used successfully for both one-dimensional data (de la Mata-Espinosa, 2011a; Liang, 2010; Laursen, 2010) and two-dimensional data, with some modifications (Zhang, 2008). It is important to note that the shifting of chromatograms using coshift or icoshift does not lead to distortions of peak shape, and consequently does not introduce errors into quantitative results.

The piecewise peak matching approach (Johnson et al., 2003) provides another avenue for chromatographic alignment. In this approach, peaks are identified in a target signal to which all other signals will be aligned. The algorithm then identifies peaks within the sample signals located within predetermined windows of the peaks in the target. Peaks within windows are deemed to come from the same compound, and matched. The chromatograms are aligned by stretching or compressing the regions between peak apexes. A variant of this algorithm can be used when MS data are available. In this case, the mass spectrum at the apex of each peak in the target signal is compared to the mass spectrum of each peak within a set window on the sample signal and peaks are matched if their spectra have a high enough match quality (Watson et al., 2006). A general scheme for peak alignment using this approach is described in Figure 4. Depending on the number and relative positions of the peaks in chromatograms matched using this approach, peak shapes may be altered, possibly affecting quantitative results.

One of the biggest challenges for all alignment algorithms is that they depend on the data to be aligned being reasonably similar in terms of both matrix and analyte peaks. In some instances this will not be the case. In our laboratory, we have observed this when analyzing arson debris where the matrix and analytes form an incredibly complex and variable chromatogram from one sample to the next. A similar situation can be easily imagined when processing samples of biological origin. One solution to this issue is to add markers to every sample prior to the separation step in the analysis. These markers should be easily identifiable within the samples, even under conditions where they coelute with matrix components; should occur in multiple, evenly distributed locations along the chromatogram, and should not occur natively in the samples. One choice is a series of deuterated compounds which, with MS detection, are trivial to identify even in a complex mixture (Sinkov et al., 2011b). One additional benefit is that these compounds can act as internal standards if quantitative results are desired.

Application of Chemometrics to the Interpretation of Analytical Separations Data 315

Fig. 5. Deconvolution of overlapping peaks. The black, solid trace represents the analytical signal observed at the detector, which is the sum of the four peaks represented by dashed

Multivariate curve resolution is widely applicable to separations data and is one of the most common approaches (Franch-Lage et al., 2011; Marini et al., 2011, de la Mata-Espinosa et al., 2011a). The aim of this technique is to determine the number of components present in a sample and the contribution of each component to the sample. In performing MCR, the concentration and response profiles for each analyte are obtained, providing a qualitative and semi-quantitative overview of the components in an unresolved mixture without *a priori*

When multivariate detectors are used for separations, the additional dimension of information can be exploited to aid in deconvolution. MCR and EFA can also be used with multivariate data. In the case of MCR, the experimental matrix is decomposed into a matrix of concentration vs. time profiles (deconvoluted peaks) and pure spectral profiles of each compound. Knowledge of the number of components contributing to the signal in the region being deconvoluted is useful to guide the process and improve the results (de Juan &

Parallel factor analysis (PARAFAC) (Harshman, 1970; Bro, 1997; Amigo et al., 2010) is a technique that is ideally suited for interpreting multivariate separations data. PARAFAC is a decomposition model for multivariate data which provides three matrices, **A**, **B** and **C** which contain the scores and loadings for each component. The residuals, **E**, and the number of factors, *r*, are also extracted. The PARAFAC decomposition finds the best

lines.

knowledge of the mixture composition.

**3.3.2 Deconvolution of multivariate signals** 

Tauler, 2006), though strictly speaking it is not required.

Fig. 4. Flowchart for target-based chromatographic alignment, adapted from (Johnson et al., 2003).

#### **3.3 Deconvolution of overlapping peaks**

The central issue in deconvolution is depicted in Figure 5. The instrument response is represented as a black solid line which is the sum of the four dashed, coloured peaks. Ideally, the four signals should be individually quantified. This is a common problem for analytical separations, even those of relatively simple mixtures. Some of these issues may be solved by changing the experimental conditions or using characteristic features (wavelengths or ions) of the coeluting analytes and a multivariate detector to selectively detect and quantify them. However, in many cases this is insufficient and more advanced techniques must be used. The strategies used for deconvolution depend heavily on whether the detector signal is univariate or multivariate.

#### **3.3.1 Deconvolution of univariate signals**

In the case of univariate signals, one is typically limited to using univariate curve-fitting analyses where a number of Gaussian or modified Gaussian curves are determined such that the sum of these curves fits the experimentally observed cluster of peaks (Felinger, 1994). In these approaches, only a small window of chromatographic data (one peak cluster) should be processed at a time, and constraints such as fixed peak widths, shapes, unimodality, and non-negativity are often required to ensure the validity of the solution.

To solve a univariate deconvolution problem, approaches such as evolving factor analysis (EFA) (Maeder, 1987) or multivariate curve resolution (MCR) (Tauler & Barceló, 1993), among others (Vivó-Truyols et al., 2002; Sarkar et al., 1998; Kong et al. 2005) can be used. When these approaches are used with univariate data, the variables to be solved for are the number, positions, and abundances of each of the peaks that make up the signal.

Fig. 4. Flowchart for target-based chromatographic alignment, adapted from (Johnson et al.,

The central issue in deconvolution is depicted in Figure 5. The instrument response is represented as a black solid line which is the sum of the four dashed, coloured peaks. Ideally, the four signals should be individually quantified. This is a common problem for analytical separations, even those of relatively simple mixtures. Some of these issues may be solved by changing the experimental conditions or using characteristic features (wavelengths or ions) of the coeluting analytes and a multivariate detector to selectively detect and quantify them. However, in many cases this is insufficient and more advanced techniques must be used. The strategies used for deconvolution depend heavily on whether

In the case of univariate signals, one is typically limited to using univariate curve-fitting analyses where a number of Gaussian or modified Gaussian curves are determined such that the sum of these curves fits the experimentally observed cluster of peaks (Felinger, 1994). In these approaches, only a small window of chromatographic data (one peak cluster) should be processed at a time, and constraints such as fixed peak widths, shapes, unimodality, and non-negativity are often required to ensure the validity of the solution.

To solve a univariate deconvolution problem, approaches such as evolving factor analysis (EFA) (Maeder, 1987) or multivariate curve resolution (MCR) (Tauler & Barceló, 1993), among others (Vivó-Truyols et al., 2002; Sarkar et al., 1998; Kong et al. 2005) can be used. When these approaches are used with univariate data, the variables to be solved for are the

number, positions, and abundances of each of the peaks that make up the signal.

2003).

**3.3 Deconvolution of overlapping peaks** 

the detector signal is univariate or multivariate.

**3.3.1 Deconvolution of univariate signals** 

Fig. 5. Deconvolution of overlapping peaks. The black, solid trace represents the analytical signal observed at the detector, which is the sum of the four peaks represented by dashed lines.

Multivariate curve resolution is widely applicable to separations data and is one of the most common approaches (Franch-Lage et al., 2011; Marini et al., 2011, de la Mata-Espinosa et al., 2011a). The aim of this technique is to determine the number of components present in a sample and the contribution of each component to the sample. In performing MCR, the concentration and response profiles for each analyte are obtained, providing a qualitative and semi-quantitative overview of the components in an unresolved mixture without *a priori* knowledge of the mixture composition.

#### **3.3.2 Deconvolution of multivariate signals**

When multivariate detectors are used for separations, the additional dimension of information can be exploited to aid in deconvolution. MCR and EFA can also be used with multivariate data. In the case of MCR, the experimental matrix is decomposed into a matrix of concentration vs. time profiles (deconvoluted peaks) and pure spectral profiles of each compound. Knowledge of the number of components contributing to the signal in the region being deconvoluted is useful to guide the process and improve the results (de Juan & Tauler, 2006), though strictly speaking it is not required.

Parallel factor analysis (PARAFAC) (Harshman, 1970; Bro, 1997; Amigo et al., 2010) is a technique that is ideally suited for interpreting multivariate separations data. PARAFAC is a decomposition model for multivariate data which provides three matrices, **A**, **B** and **C** which contain the scores and loadings for each component. The residuals, **E**, and the number of factors, *r*, are also extracted. The PARAFAC decomposition finds the best

Application of Chemometrics to the Interpretation of Analytical Separations Data 317

In the case of multivariate detection, it can be advantageous to monitor only one or a few channels (wavelengths, ions, etc.) as this will selectively detect only a portion of the analytes, allowing the analyst to avoid many interfering species while greatly reducing the size of the data. However, in these cases the analyst must know exactly what signals to use and runs the risk of missing important features of the data encoded in the channels that were ignored. Further, using this approach destroys much of the multivariate advantage that can be realized through using these more complex (and expensive) detection strategies. Objective feature selection techniques generally have two steps: variable ranking, and variable selection. Objective variable ranking techniques such as analysis of variance (ANOVA) (Johnson & Synovec, 2002), the discriminating variable test (DIVA) (Rajalahti et al., 2009a, 2009b), and informative vectors (Teofilo et al., 2009) have the distinct advantage that variables are ranked based on a mathematically calculable "perceived utility" and not on subjective analyst perception. In essence, the data are given the chance to inform the user of what is relevant and what is likely noise, providing an approach that can be generalized

ANOVA is an effective method when the goal is to discriminate between classes of samples. ANOVA calculates the F ratio for each variable: the ratio of between-class variance to within-class variance. If the F ratio for a given variable is high, it is deemed to be more valuable for describing the difference between classes. Once the F ratio has been calculated for every data point in the chromatogram, the variables can be ranked in order of decreasing F ratio. A chemometric model is then constructed using a fraction of variables having the highest F ratio. One significant advantage of ANOVA is that the algorithm can be written with memory conservation in mind and thus is easily applied to data sets with very large numbers of samples and variables (hundreds or thousands of samples, each containing millions of variables). Consequently, it can be easily applied to a set of GC-MS chromatograms across the entire chromatogram, something that is difficult for other feature

DIVA is a feature ranking technique that aids feature selection prior to chemometric analysis (Rajalahti et al., 2009a, 2009b). This approach involves the creation of a PLS-DA model using all candidate variables. Projecting this PLS-DA model onto a new single LV yields what is termed a target projected (TP) model (Rajalahti et al., 2009a). From this, the ratio of explained variance to residual variance for each variable in the TP model provides its selectivity ratio, upon which variables are ranked (Rajalahti et al., 2009a, 2009b; Kvalheim, 1990; Kvalheim & Karstang, 1989). DIVA produces a ranking that is slightly different than that produced by ANOVA, though a direct comparison on chromatographic

Once variables have been ranked, those to be included in the model must be selected. This is generally achieved by constructing a model using a forward-selection or backwards elimination approach, in an attempt to maximize some metric of model quality. Model quality can be assessed based on several metrics such as mean correct classification rates (Rajalahti et al., 2009b) or the degree of separation between classes of samples in principal component (or latent variable) space, for example using either a Euclidian distance-based metric (Pierce et al., 2005) or a metric that accounts for size and shape of clusters (Sinkov &

to any set of analytical data.

ranking approaches.

Harynuk, 2011a).

data has not yet been performed to our knowledge.

trilinear model that minimizes the sum squares of the residuals in the model through a procedure of alternating least squares.

The biggest advantage of using PARAFAC over other models is the uniqueness of the solution; PARAFAC is less flexible and uses fewer degrees of freedom, being a more restricted model. However, its unique solution reflects actual pure analyte profiles in both the time dimension and the spectral dimension. Thus, the results of PARAFAC analysis on a cluster of overlapping multivariate peaks provide both qualitative and quantitative data where the deconvoluted signals appear as analyte peaks. One restriction to the use of PARAFAC is that the data must be trilinear (Bro, 1997; Amigo et al., 2010). In the case of chromatographic techniques with a multivariate detector, the dimensions are retention time, detector signal, and samples. In the case of comprehensive multidimensional separations, such as GC×GC, PARAFAC considers retention in the two dimensions and the samples as the three dimensions.

#### **3.4 Feature selection**

High data acquisition rates combined with the length of time required for many separations results in a large number of data points collected for a given separation (see Section 1.2). In many situations, most of the data are collected when no analytes are eluting from the system, and represent background signal when only mobile phase is reaching the detector. In the case of spectroscopic and especially mass spectral detectors, at a given point in time, many of the recorded data in this dimension will not contain useful information, even when an analyte of interest is eluting. Furthermore, many components in the mixture can be completely irrelevant to analysis (Johnson & Synovec, 2002; Sinkov & Harynuk, 2011a). Consequently, only a small portion of separations data is potentially useful. It is also well known that any model will be heavily influenced by the specific variables that are included in its construction (Kjeldahl & Bro, 2010).

The inclusion of irrelevant data is detrimental to the model because the mathematics attempt to account for variations observed in these irrelevant variables. Consequently the model is forced to model noise, resulting in a decrease in its predictive ability. Worse yet, the model could fit the data well and provide a seemingly useful prediction, until crossvalidation shows otherwise. Finally, the inclusion of extraneous variables increases the demands on the computer system being employed, making model construction slower, or in some cases outright impossible. Thus, prior reduction of separations data to a manageable size is crucial. Figure 6 depicts situations where either too few or too many variables were used to model a system.

One common manner to achieve data reduction is to use a table of integrated peaks instead of raw chromatographic data. This has the advantage of reducing the number of variables to those compounds included in the peak list, removing baseline noise and, if the analyst knows which exact peaks to use, removing signal from irrelevant compounds. Problems with this approach include the restriction to identified compounds, which may or may not include all of the information required for modeling, and integration errors that skew results. Finally, even with an error-free comprehensive peak table, the analyst must still perform feature selection since many peaks will undoubtedly be irrelevant to the analysis.

trilinear model that minimizes the sum squares of the residuals in the model through a

The biggest advantage of using PARAFAC over other models is the uniqueness of the solution; PARAFAC is less flexible and uses fewer degrees of freedom, being a more restricted model. However, its unique solution reflects actual pure analyte profiles in both the time dimension and the spectral dimension. Thus, the results of PARAFAC analysis on a cluster of overlapping multivariate peaks provide both qualitative and quantitative data where the deconvoluted signals appear as analyte peaks. One restriction to the use of PARAFAC is that the data must be trilinear (Bro, 1997; Amigo et al., 2010). In the case of chromatographic techniques with a multivariate detector, the dimensions are retention time, detector signal, and samples. In the case of comprehensive multidimensional separations, such as GC×GC, PARAFAC considers retention in the two dimensions and the samples as

High data acquisition rates combined with the length of time required for many separations results in a large number of data points collected for a given separation (see Section 1.2). In many situations, most of the data are collected when no analytes are eluting from the system, and represent background signal when only mobile phase is reaching the detector. In the case of spectroscopic and especially mass spectral detectors, at a given point in time, many of the recorded data in this dimension will not contain useful information, even when an analyte of interest is eluting. Furthermore, many components in the mixture can be completely irrelevant to analysis (Johnson & Synovec, 2002; Sinkov & Harynuk, 2011a). Consequently, only a small portion of separations data is potentially useful. It is also well known that any model will be heavily influenced by the specific variables that are included

The inclusion of irrelevant data is detrimental to the model because the mathematics attempt to account for variations observed in these irrelevant variables. Consequently the model is forced to model noise, resulting in a decrease in its predictive ability. Worse yet, the model could fit the data well and provide a seemingly useful prediction, until crossvalidation shows otherwise. Finally, the inclusion of extraneous variables increases the demands on the computer system being employed, making model construction slower, or in some cases outright impossible. Thus, prior reduction of separations data to a manageable size is crucial. Figure 6 depicts situations where either too few or too many variables were

One common manner to achieve data reduction is to use a table of integrated peaks instead of raw chromatographic data. This has the advantage of reducing the number of variables to those compounds included in the peak list, removing baseline noise and, if the analyst knows which exact peaks to use, removing signal from irrelevant compounds. Problems with this approach include the restriction to identified compounds, which may or may not include all of the information required for modeling, and integration errors that skew results. Finally, even with an error-free comprehensive peak table, the analyst must still perform feature selection since many peaks will undoubtedly be irrelevant to

procedure of alternating least squares.

the three dimensions.

**3.4 Feature selection** 

used to model a system.

the analysis.

in its construction (Kjeldahl & Bro, 2010).

In the case of multivariate detection, it can be advantageous to monitor only one or a few channels (wavelengths, ions, etc.) as this will selectively detect only a portion of the analytes, allowing the analyst to avoid many interfering species while greatly reducing the size of the data. However, in these cases the analyst must know exactly what signals to use and runs the risk of missing important features of the data encoded in the channels that were ignored. Further, using this approach destroys much of the multivariate advantage that can be realized through using these more complex (and expensive) detection strategies.

Objective feature selection techniques generally have two steps: variable ranking, and variable selection. Objective variable ranking techniques such as analysis of variance (ANOVA) (Johnson & Synovec, 2002), the discriminating variable test (DIVA) (Rajalahti et al., 2009a, 2009b), and informative vectors (Teofilo et al., 2009) have the distinct advantage that variables are ranked based on a mathematically calculable "perceived utility" and not on subjective analyst perception. In essence, the data are given the chance to inform the user of what is relevant and what is likely noise, providing an approach that can be generalized to any set of analytical data.

ANOVA is an effective method when the goal is to discriminate between classes of samples. ANOVA calculates the F ratio for each variable: the ratio of between-class variance to within-class variance. If the F ratio for a given variable is high, it is deemed to be more valuable for describing the difference between classes. Once the F ratio has been calculated for every data point in the chromatogram, the variables can be ranked in order of decreasing F ratio. A chemometric model is then constructed using a fraction of variables having the highest F ratio. One significant advantage of ANOVA is that the algorithm can be written with memory conservation in mind and thus is easily applied to data sets with very large numbers of samples and variables (hundreds or thousands of samples, each containing millions of variables). Consequently, it can be easily applied to a set of GC-MS chromatograms across the entire chromatogram, something that is difficult for other feature ranking approaches.

DIVA is a feature ranking technique that aids feature selection prior to chemometric analysis (Rajalahti et al., 2009a, 2009b). This approach involves the creation of a PLS-DA model using all candidate variables. Projecting this PLS-DA model onto a new single LV yields what is termed a target projected (TP) model (Rajalahti et al., 2009a). From this, the ratio of explained variance to residual variance for each variable in the TP model provides its selectivity ratio, upon which variables are ranked (Rajalahti et al., 2009a, 2009b; Kvalheim, 1990; Kvalheim & Karstang, 1989). DIVA produces a ranking that is slightly different than that produced by ANOVA, though a direct comparison on chromatographic data has not yet been performed to our knowledge.

Once variables have been ranked, those to be included in the model must be selected. This is generally achieved by constructing a model using a forward-selection or backwards elimination approach, in an attempt to maximize some metric of model quality. Model quality can be assessed based on several metrics such as mean correct classification rates (Rajalahti et al., 2009b) or the degree of separation between classes of samples in principal component (or latent variable) space, for example using either a Euclidian distance-based metric (Pierce et al., 2005) or a metric that accounts for size and shape of clusters (Sinkov & Harynuk, 2011a).

Application of Chemometrics to the Interpretation of Analytical Separations Data 319

After applying the appropriate pre-processing, different chemometric techniques can be applied according to the aim of the study. Pattern recognition is one of the chemometric methods most used in analytical chemistry and this is true for separations data. Pattern recognition can be generally divided into two classes: exploratory data analysis and

Exploratory data analysis aims to extract important information, detect outliers and identify relationships between samples and its use is recommended prior to the application of other chemometric techniques. Examples of the use of exploratory data analysis tools applied to separations data include principal component analyisis (PCA) (de la Mata-Espinosa et al.,

Unsupervised pattern recognition techniques uncover patterns within a data set without *a priori* class assignment of samples. Here, the objective is to find patterns in the data which allow grouping of similar samples using, for example, cluster analysis which has been applied to separations data by Reid et al. (2007). When supervised pattern recognition is used, the classes of samples in a training set are known and used to calibrate a model, which is then used to predict class assignments of unknown samples. Some examples of which are linear discriminant analysis (LDA), and partial least squares-discriminant analysis (PLS-DA) (de la Mata-Espinosa et al., 2011b; Zorzetti et al., 2011; Sinkov et al., 2011b). In a study performed by Sinkov et al., two alignment techniques for chromatographic data were compared. The data comprised raw GC-MS chromatograms of simulated arson debris where some samples contained different types of gasoline weathered to different extents spiked into debris samples which themselves exhibited a high degree of variability in their chemical composition. The goal was to build a PLS-DA model that could correctly classify debris samples based on whether or not they contained gasoline (Figure 7). As can be seen, the alignment algorithm used has a direct impact on the quality of the predictions. In Figure 7A, there are multiple false positives, false negatives, and ambiguous samples. In Figure 7B,

Another example of applying chemometrics to separations data is depicted in Figures 8 and 9. Here, interval PLS (iPLS) was applied to blends of oils in order to quantify the relative concentration of olive oil in the samples (de la Mata-Espinosa et al., 2011b). iPLS divides the data into a number of intervals and then calculates a PLS model for each interval. In this example, the two peak segments which presented the lower root mean square error of cross

As mentioned in Section 3.3.2, PARAFAC is a chemometric tool for multidimensional data treatment. The scores and loadings obtained with PARAFAC can be used in two-way models for data exploration and quantitative analysis (Vosough et al., 2010). When small deviations in trilinearity exist within the data, usually due to relatively small shifts in retention time in the case of separations data, a modified version of PARAFAC called

Like PARAFAC, PARAFAC2 decomposes raw data into loading and score matrices but without the imposition of trilinearity as in PARAFAC. Even without this constraint, the PARAFAC2 model preserves the property of uniqueness that is so advantageous with PARAFAC. Thus, analyte profiles and concentrations can be estimated by PARAFAC2 even

if chromatographic alignment is not perfect (Amigo et al., 2008; Skov et al., 2009).

unsupervised and supervised pattern recognition (Otto, 2007; Brereton, 2007).

2011a; Ruiz-Samblas et al., 2011) and factor analysis (Stanimirova et al., 2011).

all samples are classified correctly and there are no ambiguous samples.

validation (RMSECV) were used for building the final PLS model.

PARAFAC2 is recommended for use (Bro et al., 1999).

**4. Applications and examples** 

The one exception to the rank-and-select approach are genetic algorithms (Yoshida et al., 2001), though due to the sheer number of variables present in a typical separation, these are not often used on the raw separations data as arriving at the optimal number and combination of variables is computationally inefficient and uncertain.

Sometimes, several feature selection methods are used for a given analysis. For example, an analyst might reduce chromatogram to a peak table, selecting a series of candidate variables of interest and then perform further variable ranking and optimization on the integrated peak table, especially in the case of multidimensional separations where hundreds, if not thousands of compounds can be resolved (Felkel et al., 2010).

Finally, cross-validation is extremely important, especially when processing raw separations data and using a feature ranking approach such as ANOVA. As discussed previously, raw separations data contain on the order of 105 to 106 data points for each sample. In these cases of overdetermined systems it is entirely possible that some combinations of variables containing only noise will, by random chance, indicate a difference between samples. When handling raw separations data, a good approach to avoid this problem is to break the data set into three separate sets: a training set to construct the model, an optimization set to optimize data processing parameters (such as alignment and feature selection), and finally a test set to determine if the optimized model has any meaning (Brereton, 2007). Of course this does require that one collect data for a large number of samples so that a representative population of samples exists for each of the three subsets of data.

Fig. 6. Models constructed from the same data set using different numbers of top-ranked variables. (A) Too few variables; (B) Too many variables; (C) Optimal number of variables.

### **4. Applications and examples**

318 Chemometrics in Practical Applications

The one exception to the rank-and-select approach are genetic algorithms (Yoshida et al., 2001), though due to the sheer number of variables present in a typical separation, these are not often used on the raw separations data as arriving at the optimal number and

Sometimes, several feature selection methods are used for a given analysis. For example, an analyst might reduce chromatogram to a peak table, selecting a series of candidate variables of interest and then perform further variable ranking and optimization on the integrated peak table, especially in the case of multidimensional separations where hundreds, if not

Finally, cross-validation is extremely important, especially when processing raw separations data and using a feature ranking approach such as ANOVA. As discussed previously, raw separations data contain on the order of 105 to 106 data points for each sample. In these cases of overdetermined systems it is entirely possible that some combinations of variables containing only noise will, by random chance, indicate a difference between samples. When handling raw separations data, a good approach to avoid this problem is to break the data set into three separate sets: a training set to construct the model, an optimization set to optimize data processing parameters (such as alignment and feature selection), and finally a test set to determine if the optimized model has any meaning (Brereton, 2007). Of course this does require that one collect data for a large number of samples so that a representative population

Fig. 6. Models constructed from the same data set using different numbers of top-ranked variables. (A) Too few variables; (B) Too many variables; (C) Optimal number of variables.

combination of variables is computationally inefficient and uncertain.

thousands of compounds can be resolved (Felkel et al., 2010).

of samples exists for each of the three subsets of data.

After applying the appropriate pre-processing, different chemometric techniques can be applied according to the aim of the study. Pattern recognition is one of the chemometric methods most used in analytical chemistry and this is true for separations data. Pattern recognition can be generally divided into two classes: exploratory data analysis and unsupervised and supervised pattern recognition (Otto, 2007; Brereton, 2007).

Exploratory data analysis aims to extract important information, detect outliers and identify relationships between samples and its use is recommended prior to the application of other chemometric techniques. Examples of the use of exploratory data analysis tools applied to separations data include principal component analyisis (PCA) (de la Mata-Espinosa et al., 2011a; Ruiz-Samblas et al., 2011) and factor analysis (Stanimirova et al., 2011).

Unsupervised pattern recognition techniques uncover patterns within a data set without *a priori* class assignment of samples. Here, the objective is to find patterns in the data which allow grouping of similar samples using, for example, cluster analysis which has been applied to separations data by Reid et al. (2007). When supervised pattern recognition is used, the classes of samples in a training set are known and used to calibrate a model, which is then used to predict class assignments of unknown samples. Some examples of which are linear discriminant analysis (LDA), and partial least squares-discriminant analysis (PLS-DA) (de la Mata-Espinosa et al., 2011b; Zorzetti et al., 2011; Sinkov et al., 2011b). In a study performed by Sinkov et al., two alignment techniques for chromatographic data were compared. The data comprised raw GC-MS chromatograms of simulated arson debris where some samples contained different types of gasoline weathered to different extents spiked into debris samples which themselves exhibited a high degree of variability in their chemical composition. The goal was to build a PLS-DA model that could correctly classify debris samples based on whether or not they contained gasoline (Figure 7). As can be seen, the alignment algorithm used has a direct impact on the quality of the predictions. In Figure 7A, there are multiple false positives, false negatives, and ambiguous samples. In Figure 7B, all samples are classified correctly and there are no ambiguous samples.

Another example of applying chemometrics to separations data is depicted in Figures 8 and 9. Here, interval PLS (iPLS) was applied to blends of oils in order to quantify the relative concentration of olive oil in the samples (de la Mata-Espinosa et al., 2011b). iPLS divides the data into a number of intervals and then calculates a PLS model for each interval. In this example, the two peak segments which presented the lower root mean square error of cross validation (RMSECV) were used for building the final PLS model.

As mentioned in Section 3.3.2, PARAFAC is a chemometric tool for multidimensional data treatment. The scores and loadings obtained with PARAFAC can be used in two-way models for data exploration and quantitative analysis (Vosough et al., 2010). When small deviations in trilinearity exist within the data, usually due to relatively small shifts in retention time in the case of separations data, a modified version of PARAFAC called PARAFAC2 is recommended for use (Bro et al., 1999).

Like PARAFAC, PARAFAC2 decomposes raw data into loading and score matrices but without the imposition of trilinearity as in PARAFAC. Even without this constraint, the PARAFAC2 model preserves the property of uniqueness that is so advantageous with PARAFAC. Thus, analyte profiles and concentrations can be estimated by PARAFAC2 even if chromatographic alignment is not perfect (Amigo et al., 2008; Skov et al., 2009).

Application of Chemometrics to the Interpretation of Analytical Separations Data 321

Fig. 9. Predicted vs. actual % olive oil using PLS model constructed based on results in

these powerful analytical tools into valuable information effectively and efficiently.

Amigo, J.M.; Skov, T.; Bro, R.; Coello, J. & Maspoch, S. (2008). Solving GC-MS problems with

Amigo, J.M.; Skov, T. & Bro, R. (2010). ChroMATHography: solving chromatographic issues

Asher, B.J.; D'Angostino, L.A.; Way, J.D.; Wong, C.S. & Harynuk, J.J. (2009). Comparison of

PARAFAC2. *Trends in Analytical Chemistry,* Vol.27, No.8, (September 2008), pp. 714-

with mathematical models and intuitive graphics. *Chemical Reviews*, Vol.110, No.8,

peak integration methods for the determination of enantiomeric fraction in

The analyst must choose from a plethora of methods for processing separations data, a potentially daunting task. It is our hope that this review will help chromatographers entertaining thoughts of applying chemometrics to their data understand what they must consider when choosing how to prepare their data. Likewise, it is hoped that we have informed chemometricians of some of the specific challenges associated with the processing of chromatographic data and the origins of those limitations. In the development of a chemometric model for the interpretation of separations data, there are numerous opportunities for missteps that will exclude key information from the model and/or generate meaningless results. However, when due care is taken there are also many opportunities to apply chemometric techniques to transform the rich data generated by

Figure 8. Reprinted from de la Mata-Espinosa et al., 2011b, with permission.

**5. Conclusions** 

**6. References** 

725, ISSN 0165-9936

(May 2010), pp. 4582-4605, ISSN 1520-6890

Fig. 7. PLS-DA Models for identifying gasoline in simulated arson debris derived from the same raw data, but aligned with different techniques. (A) Feature-based alignment; (B) Deuterated alkane ladder – based alignment. All other treatment and model construction algorithms were the same in both cases. Hollow markers indicate data in the training set while filled markers indicate data in the validation set. Circles represent debris containing gasoline while triangles represent gasoline-free debris. Reprinted from Sinkov et al., 2011b, with permission.

Fig. 8. Feature selection using iPLS. Segments in green showed lower RMSECV and were thus used to construct the final model. Reprinted from de la Mata-Espinosa et al., 2011b, with permission.

Fig. 9. Predicted vs. actual % olive oil using PLS model constructed based on results in Figure 8. Reprinted from de la Mata-Espinosa et al., 2011b, with permission.

#### **5. Conclusions**

320 Chemometrics in Practical Applications

Fig. 7. PLS-DA Models for identifying gasoline in simulated arson debris derived from the same raw data, but aligned with different techniques. (A) Feature-based alignment; (B) Deuterated alkane ladder – based alignment. All other treatment and model construction algorithms were the same in both cases. Hollow markers indicate data in the training set while filled markers indicate data in the validation set. Circles represent debris containing gasoline while triangles represent gasoline-free debris. Reprinted from Sinkov et al., 2011b,

Fig. 8. Feature selection using iPLS. Segments in green showed lower RMSECV and were thus used to construct the final model. Reprinted from de la Mata-Espinosa et al., 2011b,

with permission.

with permission.

The analyst must choose from a plethora of methods for processing separations data, a potentially daunting task. It is our hope that this review will help chromatographers entertaining thoughts of applying chemometrics to their data understand what they must consider when choosing how to prepare their data. Likewise, it is hoped that we have informed chemometricians of some of the specific challenges associated with the processing of chromatographic data and the origins of those limitations. In the development of a chemometric model for the interpretation of separations data, there are numerous opportunities for missteps that will exclude key information from the model and/or generate meaningless results. However, when due care is taken there are also many opportunities to apply chemometric techniques to transform the rich data generated by these powerful analytical tools into valuable information effectively and efficiently.

#### **6. References**


Application of Chemometrics to the Interpretation of Analytical Separations Data 323

François, I.; Sandra, K. & Sandra, P. (2009). Comprehensive liquid chromatography:

Gan, F.; Ruan, G. & Mo, J. (2006). Baseline correction by improved iterative polynomial

Górecki, T.; Harynuk, J. & Panić, O. (2004). The evolution of comprehensive two-

Harshman, R.A. (1970). Foundations of the PARAFAC procedure: models and conditions for

Johnson, K.J. & Synovec, R.E. (2002). Pattern recognition of jet fuels: comprehensive GC×GC

Johnson, K.J.; Wright, B.W.; Jarman, K.H. & Synovec, R.E. (2003). High-speed peak matching

Kaczmarek, K.; Walczak, B.; de Jong, S. & Vandeginste, B.G.M. (2005). Baseline reduction in

Kivilompolo, M.; Pol, J. & Hyotylainen, T. (2011). Comprehensive two-dimensional liquid

Kjeldahl, K. & Bro, R. (2010). Some common misunderstandings in chemometrics. *Journal of Chemometrics*, Vol.24, No.7-8, (July-August, 2011), pp. 558-564, ISSN 0886-9383 Kong, K.; Ye, F.; Guo, L.; Tian, J. & Xu, G. (2005). Deconvolution of overlapped peaks based

Kvalheim, O.M. & Karstang, T.V. (1989). Interpretation of latent-variable regression models.

Kvalheim, O.M. (1990). Latent-variable regression models with higher-order terms: An

Lavine, B.K.; Brzozowski, D.; Moores, A.J.; Davidson, C.E. & Mayfield, H.T. (2001). Genetic

*Acta,* Vol.641, No.1-2, (May 2009), pp. 14-31, ISSN 0003-2670

Vol.82, No.1 (May 2006), pp. 59-65, ISSN 0169-7439

35, ISSN 0378-5173

359-379, ISSN 1615-9306

225-237, ISSN 0169-7439

141-155, ISSN 0021-9673

pp. 82-96, ISSN 1233-2356

2005) pp. 160-164, ISSN 0021-9673

2001), pp. 233-246, ISSN 0003-2670

232-+, ISSN 1471-6577

39-51, ISSN 0169-7439

pp. 59-67, ISSN 0169-7439

(1970), pp. 1-84

imaging. *International Journal of Pharmaceutics,* Vol.441, No.1-2, (June 2011), pp. 27-

Fundamental aspects and practical considerations—A review. *Analytica Chimica* 

fitting with automatic threshold. *Chemometrics and Intelligent Laboratory Systems,* 

dimensional gas chromatography, *Journal of Separation Science,* Vol.27 (2004) pp.

an 'exploratory' multimodal factor analysis. *UCLA Working Papers Phonet.* Vol 16,

with ANOVA-based feature selection and principal component analysis. *Chemometrics and Intelligent Laboratory Systems*, Vol.60, No.1-2, (January 2002), pp.

algorithm for retention time alignment of gas chromatographic data for chemometric analysis. *Journal of Chromatography A*, Vol.996, No.1-2, (May 2003), pp.

two dimensional gel electrophoresis images. *Acta Chromatographica,* Vol.15 (2005),

chormatography (LC×LC): A review. *LC GC Europe*, Vol.24, No 5 (May 2011), pp.

on the exponentially modified Gaussian model in comprehensive two-dimensional gas chromatography, *Journal of Chromatography A,* Vol.1086, No.1-2 (September

*Chemometrics and Intelligent Laboratory Systems*, Vol.7, No.1-2, (December 1989), pp.

extension of response modelling by orthogonal design and multiple linear regression. *Chemometrics and Intelligent Laboratory Systems*, Vol.8, No.1, (May 1990),

algorithm for fuel spill identification. *Analytica Chimica Acta*, Vol.437, No.2, (June

environmental samples. *Chemosphere*, Vol.75, No.8, (May 2009), pp. 1042-1048, ISSN 0045-6535


Brereton, R.G. (2003). *Chemometrics Data Analysis for the Laboratory and Chemical Plant,* Wiley,

Brereton, R.G. (2007). *Applied Chemometrics for Scientists*, John Wiley & Sons Inc., ISBN 978-0-

Bro, R. (1997). PARAFAC. Tutorial and applications. *Chemometrics Intelligent Laboratory* 

Bro, R.; Andersson, C.A.& Kiers, H.A.L. (2009). PARAFAC-Part II. Modeling

Casez, J. (2010). *Encyclopaedia of Chromatography*, (3rd ed.) CRC Press, ISBN 1-4200-8483,

Cortes, H.J.; Winniford, B.; Luong, J. & Pursch, M. (2009). Comprehensive two dimensional

de Juan, A. & Tauler, R. (2006). Multivariate Curve Resolution (MCR) from 2000: Progress in

de la Mata-Espinosa, P.; Bosque-Sendra, J.M.; Bro, R. & Cuadros-Rodríguez, L. (2011a).

de la Mata-Espinosa, P.; Bosque-Sendra, J.M.; Bro, R. & Cuadros-Rodríguez, L. (2011b).

Eilers, P.H.C. (2003). A perfect Smoother. *Analytical Chemistry,* Vol.75, No.14, (July 2003) pp.

Eilers, P.H.C. (2004). Parametric Time Warping. *Analytical Chemistry,* Vol.76, No.2, (January

Erni, F. & Frei, R.W. (1978). 2-Dimensional column liquid-chromatographic technique for

Etxebarria, N.; Zuloaga, O.; Olivares, M.; Bartolomé. L.J. & Navarro, P. (2009). Retention-

Felinger, A. (1994). Deconvolution of overlapping skewed peaks. *Analytical Chemistry,* 

Felkel, Y.; Dorr, N.; Glatz, F. & Varmuza, K. (2010). Determination of the total acid number

Franch-Lage, F.; Amigo, J.M.; Skibsted, E.; Maspoch, S. & Coello, J. (2011). Fast assessment of

chromatographic data with retention times shifts. *Journal of Chemometrics,* Vol.13,

gas chromatography review. *Journal of Separation Science,* Vol.32, No.5-6, (March

concepts and applications. *Critical Reviews in Analytical Chemistry,* Vol.36, No.3-4,

Discriminating olive and non-olive oils using HPLC-CAD and chemometrics. *Analytical and Bioanalytcial Chemistry,* Vol.399, No.6, (February, 2011), pp. 2083-2092,

Olive oil quantification of edible vegetable oil blends using triacylglycerols chromatographic fingerprints and chemometric tools. *Talanta,* Vol.85, No.1, (July

resolution of complex mixtures. *Journal of Chromatography,* Vol.149, (February 1978),

time locked methods in gas chromatography. *Journal of Chromatography A*, Vol.1216,

(TAN) of used gas engine oils by IR and chemometrics applying a combined strategy for variable selection. *Chemometrics and Intelligent Laboratory Systems,* Vol.

the surface distribution of API and excipients in tablets using NIR-hyperspectral

*Systems,* Vol.38, No.2, (October 1997), pp. 149-171, ISSN 0169-7439

No.3-4, (May-August 1999), pp. 295-309, ISSN 0886-9383

0045-6535

Florida, USA

ISSN 1618-2650

ISBN 0-474-78977-8, UK

470-01686-2, Toronto, Canada

2009), pp. 883-904, ISSN 1615-9306

(2006) pp. 163-176, ISSN 1040-8347

2011), pp. 183-196, ISSN 0039-9140

2004), pp. 404-411, ISSN 0003-2700

No.10, (March 2009), pp. 1624-1629 ISSN 0021-9673

101, No. 1, (March, 2010), pp. 14-22 ISSN 0169-7439

Vol.66, No.19, (October 1994), pp. 3066-3072, ISSN 0003-2700

3631-3636, ISSN 0003-2700

pp. 561-569 ISSN 0021-9673

environmental samples. *Chemosphere*, Vol.75, No.8, (May 2009), pp. 1042-1048, ISSN

imaging. *International Journal of Pharmaceutics,* Vol.441, No.1-2, (June 2011), pp. 27- 35, ISSN 0378-5173


Application of Chemometrics to the Interpretation of Analytical Separations Data 325

Poole, C.F. (2003). *The Essence of Chromatography*, (1st ed.), Elsevier, ISBN 0444501983,

Rajalahti, T.; Arneberg, R.; Berven, F.S.; Myhr, K.M.; Ulvik, R.J. & Kvalheim, O.M. (2009a).

Rajalahti, T.; Arneberg, R.; Kroksveen, A.C.; Berle, M.; Myhr, K.M. & Kvalheim, O.M.

Reid, R.G.; Durham, D.G.; Boyle S.P.; Low, A.S. & Wangboonskul, J. (2007). Differentiation

Ruiz-Samblas, C.; Cuadros-Rodriguez, L.; Gonzalez-Casado, A.; Rodriguez Garcia, F.D.P; de

Sarkar, S.; Dutta, P.K. & Roy, N.C. (1998). A blind-deconvolution approach for

Savorani, F.; Tomasi, G. & Engelsen, S.B. (2010). Icoshift: A versatile tool for the rapid

Sinkov, N.A. & Harynuk, J.J. (2011a). Cluster resolution: A metric for automated, objective

Sinkov, N.A.; Johnston, B.M.; Sandercock, P.M.L. & Harynuk, J.J. (2011b). Automated

Skov, T.; Hoggard, J.C.; Bro, R. & Synovec, R.E. (2009). Handling within run retention time

Stanimirova, I.; Boucon, C. & Walczak, B. (2011). Relating gas chromatographic profiles to

Teofilo, R.F.; Martins, J.P.A. & Ferreira, M.M.C. (2009). Sorting variables by using

*Talanta,* Vol.83, No 4, (January 2011), pp. 1239-1246, ISSN 0039-9140 Tauler, R. & Barceló, D. (1993). Multivariate curve resolution applied to liquid

Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. *Chemometrics and Intelligent Laboratory Systems*, Vol. 95, No. 1, (January 2009), pp.

(2009b). Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in Complex Spectral or Chromatographic Profiles. *Analytical Chemistry*, Vol. 81, No. 7, (April 2009), pp.

of opium and poppy straw using capillary electrophoresis and pattern recognition techniques. *Analytica Chimica Acta*, Vol.605, No. 1, (December 2007), pp. 20-27, ISSN

la Mata-Espinosa, P.; Bosque-Sendra, J.M. (2011). Multivariate analysis of HT/GC- (IT)MS chromatographic profiles of triacylglycerols for classification of olive oil varieties, *Analytical and Bionalytical Chemistry*, Vol.399, No.6 (February 2011), pp.

chromatographic and spectroscopic peak restoration, *IEEE transactions on instrumentation and measurement,* Vol.47, No.4 (August 1998), pp. 941-947, ISSN

alignment of 1D NMR spectra. *Journal of Magnetic Resonance,* Vol.202, No.2,

and optimized feature selection in chemometric modeling. *Talanta*, Vol.83, No.4,

optimization and construction of chemometric models based on highly variable raw chromatographic data. *Analytica Chimica Acta*, Vol.697, No.1-2, (July 2011), pp.

shifts in two-dimensional chromatography data using shift correction and modeling. *Journal of Chromatography A,* Vol.1216, No.18, (May 2009), pp. 4020-4029,

sensory measurements describing the end products of the Maillard reaction.

chromatography-diode array detection. *Trends in Analytical Chemistry,* Vol.12, No.8,

informative vectors as a strategy for feature selection in multivariate regression.

Amsterdam, The Netherlands

35-48, ISSN 0169-7439

2581-2590, ISSN 0169-7439

2093-2103, ISSN 1618-2642

8-15, ISSN 1873-4324

(1993), pp. 319-327, ISSN 0165-9936

ISSN 0021-9673

(February 2010), pp. 190-202 ISSN 1090-7807

(January 2011), pp. 1079-1087, ISSN 0039-9140

0003-2670

0018-9456


Laursen, K.; Frederiksen, S.S.; Leuenhagen, C. & Bro, R. (2010). Chemometric quality contrl

Li, Y.H.; Wojcik, R & Dovichi, N.J. (2011). A replaceable microreactor for on-line protein

Liang, Y.; Xie, P. & Chau, F. (2010). Chromatographic fingerprinting and related

Liu, Z. & Phillips, J.B. (1991). Comprehensive 2-dimensional gas-chromatography using a

Maeder, M. (1987). Evolving factor analysis for the resolution of overlapping

Marini, F.; D'Aloise, A.; Bucci, R.; Buiarelli, F.; Magri, A.L. & Magri, D. (2011), Fast analysis

Michels, D.A.; Hu, S.; Schoenherr, R.M.; Eggertson, M.J. & Dovichi, N.J. (2002), Fully

Miller, J.M. (2005). *Chromatography: concepts and contrasts*, (2nd ed.) Wiley, ISBN 0471472077,

Mommers, J.; Knooren, J.; Mengerink, Y.; Wilbers, A.; Vreuls, R. & van der Wal, S. (2011).

Nielsen, N-P.; Cartensen, J.M. & Smedsgaard, J. (1998). Aligning of single and multiple

Otto, M. (2007). *Chemometrics*, Wiley-VCH, ISBN 978-3-527-31418-8, Weinheim, Germany Persson, P.O. & Strang, G. (2003). Smoothing by Savitzky-Golay and Legendre filters, In:

Pierce K.M.; Hope J.L.; Johnson K.J.; Wright B.W. & Synovec R.E. (2005). Classification of

2010), pp. 6503-6510, ISSN 0021-9673

pp. 2007-2011, ISSN 0021-9673

227-231, ISSN 0021-9665

530, ISSN 0003-2700

ISSN 1535-9476

Hoboken, USA

0021-9673

3159-3165 ISSN 0021-9673

1998), pp. 17-35 ISSN 0021-9673

978-0387-40319-9, New York, USA

9314

7439

of chromatographic purity. *Journal of Chromatography A,* Vol.1217, No.42 (October

digestion in a two dimensional capillary electrophoresis system with tandem mass spectrometry detection. *Journal of Chromatography A*, Vol.1218, No.15 (April 2011),

chemometric techniques for quality control of traditional Chinese medicines. *Journal of Separation Science,* Vol.33, No.3 (February 2010), pp. 410-421, ISSN 1615-

modulator interface. *Journal of Chromatographic Science*, Vol.29, No.6 (June 1991), pp.

chromatographic peaks. *Analytical Chemistry,* Vol.59, No.3, (February 1987), pp 527-

of 4 phenolic acids in olive oil by HPLC-DAD and chemometrics, *Chemometrics and Intelligent laboratory systems,* Vol.106, No.1, (March 2011), pp. 142-149, ISSN 0169-

automated two-dimensional capillary electrophoresis for high sensitivity protein analysis, *Molecular & Cellular Proteomics,* Vol.1, No.1, (January 2002), pp. 69-74,

Retention time locking procedure for comprehensive two-dimensional gas chromatography. *Journal of Chromatography A*, Vol.1218, No.21 (May, 2011), pp.

wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. *Journal of Chromatography A,* Vol.805, No.1-2 (May

*Mathematical Systems Theory in Biology, Communications, Computation and Finance,*  Rosenthal J. Gilliam D.S., pp. 301-315, IMA Vol. Math. Appl., 134, Springer, ISBN

gasoline data obtained by gas chromatography using a piecewise alignment algorithm combined with feature selection and principal component analysis. *Journal of Chromatography A*, Vol.1096, No.1-2, (November 2005), pp. 101-110, ISSN


*Journal of Chemometrics*, Vol.23, No.1-2, (January-February 2009), pp. 32-48, ISSN 0886-9383


Tomasi, G.; Van den Berg, F. & Andersson, C. (2004). Correlation optimized warping and

Van den Berg, F.; Tomasi, G. & Viereck, N. (2005). Warping: investigation of NMR pre-

Van Nederkassel, A.M.; Dazykowski, M.; Eilers, P.H.C. & Vander Heyden, Y. (2006). A

Vosough, M.; Bayat, M. & Salemi, A. (2010). Matrix-free analysis of aflatoxins in pistachio

Watson, N.E.; VanWingerden, M.M.; Pierce, K.M.; Wright, B.W. & Synovec, R.E. (2006).

Yao, W., Yin, X. & Hu Y. (2007). A new algorithm of piecewise automated beam search for

Yoshida H.; Leardi R.; Funatsu K. & Varmuza K. (2001) Feature selection by genetic

Zhang D.; Huang, X.; Regnier, F.E. & Zhang, M. (2008). Two-dimensional correlation

Zhang, Z.M.; Chen, S. & Liang, Y.Z. (2010). Baseline correction using adaptive iteratively

Zorzetti, B.M.; Shaver, J.M. & Harynuk, J.J. (2011). Estimation of the age of a weathered

1160, No.1-2, (August 2007), pp. 254-262. ISSN 0021-9673

Vol.80, No.8 (April 2008), pp. 2664-2671, ISSN 0003-2700

*Chromatography A,* Vol.118, No.2 (June 2006), pp. 199-210 ISSN 0021-9673 Vivó-Truyols, G.; Torres-Lapasió, J.R.; Caballero R.D. & García-Alvarez-Coque, M.C. (2002).

*of Chemometrics,* Vol.18, No.5, (May 2004), pp. 231-241, ISSN 0886-9383 Toppo, S.; Roveri, A.; Vitale, M.P.; Zaccarin, M.; Serain, E.; Apostolidis, E.; Gion, M.,

No.2, (January 2008), pp. 250-253 ISSN 1615-9861

Chemistry, ISBN 0854046488, Cambridge, UK

0886-9383

0021-9673

118, ISSN 0021-9673

ISSN 0003-2654

2001), pp. 485-494, ISSN 0003-2670

(May 2011), pp. 31-37, ISSN 0003-2670

*Journal of Chemometrics*, Vol.23, No.1-2, (January-February 2009), pp. 32-48, ISSN

dynamic time warping as preprocessing methods for chromatographic data. *Journal* 

Mariorino, M. & Ursini, F. (2008). MPA: A multiple peak alignment algorithm to perform multiple comparisons of liquid-phase proteomic profiles. *Proteomics,* Vol.8,

processing and correction, In: *Magnetic Resonance in Food Science: The Multivariate Challenge,* Engelsen, S.B., Belton, P.S., Jakobsen, H.J., pp. 131-138, Royal Society of

comparison of three algorithms for chromatograms alignment. *Journal of* 

Peak deconvolution in one-dimensional chromatography using a two-way data approach. *Journal of Chromatography A,* Vol.958, No.1-2, (June, 2002), pp. 35-49, ISSN

nuts using parallel factor modeling of liquid chromatography diode-array detection data. *Analytica Chimica Acta*, Vol.663, No.1, (March 2010), pp. 11-18. ISSN 0003-2670

Classification of high-speed gas chromatography-mass spectrometry data by principal component analysis coupled with piecewise alignment and feature selection. *Journal of Chromatography A*, Vol.1129, No.1, (September, 2006), pp. 111-

peak alignment of chromatographic fingerprints. *Journal of Chromatography A,* Vol.

algorithms for mass spectral classifiers. *Analytica Chimica Acta*, 446, 1-2, (November

optimized warping algorithm for aligning GC×GC-MS data. *Analytical Chemistry,* 

reweighted penalized least squares. *Analyst,* Vol.5 (February 2010), pp. 1138-1146,

mixture of volatile organic compounds. *Analytica Chimica Acta*, Vol.694, No.1-2,

### *Edited by Kurt Varmuza*

In the book "Chemometrics in practical applications", various practical applications of chemometric methods in chemistry, biochemistry and chemical technology are presented, and selected chemometric methods are described in tutorial style. The book contains 14 independent chapters and is devoted to filling the gap between textbooks on multivariate data analysis and research journals on chemometrics and chemoinformatics.

Chemometrics in Practical Applications

Chemometrics in Practical

Applications

*Edited by Kurt Varmuza*

Photo by Ca-ssis / iStock