**5. PARAFAC for parallel factor analysis, a generalization of PCA to 3-way data**

#### **5.1. Introduction**

Usually, physicochemical analysis - or else the more generally parametric monitoring of a number of physicochemical properties of a set of samples - resulted in the construction of a data matrix with samples in rows and physicochemical properties in columns. This data matrix is a mathematical representation of the characteristics of the sample set at time *t* in preparatory and analytical conditions attached to the instant action. Incidentally, the fact of working on a matrix which is a two-dimensional array allows us to speak of two modes or 2-way data. Now, imagine that you reproduce these measurements on several dates *t*1, *t*2, ..., *t*n. You no longer have a matrix *X* (*n*, *p*) but the **N** matrices *X*i (*n*, *p*) of the same size where *i* is the number of the matrix corresponding to the time *t*i. This is known as 3-mode data or three-way data or a "data cube". Figure 8 below illustrates what has been said.

PCA: The Basic Building Block of Chemometrics 23

in 1970 [*33*] and by Carroll & Chang [*34*] who named the model CANDECOMP (canonical decomposition). The oldest paper we found relating the mathematical idea of PARAFAC was probably published by Hitchcock F.L. in 1927 who presents a method to decompose tensors or polyadics as a sum of products [*35*]. In this class of tools, the size or dimensions of the dataset are called "modes", hence the name of multi-mode techniques. The most widely

As discussed above, many problems involve chemical 3-ways data tables. Let us come back to the hypothetical example of a liquid chromatography coupled to a fluorescence spectrometer that produces a data set consisting of 3-way "layers" or 2-dimensional matrix sheets coming from the excitation-emission fluorescence spectra collection (Excitation-

The fluorescence intensities depend on the excitation and emission wavelengths, and the elution time, these variables will be used to represent the three modes. To exploit these multiway data, the PARAFAC model is an effective possibility to extract useful information. PARAFAC is an important and useful tool for qualitative and quantitative analysis of a set of samples characterized by bilinear data such as EEM in fluorescence spectroscopy. One of the most spectacular examples of the capability of PARAFAC model is the analysis of a mixture of several pure chemical components characterized by bilinear responses. In this case, PARAFAC is able to identify the right pairs of profiles (e.g. emission and excitation profiles) of pure components as well as the right concentration proportions of the mixture. PARAFAC yields unique component solutions. The algorithm is based on a minimizing least squares method. This is to decompose the initial data table in a procedure known as "trilinear decomposition" which gives a unique solution. The trilinear decomposition comes from the model structure and sometimes data itself implies that because of its (their) natural decomposition in 3 modes. The PARAFAC model is a generalization of the PCA itself bilinear, to arrays of higher order (i.e., three or more dimensions). PCA decomposes the data into a two-mode product of a matrix called matrix *T* scores and a matrix of loadings *P* describing systematic variations in the data, over a matrix of residues *E* representing the discrepancies between actual data and the model obtained. Thus, as illustrated by figure 8, the EEM are arranged in a 3-way table (*X*) of size *I* x *K* x *J* where *I* is the number of samples, *J* the number of emission wavelengths, and *K* the number of excitation wavelengths. Similarly, PARAFAC decomposes *X* into three matrices (see figure 9): *A* (scores), *B* and *C* (loadings) with the elements *a*if, *b*jf, and *c*kf. In other words, N sets of triads are produced and the trilinear model is usually presented as

1

*ijk if jf kf ijk*

*F*

*f x abc e* 

(i 1...I; j 1...J; k 1...K)

(3)

used encountered term is "multiway" technique.

Emission Matrix: EEM) which vary as a function of elution time.

**5.2. The PARAFAC model** 

followed in equation 1 [*36*]:

**Figure 8.** Representation of a matrix and a cube of data from experimental measurements.

At this stage, a key question is raised: how should one analyse all these matrices so as to extract all the relevant information that takes into account the time factor? Several options are available to us. The first would be, for example, an averaging of each parameter per sample measured at the various dates/times. This has the immediate effect of losing the time-based information. By this operation, the effect of time on the measured parameters is no longer observable. We would obtain an average matrix resulting from all of the measured parameters of the study period. The advantage of this in relation to the question is null. Finally, we could consider using the chemometric techniques discussed previously in this book, such as principal component analysis (or others not presented here, such as the hierarchical analysis of each class of matrices stored). But then, it becomes difficult to visualize changes occurring between successive matrices which may correspond to an evolution with time (as in the case of 3D fluorescence spectra which are commonly qualified as second-order data) or to a geographical evolution of the measured parameters (e.g., when studying physicochemical a river port or lagoon in which samples are realized in various places to monitor the level of chemical pollution [*32*]). The overall interpretation is certainly more difficult.

The solution must be sought to tools capable of taking into account a third or Nth dimension(s) in the data while retaining the ability to account for interactions between factors. The following pages describe two of these tools; probably one of the most famous of them is the PARAllel FACtor model (PARAFAC) introduced independently by Harshman in 1970 [*33*] and by Carroll & Chang [*34*] who named the model CANDECOMP (canonical decomposition). The oldest paper we found relating the mathematical idea of PARAFAC was probably published by Hitchcock F.L. in 1927 who presents a method to decompose tensors or polyadics as a sum of products [*35*]. In this class of tools, the size or dimensions of the dataset are called "modes", hence the name of multi-mode techniques. The most widely used encountered term is "multiway" technique.

#### **5.2. The PARAFAC model**

22 Analytical Chemistry

more difficult.

**Figure 8.** Representation of a matrix and a cube of data from experimental measurements.

At this stage, a key question is raised: how should one analyse all these matrices so as to extract all the relevant information that takes into account the time factor? Several options are available to us. The first would be, for example, an averaging of each parameter per sample measured at the various dates/times. This has the immediate effect of losing the time-based information. By this operation, the effect of time on the measured parameters is no longer observable. We would obtain an average matrix resulting from all of the measured parameters of the study period. The advantage of this in relation to the question is null. Finally, we could consider using the chemometric techniques discussed previously in this book, such as principal component analysis (or others not presented here, such as the hierarchical analysis of each class of matrices stored). But then, it becomes difficult to visualize changes occurring between successive matrices which may correspond to an evolution with time (as in the case of 3D fluorescence spectra which are commonly qualified as second-order data) or to a geographical evolution of the measured parameters (e.g., when studying physicochemical a river port or lagoon in which samples are realized in various places to monitor the level of chemical pollution [*32*]). The overall interpretation is certainly

The solution must be sought to tools capable of taking into account a third or Nth dimension(s) in the data while retaining the ability to account for interactions between factors. The following pages describe two of these tools; probably one of the most famous of them is the PARAllel FACtor model (PARAFAC) introduced independently by Harshman As discussed above, many problems involve chemical 3-ways data tables. Let us come back to the hypothetical example of a liquid chromatography coupled to a fluorescence spectrometer that produces a data set consisting of 3-way "layers" or 2-dimensional matrix sheets coming from the excitation-emission fluorescence spectra collection (Excitation-Emission Matrix: EEM) which vary as a function of elution time.

The fluorescence intensities depend on the excitation and emission wavelengths, and the elution time, these variables will be used to represent the three modes. To exploit these multiway data, the PARAFAC model is an effective possibility to extract useful information. PARAFAC is an important and useful tool for qualitative and quantitative analysis of a set of samples characterized by bilinear data such as EEM in fluorescence spectroscopy. One of the most spectacular examples of the capability of PARAFAC model is the analysis of a mixture of several pure chemical components characterized by bilinear responses. In this case, PARAFAC is able to identify the right pairs of profiles (e.g. emission and excitation profiles) of pure components as well as the right concentration proportions of the mixture. PARAFAC yields unique component solutions. The algorithm is based on a minimizing least squares method. This is to decompose the initial data table in a procedure known as "trilinear decomposition" which gives a unique solution. The trilinear decomposition comes from the model structure and sometimes data itself implies that because of its (their) natural decomposition in 3 modes. The PARAFAC model is a generalization of the PCA itself bilinear, to arrays of higher order (i.e., three or more dimensions). PCA decomposes the data into a two-mode product of a matrix called matrix *T* scores and a matrix of loadings *P* describing systematic variations in the data, over a matrix of residues *E* representing the discrepancies between actual data and the model obtained. Thus, as illustrated by figure 8, the EEM are arranged in a 3-way table (*X*) of size *I* x *K* x *J* where *I* is the number of samples, *J* the number of emission wavelengths, and *K* the number of excitation wavelengths. Similarly, PARAFAC decomposes *X* into three matrices (see figure 9): *A* (scores), *B* and *C* (loadings) with the elements *a*if, *b*jf, and *c*kf. In other words, N sets of triads are produced and the trilinear model is usually presented as followed in equation 1 [*36*]:

$$\begin{aligned} \mathbf{x}\_{ijk} &= \sum\_{f=1}^{F} a\_{if} b\_{jf} c\_{kf} + e\_{ijk} \\ \mathbf{f} \cdot \mathbf{(i} &= \mathbf{1} \dots \mathbf{I}; \mathbf{j} = \mathbf{1} \dots \mathbf{J}; \ k = \mathbf{1} \dots \mathbf{K}) \end{aligned} \tag{3}$$

where *x*ijk is the fluorescence intensity for the *i* th sample at the emission wavelength *j* and the excitation wavelength *k*. The number of columns *f* in the matrices of loadings is the number of PARAFAC factors and *e*ijk residues, which account for the variability not represented by the model.

PCA: The Basic Building Block of Chemometrics 25

here, **xijk** are the measured data, **giu**, **hjv** and **ekw** are the elements of the loading matrices for each the three ways (with **r**, **s** and **t** factors, respectively) and **cuvw** are the elements of the core array (of size **r** × **s** × **t**), while ijk are the elements of the array of the residuals. A tutorial on chemical applications of Tucker3 was proposed by Henrion [33], while Kroonenberg [34] gives a detailed mathematical description of the model and discusses advanced issues such as data preparation/scaling and core rotation. For a complete and very pedagogical comparison of Tucker3 with PARAFAC, another multiway procedure, see Jiang [35]. Nevertheless, some aspects of Tucker3 model which distinguish it from PARAFAC have to be discussed here. The first, the Tucker3 model does not impose the extraction of the same number of factors for each mode. Second, the existence of a core array, *C*, governing the interactions between factors allows the modelling of two or more factors that might have the same chromatographic profile but different spectral and/or concentration profiles. Third, the presence of the core cube in the Tucker3 model gets it to appear as a non linear model which is not always appropriate for problems having trilinear structure. But this limitation can be overcome by applying constraints to the core cube *C*. In some cases, with constraints applied to *C* leadings to have only nonzero elements on the superdiagonal of the cube and a number of factors equal on each mode, then the resulting solution is equivalent to the PARAFAC

Although PARAFAC and Tucker3 as factorial decomposition techniques come from the last century, their routine use in analytical chemistry became popular with the Rasmus Bro's thesis in 1998. Of particular note is the remarkable PhD works published and available on the Department of Food Science website of the Faculty of Life Sciences, University of Copenhagen using the PARAFAC model and many other multivariate and multiway methods in many industrial food sectors6. Before the Bro's work and more generally of the Danish team, one can list a number of publications on the application of PARAFAC or Tucker3 in various sciences, but the literature does not appear to show significant, focused and systematic production of this type of model in the world of chemistry. Since the 2000s, multiway techniques have become widespread with strength in analytical chemistry in fields as diverse as food science and food safety, environment, sensory analysis and process chemistry. This explosion of applications in the academic and industrial sectors is linked to the popularization of analytical instruments directly producing multiway data and particularly fluorescence spectrometers for the acquisition matrix of fluorescence which are naturally two-modes data, inherently respecting the notion of tri-linearity and therefore suitable for processing by models such as PARAFAC or Tucker3. Applications of three-way techniques are now too numerous to be cited in their entirely. Therefore, some of the more interesting applications are listed in table 8 below and the reader is encouraged to report

6 Department of Food Science, Faculty of Life Sciences, University of Copenhagen, http://www.models.kvl.dk/theses,

model [*38*].

last visit April, 2012.

**5.4. Applications: Brief review** 

himself to reviews included in this table.

Note the similarities between the PARAFAC model and that of the PCA in schema 2, § "II.A *Some theoretical aspects*". The PARAFAC model is a specific case of the Tucker3 model introduced by Tucker in 1966. The following paragraph presents the essentials about Tucker3 model and proposes some important papers on the theory and applications of this multiway tool. The subsequent paragraph gives a more complete bibliography related to PARAFAC and Tucker3 models on various application areas.

#### **5.3. Tucker3: a generalization of PCA and PARAFAC to higher order**

Conceptually, the Tucker3 model [*37*] is a generalization of two-way data decomposition methods such as PCA or singular value decomposition (SVD) to higher order arrays or tensors [8] and [9]. In such multiway methods, scores and loadings are not distinguishable and are commonly treated as numerically equivalent. Being a generalization of principal component analysis and PARAFAC to multiway data arrays, the Tucker3 model has for its objective to represent the measured data as a linear combination of a small number of optimal, orthogonal factors. For a 3-way data array, the Tucker3 model takes the following form:

$$\mathcal{X}\_{ijk} = \sum\_{\mu=1}^{r} \sum\_{v=1}^{s} \sum\_{w=1}^{t} \mathcal{G}\_{i\mu} \mathcal{H}\_{jv} \mathcal{C}\_{k\upsilon} \mathcal{C}\_{\mu\upsilon w} + \mathcal{E}\_{ijk}$$

**Figure 9.** Principle of the decomposition of a 3-way data cube according to the PARAFAC model.

here, **xijk** are the measured data, **giu**, **hjv** and **ekw** are the elements of the loading matrices for each the three ways (with **r**, **s** and **t** factors, respectively) and **cuvw** are the elements of the core array (of size **r** × **s** × **t**), while ijk are the elements of the array of the residuals. A tutorial on chemical applications of Tucker3 was proposed by Henrion [33], while Kroonenberg [34] gives a detailed mathematical description of the model and discusses advanced issues such as data preparation/scaling and core rotation. For a complete and very pedagogical comparison of Tucker3 with PARAFAC, another multiway procedure, see Jiang [35]. Nevertheless, some aspects of Tucker3 model which distinguish it from PARAFAC have to be discussed here. The first, the Tucker3 model does not impose the extraction of the same number of factors for each mode. Second, the existence of a core array, *C*, governing the interactions between factors allows the modelling of two or more factors that might have the same chromatographic profile but different spectral and/or concentration profiles. Third, the presence of the core cube in the Tucker3 model gets it to appear as a non linear model which is not always appropriate for problems having trilinear structure. But this limitation can be overcome by applying constraints to the core cube *C*. In some cases, with constraints applied to *C* leadings to have only nonzero elements on the superdiagonal of the cube and a number of factors equal on each mode, then the resulting solution is equivalent to the PARAFAC model [*38*].

#### **5.4. Applications: Brief review**

24 Analytical Chemistry

the model.

form:

where *x*ijk is the fluorescence intensity for the *i*

PARAFAC and Tucker3 models on various application areas.

**5.3. Tucker3: a generalization of PCA and PARAFAC to higher order** 

11 1

*rst*

*uvw x ghe c*

th sample at the emission wavelength *j* and the

excitation wavelength *k*. The number of columns *f* in the matrices of loadings is the number of PARAFAC factors and *e*ijk residues, which account for the variability not represented by

Note the similarities between the PARAFAC model and that of the PCA in schema 2, § "II.A *Some theoretical aspects*". The PARAFAC model is a specific case of the Tucker3 model introduced by Tucker in 1966. The following paragraph presents the essentials about Tucker3 model and proposes some important papers on the theory and applications of this multiway tool. The subsequent paragraph gives a more complete bibliography related to

Conceptually, the Tucker3 model [*37*] is a generalization of two-way data decomposition methods such as PCA or singular value decomposition (SVD) to higher order arrays or tensors [8] and [9]. In such multiway methods, scores and loadings are not distinguishable and are commonly treated as numerically equivalent. Being a generalization of principal component analysis and PARAFAC to multiway data arrays, the Tucker3 model has for its objective to represent the measured data as a linear combination of a small number of optimal, orthogonal factors. For a 3-way data array, the Tucker3 model takes the following

*ijk iu jv kw uvw ijk*

**Figure 9.** Principle of the decomposition of a 3-way data cube according to the PARAFAC model.

 Although PARAFAC and Tucker3 as factorial decomposition techniques come from the last century, their routine use in analytical chemistry became popular with the Rasmus Bro's thesis in 1998. Of particular note is the remarkable PhD works published and available on the Department of Food Science website of the Faculty of Life Sciences, University of Copenhagen using the PARAFAC model and many other multivariate and multiway methods in many industrial food sectors6. Before the Bro's work and more generally of the Danish team, one can list a number of publications on the application of PARAFAC or Tucker3 in various sciences, but the literature does not appear to show significant, focused and systematic production of this type of model in the world of chemistry. Since the 2000s, multiway techniques have become widespread with strength in analytical chemistry in fields as diverse as food science and food safety, environment, sensory analysis and process chemistry. This explosion of applications in the academic and industrial sectors is linked to the popularization of analytical instruments directly producing multiway data and particularly fluorescence spectrometers for the acquisition matrix of fluorescence which are naturally two-modes data, inherently respecting the notion of tri-linearity and therefore suitable for processing by models such as PARAFAC or Tucker3. Applications of three-way techniques are now too numerous to be cited in their entirely. Therefore, some of the more interesting applications are listed in table 8 below and the reader is encouraged to report himself to reviews included in this table.

<sup>6</sup> Department of Food Science, Faculty of Life Sciences, University of Copenhagen, http://www.models.kvl.dk/theses, last visit April, 2012.


PCA: The Basic Building Block of Chemometrics 27

**5.5. A research example: combined utilization of PCA and PARAFAC on 3D** 

The interest for the use of chemometric methods to process chromatograms in order to achieve a better discrimination between authentic and adulterated honeys by linear discriminant analysis was demonstrated by our group previously [*61*]. An extent of this work was to quantify adulteration levels by partial least squares analysis [*62*]. This approach was investigated using honey samples adulterated from 10 to 40% with various industrial bee-feeding sugar syrups. Good results were obtained in the characterization of authentic and adulterated samples (96.5% of good classification) using linear discriminant analysis followed by a canonical analysis. This procedure works well but the data acquisition is a bit so long because of chromatographic time scale. A new way for honey analysis was recently investigated with interest: Front-Face Fluorescence Spectroscopy (FFFS). The autofluorescence (intrinsic fluorescence) of the intact biological samples is widely used in biological sciences due to its high sensitivity and specificity. Such an approach increases the speed of analysis considerably and facilitates non-destructive analyses. The non-destructive mode of analysis is of fundamental scientific importance, because it extends the exploratory capabilities to the measurements, allowing for more complex relationships such as the effects of the sample matrix to be assessed or the chemical equilibriums occurring in natural matrices. For a recent and complete review on the use of fluorescence spectroscopy applied on intact food systems see [*63, 64*]. Concerning honey area, FFFS was directly applied on honey samples for the authentication of 11 unifloral and polyfloral honey types [*65*] previously classified using traditional methods such as chemical, pollen, and sensory analysis. Although the proposed method requires significant work to confirm the establishment of chemometric model, the conclusions drawn by the authors are positive about the use of FFFS as a means of characterization of botanical origin of honeys samples. At our best knowledge, the previous mentioned paper is the first work having investigated the potential of 2D-front face fluorescence spectroscopy to determine the botanical origins of honey at specific excitation wavelengths. We complete this work by adopting a 3D approach of measurements. We present here below the first characterization of three clear honey

**fluorescence spectra to study botanical origin of honey** 

varieties (Acacia, Lavender and Chestnut) by 3D-Front Face Spectroscopy.

in 3 mL quartz cuvette and spectra were recorded at 20 °C.

This work was carried out on 3 monofloral honeys (Acacia: *Robinia* pseudo-acacia, Lavandula: *Lavandula hybrida* and Chestnut: *Castanea sativa*). Honeys were obtained from French beekeepers. The botanical origin of the samples was certified by quantitative pollen analysis according to the procedure of Louveaux et al. [*66*]. An aliquot part of 10 g of the honey samples was stirred for 10min at low rotation speed (50-80 rpm) after slight warming (40°C, for 1h), allowing the analysis of honeys at room temperature by diminishing potential difficulties due to different crystallization states of samples. Honey samples were pipetted

*5.5.1. Application of PCA* 

*5.5.1.1. Samples* 

Books and thesis; Scientific papers; Reviews

**Table 8.** Bibliography related to PARAFAC and/or Tucker3 models. Theory and applications in areas such as chemicals and food science, medicine and process chemistry.
