**2.3 Principal component analysis**

Among multivariate analyses, Principal Component Analysis (PCA) is the most common method for extracting information from large datasets generated by analytical pyrolysis [3, 12]. The PCA has different objectives, it is mainly used to reduce the dimensions of the datasets by extracting the most important information. In addition, it is useful to simplify the description of the data series and to analyze the structure of the observations and variables [63–65]. The PCA generates principal components (PC) that result from linear combinations of the original variables (e.g., the identified compounds). The number of these new variables can

**Figure 5.**

*Comparison of different methods for calculating the optimal number of k clusters. A) Optimal number of k clusters suggested by the majority rule by analysing all indexes. B) Elbow method. C) Silhouette method. D) Gap Statistic method.*

be arbitrarily defined. Commonly, the first component explains the largest possible variance of the dataset and the second, being orthogonal to the first, will be calculated to represent the largest possible variance. The factor scores correspond to the values of these new variables for the original observations (e.g., relative abundances of the compounds). The eigenvalues associated with each component correspond to the sum of the squared factor scores for each component. Thus, the contribution of each observation to a component (i.e., importance of the observation) is represented by the ratio of the square factor score of the observation by the eigenvalue associated with that component. Contributions for a given component can take values from zero to one, so the sum of all contributions for that component is equal to one [65]. Alternatively, the correlation of the two new variables generated by the PCA can be represented by a biplot [66]. Thus, it is possible to know the compounds that contribute the most to the sets obtained in the PCA (**Figure 6**). As stated, the first two components extracted by the PCA represent the largest variances for the data series. However, to determine the optimal number of components to consider, it is suggested to perform the "scree" test, plotting the eigenvalues as a decreasing function of their size [64]. In the graph, an "elbow" will be observed after the point where the slope of the curve decreases (flattens), therefore the optimal number of components must include all the components before that point (**Figure 7A**).

### **2.4 Classification of samples using only the most informative compounds**

Multivariate analyses are very useful when working with a large number of data. If lignocellulosic samples are analyzed by Py-GC/MS and the deconvolution method is applied, hundreds of derived compounds can be expected for each sample [21]. The PCA and clustering analysis allows separately to reduce the dimensionality of the datasets, identify relationships between the variables, and quantify the significance of the variables that can explain the resulting clusters [67].

*Cheminformatics Applied to Analytical Pyrolysis of Lignocellulosic Materials DOI: http://dx.doi.org/10.5772/intechopen.100147*

**Figure 6.**

*PCA results: the correlation between the variables generated by the PCA for lignin derivatives is shown. A) Compounds clustered according to their origin: C, catechols; H, phenols; G, guaiacols. B) Biplot that represents the correlation between variables. C) Confidence intervals for the correlation between variables; ellipses represent a significance level of 99%.*

The dimensionality of the data directly influences the results; the higher the dimensionality the classifications obtained will be more reliable [68, 69]. For the analysis of chemical compounds in materials the optimal relation of data points to variables is 6:1 or higher, with an absolute minimum of 3:1 [69–71]. However, to achieve these high proportions in the optimal ratio it is necessary to increase the number of experiments. An alternative to achieve the optimal relationship

### **Figure 7.**

*HCPC analysis for minimizing noise resulting in Py-CG/MS analysis. A) Scree plot, to determine the number of components that explain most of the variance. Number of components used = 5. B) Optimal number of k clusters. Optimal k clusters suggested by the majority rule = 4. C) Factorization of the data series using the PCA. D) Initial hierarchical clustering on the reduced matrix generated by the PCA. E) Clustering obtained using the number of k clusters suggested by the majority rule (the same suggested by the "elbow" method). F) Clusters obtained using a non-optimal number of k clusters.*

when it is not possible to increase the number of experiments is by reducing the number of variables [68]. In that sense, the HCPC analysis is a very powerful tool (**Figure 7A**–**C**). Compared with PCA and CA, the HCPC analysis increases the objectivity and robustness of the results. That is, the classifications are restricted only to the dimensions that contain the most significant information [67, 72]. In this way, the statistical noise caused by the many uninformative derivatives of pyrolysis is minimized [21]. In addition, it improves the visualization of the data and provides information on the variables (i.e., compounds) that contribute predominantly to the resulting clusters [21, 67]. The HCPC is an exploratory statistical analysis whose computational algorithm can be summarized in three steps: first, the reduction of dimensions can be by any factorial method. PCA for quantitative variables, multiple correspondence analysis for categorical data, or multiple factor analysis to jointly integrate different data blocks [72, 73]. This step allows the determination of the relationships between the concentrations of most abundant compounds and the trace compounds. In addition, it simplifies the dataset by reducing the number of variables to only two principal components that explain most of the variance [74] (**Figure 7C**). Second, the hierarchical cluster analysis (HCA), by using the Euclidean distance, form clusters of samples according to the similarities in their chemical composition [73, 74] (**Figure 7D**). Each object is treated as a single cluster

*Cheminformatics Applied to Analytical Pyrolysis of Lignocellulosic Materials DOI: http://dx.doi.org/10.5772/intechopen.100147*

and pairs of groups are successively merged until all clusters merge into one large group [48]. The algorithm uses Ward's method to minimize the total intragroup variance [47, 72, 75]. Finally, the partition with *k*-means allows to stabilize the groupings obtained by the HCA [67, 73] (**Figure 7E**). In this way, the HCPC applied to the data resulting from Py-GC/MS of lignocellulosic materials allows the samples to be classified based on the abundance patterns of the most informative compounds. That is, statistical noise generated by uninformative, ambiguous, or noisy compounds is suppressed [21].

## **2.5 Simplified visualization of abundance and similarity patterns from Py-GC/MS data**

The heat map method is a simple but highly efficient tool for the graphical representation of large datasets (**Figure 3**). This method is very useful in studies where it is necessary to interpret a large amount of quantitative data; e.g., metabolomics, proteomics, lipidomics, and genomics [76–78]. The quantitative data (i.e., relative abundances of the ions detected by the MS) are represented in different color scales in the format of a two-dimensional matrix [79, 80]. The basic structure of the matrix is given by columns and rows; each column represents a sample and each row represents a compound [76]. The quantitative values correspond to the relative abundance for each compound in each sample. For a certain range of values a particular color is assigned. The highest relative abundances are represented by one end of the color scale and the lowest abundances are represented by the opposite end of that color scale [77]. Additionally, the columns and rows of the matrix are rearranged to recognize significant patterns in the heat map. To do this, rows and columns with similar profiles are arranged so that they are closer to each other, making these profiles easily visible to the eye [79, 80]. The permutation of rows and columns is made based on the result of the CA on the correlation matrix of the variables for each set of variables [77]. Alternatively, the dendrograms resulting from the CA can be represented at the edges of the matrix, both for the samples and for the compounds [77, 79, 81, 82]. This form of representation of the relative abundances is so efficient that after rearranging the rows and columns of the matrices the abundance patterns of the compounds become obvious [76, 83].

The standardization (e.g., Z-transformation) of the variables from each series of variables highly influences the correct representation of the similarity patterns obtained [77, 80]. If raw, non-standardized data are used, the low relative abundances will be obscured by the higher relative abundances (**Figure 4A**–**C**). When using transformed data it is possible to infer that those compounds with similar abundance patterns imply equal origins [21, 79].

An interactive variant of the heat map method has been referred by several authors in the field of metabolomics [76, 84, 85]. Of course, this can also be applied to Py-GC/MS data. This online variant allows the visualization of important information from the mass spectra on the matrix. Metadata such as mass spectrum, retention time, extracted ion chromatograms (EICs), box and whisker plots as well as matches for each compound can be displayed in real time for each observation [76, 86].

On the other hand, alternative methods for interpreting the data resulting from Py-GC/MS have emerged recently. The Van Krevelen (VK) diagrams have been successfully applied for interpretation of high resolution GC/MS data [3, 87]. These diagrams allows to visualize the chemical composition of complex chemical mixtures by plotting the H:C ratio against the O:C ratio for every compound in the mixture [6]. Thus, the VK diagrams provide information about the classes of compounds present and allow accurately evaluate the number of compounds in

a sample [88]. Furthermore, van Krevelen diagrams play an important role in the deconvolution of high resolution MS spectra for complex lignin samples [6].
