**2. Cheminformatics applied to Py-GC/MS**

Increased computational capacity, development of powerful deconvolution algorithms and technological advances in analysis equipment have allowed the design of specialized software for chemical analysis. Areas such as omics sciences have particularly benefited from the rise of cheminformatics [26]. However, the application of untargeted analysis is becoming broader and is no longer restricted to the discovery and characterization of compounds in metabolomics. In this sense, it is possible to use the spectral deconvolution software for the processing of the data resulting from Py-GC/MS [21]. Open source software follows the same principle as native GC/MS software for spectra deconvolution and compound identification. However, it allows the use of different input formats for the raw datasets, regardless of the type, resolution and brand of the GC/MS equipment [26, 28]. In addition, different parameters can be adjusted to improve the informative quality of the results; e.g., the parameters used for deconvolution, the use of quality controls and normalization of the relative abundances for a batch of samples, alignment parameters and identification of compounds, use of different reference libraries for mass spectra, retention indices and times of retention. Because Py-GC/MS produces a large number of derived compounds, a lot of information is generated (i.e., mass spectra recorded by the detector in the MS). Omics tools allow deconvolution of all acquired mass spectra for a batch of samples in independent experiments. Basically, the peaks are detected by deconvolution of the mass spectra, smoothing the data points by the least squares method or by linear weighted smoothing average [28, 39]. Afterwards, both the first and second derivatives are considered together with the amplitude of the ions to identify the noise threshold. Based on the noise levels, the initial retention times are calculated for each peak. For the final detection of the peaks, the unsmoothed raw chromatogram is used as a control [28]. The deconvoluted spectra for the batch of samples are aligned based on the similarity of their mass spectra and their RTs. Finally, they are compared with those spectra in the reference MS libraries and the compounds can be identified based on the maximum fit of their RT, RI and mass spectra [26]. Additionally, the deconvoluted datasets for a batch of samples can be normalized and exported in table format. The information contained in the output file is important for comparative analysis: i.e., EI fragmentation pattern, quant mass (m/z of the main ion), averaged RT, InChIKey, total similarity with the reference spectrum and relative abundance of each compound normalized for the entire batch of samples [28]. This information can be used for comparative analysis by multivariate methods. Alternatively, it can

be compared with databases such as the Chemical Entities of Biological Interest (ChEBI) ontology [25], to infer biological characteristics of the original samples based on their derivatives from pyrolysis [21].

The comparative analysis of lignocellulosic samples is highly favored by the normalization process on the data obtained for a batch of samples [21]. The normalization of the deviations of the MS signal intensities is carried out including a series of quality control (QC) samples. The QC samples are one or more samples obtained by combining all samples in the batch. For lignocellulosic materials it is suitable to use alternately one QC sample for every five samples analyzed [21]. The data obtained from the measurement of the QC samples are smoothed by the Lowess of the single-degree least-squares. The coefficients generated on the QC samples are interpolated using the cubic spline and finally all the datasets are aligned based on the spline interpolation result [28].

Additionally, the unknown compounds can be annotated using their elemental formulas and *in silico* mass spectra fragmentation based on public spectral databases, such as MassBank, LipidBlast and GNPS [27, 28]. Currently most open access MS reference libraries are focused on the compounds of interest; i.e., metabolomics and lipidomics. Several of them include precursors or derivatives of lignocellulosic biomass, such as anhydro sugars, furans, pyrans and phenols and their derivatives. Actually, as the areas of application of omics tools diversify (for spectral deconvolution and compound annotation) it can be expected that the diversity and number of compounds incorporated in open access databases will increase.

### **2.1 Multivariate analysis on exported Py-GC/MS data**

Interpretation of the results obtained by Py-GC/MS is a complex process. This is due to the large number of compounds that are generated by pyrolysis and the little information provided by compounds with ambiguous origin, often very numerous (as described above). Multivariate analysis applied to Py-GC/MS data from various materials helps to make data management easier, reduce the information obtained and facilitate the interpretation of the results. It has been used to characterize lignocellulosic samples and other biological samples [40–43].

A common application of Py-GC/MS material analysis is the classification of samples based on the similarities of the compounds they produce. For example, to evaluate different experimental systems [44, 45] or for the optimization of two different methods [46]. It was recently used to characterize and classify lignocellulosic samples applying cheminformatics from a chemotaxonomic approach [21].

Classification of the observations into groups requires the calculation of the distance between each pair of observations. As a result, a distance matrix is obtained, also called a dissimilarity matrix. The distance most commonly used by computational algorithms is the Euclidean distance [47]; i.e., the root sum-of-squares of differences for a set of vectors [48]. As a result, observations with high values of features will be grouped together, likewise, observations with low features values will be grouped together.

Apart from the normalization performed by the spectral deconvolution software on the output datasets, it is highly recommended to standardize the variables before measuring the dissimilarities between observations [49]. This step is considered necessary as it can have a great impact on the results of the analysis on biological data [49, 50]. **Figure 4** represents the differences between non-standardized data and standardized data. In standardization, the values of each variable are weighed by a scale factor in order to give more weight to the small but potentially significant changes in signal intensity [51]. Thus, the standard deviation and the mean usually take values of one and zero, respectively. On the other hand, standardization will help to obtain equivalent similarities regardless of the distance method used (e.g.,

Euclidian, Manhattan, Correlation or Eisen). For example, when using standardized data, there is a functional relationship between Pearson's correlation coefficient and the standardized Euclidean distance, so that both results are comparable [48].
