*1.4.5 Partial Least Squares*

Partial Least Squares (PLS) regression and principal component regression (PCR) are examples of quantitative regression algorithms that are currently used for linear data, being considered as factor-based models. PLS and PCR use information from all wavelengths in the entire NIR spectrum to predict sample composition, instead of using a few selected wavelengths. PLS is similar to PCR but more sensitive in terms of variations in sample concentration. Studies performed by Wehling described that PLS and PCR, based on data reduction approaches, allowed to decrease a huge number of variables to a much smaller number of new variables that account for most of the variability in the samples [66]. The amount of a constituent in samples can then be predicted by these new variables. PLS is the most widely used supervised multivariate data analysis method that estimates and quantify components in a specific sample. Each training example is defined as a pair (*x*, *f(x)*), where *x* represents the input, and *f(x)* is the output of the underlying unknown function. The objective of supervised learning is given a set of examples of *f*, return a function *h* that best approximates *f*. Osborne et al. indicated that PLS tends to generate solutions that need fewer factors than calibrations of comparable performance produced by PCR [53]. PLS is defined as a regression algorithm that uses concentration data during the decomposition process and involves information as much as possible into the first few loading vectors [67]. It performs, simultaneously, a decomposition on the spectral and concentration data. A small number of factors are developed as specific data linear and regression on the scores of the factors used to derive a prediction equation. To remove irrelevant spectral variables and to improve model performance, several methods have been studied to select the optimal variables for multivariate calibration. The multivariate calibration allowed builds a predictive model, relating variables (wavenumbers) to properties of interest (concentration data). To address this common problem, a variety of linear regression methods based on latent variables (LVs) have been developed, such as partial least squares (PLS), but due to several drawbacks such as the noise in spectral data, the calibration and prediction errors are high, and the model can be affected [68]. Regardless of the regression method, the initial stage of this process is related to a typical development, optimization and refinement. The main objective of any multivariate regression is to predict unknown the samples' with a degree of certainty and great accuracy using a process known in multivariate analysis as "validation". The established regression models must be sufficiently validated, usually with independent validation samples of known concentrations. Root-mean-squareerror-of-prediction (RMSEP) and root-mean-square-percent-relative error (RMSRE) are utilized to calculate the reliability and performance of the regression model for accurate determination of analyte concentrations of validation or future samples.

The matrices containing the data provided by the NIR spectra, denominated by *X* and the vector *Y* containing the parameters that it will be determine are employed to build the regression model. The performance of the final PLS model is evaluated according to the RMSEP and the correlation coefficient (R). RMSEP was defined as:

$$RMSE = \sqrt{\frac{\sum\_{i=1}^{n} \left(\boldsymbol{\nu}\_{i} - \boldsymbol{\bar{\boldsymbol{\nu}}\_{i}}\right)^{2}}{n}} \tag{3}$$

$$R = \sqrt{1 - \frac{\sum\_{i=1}^{n} \left(\frac{1}{\mathcal{Y}\_i} - \mathcal{Y}\_i\right)^2}{\sum\_{i=1}^{n} \left(\mathcal{Y}\_i - \overline{\mathcal{Y}}\right)^2}}\tag{4}$$

where *n* represents the number of samples in test set validation, *y*i is the reference measurement result for the test set sample *i* and *ŷ*i is the estimated result of the model for the test sample *i*. (Eq. (3)). Correlation coefficient (R) relatively to the predicted and the quantified value are determined for both the calibration and the test set which is determined based on the (Eq. (4)), where ȳ represents the mean of the reference measurement for all samples in the calibration and test set. The best combination of spectral regions and the pre-processing techniques were selected by picking the PLS model with a small RMSEP, a high R and a low number of latent variables covering enough data variance. The model construction was based on test set validation composed by randomly chosen samples from the entire dataset, not used for model calibration. Based on PLS models, there are some procedures that depends on specific algorithm, spectral region selection, can considerably improve the performance of the full-spectrum calibration techniques, avoiding non-modeled interferences and building a well-fitted model [69–71]. Studies then performed showed that it is fundamental to conduct a spectra region selection responsible for the property of interest to increase the prediction performance [72, 73]. These procedures can be categorized into two classes: single wavelength selection and wavelength interval selection. Different strategies have been suggested for selection of optimal set of spectral regions such an interval PLS (iPLS), synergy PLS (siPLS), and moving window PLS (mwPLS) [69, 74, 75]. The principle of iPLS involves of splitting the spectra into equal-width intervals, and developing sub-PLS models for each one. The sub-intervals with the lowest value of the root mean squared error of prediction (RMSEP) must be chosen as the best. Several methods based on iPLS were developed to optimize the combination of the selected intervals, such as synergy iPLS (siPLS) [74]. These methods present a significant advantage because it uses a graphical presentation to focus on a selection of better sub-intervals and perform comparison among the prediction execution of local models and the full-spectrum model. Instead of just testing a series of adjacent but non overlapping intervals, which would miss some more informative ones, mwPLS was proposed to overcome this drawback. This strategy develops a series in a window that moves through the complete spectra and then selects the informative intervals with low model complexity and low value of the sum of residuals. Because it considers all the possible continuous intervals, it can select all the possible informative intervals but not the optimized ones [76].

#### *1.4.6 Soft Independent Modeling of Class Analogy*

Soft Independent Modeling of Class Analogy (SIMCA) is a supervised discriminant analysis method based on PCA [77]. This methodology is a class-modeling approach, meaning that, in defining the class boundaries, the method focuses on the similarities among samples from the same category [61, 78]. For each class, a PCA model is created and consequently the residual variance of the modeled class with the residual variance of the unknown sample is compared to determine which category the sample belongs to. The number of PCs used in each class should be selected to achieve the best classification results. SIMCA results are presented in terms of "sensitivity" and "specificity", where the former specifies the percentage of samples truly belonging to the category correctly accepted by the class model, while

*Near-Infrared Spectroscopy and Machine Learning: Analysis and Classification Methods of Rice DOI: http://dx.doi.org/10.5772/intechopen.99017*

the latter expresses the percentage of the objects from other classes which have been correctly rejected. SIMCA starts from a principal component analysis (PCA) of only the training objects belonging to the category to be modeled, to "capture" the regular variability due to the similarities among samples of the same class [79, 80]. Once the PCA is calculated, objects are accepted or rejected by the class-model based to their reduced distance from the class space, referred as *d*. For a generic *i* th sample, the d value is calculated by Eq. (5),

$$d\_i = \sqrt{\left(\frac{T\_i^2}{T\_{o.95}^2}\right)^2 + \left(\frac{Q\_i}{Q\_{o.95}}\right)^2} = \sqrt{\left(T\_{i, rad}^2\right)^2 + Q\_{i, rad}^2} \tag{5}$$

where T<sup>2</sup> is the Mahalanobis distance of the sample from the center of the class space and Q is its orthogonal distance from the PC subspace. These values are divided by T<sup>2</sup> 0.95 and Q0.95, which are the 95th percentiles of the T<sup>2</sup> and Q0.95 distributions, obtaining the reduced T<sup>2</sup> (T<sup>2</sup> red) and the reduced Q (Qred), respectively [79]. Due to the normalization, T<sup>2</sup> and Q limit values are equal to 1; a sample will then be accepted by the class model if *d* < sqrt(2), otherwise it is rejected.

#### *1.4.7 k-Nearest Neighbor*

*k*-Nearest Neighbor (k-NN) is methodology used for a classification step based on the closest training examples in the feature space. If most an unknown sample's *k*-Nearest Neighbors in training set belong to a specific class, then this unidentified sample is classified as this class. The parameter *k* affects the performance of k-NN model. The Euclidean distance is the most common algorithm used in k-NN [81].

#### *1.4.8 Random Forest*

Random Forest (RF) is a novel machine learning algorithm that presents many decision trees, and each tree is grown from a bootstrap sample of the response variable. The optimal split is chosen from a random subset of variables at each node of the tree, and then extends the tree to the maximum extent without cutting. Prediction procedure can be performed from new data by combining the outputs of all trees. RF is suitable and fast to deal with a large amount of data, showing the advantages to reduce variance and achieve comparable classification accuracy [82, 83].

#### *1.4.9 Artificial Neural Networks*

Artificial Neural Networks (ANNs) is defined a non-parametric regression models that capture any phenomena, to any degree of accuracy (depending on the adequacy of the data and the power of the predictors), without prior knowledge of the phenomena. ANNs are applied for classification and function mapping difficulties which are tolerant of some inaccuracy and have lots of training data available, but to which hard and fast rules cannot easily be applied [84]. In the ANN the input layer is linked to an output layer, either directly or through one or numerous hidden layers of interconnected neurons. The amount of hidden layers defines the depth of a ANN, and the width depends on the amount of neurons of each layer. Rapid optimization algorithms are used to iteratively develop forward and backward passes for minimization of a loss function and to learn the weights and biases of the layer. The activation functions are applied to the present values of the weights at

each layer in the forward pass. The final result of a forward pass is new predicted outputs. The backward pass computes the error derivatives among the expected outputs and the real outputs. These errors are then disseminated backwards updating the weights and calculating new error terms for each layer. Iterative repetitions of this process is designated as back-propagation [85]. A neural network is an adaptable system that learns relationships from the input and output data sets and then can predict a previously unseen data set of similar characteristics to the input set [86, 87]. Multilayer perceptron (MLP) and radial basis function (RBF) are widely used neural network architecture in literature for regression problems [88–90]. MLPs are usually used for prediction and classification using suitable training algorithms for the network weights. The MLP trained with the use of back propagation learning algorithm. **Figure 5a** represents a three-layer structure (MLP) the most basic ANN and its minimum configuration that consists of three layers of nodes (1) input layer, (2) hidden layer, and (3) output layer. The input layer accepts the data and the hidden layer processes them and finally the output layer displays the resultant outputs of the model [91, 92]. Each node, with the exception of the input, is a neuron that is based on a non-linear activation function. The MLP can be regarded as a hierarchical mathematical function planning some set of input values to output values via many simpler functions. Normally, the nodes are fully linked between layers and therefore the quantity of parameters quickly increases to huge numbers with a considerable risk of overfitting [93]. The RBF is considered the most broadly used structural design in ANN and simpler than MLP neural network (**Figure 5b**). The RBF has also an input, hidden and output layer. There are different types of radial basis functions, but the most widely used type is the Gaussian function.
