**1. Introduction**

### **1.1 Rice (***Oryza sativa* **L.): biochemical and physical characteristics**

Rice (*Oryza sativa* L.), considered as the principal staple food for half of the world's population, is consumed from ancient times being considered one of the most important sources of dietary proteins, carbohydrates, vitamins, minerals and fiber [1]. Rice belongs to the family of cereal grasses, along with wheat, corn, millet, oats, barley, rye, and numerous others. Rice is a plant that normally grows for only one year, consisting of rounded, hollow, and articulated stalks (stems), has flat-looking leaves and a terminal panicle. Rice is considered the only cereal adapted to grow in either flooded or non-flooded soil. Rice is cultivated in different climatic and geographic conditions and is the basis of food for a significant part of the world population. The diversity of rice grains and their quality are important factors for producers and consumers and depend on genetic characteristics and growing conditions. The grain is the seed of rice which, when the egg is fertilized, contains an embryo that has an ability to germinate and give rise to a new plant. It consists of the mature ovary, the lemma and palea (shell), the rachilla, the sterile lemmas and the wing (not always present). The embryo, present on the ventral side of the spikelet, close to the spikelet, has an embryonic root. The rest of the grain structure consists mainly of the endosperm (the edible portion), which contains starch, proteins, carbohydrates, fat, crude fiber and inorganic substance. The rough rice kernel includes the husks or hulls and pedicel, as well as the caryopsis (**Figure 1**). The weight distribution of rice caryopsis throughout the maturation phase is defined as follows: pericarp (1–2%), tegument and aleurone (5%), starchy endosperm (89–91%) and embryo (2–3%) [3]. A rice caryopsis (rice seed or whole rice grain) tends to accumulate rapidly during the developmental phase, over 5 to 15 days after fertilization under ideal conditions for development. Starch is accumulated in higher concentration in the starchy endosperm. Small amounts of starch are found in the subaleurone layer and very small amounts are present in the embryo and aleurone layer [4]. Functional proteins are present in different tissues of the embryo during development; the proteins considered storage are found accumulated in these tissues [5]. Storage proteins are found in high amounts in the starchy endosperm, however, the protein concentration is higher in the aleurone layer compared to the subaleurone layer and in the starchy endosperm [3]. Lipids, in the

#### **Figure 1.**

*Parts of rough rice grain. 1-Scutelium (Cotyledon); 2-Coleoptile; 3-Epicotyl (Plumule); 4-Apical meristem; 5-Radicle; 6-Coleorhiza; 7-Pericarp; 8-Tegmen (Seed coat); 9-Aleurone layer; 10-Subaleurone layer; 11-Starchy endosperm; 12-Lemma; 13-Palea; 14-Sterile lemmas; 15-Rachilla; 16-Part of pedicel. Adapted from: [2].*

#### *Near-Infrared Spectroscopy and Machine Learning: Analysis and Classification Methods of Rice DOI: http://dx.doi.org/10.5772/intechopen.99017*

form of lipid bodies, begin to accumulate about five days after anthesis and increase in content in conjunction with starch and protein it can be accumulated for a longer period [6]. The biological activity of the pericarp and seed coat during development is important for cereals, including rice, but the synthetic activity of the seed covering the maternal tissue begins to decline before the endosperm and embryo maturity [7].

Many characteristics of grain quality, such as milling behaviour, appearance, nutritional properties, and cooking qualities, have been routinely evaluated [8]. The evaluation methods of rice varieties are based on their chemical composition, namely (protein, moisture, fat, and ash), apparent amylose concentration, gelatinization temperature, gel consistency and dough viscosity. These procedures are based on standardized methods, which are often considered to be slow and expensive [8]. The classification and characterization of different types of rice depends on several physicochemical parameters, namely, biometric data and protein, fat, ash, moisture, starch, amylose, among other.

Starch is one of main components in rice grain, being the essential carbohydrate reserve in the grain, and so its impact in the evaluated physico-chemical parameters. Starch is a complex polysaccharide of α-D-glucose units exclusively, which are joined by a sequence of α-D-(1,4)-glucosidic linkages thus giving rise to a linear or helical chain, being composed by two classes of glucose polymers: amylopectin and amylose. Amylose is a linear polymer of D-glucose units, and amylopectin is a highly branched polymer of glucose. These are referred to as amylose (20–30%). The much less frequent α-(1,6)-glucosidic linkages form the branch points between the chains thereby creating highly branched domains, denominated amylopectin (70–80%) [9]. Amylose is considered the most important determinant of the eating quality of rice and based on their contents, rice varieties can be classified as: waxy (0–2%); very low (3–12%); low (13–20%); intermediate (21–25%) and high (>26%) [10]. The classical and still commonly used method for the amylose and amylopectin determination is the iodine reaction coupled with potentiometric or amperometric titration. There are also other methods such as: differential scanning calorimetry [11], potentiometric [12], spectrophotometric [13], and chromatographic [14, 15] that can be used for classification and a detailed analysis. The fine structure of amylose, both molecular size and chain-length distribution, are also significant factors of the hardness of cooked rice [16]. Amylose content is correlated with the retrogradation behavior, influencing the textural properties of cooked rice and the viscoelasticity dynamic of rice starch gel [17]. The elongation of grains, volume expansion as well as water absorption characteristics are accounted for cooked rice quality [18].

Proteins and lipid content are also characteristics currently accepted to define rice quality [19]. After starch, the protein is the second main component of rice, being found by four fractions: albumin (soluble in water), globulin (soluble in salt), glutelin (soluble in alkali), which represents the dominant protein in brown rice and white rice, and prolamine (soluble alcohol), a secondary protein in all rice mill fractions [20, 21]. Lipids are the third major component of brown rice, next to carbohydrates and protein, playing a major role in the quality of rice during processing and storage. Fats or lipids are mainly concentrated in the outer bran layer of brown rice, up to 20% by mass; therefore, the lipids content of brown rice is greater than that of milled rice [19, 22].

Appearance quality is how the rice appears after milling and it is associated with grain length, width, length-width ratio (shape) and translucency/chalkiness of the endosperm. Generally, most markets prefer translucent rice as opposed to chalky ones. Appearance quality has a direct influence on marketability and success of commercial varieties. The physical properties of rice grain include all of its external

**Figure 2.** *Rice grains aspects.*

or integral characteristics, such as its appearance (size, shape, smoothness, colour), weight, hardness, volume, flow properties and so on (**Figure 2**).

Rice classification and consequent analysis is a comprehensive quality indicator not only in terms of the appearance but also for its cooking and processing qualities. Physical properties of rice are fundamental in all activities related to the production, preservation and utilisation of rice [23]. The parameters such as dimensions, density, hardness, friction and mechanical properties are affected by the moisture content of the grain and its degree of milling, and also to a small extent by temperature. Cereal research, as well as grading and evaluation of food products, have encouraged the development of non-destructive, rapid and accurate analytical techniques to evaluate grain quality and safety being characterized by a huge amount of experimental data that must be accurately analysed [24]. Different types of rice vary in terms of size, shape, color and constitution, which cannot be accurately identified by human visualization. Often, rice seed cultivars, characterized by high quality, can be faked using low quality cultivars or confused with other cultivars, which complicates rice quality, yield and value. For this reason, the identification of rice seed cultivars is extremely important.

Grain appearance is characterized by biometric parameters (length, width, length/width ratio), total whiteness, vitreous whiteness, and chalkiness, being considered as crucial factor that affects its market acceptability. Grain shape can be described by biometric parameters, which are closely associated with grain weight [25, 26]. The ratio of the length and the width is used internationally to describe the shape and class of the variety. Grain weight provides information about the size and density of the grain. Grains of different density mill differently, and are likely to retain moisture differently and cook differently. Uniform grain weight is important for consistent grain quality [27]. Chalkiness, an opaque white discoloration of the endosperm, reduces the value of head rice kernels and decreases the ratio of head to broken rice produced during the milling process [28]. Viscosity is a characteristic that indicates some of the cooking properties of rice, being evaluated by Rapid Visco Analysis (RVA), which mimics the process of cooking and monitors the changes to a slurry of rice flour and water, during the test. Starch viscosity curves are useful for breeding because the shape of the curve is unique to each class of rice [29]. The primary RVA parameters include peak viscosity, PV (first peak viscosity after gelatinization); trough or hot paste viscosity, HPV (paste viscosity at the end of the 95 °C holding period) and final or cool paste viscosity, CPV (paste viscosity at the end of the test) [30]. The breakdown (BD = PV − HPV); setback (SB = CPV − PV); consistency (CS = CPV – HPV); set back ratio (SBR = CPV/HPV) and stability

*Near-Infrared Spectroscopy and Machine Learning: Analysis and Classification Methods of Rice DOI: http://dx.doi.org/10.5772/intechopen.99017*

(ST = HPV/PV) are considered as secondary parameters, once are derived from primary ones [30–32]. Other factors include peak time (time required to reach peak viscosity), and pasting temperature (temperature of initial viscosity increase) [33].

Industrial processing parameters such as the milling yield husked, milling yield milled, and milling industrial can influence positive and negatively the acceptability of rice by the industrials, can also affect the commercial value of rice. Rice yield and milling quality determine the economic value of rice from the field to the mill and in the industrial market. The rice commercial quality depends on several parameters that are evaluated separately or are involved several time-consuming experimental procedures. The evaluation of some parameters are related to biochemical or biological properties that allow more esasily its determination or prediction. Milling quality aspects affected by temperature during rice ripening include chalkiness, immature kernels, kernel dimensions, fissuring, protein content, amylose content, and amylopectin chain length [10]. Rice milling process can be subjected to dehusking of paddy which results in brown rice, and removing the bran from the kernel by polishing the brown rice to yield white rice. The milling quality of rice determines the yield and appearance of the rice after the milling process.

#### **1.2 Near-infrared spectroscopy**

Beer's law is generally applied in analytical spectroscopy to correlate the concentrations of standard samples with corresponding analyte absorbances to develop the calibration curve that is later used to evaluate the concentration of analyte of unknown samples, typically at lambda (λmax). Variation in other wavelengths/wavenumber regions is often not considered but contains significant information that may be selected to represent analyte absorption fingerprint signatures and spectral profiles for ultimate pattern recognition and/or quantification of analytes in unknown samples.

Analytical infrared spectra are focus on the absorption or reflection of the electromagnetic radiation can be divided in three regions of IR: near IR (NIR) in the 12.000–4000 cm−1 region, mid IR (MIR) in the 4000–400 cm−1 region, and far IR (FIR) beyond 400 cm−1 (**Figure 3**). The MIR region (4000–400 cm−1) is a well-recognized and reliable method through which different compounds can be identified and quantified, being used for biological applications, which includes the so-called fingerprint regions representative for lipids, proteins, amide I/II, carbohydrates, and nucleic acids (**Figure 3**). FIR spectroscopy (400–20 cm−1) provides information on the highly ordered structures such as fibrillar formation and protein dynamics [35] since it is more sensitive to the vibrations from the peptide skeletons and hydrogen bonds than MIR [36]. NIR, known also "far-visible spectroscopy" or "overtone vibrational spectroscopy", can measure the chemical composition of biological materials using the diffuse reflectance or transmittance of the sample at several wavelengths [37]. The NIR spectrum, from 12.000 to 4000 cm−1 lies between the visible and mid-infrared regions of the electromagnetic spectrum, is characterized by a number of absorption bands that vary in intensity due to energy absorption by specific functional groups in a sample [38].

NIR is a spectroscopic technique used to study of hydrogen bonding because it evaluates the overtones and combinations of the molecule's vibrational modes, principally those involving hydrogen. NIR spectroscopy can measure the concentration of components, characterized by different molecular composition such as protein, water, or starch [39]. The chemical bonds present in food and crop components such as fats, water, and carbohydrates are easily detected by NIR spectroscopy due to the specificity of the radiation, in terms of the groups of interest such as N-H, C-H, and O-H bonds. Due to the macromolecular complexity of the rice sample, it is normal for these bands to overlap one another.

**Figure 3.**

*Infrared spectral region (adapted by Balan et al. [34]).*

The transmission and reflection are defined as the two major modes of NIR spectroscopy, that are used based on physical state of the sample. Transmission modes are more suitable for liquids, thin solids, and thick solids when inspecting a food item for its ripeness, or whether it contains pests or defects. In another side, reflectance mode is applied for measuring content in whole grains such as lipids, starch, amylose, protein, moisture, and oil content. Low reflectivity indicates that energy diffuses readily beneath the surface of most samples, including visually opaque samples. Low absorptivity represents that NIR light energy easily penetrates the samples without fast attenuation [40]. This technique is extensively used in breeding procedures for quality improvement of any cereals, and crop management, receivable testing, and on-line process control [41, 42].

The NIR methodology presents some advantages such as no sample preparation or pre-treatment process, no need for dangerous reagents or solvents, and no disposal problem, either. These advantages can eliminate sampling errors caused by manual sample handling and reagent contamination. The samples also can be used in additional studies, being carried out by technically untrained personnel. On the other hand, through NIR analysis, it is possible to obtain a set of spectra, simultaneously, in a certain range of wavelengths, which may serve as a basis for the development of specific calibration curves for each analyte. In the calibration process are transformed during modelling using, for this purpose, chemometric techniques that use a representative set of training to use the program to discriminate slight differences that exist in the specific spectra of the sample [43]. A single spectrum can be subjected to many different calibration models, to measure any number of constituents.

Different techniques such as machine vision and Visible/Near-Infrared spectroscopy have been developed and applied to determine and characterize rice varieties and evaluate the biochemical characteristics. Traditional techniques used for rice variety evaluation such as High-pressure Liquid Chromatography (HPLC) or Gas chromatography-mass spectrometry (GC-MS) *Near-Infrared Spectroscopy and Machine Learning: Analysis and Classification Methods of Rice DOI: http://dx.doi.org/10.5772/intechopen.99017*

are time-consuming and hard to apply [44]. NIR spectroscopy, compared to the traditional analysis methods, is characterized by many advantages, such as is easy-to-use, real-time analysis, fast and accurate, highly reproducible results, non-destructive sampling, no sample preparation, multiple components analysis with a single measurement, high precision and non-destructive detection, being widely used in the measurement of agricultural and food products [45, 46].

#### **1.3 Spectral pre-processing techniques**

Over the years, several multivariate regression analysis methods have been developed in order to provide significant information from spectral data, due in part to the limitations of univariate spectral analysis. The processing of spectral data for chemical analysis usually uses the field of statistics and advanced mathematics for an analysis in terms of multivariate regression of spectral data. Simultaneous investigation of several wavenumbers or wavenumbers for biochemical analysis can be carried out through multivariate regression techniques, as these allow the analysis of different sample components without the need for spectral resolution and spectral deconvolutions. Pre-processing methods allowed eliminating noise caused by spectral data, which allow to remove the non-informative variability present in the spectra. Data pre-processing techniques such as normal variable transformation (SNV), multiplicative dispersion correction (MSC) and smoothing derivative are required for raw NIR spectra for proper qualitative classification and development of quantitative calibration models. MSC is used to compensate for particle size effects as it rotates the spectra to remove part of that effect, adjusting as close to the average spectrum as possible [47]. The first and second derivatives are calculated according to the Savitzky–Golay approach using a 19 point window and a 2nd or 3rd order polynomial, which allows to remove noise such as baseline drift, large, reverse and so on [48–50] (**Figure 4**).

#### **Figure 4.**

*Rice NIR spectra data without treatment (a); and after pre-processing procedure: baseline correction; (b, c) and first derivative process. (Adapted from Sampaio et al. [51]).*

### **1.4 Machine learning methods**

Machine learning is one of the most promising technologies in the field of artificial intelligence, that involve the use of algorithms that allow machines to learn by imitating the way humans learn step. Machine learning based on experimental data allows to optimize grouping or classification, developing models that allow to predict the behavior or properties of systems. There are two main types of machine learning: the supervised and the unsupervised process. Supervised machine learning uses algorithms that "learn" from the labeled data entered by a person without an algorithm. The algorithm generates expected output data as long as the input has been labelled and prior primary. There are two types of data that can be used in the development of the algorithm: (a) classification, which classifies an object into different classes, for example, it allows determining the type of rice according to its physical characteristics; (b) Regression, predicts a numerical value such as the concentration of any biochemical parameters such as the protein, lipids, or carbohydrates, etc. Supervised learning consists of learning a function from training examples, based on their attributes (inputs) and labels (outputs). In the unsupervised machine learning, unlike the previous case, there is no human intervention, and the algorithms learn process is based on the data with unlabeled elements, looking for patterns between them without human intervention. In this case two types of algorithms have been developed: (a) clustering, classifies the output data into groups according to its similarity; (b) association, the algorithm discovers rules within the data set. In semi-supervised learning, both labeled and unlabeled data is used for training, with usually only a small amount of labeled data, but a large amount of unlabeled data. Instead, the learning system receives some sort of a reward after each action, and the goal is to maximize the cumulative reward for the whole process. The much recognized machine learning methods are: Principal Component Analysis (PCA), the most basic feature extraction unsupervised techniques, based on the analysis of the variance of features within the full spectrum; the clustering unsupervised methods, used to identify biological subtypes within a sample, such as Hierarchical Cluster Analysis (HCA), k-Nearest Neighbors (KNN), Artificial Neural Networks (ANN), discriminant analysis (DA), Partial Least-Squares-Discriminant Analysis (PLS-DA), Partial Least-Squares (PLS), and Support Vector Machines (SVM).

#### *1.4.1 Principal Component Analysis*

Principal Component Analysis (PCA) is an unsupervised technique that allows the dimensionality reduction of the multivariate data to *n* principal components that preserves the variance of initial data as possible in the lower dimensionality output data [52]. The huge number of data are transformed into a reduced number of uncorrelated variables called principal components (PC) where each component represents a linear combination of the original data and the number of PCs is equal to the original variables. Early PCs explain most of the sample data, which allows for the reduction of data size. A PCA can reveal as variables that determine some inherent structure of the data, which can be interpreted in chemical or physicochemical terms. The scatter plot of PC1 and PC2 scores represent the most expressive variability among themselves, which account for most of the variability between samples and contain information from the entire spectrum. The PCA has been coupled with Mahalanobis distances to reduce dimensionality before carrying out the discriminant analysis [53]. Plots of PCs *versus* each other represents how the variables that they account for are related. To monitor the cluster together is important to determine a set of scaling coefficients, the scores. The scores for each

#### *Near-Infrared Spectroscopy and Machine Learning: Analysis and Classification Methods of Rice DOI: http://dx.doi.org/10.5772/intechopen.99017*

factor can be evaluate for every spectrum in the training set. The original spectra are constructed when the scores are multiplied by the load vectors and the results summed. In this way, knowing the set of charge vectors, how scores represent the spectra with the precision of the original responses at all wavelengths. PCA avoids the problem of overfitting by selecting too many wavelengths. This pattern recognition method was used to determine the Mahalanobis distances that are determined in units of standard deviations from the center (mean) of the training set cluster. Cross-validation is one method that is employed for evaluating the suitable number of factors. For performing this evaluation, each sample present in the calibration set is eliminated one by one and the remaining samples are used to build a Mahalanobis matrix for one, two, three factors, and so on. Then, the excluded sample is predicted, using the models developed for Mahalanobis grouping. The excluded sample is then put back to the calibration set, and a new sample is removed. The process continues until all changes have been removed from the calibration and prediction set. This represents an advantage of cross-validation compared to other methods, since the favors are not the same in relation to those used to define the model.

### *1.4.2 Discriminant Analysis*

A Discriminant Analysis is a strategy that has been used successfully for a qualitative analysis, being called pattern recognition. This methodology aims to classify groups as groups into well-defined groups according to the similarities of a "training set" despite limited knowledge of the composition of those belonging to the group. Johnson and Wichern [54] concluded that the use of discriminant analysis uses several variables and analyzed how to solve the grouping together. The development of calibration models in discriminant analysis is based on two methods: Mahalanobis distances, considered the unit distance vector in multidimensional space, and PCA coupled with Mahalanobis distances [54, 55]. The Mahalanobis distance can be defined by an ellipsoid in a multidimensional space that circumscribes the data. This method is based on a matrix that represents the inverse of the matrix formed by combining the covariance matrices within the group of all groups, which is generated by combining information from all different materials of interest in a single matrix. Studies developed by and Williams considered the Mahalanobis distance as the mathematical number that defines the position, size and shape of the ellipsoid for all clusters [38]. According to of statistical perspective, the Mahalanobis distance considers the sample variability to be valid, while the Euclidean distance method does not consider the variability of values in all dimensions to be valid. The Mahalanobis distances look at not only variation between the responses at the same wavelengths, but also at the inter-wavelength variations. Instead of treating all values equally when calculating the distance from the mean point, it weights the differences by the range of variability in the direction of the sample point. The place of each cluster in multidimensional space is defined by the mean value of the absorbances (the group mean) at each wavelength. Dunmire and Williams indicated that the sample can be classified clearly if it falls within three times the Mahalanobis distance from the respective centroid and at least six times the Mahalanobis distance from the ellipses of other groups [38]. Meanwhile, the Mahalanobis distance represents a multidimensional distance *D* defined by the matrix equation as follows (Eq. (1)) [55]:

$$\mathbf{D}^2 = (\mathbf{x} - \mathbf{x}')\mathbf{M}\left(\mathbf{x} - \mathbf{x}'\right) \tag{1}$$

where *x* represents a vector related to optical readings at several wavelengths which describes the position in multidimensional space corresponding to the spectrum of a given sample, *x'* is a vector that represents the position of a reference point in space, while *M* is the pooled inverse covariance matrix describing distance measures in the multidimensional space.

#### *1.4.3 Partial Least Squares-Discriminant Analysis*

Partial Least Squares-Discriminant Analysis (PLS-DA) is defined as a linear classification method that permits to estimate the predictive models based on partial least squares regression algorithm that follows for latent variables with maximum covariance, representing the significative sources of data variability with linear combinations of the original variables is considered an example of machine learning tool applied to conduct a global cellular analysis of bioprocess as an exploratory technique, gaining increasing attention as a useful feature selector and classifier [56–60]. Multivariate classification methods aimed at finding mathematical models able to recognize the membership of each sample to its appropriate class, by a set of measurements. PLS-DA have shown promising results in the detection of food adulteration without identifying specific compounds [61]. PLS-DA is a discriminant classifier, being particularly suitable for handling correlated features (e.g., spectroscopic variables). The predicted value is a number, but not a dummy integer. Thus, a cut off value needs to be set to determine which class the sample belongs to. PLS-DA is computed based to full cross validation methods. More specifically, a predictor block is used to estimate (by PLS) a binary response called dummy Y (a binary response matrix encoding the class-belonging). Mathematically, the regression relation between the data matrix X and the dummy vector y for a two-class case is represented by the model represented in Eq. (2)

$$y = \stackrel{\cdot}{y} + e = X\_{\ b} + e \tag{2}$$

where *y , b*, and *e* represents, respectively, the vectors of predicted responses, regression coefficients, and residuals. When new samples (test set) need to be classified, their predicted responses, *ynew* , are calculated based on the measurements, Xnew, and the regression coefficients, *b*, estimated on the training set, and the classification rule is then applied to assign each individual to one of the categories under study.

#### *1.4.4 Support Vector Machine*

Support Vector Machine (SVM) is a widely used supervised statistical learning algorithm, considered as a nonlinear classification technique, which works with supervised learning models that analyze data used for classification and regression analysis, producing linear boundaries between objects groups in a transformed space of the *x*-variables [62–64]. SVM was previously used to detect and quantify milk adulteration by mid-infrared spectrometry [64] and to identify rice seed cultivars [65]. SVM reveals advantages in dealing with small sample, non-linear and high dimensional data. The model performance depends of the selection of kernel function in SVM models, and the commonly used Radial Bias Function (RBF) is used as kernel function. The regularization parameter *c*, controls trade-off between the minimum training error and minimum model complexity, along with

the kernel parameter *g* of the kernel function. The parameter *c* reflects the degree of generalization, represents the width of the kernel function and reflects the degree of generalization are determined by a grid-search procedure in SVM.
