4.2. Steps for construction of the ANN model and selection of sample sets

The construction and validation of models for multivariate classification or calibration go through three basic steps, and for each of them it is necessary a sample set. The first step, the training or calibration, consists of model adjustment with pairs of inputs and outputs (X, y) provided in the database. The coefficients or weights of the model are amended so the response calculated based on variations in X data is as closest to the real (experimental) response as possible. In the training, it is important to have representative samples concerning all the possible X and y variations that real samples can have.

The step of validation (or internal validation) helps assess the progress of optimization and indicates when the model adjustment should be ended, so it occurs simultaneously to the training step. In the beginning of the training, the coefficients and weights are underfit and the errors are large. In the course of training, the errors decrease as the coefficients are adjusted

Figure 1. Raw (a) and derivative (b) NIR spectra of the 70 samples of biodiesel mixtures.

4.1. Database acquisition and preprocessing

192 Advanced Applications for Artificial Neural Networks

designed by simplex-lattice and centroid-simplex designs.

age of 16 scans and spectral resolution of 2 cm<sup>1</sup>

4000–12,000 cm<sup>1</sup>

(5550–6100 cm<sup>1</sup>

.

12,000 cm<sup>1</sup>

Biodiesels from soybean, corn, palm and babassu (an oleaginous abundant in the Northeast of Brazil) were synthesized via transesterification by methylic route and homogeneous alkaline catalysis and used to prepare 70 binary, ternary and quaternary mixtures (volumetric fractions)

The oxidative stabilities of the samples were determined by the method EN 14112:2003 [33] using a Rancimat equipment Metrohm model 873. The average of two measurements for each sample was taken. The oxidative stabilities of the mixtures ranged from 4.81 to 25.47 h.

The spectra were acquired using a Fourier transform NIR spectrometer PerkinElmer model Frontier™ with a near infrared reflectance accessory (NIRA), equipped with a fast recovery deuterated triglycine sulfate (FR-DTGS) detector. All spectra were recorded with an aver-

informative signal (close to baseline) and increase of noise as wavenumber gets close to

The raw spectra (Figure 1a) showed bands characteristic of first overtone of CdH stretching

[34]. The bands around 4262 and 4334 cm<sup>1</sup> can be associated to the second overtone of CdH

For correction of spectra baseline deviations caused by systematic variations, the first derivative was calculated by the Savitzky-Golay filter [36] with a 15-point quadratic smoothing function. The window size of points to fit the polynomial function of Savitzky-Golay filter depends on how noisy the spectra are. In this case, a 15-point window was enough to smooth the spectral noise. The derivative NIR spectra of the full database can be seen in Figure 1b.

After applying Savitzky-Golay filter, the spectra were mean-centered and then used as input data (X-matrix) consisting of 1051 variables and, as output variable (response, y-vector), it was used the raw oxidative stabilities (h). From this point, only the preprocessed data was used.

The construction and validation of models for multivariate classification or calibration go through three basic steps, and for each of them it is necessary a sample set. The first step, the training or calibration, consists of model adjustment with pairs of inputs and outputs (X, y) provided in the database. The coefficients or weights of the model are amended so the response calculated based on variations in X data is as closest to the real (experimental) response as possible. In the training, it is important to have representative samples concerning

The step of validation (or internal validation) helps assess the progress of optimization and indicates when the model adjustment should be ended, so it occurs simultaneously to the training step. In the beginning of the training, the coefficients and weights are underfit and the errors are large. In the course of training, the errors decrease as the coefficients are adjusted

bending and to combination of CdH and C]C stretching modes, respectively [35].

4.2. Steps for construction of the ANN model and selection of sample sets

all the possible X and y variations that real samples can have.

, but the work range was restricted to 4000–6100 cm<sup>1</sup> because of non-

) and of combination of CdH and C]O stretching modes (4640–4700 cm<sup>1</sup>

. The measured wavenumber range was

)

and begin to model even the natural noise coming from systematic errors of experimental measurements. At this stage, it occurs the so-called overfitting and the model will not be able to predict or classify external samples with accuracy, although the training and validation errors are small. Therefore, the aim of internal validation is to aid choosing the number of neurons and hidden layers in such a way to balance underfitting and overfitting.

The last step is the test (or external validation), in which samples that were not used during training or internal validation are estimated or classified by the optimized model, simulating a real application. The neural networks learn from the past (training samples) to estimate future cases (test samples).

There are some methods for selection of samples for training, validation and test. In this case study, it will was used the SPXY method (Sample set Partitioning based on joint x-y distances), which is based on the variability of both input (NIR spectra) and output (oxidative stability) variables [32]. The test set consisted of 30% of the database (21 samples), while the 49 remaining samples were split for training (39 samples) and validation (10 samples).

### 4.3. Dimensionality reduction and ANN configuration

As the X-matrix is composed of 1051 variables, it is necessary to apply a method for dimensionality reduction before the training of the neural networks. Otherwise, the modeling would consider too much noise and, because of the large number of input neurons, the ANNs would take too long to converge.

The partial least squares regression (PLS) was used for dimensionality reduction. The number of latent variables (LVs) was optimized based on full cross-validation method. Four LVs explained 99.15% of the X-variance and 82.85% of y-variance.

The feedforward MLP ANNs were trained using backpropagation algorithm with a fixed learning rate (0.125) as convergence method to minimize the RMSEC. The input layer is formed by four neurons receiving the four LVs, and the output layer is constituted of one neuron (oxidative stability). The number of neurons in the first hidden layer ranged from 1 to 20, and in the second hidden layer, from 1 to 10. It was also tested a topology with only one hidden layer. The hyperbolic tangent (tanh, Eq. (10)) and purelin (Eq. (11)) functions were used as activation functions (or transfer functions) of the hidden and output layers, respectively.

$$f(\mathbf{x}) = \tanh(\mathbf{x}) = \frac{e^{\mathbf{x}} - e^{-\mathbf{x}}}{e^{\mathbf{x}} + e^{-\mathbf{x}}} \tag{10}$$

$$f(\mathbf{x}) = \mathbf{x} \tag{11}$$

#### 4.4. Results and discussion

As the validation is the step that aids to assess and choose the best fit under varying conditions during optimization, the RMSEV was the criteria taken to choose the best number of hidden layers and neurons. The RMSEVs of the biodiesel oxidative stabilities (h) predicted by ANNs with different numbers of neurons in the hidden layers are represented in Figure 2, showing the dependence of the RMSEV on the ANN topology.

Figure 2. Dependence of the RMSEV of the biodiesel oxidative stability predicted by ANNs with different numbers of neurons in first and second hidden layers.

The RMSEVs when number of neurons in the second hidden layer is zero correspond to the topologies having only one hidden layer (the first one, varying from 1 to 20 neurons). These results presented high RMSEVs that did not vary much with the number of the hidden neurons, evidencing the convergence difficulty of the ANNs with only one hidden layer. Hence, the second layer was added to the optimization process.

The partial least squares regression (PLS) was used for dimensionality reduction. The number of latent variables (LVs) was optimized based on full cross-validation method. Four LVs

The feedforward MLP ANNs were trained using backpropagation algorithm with a fixed learning rate (0.125) as convergence method to minimize the RMSEC. The input layer is formed by four neurons receiving the four LVs, and the output layer is constituted of one neuron (oxidative stability). The number of neurons in the first hidden layer ranged from 1 to 20, and in the second hidden layer, from 1 to 10. It was also tested a topology with only one hidden layer. The hyperbolic tangent (tanh, Eq. (10)) and purelin (Eq. (11)) functions were used as activation functions (or transfer functions) of the hidden and output layers, respectively.

f xð Þ¼ tanhð Þ¼ x

As the validation is the step that aids to assess and choose the best fit under varying conditions during optimization, the RMSEV was the criteria taken to choose the best number of hidden layers and neurons. The RMSEVs of the biodiesel oxidative stabilities (h) predicted by ANNs with different numbers of neurons in the hidden layers are represented in Figure 2, showing

Figure 2. Dependence of the RMSEV of the biodiesel oxidative stability predicted by ANNs with different numbers of

<sup>e</sup><sup>x</sup> � <sup>e</sup>�<sup>x</sup>

ex <sup>þ</sup> <sup>e</sup>�<sup>x</sup> (10)

f xð Þ¼ x (11)

explained 99.15% of the X-variance and 82.85% of y-variance.

194 Advanced Applications for Artificial Neural Networks

the dependence of the RMSEV on the ANN topology.

4.4. Results and discussion

neurons in first and second hidden layers.

Few neural network topologies presented RMSEV lower than 0.5 h, but the best ones are those represented by the black squares in Figure 2: MLP 4-2-5-1, MLP 4-2-9-1, MLP 4-3-1-1, MLP 4-3-3-1, MLP 4-4-1-1, MLP 4-5-4-1, MLP 4-6-2-1, MLP 4-8-2-1, MLP 4-8-3-1, MLP 4-12-2-1, MLP 4-12-4-1, MLP 4-14-3-1, MLP 4-16-9-1 and MLP 4-17-6-1. In the notation MLP A-B-C-D, A is the number of input neurons (four LVs), B and C are the number of neurons in the first and second hidden layers, respectively, and D is the number of output neurons (one, oxidative stability).

The 14 best topologies above mentioned had had RMSEV less than or equal to 0.43 h. For choosing among them, the smaller number of neurons is preferred (principle of parsimony: the simpler the better). Therefore, the topology MLP 4-3-1-1 was selected to expand results and predict the oxidative stability of the test samples, but the topologies MLP 4-3-3-1 and MLP 4-4-1-1 should provide similar results.

Figure 3. Scatter plot of the biodiesel oxidative stability values predicted by the ANN MLP 4-3-1-1 against the actual values measured by the method EN 14112.

The evaluation parameters calculated for the ANN MLP 4-3-1-1 can be verified in Table 4. These parameters can be interpreted as in Section 3.2.

The most important to evaluate the optimized model are the parameters obtained for the test dataset, since these samples simulate a real application with data not used to build nor optimize the model. The RMSEP was 0.67 and the MAPE for test samples was 6.89%, which means that the predicted oxidative stabilities for real samples differed in 0.67 h from the actual values and deviated about 6.89%, related to their actual values.

Still for the test samples, the correlation coefficient was 0.9769, indicating a high correlation between the actual and the predicted values of oxidative stability. The determination coefficient was also high, meaning that the ANN MLP 4-3-1-1 explained 95.44% of the total data variance, and the prediction errors represents 4.56% of the total variance.

The correlation plot for samples of all the three steps can be seen in Figure 3, in which the samples are well distributed along the line, especially the validation and test samples, leading to correlation coefficients higher than 0.96 for the three steps.

In residual plot (Figure 4), it is important to have approximately the same quantity of samples with positive and negative residuals, and the closer to the central line (y=0) the smaller the RMSEs. In this case study, the majority of samples had residual lower than 1.5 h and they are well divided with positive and negative residuals. The higher residuals belong to the training samples, which indeed had the highest RMSE (1.31 h).

Figure 4. Residual plot the biodiesel oxidative stability values predicted by the ANN MLP 4-3-1-1.
