**3. Methodology**

#### **3.1 Random forest**

Random Forest (RF) is an ensemble machine learning technique driven by the development of a large number of decision trees that is produced by Leo Breiman [14]. Unlike DT, which uses all the features to construct a tree-like classification graph, RF uses an "efficient bagging" learning algorithm which integrates random selection of features with bagging. If one or a few features are very good predictors for target performance, it will pick this subset of features to construct a tree-like graph. This type of sample is known as the Bootstrap Sample. Using bagging techniques, these models are fitted with the above bootstrap samples, and then combined by voting. RF improves reliability and precision, reduces uncertainty and helps avoid overfitting.

Bootstrap aggregation or bagging is used to determine an appropriate number of trees with the size and nature of the training set. The RF prediction can be expressed as: by averaging the predictions from the individual regression trees;

An optimal number of trees are calculated by bootstrap aggregation or bagging with the size and nature of the training set. By averaging the predictions from the individual regression trees; The RF prediction can be expressed as:

$$\hat{\mathbf{g}}(\mathbf{x}) = \frac{1}{N} \sum\_{n=1}^{N} \mathbf{g}\_n(\mathbf{x}) \tag{1}$$

The *IGR* considers all the predictors of liquefaction-induced settlement with subset Si from the training dataset (*S*): *i* = 1, 2,. .., *n* successive pruning steps. Since complex decision trees can result in a model being overfitted and less interpretable, REP helps to reduce complexity by removing the DT structure's leaves and branches [16, 18–20].

*Evaluation of Liquefaction-Induced Settlement Using Random Forest and REP Tree Models:…*

The manner in which data are divided into training and test data sets in data mining procedures has a substantial effect on the results [21–23]. The statistical parameters for the input variables include the minimum, maximum, mean and standard deviation of the training and test datasets, as shown in **Table 2**. Data set splitting was done to assess the generalization efficiency and predictive ability of the developed models. The related performance of the training and testing datasets suggests that the developed models can be applied to the trained ranges. In the testing the ranges of input and output parameters often occur in the training datasets as shown in **Table 2**. The training and testing datasets'statistical consistency enhances the perfor-

To ensure comparability, the RF and REP Tree models are proposed using the same training and test datasets. Using these models, liquefaction-induced settlements are predicted, and an analysis of the detailed performance of these models will find the optimum model afterwards. If the performance of this model on the training and test datasets is adequate then it can be adopted for development.

In this study, three evaluation measures, mean absolute error (MAE), root mean

square error (RMSE), and correlation coefficient (*r*) are used to evaluate and compare the performance of the models. The MAE, RMSE and *r* are three useful statistical measures which provide some useful insights into the prediction model, of which the MAE is an average of the sum of the differences between the values predicted by a model and the actual values, the RMSE is a standard deviation of the

**4. Liquefaction-induced settlement model development**

mance of the developed models and thus helps to properly assess them.

**4.1 Preparing training and testing datasets**

*Schematic representation of a RF classifier with* N *trees.*

*DOI: http://dx.doi.org/10.5772/intechopen.94274*

**Figure 1.**

**4.2 Evaluation measures**

**263**

where ^*g x*ð Þrepresents the RF prediction from the total of *N* trees, and *gn*ð Þ *x* denotes the prediction of each individual tree with the input *x*. In addition, an approximation of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the trees, which can be expressed as:

$$\sigma = \sqrt{\frac{\sum\_{n=1}^{N} \left(\mathbf{g}\_n(\mathbf{x}) - \hat{\mathbf{g}}(\mathbf{x})\right)^2}{N - 1}} \tag{2}$$

**Figure 1** demonstrates the method of classifying RF with the *N* trees. Starting from the root node (*νn*), after comparison with certain parameters or threshold values, samples are moved to the right node (*νR*) or the left node (*νL*). Repeat this partition until a terminal node is reached and get a classification tag (in this case, classes A or B). For classification task, the ensemble prediction is achieved by majority voting rule as a combination of the results of the individual trees [15].

#### **3.2 REP tree**

The reduced error pruning tree (REP Tree) is an ensemble model of decision tree (DT) and The REP Tree (Reduced Error Pruning Tree) is an ensemble model of decision tree (DT) and reduced error pruning (REP) algorithms, equally good for classification and regression problems [16]. The REP Tree algorithm generates a decision regression tree by dividing and pruning the regression tree based on the importance of the highest knowledge benefit ratio (*IGR*) [17]; The *IGR* values were determined via Eq. (3) based on the entropy (*E*) function.

$$IGR(\mathbf{x}, \mathbf{S}) = \frac{E(\mathbf{S}) - \sum\_{i=1}^{n} \frac{E(S\_i)|S\_i|}{|\mathbf{S}|}}{-\sum\_{i=1}^{n} \frac{|S\_i|}{|\mathbf{S}|} \log\_2 \frac{|S\_i|}{|\mathbf{S}|}} \tag{3}$$

*Evaluation of Liquefaction-Induced Settlement Using Random Forest and REP Tree Models:… DOI: http://dx.doi.org/10.5772/intechopen.94274*

**Figure 1.** *Schematic representation of a RF classifier with* N *trees.*

**3. Methodology**

*Natural Hazards - Impacts, Adjustments and Resilience*

**3.1 Random forest**

helps avoid overfitting.

**3.2 REP tree**

**262**

Random Forest (RF) is an ensemble machine learning technique driven by the development of a large number of decision trees that is produced by Leo Breiman [14]. Unlike DT, which uses all the features to construct a tree-like classification graph, RF uses an "efficient bagging" learning algorithm which integrates random selection of features with bagging. If one or a few features are very good predictors for target performance, it will pick this subset of features to construct a tree-like graph. This type of sample is known as the Bootstrap Sample. Using bagging techniques, these models are fitted with the above bootstrap samples, and then combined by voting. RF improves reliability and precision, reduces uncertainty and

Bootstrap aggregation or bagging is used to determine an appropriate number of

*n*¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

**Figure 1** demonstrates the method of classifying RF with the *N* trees. Starting from the root node (*νn*), after comparison with certain parameters or threshold values, samples are moved to the right node (*νR*) or the left node (*νL*). Repeat this partition until a terminal node is reached and get a classification tag (in this case, classes A or B). For classification task, the ensemble prediction is achieved by majority voting rule as a combination of the results of the individual trees [15].

The reduced error pruning tree (REP Tree) is an ensemble model of decision tree (DT) and The REP Tree (Reduced Error Pruning Tree) is an ensemble model of decision tree (DT) and reduced error pruning (REP) algorithms, equally good for classification and regression problems [16]. The REP Tree algorithm generates a decision regression tree by dividing and pruning the regression tree based on the importance of the highest knowledge benefit ratio (*IGR*) [17]; The *IGR* values were

*E S*ð Þ� <sup>P</sup>*<sup>n</sup>*

�P*<sup>n</sup> i*¼1 *Si* j j j j *<sup>S</sup>* log <sup>2</sup>

*i*¼1

*E S*ð Þ*<sup>i</sup> Si* j j j j *S*

> *Si* j j j j *S*

*<sup>n</sup>*¼<sup>1</sup> *gn*ð Þ� *<sup>x</sup>* ^*g x*ð Þ � � *N* � 1

2

where ^*g x*ð Þrepresents the RF prediction from the total of *N* trees, and *gn*ð Þ *x* denotes the prediction of each individual tree with the input *x*. In addition, an approximation of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the trees, which can be expressed as:

*gn*ð Þ *x* (1)

(2)

(3)

trees with the size and nature of the training set. The RF prediction can be expressed as: by averaging the predictions from the individual regression trees; An optimal number of trees are calculated by bootstrap aggregation or bagging with the size and nature of the training set. By averaging the predictions from the

> ^*g x*ð Þ¼ <sup>1</sup> *N* X *N*

> > P*<sup>N</sup>*

s

individual regression trees; The RF prediction can be expressed as:

*σ* ¼

determined via Eq. (3) based on the entropy (*E*) function.

*IGR x*ð Þ¼ , *S*

The *IGR* considers all the predictors of liquefaction-induced settlement with subset Si from the training dataset (*S*): *i* = 1, 2,. .., *n* successive pruning steps. Since complex decision trees can result in a model being overfitted and less interpretable, REP helps to reduce complexity by removing the DT structure's leaves and branches [16, 18–20].
