**5. Experimental study**

In this study, different ensemble learning strategies were compared in terms of classification accuracy, precision (PRE), recall (REC), and f-measure (F-MEA). Four ensemble learning strategies were tested on six different real-world environmental datasets. The application was developed by using Weka open source data mining library.

### **5.1. Dataset description**

In this experimental study, six different datasets that are available for public use were selected to determine the best ensemble strategy. Basic characteristics of the investigated environmental datasets are given in **Table 3**.


**Table 3.** Environmental datasets and their characteristics.

### **5.2. Comparison of ensemble strategies**

• *Computational cost*: increasing the number of classifiers usually increases computational cost. To overcome this problem, users may predefine a suitable ensemble size limit, or clas-

• *Complex nature of environmental data*: it is necessary to deal with high dimensionality and complexity of environmental data. To reduce the dimensionality of the feature vector, feature selection techniques can be used such as principal component analysis, information gain, and ReliefF. Another problem is to deal with heterogeneous data by adding problem-

• *Post processing*: another critical issue is determining what the best voting mechanism (majority, weighted, average, etc.) for combining the outputs of base classifiers is. Furthermore, the final results should be presented in an appropriate form to help users understand and

In this study, different ensemble learning strategies were compared in terms of classification accuracy, precision (PRE), recall (REC), and f-measure (F-MEA). Four ensemble learning strategies were tested on six different real-world environmental datasets. The application was

In this experimental study, six different datasets that are available for public use were selected to determine the best ensemble strategy. Basic characteristics of the investigated environmen-

Ozone+Level+Detection

Forest+type+mapping

master/statlib/nominal/cloud.arff

1 Ozone (1 h) 2008 74 2536 Air http://archive.ics.uci.edu/ml/datasets/

3 Leaf 2014 17 340 Ecology http://archive.ics.uci.edu/ml/datasets/Leaf 4 Eucalyptus 1991 20 736 Soil https://weka.wikispaces.com/Datasets 5 Forest type 2015 28 523 Ecology https://archive.ics.uci.edu/ml/datasets/

6 Cloud 1971 8 108 Rainfall https://github.com/renatopp/arff-datasets/blob/

sifiers can be trained in parallel.

interpret easily.

10 Data Mining

**5. Experimental study**

**5.1. Dataset description**

tal datasets are given in **Table 3**.

2 Ozone (8 h) 2008 74 2534

**Table 3.** Environmental datasets and their characteristics.

specific science algorithms to the solution.

developed by using Weka open source data mining library.

**ID Dataset name Year Attributes Instances Type Link**

Classification accuracies, precision, recall, and f-measure values for the applied algorithms were obtained using tenfold cross validation. Comparison of the classification accuracies of the applied algorithms for each dataset is displayed in **Figure 3**. Four weak learners (support vector machine (SVM), naive Bayes (NB), decision tree (DT) applied with C4.5 algorithm, and K-nearest neighbor (KNN)) and four ensemble learners (bagging, random forest (RF), AdaBoost, and voting) were used to construct classification models from environmental data. The base classifiers for the ensemble learners were selected as the one which gave the best classification accuracy among the applied weak learners for the respective dataset.

The experimental results were obtained with optimum parameters (given in **Table 4**) using grid search. The best parameters of SVM were found for the complexity parameter, *C* for the exponent value, *E* for polykernel parameters in the interval [10<sup>k</sup> for *k* ϵ {−3, …, 3}], and [1–10], respectively. To model DT, confidence factor, *C*, for pruning and the minimum number of objects, *M*, for leaf were obtained in the intervals of [0.05–0.95] and [1–10]. The number of neighbors, *N* for KNN classifier, was selected in the range of [1, 25]. For RF classifier, the number of randomly chosen attributes, *K*, and the number of iterations to be performed, *I*, were found in the intervals [0–15] and [10–100], respectively. The number of ensemble classifiers for bagging is 10 for each dataset. Weight threshold for weight pruning, *P*, and the number of iterations to be performed, *I*, were selected in the interval [10–100] for AdaBoost classifier. Voting was performed using the optimum parameters of SVM, NB, DT, KNN, and RF classifiers.

The objective of this experiment is to remark the success of the ensemble strategies in terms of classification accuracy concerning environmental data. According to the experimental results, it is apparent that the number of correctly classified instances is increased if ensemble strategies are applied. Especially, AdaBoost classifier provides significant performance gain compared to other models. SVM has superiority over other single learners; hence, most of the ensemble models selected it as the base learner.

**Figure 3.** Comparison of single and ensemble classifiers in terms of classification accuracies.


**Table 4.** Optimum classifier parameters corresponding to each dataset.

There are a number of cases resulting in poor classification performance, such as the following:


For example, because the number of instances in "cloud" dataset is very few (due to the insufficient number of instances), inferior results are obtained for most of the applied algorithms as expected. However, even in such cases while some algorithms fail, some others manage to perform well (e.g., C4.5 DT 82%). In this situation, the classifier's performance can also be enhanced by applying ensemble learning methods as in the case of AdaBoost with 84% classification accuracy for the same dataset. AdaBoost is a powerful ensemble learning algorithm because its distribution update step ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier with the chance of further enhancement.

Due to the fact that classification accuracy as a performance metric is not just enough to decide whether a learner is considerably good or not, the precision, recall, and f-measure values were also calculated for each model (**Table 5**). It is also clear from the table values that applying ensemble strategies compared to single learners makes more sense in terms of classifier performance.

**6. Conclusion and future work**

algorithms in each dataset.

This study aims to provide helpful guidelines for future applications by presenting the advantages and challenges of ensemble-based environmental data mining and comparing alternative ensemble strategies through experimental studies. It compares four different ensemble

**Table 5.** Precision (PRE), recall (REC), and f-measure (F-MEA) results using tenfold cross validation for respective

**Dataset Algorithm PRE REC F-MEA Dataset Algorithm PRE REC F-MEA** Ozone (1-h) SVM 0.97 0.97 0.95 Eeucalyptus SVM 0.65 0.65 0.65

Ozone (8-h) SVM 0.93 0.94 0.93 Cloud SVM 0.37 0.40 0.37

Forest types SVM 0.91 0.91 0.91 Leaf SVM 0.78 0.76 0.76

NB 0.86 0.86 0.86 NB 0.75 0.74 0.74 C4.5 DT 0.88 0.88 0.87 C4.5 DT 0.66 0.65 0.64 RF 0.90 0.90 0.90 RF 0.77 0.76 0.76 K-NN 0.89 0.89 0.89 K-NN 0.69 0.67 0.67 SVMBagged 0.90 0.90 0.90 SVMBagged 0.72 0.72 0.71 SVMAdaBoost 0.91 0.91 0.91 SVMAdaBoost 0.79 0.78 0.78 Vote 0.90 0.90 0.90 Vote 0.77 0.77 0.76

NB 0.92 0.73 0.80 NB 0.49 0.36 0.32 C4.5 DT 0.87 0.93 0.90 C4.5 DT 0.82 0.82 0.82 RF 0.91 0.93 0.91 RF 0.51 0.51 0.51 K-NN 0.87 0.93 0.90 K-NN 0.33 0.35 0.32 SVMBagged 0.92 0.94 0.93 C4.5 DTBagged 0.55 0.54 0.54 SVMAdaBoost 0.93 0.94 0.93 C4.5 DTAdaBoost 0.84 0.84 0.84 Vote 0.93 0.93 0.91 Vote 0.47 0.49 0.46

NB 0.96 0.79 0.86 NB 0.62 0.55 0.55 C4.5 DT 0.94 0.97 0.95 C4.5 DT 0.66 0.65 0.64 RF 0.94 0.97 0.95 RF 0.61 0.61 0.61 K-NN 0.97 0.97 0.95 K-NN 0.57 0.57 0.56 K-NNBagged 0.97 0.97 0.95 SVMBagged 0.66 0.66 0.66 K-NNAdaBoost 0.97 0.97 0.95 SVMAdaBoost 0.67 0.67 0.67 Vote 0.94 0.97 0.95 Vote 0.67 0.65 0.65

Ensemble Methods in Environmental Data Mining http://dx.doi.org/10.5772/intechopen.74393 13


**Table 5.** Precision (PRE), recall (REC), and f-measure (F-MEA) results using tenfold cross validation for respective algorithms in each dataset.
