**2. Related work**

Data mining techniques have been recently utilized in environmental studies for processing environmental data and converting it to useful patterns to obtain valuable knowledge and make right decisions when dealing with environmental problems. Many of the developed techniques in data mining can often be tailored to fit environmental data.

Recently, ensemble learning has been one of the active research fields in machine learning. Thus, it has been utilized in a very broad range of areas such as marketing, banking, insurance, health, telecommunication, and manufacturing. In contrast to these studies, our work proposes ensemble learning approach that combines several models to produce a result to solve environmental problems.

### **2.1. Ensemble-based environmental data mining**

environmental processes and systems. However, it is not well-known yet how ensemble methodology can be utilized in order to improve the performance of a single method. For this purpose, this chapter presents the findings of a systematic survey of what is currently done in the area and aims to investigate the ability of different ensemble strategies for environmental

Ensemble learning in environmental data mining (ELEDM) can be drawn as a combination of three main areas: data mining (DM), machine learning (ML), and environmental science (**Figure 1**). ML in environmental science is learning-driven, meaning that machines teach themselves to recognize patterns by analyzing environmental data, whereas in contrast, DM is discovery-driven, meaning that patterns are automatically discovered from environmental

The novelty and main contributions of this chapter are as follows. First, it provides a brief survey of ensemble learning used in environmental data mining. Second, it presents how an ensemble of classifiers can be applied on environmental data in order to improve the performance of a single classifier. Third, it is the first study that compares different ensemble strate-

Data mining techniques have been recently utilized in environmental studies for processing environmental data and converting it to useful patterns to obtain valuable knowledge and make right decisions when dealing with environmental problems. Many of the developed

Recently, ensemble learning has been one of the active research fields in machine learning. Thus, it has been utilized in a very broad range of areas such as marketing, banking, insurance, health, telecommunication, and manufacturing. In contrast to these studies, our work proposes ensemble learning approach that combines several models to produce a result to

data. DM uses many ML methods, including ensemble learning methods.

**Figure 1.** Interdisciplinary structure of ensemble learning in environmental data mining (ELEDM).

gies on different environmental datasets in terms of classification accuracy.

techniques in data mining can often be tailored to fit environmental data.

data mining.

2 Data Mining

**2. Related work**

solve environmental problems.

Ensemble classifiers have been applied to different environmental subjects, such as air [1–6], water [7–9], soil [10–12], plant [13], forests [14, 15], climate [16–18], noise [19], rainfall [20], energy [21–23], as well as living organisms [18, 24, 25]. Some of the ensemble-based environmental data mining studies have been compared in **Table 1**. In this table, the scopes of the studies, the year they were performed, the algorithms that were used in the studies, the type of data mining task, the success rate with the validation method, and the ensemble strategy are listed. In addition, if more than one algorithm is presented and compared with each other, the proposed one (the most successful one) is also indicated. As given in the table, ensemble of models for classification or prediction has higher interest than ensemble clustering and anomaly detection [2, 22] in environmental science. Although ensemble clustering has been used in many areas, especially in bioinformatics, only a few studies [4, 25] have been conducted so far in the environmental science.



cover categories of desert landscapes using remotely sensed data [11], to solve the problem of rare classes' classification on dust storm forecasting [26] and discovering plant species for

Ensemble Methods in Environmental Data Mining http://dx.doi.org/10.5772/intechopen.74393 5

Training with different algorithms in each ensemble (*voting*) is another commonly used ensemble strategy in environmental science. Some of the examples are for the identification of anomalous consumption patterns in building energy consumption [22] and forecasting air

Differently from existing studies, the study presented in this chapter focuses on applying four distinct ensemble strategies to environmental datasets using (i) different training sets formed by random sampling with replacement (bagging), (ii) different training sets obtained by random instance and feature subset selection (random forest), (iii) different training sets using random sampling with replacement over weighted data (AdaBoost), and (iv) different

• Prediction of parameters expected based on other parameters or under different cases in environmental studies, for example, prediction of rainfall [20], climate change [16–18] spe-

• Construction of models to reduce the consumption of energy [21–23] and raw materials [2]

• Clustering the items in environmental data to describe the current situation more clearly

• Analyzing environmental data toward a better quality control such as air quality [1, 5, 6]

• Identifying unexpected patterns from an environmental data using a data mining algorithm and detection of anomalies in environmental data [2, 22] to identify bad values, changes, errors, noises, frauds, and abnormal activities to realize the purpose of giving an alarm. • Determination of the most important factor that affects the environment using a data min-

• Development of a model to manage resources effectively [2, 21, 23], including environmental resources such as air, water, and soil; flow resources such as solar power [30] and wind

• Analyzing the records of financial transactions related to environmental economics for better decision-making, i.e., investigating the financial impacts of environmental policies.

• Discovering patterns that can be used for better waste management and recycling.

such as wood, grass, metal, steel, plastics, glass, paper, fuel, and natural gas.

**2.2. Advantages of ensemble-based environmental data mining**

Some of the advantages of environmental data mining are given below:

cies richness/diversity [24, 25], and atmospheric parameters [28].

• Classification of environmental audio and environmental noise [19].

• Processing ecological data for better modeling ecological systems [24, 25].

and to plan different activities for different clusters [4, 25].

ing technique such as decision tree and random forest [29].

energy; and natural resources such as coal, gas, and forests.

automatic weed control [27].

pollutant values of a region [4].

algorithms (voting).

and water quality [7–9].

ANN, artificial neural network; SVR, support vector regression; PCA, principal component analysis; MLP, multilayer perceptron; SOM, self-organizing maps; EAD, ensemble anomaly detection; FIS, fuzzy inference system; GP, Gaussian process; MSE, mean squared error; RMSE, root-mean-square error; TPR, true positive rate; ROC, receiver operating characteristic.

**Table 1.** Comparison of ensemble-based environmental data mining studies.

The idea of using an ensemble of classifiers rather than the single best classifier has been proposed in several environmental data mining studies [5, 11, 26]. It is apparent that ensemble learners boost the performance of the single classifiers. Different models pick up different patterns in data. By pooling all these predictions together, as long as they are reasonably independent, informed, and diverse, the outcomes tend to be better.

One of the most popular ensemble learning strategies, *bagging*, is also well adapted to develop models for solving environmental problems. For example, it has been utilized to the forecast air pollution level of a region [5] and to establish habitat models for living species [25].

The second type of ensemble learning strategy, the *random forest* (RF) algorithm, has also been applied for classifying environmental data. It has been applied to predict pollutant occurrences in groundwater [9] and determination of the impact of climate change on the habitat suitability for a fish species [18] and to predict dust storm accurately [26].

Another ensemble learning strategy (*boosting*), the AdaBoost algorithm, has been used in various types of environmental applications such as for the classification of complex land use/land cover categories of desert landscapes using remotely sensed data [11], to solve the problem of rare classes' classification on dust storm forecasting [26] and discovering plant species for automatic weed control [27].

Training with different algorithms in each ensemble (*voting*) is another commonly used ensemble strategy in environmental science. Some of the examples are for the identification of anomalous consumption patterns in building energy consumption [22] and forecasting air pollutant values of a region [4].

Differently from existing studies, the study presented in this chapter focuses on applying four distinct ensemble strategies to environmental datasets using (i) different training sets formed by random sampling with replacement (bagging), (ii) different training sets obtained by random instance and feature subset selection (random forest), (iii) different training sets using random sampling with replacement over weighted data (AdaBoost), and (iv) different algorithms (voting).

### **2.2. Advantages of ensemble-based environmental data mining**

Some of the advantages of environmental data mining are given below:


The idea of using an ensemble of classifiers rather than the single best classifier has been proposed in several environmental data mining studies [5, 11, 26]. It is apparent that ensemble learners boost the performance of the single classifiers. Different models pick up different patterns in data. By pooling all these predictions together, as long as they are reasonably

ANN, artificial neural network; SVR, support vector regression; PCA, principal component analysis; MLP, multilayer perceptron; SOM, self-organizing maps; EAD, ensemble anomaly detection; FIS, fuzzy inference system; GP, Gaussian process; MSE, mean squared error; RMSE, root-mean-square error; TPR, true positive rate; ROC, receiver operating

Anomaly detection

One of the most popular ensemble learning strategies, *bagging*, is also well adapted to develop models for solving environmental problems. For example, it has been utilized to the forecast air pollution level of a region [5] and to establish habitat models for living species [25].

The second type of ensemble learning strategy, the *random forest* (RF) algorithm, has also been applied for classifying environmental data. It has been applied to predict pollutant occurrences in groundwater [9] and determination of the impact of climate change on the habitat

Another ensemble learning strategy (*boosting*), the AdaBoost algorithm, has been used in various types of environmental applications such as for the classification of complex land use/land

independent, informed, and diverse, the outcomes tend to be better.

**Ref. Year Type Description Data mining** 

of groundwater nitrate pollution

Construction of habitat models for living species in the Lake Prespa, Macedonia; in the soils of Denmark; and in the Slovenian

the Macau's air pollution index

overconsumption of fuel in aircrafts

**Table 1.** Comparison of ensemble-based environmental data mining studies.

rivers

[9] 2014 Water Predictive modeling

[25] 2013 Living

4 Data Mining

organisms

[5] 2012 Air Prediction of

[2] 2011 Air energy Detect

characteristic.

**task**

**Ensemble strategy**

Clustering 1, 2 RF and bagged

Prediction 1 Bootstrap sampling

**Algorithms Validation**

model RF-A) AUC = 0.911 (for model RF-B)

Tenfold cross validation RRMSE

RMSE = 12.21 (ANFIS with random sampling)

ROC = 0.90 NRMSE varied consistently between 85 and

90%

Prediction 2 RF regression, LR ROC = 0.923 (for

multitarget predictive clustering tree (PCT) and single-target DT

with replacement and random sampling without replacement using ANFIS method as base learner

1 Bootstrap sampling on each of the regression tree (tree), elastic network, GP, and stable GP regression methods

suitability for a fish species [18] and to predict dust storm accurately [26].


• Using ensemble methods as a preprocessing step before performing the essential environmental study.

and grading. While weighting methods are useful when combining classifiers built from a single learning algorithm and they have comparable success, meta-learning is a good choice for cases in which base classifiers consistently classify correctly or consistently misclassify.

Ensemble Methods in Environmental Data Mining http://dx.doi.org/10.5772/intechopen.74393 7

In order to construct an ensemble model, any of the following strategies can be performed:

One ensemble strategy is to train different base learners by different subsets of the training set. This can be done by random resampling of a dataset (i.e., *bagging*; **Figure 2a**). When we train multiple base learners with different training sets, it is possible to reduce variance and

The combination of bagged decision trees is constructed similar to Strategy 1 using one significant adjustment that random feature subsets are used (i.e., *random forest*; **Figure 2b**). When we have enough trees in the forest, random forest classifier is less likely overfit the model. It is also useful to reduce the variance of low-bias models, besides handling missing values easily.

**4.3. Strategy 3: different training sets using random sampling with replacement over** 

linear classifiers or univariate decision trees also known as decision stumps.

This ensemble strategy can be implemented by weighted resampling of the dataset serially by focusing on difficult examples which are not correctly classified in the previous steps (i.e., *boosting*; **Figure 2c**). Boosting helps to decrease the bias of otherwise stable learners such as

The other ensemble strategy (i.e., *voting*; **Figure 2d**) is to use different learning algorithms to train different base learners on the same dataset. So, the ensemble includes diverse algorithms that each takes a completely different approach. The main idea behind this kind of ensemble learning is taking advantage of classification algorithms' diversity to face complex data.

Although ensemble classifiers have a common goal to construct multiple, diverse and predictive models and finally to combine their outputs, each strategy is carried out in different ways using different training sets, combiner or inducer. **Table 2** summarizes the properties of different ensemble strategies, the popular algorithms under each category and pros and cons of

**4.1. Strategy 1: different training sets using random sampling with replacement**

**4.2. Strategy 2: different training sets obtained by random instance and feature** 

**4. Ensemble strategies**

therefore error.

**subset selection**

**weighted data**

**4.4. Strategy 4: different algorithms**

each ensemble classifier.

**4.5. Characteristic of different ensemble classifiers**

