**2. Materials and methods**

### **2.1. Description of case study and preparation of data**

Ensemble modeling by data-driven methods was applied for the 2-day ahead prediction of flows on the Hron River in Slovakia. The watershed of this river is a sub-basin of the Danube River. This task was accomplished by using data observed in the period from 01-01-1984 to 31-12-2000. Specifically, the average daily flow [m<sup>3</sup> s−1], the average daily temperatures [°C] and the daily rainfall depths [mm] were used.

The prediction of flows at the Banska Bystrica gauging station (**Figure 1**) serves as the case study in this paper. Each row in the input file for this task includes the date of the predicted flow, the predicted flow itself (two days' ahead—these are the modeled data, but their values are necessary for the training and testing mechanism), the input data of the flows from the three measuring stations, the temperatures from five meteorological stations, and the precipitation from 51 stations. All the input data were included in the input dataset from 1, 2, 3 and 4 days before the date of the predicted flow. This means that a summary of 238 variables is in each data row. Because daily data were used from 01-01-1984 to 31-12-2000, 6209 rows are in the dataset.

Some data preprocessing procedures had to be accomplished: cleansing the data, formatting it, inputting the missing data and normalizing it. These operations are not described here, because they are common procedures in data mining. A few words will follow about the division of the data and the sampling, which were important from the point of view of this paper.

The correct prediction of high flows is the most important task for flood predictions. The period from 1996 to 2000 includes many situations with high flows and floods, which was the reason for its selection as the testing period. The rest of the years (1984–1996) were used for the training (**Table 1**).

A sampling of the data was also accomplished to obtain a balanced training dataset and dataset that led to less demanding requirements from the point of view of the hardware and

**Figure 1.** Map of the area studied within the Hron River watershed.


**Table 1.** Statistics of the data.

**2. Materials and methods**

156 Time Series Analysis and Applications

the training (**Table 1**).

**2.1. Description of case study and preparation of data**

31-12-2000. Specifically, the average daily flow [m<sup>3</sup>

**Figure 1.** Map of the area studied within the Hron River watershed.

and the daily rainfall depths [mm] were used.

Ensemble modeling by data-driven methods was applied for the 2-day ahead prediction of flows on the Hron River in Slovakia. The watershed of this river is a sub-basin of the Danube River. This task was accomplished by using data observed in the period from 01-01-1984 to

The prediction of flows at the Banska Bystrica gauging station (**Figure 1**) serves as the case study in this paper. Each row in the input file for this task includes the date of the predicted flow, the predicted flow itself (two days' ahead—these are the modeled data, but their values are necessary for the training and testing mechanism), the input data of the flows from the three measuring stations, the temperatures from five meteorological stations, and the precipitation from 51 stations. All the input data were included in the input dataset from 1, 2, 3 and 4 days before the date of the predicted flow. This means that a summary of 238 variables is in each data row. Because daily data were used from 01-01-1984 to 31-12-2000, 6209 rows are in the dataset. Some data preprocessing procedures had to be accomplished: cleansing the data, formatting it, inputting the missing data and normalizing it. These operations are not described here, because they are common procedures in data mining. A few words will follow about the division of the data and the sampling, which were important from the point of view of this paper. The correct prediction of high flows is the most important task for flood predictions. The period from 1996 to 2000 includes many situations with high flows and floods, which was the reason for its selection as the testing period. The rest of the years (1984–1996) were used for

A sampling of the data was also accomplished to obtain a balanced training dataset and dataset that led to less demanding requirements from the point of view of the hardware and

s−1], the average daily temperatures [°C]

CPU time. Because of these computer power demands, sampling as a form of data reduction is a particularly important procedure in ensemble modeling, because in such modeling many runs of many algorithms are necessary, and computer demands rise with the amount of data used for training. A proper sampling methodology should be chosen in relation to the properties of the data and the problem studied. In streamflow predictions, a high amount of relatively low flows is usually available (also in the case studied), which led to the authors' decision to filter out some of them. On the contrary, high flows are somewhat rare. Because high flows are the most important data in flood predictions, the decision was made to filter out the data nonuniformly and leave all the input rows with this rare and large flow data in the final training dataset. Exactly, the same sampling of the same data was described in the previous work of the authors of the present paper [20], in which more details can be found.

### **2.2. Methodology**

The goal of the proposed ensemble methodology is to combine the predictions of several models in order to improve the robustness/generalizability that could be obtained from any of the constituent models. The proposed ensemble methodology for predicting the river flows is divided into four equally important steps (**Figure 2**). The preparation of the data was described in a previous part of this chapter. This section follows two subsections: in the first, members of the ensemble are described, whereas the second subsection contains a description of each model's weight optimization by the harmony search methodology. The final model is predicted using the weighted average of the base learners in which these weights are used.

### *2.2.1. Selection and training of ensemble members*

In contrast to the usual approach when ensemble consists of less powerful algorithms, the authors' intention was to evaluate the use of strong algorithms for members of the ensemble. The choice of "strong" algorithms is based on some papers, which evaluate existing data mining algorithms [21, 22].

A grid search combined with a repeated cross-validation methodology was used for finding the parameters of all the models included in the ensemble [6, 7]. In this approach, a set of each model's parameters from a predetermined grid is sent to the parameter-evaluating algorithm. A 5-times repeated 10-fold cross validation was used to find best parameters for the final

**Figure 2.** Proposed steps for the development of ensemble predictions of river flows.

models. Sampling of the data (as mentioned in the data preparation part of this chapter) was used in this process, because each basic algorithm runs in such a strategy many times.

Only a sketch of the algorithms is provided hereinafter, because in this work, a number of algorithms are used, and it is neither possible nor useful in this paper to go into a more thorough explanation. In the case of interest, the authors have indicated links to the relevant literature for detailed information.

### *2.2.1.1. Support vector machines (SVM)*

A support vector machine (SVM) [23] is very effective, supervised, machine learning method for various machine learning tasks. It is specific by using kernel trick-nonlinear mapping used to transform the original training data of a nonlinear problem (which is also our case) into a higher dimension. Herein, SVM learn a nonlinear function indirectly and easier: they learn a linear function in the space induced by the particular kernel, which matches to a nonlinear function in the original space.

The next important concept in SVM methodology is to fully ignore small errors. In SVM, bounds for regression are set by defining the loss function that ignores errors, which are situated within the distance ε of the true value. This type of function is called epsilon insensitive loss function. As a consequence, good generalization of SVM is gained, because not all the input vectors of data are used, but only the so-called support vectors, which are training samples that lie outside of the boundary of the ε-tube.

In this chapter, the ε-SVM model was created by: (1) choosing a radial basis kernel with parameter sigma = 0.0005; (2) specifying the *ε* parameter to be equal to 0.1 and (3) specifying the capacity *C* = 10.5. All parameters were found by a grid search.

### Multilayer perceptron (MLP).

Artificial neural networks (ANNs) are the most popular and well-known data-driven methodology; it has been described and is available in various literature sources, e.g. [24]. Briefly summarized, a multilayer perceptron, the most commonly used type of neural network, which was used also in this work, consists of input, hidden and output layers, all of which contain some processing elements or neurons. Input and output layer contains as many neurons as the model has input, respectively output variables. The so-called learning involves determination of number, types and particular properties of neurons in hidden layer. This layer is used for the transformation of the inputs to the outputs. A type of ANN known as a multi-layer perceptron (MLP), which uses a back propagation training algorithm, was used for generating the flow predictions in this study. The number of neurons in a hidden layer was found by a grid search and is equal to 6. Neurons with a logistic activation function were used in the hidden layer and with the linear activation function in the output layer.

### *2.2.1.2. Random forest (RF)*

models. Sampling of the data (as mentioned in the data preparation part of this chapter) was

Only a sketch of the algorithms is provided hereinafter, because in this work, a number of algorithms are used, and it is neither possible nor useful in this paper to go into a more thorough explanation. In the case of interest, the authors have indicated links to the relevant

A support vector machine (SVM) [23] is very effective, supervised, machine learning method for various machine learning tasks. It is specific by using kernel trick-nonlinear mapping used to transform the original training data of a nonlinear problem (which is also our case) into a higher dimension. Herein, SVM learn a nonlinear function indirectly and easier: they learn a linear function in the space induced by the particular kernel, which matches to a nonlinear

The next important concept in SVM methodology is to fully ignore small errors. In SVM, bounds for regression are set by defining the loss function that ignores errors, which are situated within the distance ε of the true value. This type of function is called epsilon insensitive loss function. As a consequence, good generalization of SVM is gained, because not all the input vectors of data are used, but only the so-called support vectors, which are training

In this chapter, the ε-SVM model was created by: (1) choosing a radial basis kernel with parameter sigma = 0.0005; (2) specifying the *ε* parameter to be equal to 0.1 and (3) specifying

Artificial neural networks (ANNs) are the most popular and well-known data-driven methodology; it has been described and is available in various literature sources, e.g. [24]. Briefly summarized, a multilayer perceptron, the most commonly used type of neural network, which was used also in this work, consists of input, hidden and output layers, all of which

used in this process, because each basic algorithm runs in such a strategy many times.

**Figure 2.** Proposed steps for the development of ensemble predictions of river flows.

literature for detailed information.

158 Time Series Analysis and Applications

*2.2.1.1. Support vector machines (SVM)*

function in the original space.

Multilayer perceptron (MLP).

samples that lie outside of the boundary of the ε-tube.

the capacity *C* = 10.5. All parameters were found by a grid search.

Random forests (RF) [25] are formed by a set of trees, which can either be classification or regression trees, depending on the problem being addressed. An RF prediction is an average of many trees (weak learners) grown on a bootstrap sample of the training data. The user chooses the number of trees in the forest (ensemble). Each tree is trained using a different bootstrap sample, which causes that different trees are obtained. For the regression task, the values predicted by each tree are averaged to obtain the final random forest prediction. In this work, a number of variables randomly sampled as candidates at each tree split were optimized with the help of a grid search, with the final value equal to 123. The minimum size of the terminal nodes is set at 5 and the number of trees at 500.

### *2.2.1.3. Multiple linear regression (MLR)*

Multiple linear regression (MLR) analysis is generally used to find the relevant coefficients (*a*, *b*, *c*,…, *intercept*) in the following model:

$$Y = aX\_1 + bX\_2 + cX\_3 + \dots + intercept\tag{1}$$

This is a simple, well-known methodology, which the authors included in this paper mainly for the purposes of comparison with other, more powerful, methods.

### *2.2.1.4. Generalized linear model with an elastic-net (GLMNET)*

Also in this method, as in previous case, a linear model is applied for flows prediction. Additional improvement in comparison to the basic multiple linear model is usage of regularization technique while searching parameters *a*, *b*, *c*,… from Eq. (1).

Regularization introduces additional criterion (or penalty) to the objective functions of optimization problems in order to prevent overfitting and for obtaining a more general model. In this case, least squares method for linear regression is meant as optimization problem. Various types of regularization exist. Ridge regression uses penalty, which limits the size of the coefficients in Eq. (1). Lasso uses a type of penalty which is trying to set some coefficients to be equal to zero. Elastic-net is a compromise between these two techniques and is used in this work. In work presented in this paper software provided by the authors of this regularization method was used [26].

### *2.2.1.5. Multivariate adaptive regression splines (MARS)*

MARS [27] construct regression relations from a set of coefficients *β* and linear basis functions *h* that are determined from the training data. The general MARS model equation is given as:

$$\mathbf{y} = f(\mathbf{X}) = \boldsymbol{\beta}\_o + \sum\_{m=1}^{M} \boldsymbol{\beta}\_m \boldsymbol{h}\_m(\mathbf{X}) \tag{2}$$

The basis function *h*(*x*) takes one of the following three forms:


The best parameters of multivariate adaptive regression splines were found by a grid search procedure; the maximum degree of interaction is equal to 1, and the maximum number of terms (including the intercept) in the pruned model was found as 31.

In recent years, boosting has developed into one of the most important techniques for fitting regression models in high-dimensional data settings. So, the authors decided to include the proposed ensemble in the three boosting models described below. Boosting, or additive models [28], express the searched function as a weighted sum of the basis functions as follows:

$$f(\mathbf{x}) = \sum\_{m} \beta\_{m} f m(\mathbf{x}) = \sum\_{m} \beta\_{m} b(\mathbf{x}; \mathbf{y}\_{m}) \tag{3}$$

The basis functions *b* are dependent on the type of boosting method, and the parameters (*β<sup>m</sup>* and *γm*) are assessed by minimizing a loss function (e.g. a mean square error) over the training data. Forward stagewise fitting is used for estimating *βm* and *γm* sequentially from *m* = 1 to *n*. For example, for boosted trees with a squared error loss, we fit a least-squares regression tree to the residuals of the previous iteration.

### *2.2.1.6. Boosted linear models (B\_GLM)*

In this case, a linear model is fitted using gradient boosting, where the component-wise linear models are utilized as base learners. The methodology is described in Ref. [29]. In this work, the R package mboost and glmboost function with a default setting were used for this methodology [30, 31]. The number of initial boosting iterations was found by grid search and is equal to 150; shrinkage parameter was set to 0.1.

### *2.2.1.7. Gradient Boosting with Smooth Components (B\_GAM)*

A (generalized) additive model is fitted in this case using a boosting algorithm based on component-wise univariate base learners (where only one variable is updated in each iteration of the algorithm) in combination with the *L*<sup>2</sup> loss function. A spline, which is a sufficiently smooth polynomial function that is piecewise-defined, is suitable for this task. It possesses a high degree of smoothness at the places where the polynomial pieces connect (which are known as "knots"). In this study, P-splines with a B-spline basis [32] were used as a base learner. In each iteration of the gradient-boosting algorithm, a base learner is fitted to the negative gradient of the *L*<sup>2</sup> loss function. The current estimate of the predictor function is then updated with the actual estimate of the negative gradient, which automatically results in an additive model fit. In this work, the gamboost function of the R mboost package [33] was used to fit the flow prediction model. The number of initial boosting iterations was found by grid search and is equal to 100; shrinkage parameter was set to 0.1.

## *2.2.1.8. Gradient boosting machines (GBM)*

*2.2.1.5. Multivariate adaptive regression splines (MARS)*

*y* = *f*(*X*) = *β<sup>o</sup>* + ∑

**1.** A constant (the intercept).

160 Time Series Analysis and Applications

a particular linear equation.

to the residuals of the previous iteration.

equal to 150; shrinkage parameter was set to 0.1.

of the algorithm) in combination with the *L*<sup>2</sup>

*2.2.1.7. Gradient Boosting with Smooth Components (B\_GAM)*

*2.2.1.6. Boosted linear models (B\_GLM)*

The basis function *h*(*x*) takes one of the following three forms:

tween two or more variables are modeled in this case.

terms (including the intercept) in the pruned model was found as 31.

MARS [27] construct regression relations from a set of coefficients *β* and linear basis functions *h* that are determined from the training data. The general MARS model equation is given as:

**2.** A function of the form *max*(0, *x* − *const*) or *max*(0, *const* − *x*). MARS selects the values of *const* for the knots of this function. These breakpoints define the region of application for

**3.** A product of two or more of the above-mentioned functions. The model interactions be-

The best parameters of multivariate adaptive regression splines were found by a grid search procedure; the maximum degree of interaction is equal to 1, and the maximum number of

In recent years, boosting has developed into one of the most important techniques for fitting regression models in high-dimensional data settings. So, the authors decided to include the proposed ensemble in the three boosting models described below. Boosting, or additive models [28], express the searched function as a weighted sum of the basis functions as follows:

*f*(*x*) = ∑*<sup>m</sup> β<sup>m</sup> fm*(*x*) = ∑*<sup>m</sup> β<sup>m</sup> b*(*x*; *γm*) (3)

The basis functions *b* are dependent on the type of boosting method, and the parameters (*β<sup>m</sup>* and *γm*) are assessed by minimizing a loss function (e.g. a mean square error) over the training data. Forward stagewise fitting is used for estimating *βm* and *γm* sequentially from *m* = 1 to *n*. For example, for boosted trees with a squared error loss, we fit a least-squares regression tree

In this case, a linear model is fitted using gradient boosting, where the component-wise linear models are utilized as base learners. The methodology is described in Ref. [29]. In this work, the R package mboost and glmboost function with a default setting were used for this methodology [30, 31]. The number of initial boosting iterations was found by grid search and is

A (generalized) additive model is fitted in this case using a boosting algorithm based on component-wise univariate base learners (where only one variable is updated in each iteration

loss function. A spline, which is a sufficiently

*m*=1 *M*

*β<sup>m</sup> hm*(*X*) (2)

Gradient boosting machines (GBM) are one of the most powerful boosting methods. Similarly to the other boosting methods, gradient boosting combines weak learners into a single strong learner. In GBM, decision trees (regression trees in our case) are usually employed. Weak learners are sequentially used with continually modified selection of the data. Moreover, training set is in this stepwise procedure weighted for current iteration according to the accuracy of the previously fitted model. The final prediction is obtained as a weighted average. Gradient boosting used in this work is implemented in the R package gbm [34] and is freely available. The total number of trees to fit is equal to 700 in this work and this parameter was found by a grid search. The shrinkage in GBM is controlled by parameter *υ*, which was set in this work to 0.01 (default value). Also, the maximum depth of the variable interactions was found by a grid search with up to 10-way interactions.

### *2.2.2. Harmony search (HS)*

The harmony search [17] algorithm (HS) was adopted from the musical process of finding "pleasant harmonies" through improvisation. The five fundamental steps of HS could be summarized as follows:

Step 1. Design the variables and initialization of the algorithm parameters. Initialization of HS search parameters: harmony memory size (HMS), harmony memory consideration rate (HMCR), the pitch adjustment rate (PAR) and the maximum number of improvisations (NI). The definition of the objective function *f*(*x*), which has to be minimized (or maximized), is also performed in this step.

Step 2. Initialization of harmony memory. The harmony memory is a memory location (matrix), where the solution vectors (sets of weights) and corresponding objective function values are stored. The initial HS memory consists of different randomly generated solution vectors.

Step 3. The generation of a new harmony inspired by improvisation process in music is performed and accomplished in this step. New harmony represents new solution of given optimisation problem. It consists of three basic procedures: (1) selection of harmony from the memory controlled by parameter HMCR, (2) pitch adjustment (parameter PAR) and (3) a pick a random value with probability 1-HMCR. A more detailed description of these HS operators can be found in existing HS literature, e.g. [18].

Step 4. A new solution's objective function computation. If the new harmony has better value of the objective function than any harmony in the harmony memory, the worst harmony vector in harmony memory is replaced by this new harmony vector.

Step 5. Repeat from Step 3 to Step 5 until termination criterion is satisfied. In this work, the harmony search stops if there is no improvement in an objective function during the last 500 iterations or if the total (predefined) number of iterations is reached.
