**1. Introduction**

Effective water resources management is one of the most crucial environmental challenges of our time. The inundation and flooding of landscapes and urban areas are serious problems, which cause immense damage to infrastructures and human lives in various parts of the world (e.g., recently in Australia, South America, Pakistan, West Africa and China, just to mention a few). Flood prevention requires various management tools, among which flow

prediction models occupy an important place. Flood warnings several days in advance could provide civil protection authorities and the public with the necessary preparation time and could reduce the socio-economic impacts of flooding [1].

This work presents the application of a data-driven model for streamflow predictions, which can be one of the possibilities for the preventive protection of a population and its property. There are various types of models for flow predictions: physically based, conceptual and data-driven models are among the most well known. While physically based models mainly depend on our knowledge of the physical laws in a watershed and on the corresponding geographical database, which serve as an information background for the application of the physical laws, data-driven models extract knowledge only from the monitored data describing the inputs and outputs of the watershed, e.g., time series of precipitation, temperatures, river flows, etc. For this reason, data-driven models are much more suitable for this task. It is not possible operatively to update all the detailed information about a watershed and its stated variables on a day-to-day or even hour-to-hour basis, which is necessary in the case of the application of physically based models.

The authors of this paper have focused on the application of a supervised learning methodology for flow prediction, namely, on a proposed ensemble approach, with the aim of refining the precision of the results of such modeling. In a typical supervised learning scheme, a set of input data instances, also referred to as a training set, is given. The output values of these data in the training set are known, and the goal is to construct a model in order to compute the outputs for the new instances (where the outputs are unknown).

Various models frequently show different capacities to maintain certain aspects of the hydrological processes [2], so the application of a single model often leads to predictions that could be more precise in some part of the problem domain but are less suitable in others [3].

The recognition of this fact has led to the application of an ensemble or committee of models being simultaneously considered. Many researchers have shown that by combining the output of many predictors, more accurate predictions can be produced than what could be obtained from any of the individual predictors [4–6]. Individual predictors should be accurate enough and also different from each other [7–9]. Sampling different training datasets, using different learning architectures and using different subsets of variables are the most popular approaches used to achieve such diversity [5, 10] in the application of the data-driven modeling approach. For example, in bagging [4], each classifier is trained using a different training set sampled from all the available training data. Boosting algorithms are different and powerful ensemble learners, which implement forward stagewise additive modeling, where in each stage the data are reweighted: the examples that produced the worst predictions gain weight and the examples that produced precise results lose weight. Thus, the next basic learner is focused more on examples that were previously incorrectly predicted. Stacking, another type of ensemble learner concept, tries to learn which base models are more reliable than others by using a meta data-driven algorithm, the task of which is to discover how to best combine the output of the base models to achieve the final results.

In the field of streamflow forecasting, various papers have been published [3] in which the data-driven ensemble modeling approach has been studied, but they are usually focused on climate inputs obtained by ensemble modeling of weather, which is not the subject of this paper. Selection of existing works from the focus of this article follows.

prediction models occupy an important place. Flood warnings several days in advance could provide civil protection authorities and the public with the necessary preparation time and

This work presents the application of a data-driven model for streamflow predictions, which can be one of the possibilities for the preventive protection of a population and its property. There are various types of models for flow predictions: physically based, conceptual and data-driven models are among the most well known. While physically based models mainly depend on our knowledge of the physical laws in a watershed and on the corresponding geographical database, which serve as an information background for the application of the physical laws, data-driven models extract knowledge only from the monitored data describing the inputs and outputs of the watershed, e.g., time series of precipitation, temperatures, river flows, etc. For this reason, data-driven models are much more suitable for this task. It is not possible operatively to update all the detailed information about a watershed and its stated variables on a day-to-day or even hour-to-hour basis, which is necessary in the case of

The authors of this paper have focused on the application of a supervised learning methodology for flow prediction, namely, on a proposed ensemble approach, with the aim of refining the precision of the results of such modeling. In a typical supervised learning scheme, a set of input data instances, also referred to as a training set, is given. The output values of these data in the training set are known, and the goal is to construct a model in order to compute

Various models frequently show different capacities to maintain certain aspects of the hydrological processes [2], so the application of a single model often leads to predictions that could

The recognition of this fact has led to the application of an ensemble or committee of models being simultaneously considered. Many researchers have shown that by combining the output of many predictors, more accurate predictions can be produced than what could be obtained from any of the individual predictors [4–6]. Individual predictors should be accurate enough and also different from each other [7–9]. Sampling different training datasets, using different learning architectures and using different subsets of variables are the most popular approaches used to achieve such diversity [5, 10] in the application of the data-driven modeling approach. For example, in bagging [4], each classifier is trained using a different training set sampled from all the available training data. Boosting algorithms are different and powerful ensemble learners, which implement forward stagewise additive modeling, where in each stage the data are reweighted: the examples that produced the worst predictions gain weight and the examples that produced precise results lose weight. Thus, the next basic learner is focused more on examples that were previously incorrectly predicted. Stacking, another type of ensemble learner concept, tries to learn which base models are more reliable than others by using a meta data-driven algorithm, the task of which is to discover how to best combine the

In the field of streamflow forecasting, various papers have been published [3] in which the data-driven ensemble modeling approach has been studied, but they are usually focused on

be more precise in some part of the problem domain but are less suitable in others [3].

the outputs for the new instances (where the outputs are unknown).

output of the base models to achieve the final results.

could reduce the socio-economic impacts of flooding [1].

154 Time Series Analysis and Applications

the application of physically based models.

The application of a modular approach that uses different neural network rainfall-runoff models according to the hydrologic situation in a catchment was presented in Ref. [11]. A specific model from a set of trained models is proposed here to apply to particular input data. This work proposes that the model used for particular inputs is chosen on the basis of the most similar hydrological and meteorological conditions used to train the selected model. A clustering technique based on self-organizing maps was applied to manage the model's selection. A boosting application is presented in Ref. [12], where the authors demonstrated the advantages of an improved version of boosting, namely, AdaBoost.RT, which is compared to other learning methods for several benchmarking problems, and two problems involving river flow forecasting. In a recent study [13], the authors investigated the potential usage of bagging and boosting in building classification and regression tree ensembles to refine the accuracy of streamflow predictions. They report that the bagged model performs slightly better than the boosted model in the testing phase. An ensemble neural network (ENN) designed to monthly inflows forecasting was applied in Ref. [14] to prediction of inflows into the Daecheong Dam in Korea. The ENN combined the outputs of the members of a neural network employing the bagging method. The overall results showed that the ENN outperformed a simple artificial neural network (ANN) among the three rainfall-runoff models. Cannon and Whitfield [15] studied the use of ensemble neural network modeling in streamflow forecasting. Boucher et al. [16] used bagged multi-layer perceptrons for the purpose of a 1-day ahead streamflow forecasting on three watersheds.

In general, the ensemble methods as described in the published theoretical and application papers are usually composed of weak predictors, e.g., decision trees or neural networks commonly used as base predictors while building ensemble machine learning models. On the other hand, there are only a few works in which the ensemble is formed by a fusion of strong learners. The authors of the present paper assume that it is also important to examine ensembles based on nonweak learners, such as support vector machines, random forests or various other types of strong models, which are in some cases eventually ensembles themselves (composed of weak learners, e.g., various types of boosting methods).

A major goal of the analysis in this study is to precisely evaluate ensembles composed of various strong machine learning algorithms in comparison with the results achieved by individual learners. The final prediction by the proposed ensemble is accomplished by weighted summation of the results of the individual learners. The specification of these weights is a particularly important step in ensemble model building and is proposed to be solved with the help of the harmony search optimization methodology [17]. The harmony search methodol ogy has been successfully applied to various optimization tasks and also in the area of hydrology and water resources management, e.g. [18, 19].

In Section 2, the methods of the particular machine learning algorithms involved in this study are briefly explained, together with the ensemble methodologies used. Then, the data acquisition and preparation is presented. In Section 3, the settings of the experimental computations are described and the results are evaluated. Finally, Section 4 summarizes the main achievements and conclusions of the work and proposes ideas for future work in this area.
