1. Introduction

Climate change is shifting the rainfall patterns and increasing the severity of droughts and floods around the Earth. Australia [1], Europe, and the rest of the continents have been affected by a number of major drought events [2]. In 2018, drought and heat waves reduced harvests up to 40–50% in some countries of northern and central Europe [3].

Drought is by far the Earth's most costly natural disaster and can have widespread impacts [4]. Globally, it is responsible for 22% of the economic damage caused by natural disasters and 33% of the damage in terms of the number of people affected [5]. Though average yields rose steadily between 1947 and 2008, there is no evidence that relative stress tolerance has improved [6, 7]. Therefore, until breeding programs develop adapted germplasm, drought forecasting will be

important to determine when to take contingency actions to prevent drought and mitigate its risk and impacts.

Starting with the simplest example, that is, linear regression, the objective of both SVR [26] and LS-SVR [27] is to fit a linear relation <sup>y</sup> <sup>¼</sup> <sup>w</sup><sup>T</sup><sup>x</sup> <sup>þ</sup> <sup>b</sup> between the x regressors and the dependent variable y in the so-called feature space. In SVR, the

> <sup>2</sup> <sup>þ</sup> <sup>C</sup> <sup>∑</sup> l i¼1

<sup>y</sup> � <sup>w</sup>Txi � <sup>b</sup>≤<sup>ε</sup> <sup>þ</sup> <sup>ξ</sup><sup>i</sup> <sup>y</sup> � <sup>w</sup>Txi � <sup>b</sup><sup>≤</sup> <sup>ε</sup> <sup>þ</sup> <sup>ξ</sup><sup>i</sup>

> <sup>2</sup> <sup>þ</sup> <sup>γ</sup> <sup>∑</sup> n i¼1 e 2

Both methods are very similar, but in LS-SVR, the objective is to minimize the more usual sum of the squares of the errors, by replacing the ε-tube or ε-insensitive loss of SVR, that is, by ignoring all regression errors smaller than ε (Figure 1). Solving a nonlinear regression demands a "kernel trick" [26]. This trick uses kernel functions to transform the data of the input space into a higher dimensional feature space to make it possible to perform a linear regression: Common kernels are

LS-SVR is an economic alternative to the original SVR model. It only relies on the cost function on a sum-of-squared-error (SSE) and equality constraints, instead of the computationally complex and time-consuming quadratic programming

For optimal performance, parameter tuning is necessary [29]: for SVR, C and ε and the kernel-related parameters (e.g., σ<sup>2</sup> for the RBF kernel) and for LS-SVR, g (the regularization parameter determining the trade-off between the fitting error and the smoothness of the estimated function) and the kernel-related parameters. For further information about SVR in general, the reader should refer to [30].

An ANN is a supervised learning model based on the operation of biological neurons. There are many architectures and training algorithms for ANN. The multilayer perceptron network (MLPN), the most common ANN architecture used for forecasting, consists of a feedforward neural network with at least three layers of neurons: an input layer, one or more hidden layers, and an output layer with a directed acyclic graph representation network (Figure 2). The input layer receives the data vector x, while the output layer gives the output vector y. An activation

� � <sup>¼</sup> xi � xj

� � <sup>¼</sup> exp � xi � xj

� � � � 2

ξi, ξ <sup>∗</sup> <sup>i</sup> ≥0

k k w

polynomial k xi; xj

Gaussian radial basis function k xi; xj

<sup>ξ</sup><sup>i</sup> <sup>þ</sup> <sup>ξ</sup> <sup>∗</sup> i

∗

� � (1)

<sup>i</sup> (3)

� �<sup>d</sup> (5)

2σ<sup>2</sup> !

yi � <sup>w</sup>Txi � <sup>b</sup> (4)

(2)

(6)

1 2 k k w

Satellite Data and Supervised Learning to Prevent Impact of Drought on Crop…

while for LS-SVR, the objective is to minimize

problem is solved by minimizing

DOI: http://dx.doi.org/10.5772/intechopen.85471

under the constraints

under the constraints

problem in SVR [28].

3

2.2 Artificial neural network (ANN)

The practice of drought forecasting remains challenging and is subject to great uncertainty partly due to the instability of the components of the hydrologic cycle (e.g., rainfall, soil moisture, groundwater level, etc.); temporal variability involving trends, oscillating behavior, and sudden shifts that appear in hydroclimatic records, thus posing challenges to drought prediction [8, 9].

Although agricultural indices (AIs) better reflect the soil water situation that influences crop conditions, monitoring of soil moisture is costly in terms of time and resources [10]. On the other hand, some meteorological indices (e.g., the Standard Precipitation Index) can be calculated just knowing the precipitation (pp) data, and then the expert can give a very close condition of the vegetation [11].

A variety of methods has been developed to predict drought occurrence: statistical run theory [12], Markov chain [13], loglinear [14], renewal process [15], and Poisson process [16], among others.

A valuable alternative to the aforementioned methods is machine learning (ML), a branch of artificial intelligence that studies how to extract information from big data sets with minimal human intervention. ML has been successfully tested in very different areas such as bioinformatics [17], crop protection [18], and economics [19], among others. Therefore, its potential for predicting the climate seems far from being fully exploited.

The remainder of this chapter is organized as follows: in Section 2, we introduce some representative ML methods that have been proposed for drought forecasting; in Section 3, we present the concept of meteorological drought, in particular the standardized precipitation index that is considered a primary drought indicator. Section 4 describes some forecast examples using the abovementioned methods. In Sections 5 and 6, we review satellite precipitation products and how to access and process them; and finally, in Section 7 we present the conclusions of this work.

### 2. Machine learning

ML is the science of algorithms and statistical models that computer systems use to progressively improve their performance of a specific task. They can be broadly categorized into supervised and unsupervised learning. In SL (classification or regression), the algorithm builds a function from a set of data relating the inputs to the outputs. In regression, the outputs are continuous, meaning they may have any value within a range (e.g., temperature and moisture), while in classification, the outputs are restricted to a limited set of values.

In unsupervised learning, the algorithm builds a mathematical model of a data set that contains inputs and no outputs. These unsupervised learning procedures are used to find structure in the data (e.g., cluster data) or reduce its dimensionality.

Examples of ML are land classification using remote sensing [20–22], amending satellite data assimilation [23], or decomposing the causes of climate change [24].

#### 2.1 Support vector regression (SVR) and least squares support vector regression (LS-SVR)

SVR is based on the Vapnik-Chervonenkis (VC) theory [25], which characterizes the properties of learning machines that enable them to generalize the unobserved data well.

Satellite Data and Supervised Learning to Prevent Impact of Drought on Crop… DOI: http://dx.doi.org/10.5772/intechopen.85471

Starting with the simplest example, that is, linear regression, the objective of both SVR [26] and LS-SVR [27] is to fit a linear relation <sup>y</sup> <sup>¼</sup> <sup>w</sup><sup>T</sup><sup>x</sup> <sup>þ</sup> <sup>b</sup> between the x regressors and the dependent variable y in the so-called feature space. In SVR, the problem is solved by minimizing

$$\frac{1}{2}||w||^2 + \mathcal{C} \sum\_{i=1}^{l} \left(\xi\_i + \xi\_i^\*\right) \tag{1}$$

under the constraints

important to determine when to take contingency actions to prevent drought and

The practice of drought forecasting remains challenging and is subject to great uncertainty partly due to the instability of the components of the hydrologic cycle (e.g., rainfall, soil moisture, groundwater level, etc.); temporal variability involving trends, oscillating behavior, and sudden shifts that appear in hydroclimatic records,

Although agricultural indices (AIs) better reflect the soil water situation that influences crop conditions, monitoring of soil moisture is costly in terms of time and resources [10]. On the other hand, some meteorological indices (e.g., the Standard Precipitation Index) can be calculated just knowing the precipitation (pp) data, and then the expert can give a very close condition of the vegetation [11]. A variety of methods has been developed to predict drought occurrence: statistical run theory [12], Markov chain [13], loglinear [14], renewal process [15], and

A valuable alternative to the aforementioned methods is machine learning (ML), a branch of artificial intelligence that studies how to extract information from big data sets with minimal human intervention. ML has been successfully tested in very different areas such as bioinformatics [17], crop protection [18], and economics [19], among others. Therefore, its potential for predicting the climate seems far

The remainder of this chapter is organized as follows: in Section 2, we introduce some representative ML methods that have been proposed for drought forecasting; in Section 3, we present the concept of meteorological drought, in particular the standardized precipitation index that is considered a primary drought indicator. Section 4 describes some forecast examples using the abovementioned methods. In Sections 5 and 6, we review satellite precipitation products and how to access and process them; and finally, in Section 7 we present the conclusions of this work.

ML is the science of algorithms and statistical models that computer systems use to progressively improve their performance of a specific task. They can be broadly categorized into supervised and unsupervised learning. In SL (classification or regression), the algorithm builds a function from a set of data relating the inputs to the outputs. In regression, the outputs are continuous, meaning they may have any value within a range (e.g., temperature and moisture), while in classification,

In unsupervised learning, the algorithm builds a mathematical model of a data set that contains inputs and no outputs. These unsupervised learning procedures are used to find structure in the data (e.g., cluster data) or reduce its dimensionality.

SVR is based on the Vapnik-Chervonenkis (VC) theory [25], which character-

Examples of ML are land classification using remote sensing [20–22], amending satellite data assimilation [23], or decomposing the causes of climate

2.1 Support vector regression (SVR) and least squares support vector

izes the properties of learning machines that enable them to generalize the

mitigate its risk and impacts.

Drought - Detection and Solutions

Poisson process [16], among others.

from being fully exploited.

2. Machine learning

change [24].

2

regression (LS-SVR)

unobserved data well.

thus posing challenges to drought prediction [8, 9].

the outputs are restricted to a limited set of values.

$$\begin{aligned} y - w^T \mathbf{x}\_i - b &\le \varepsilon + \xi\_i\\ y - w^T \mathbf{x}\_i - b &\le \varepsilon + \xi\_i^\*\\ \xi\_i, \xi\_i^\* &\ge \mathbf{0} \end{aligned} \tag{2}$$

while for LS-SVR, the objective is to minimize

$$\left\|\left\|\left\|\left\|\left\|\left|\right|\right|\right|\right\|^2 + \left\|\sum\_{i=1}^n e\_i^2\right\|\right\|\right\| \tag{3}$$

under the constraints

$$\mathbf{w}\_i - \mathbf{w}^T \mathbf{x}\_i - \mathbf{b} \tag{4}$$

Both methods are very similar, but in LS-SVR, the objective is to minimize the more usual sum of the squares of the errors, by replacing the ε-tube or ε-insensitive loss of SVR, that is, by ignoring all regression errors smaller than ε (Figure 1). Solving a nonlinear regression demands a "kernel trick" [26]. This trick uses kernel functions to transform the data of the input space into a higher dimensional feature space to make it possible to perform a linear regression: Common kernels are

$$\text{polynomial} \, k(\mathbf{x}\_i, \mathbf{x}\_j) = \left(\mathbf{x}\_i \cdot \mathbf{x}\_j\right)^d \tag{5}$$

$$\text{Gaussian radial basis function } k(\mathbf{x}\_i, \mathbf{x}\_j) = \exp\left(-\frac{||\mathbf{x}\_i - \mathbf{x}\_j||^2}{2\sigma^2}\right) \tag{6}$$

LS-SVR is an economic alternative to the original SVR model. It only relies on the cost function on a sum-of-squared-error (SSE) and equality constraints, instead of the computationally complex and time-consuming quadratic programming problem in SVR [28].

For optimal performance, parameter tuning is necessary [29]: for SVR, C and ε and the kernel-related parameters (e.g., σ<sup>2</sup> for the RBF kernel) and for LS-SVR, g (the regularization parameter determining the trade-off between the fitting error and the smoothness of the estimated function) and the kernel-related parameters. For further information about SVR in general, the reader should refer to [30].

#### 2.2 Artificial neural network (ANN)

An ANN is a supervised learning model based on the operation of biological neurons. There are many architectures and training algorithms for ANN. The multilayer perceptron network (MLPN), the most common ANN architecture used for forecasting, consists of a feedforward neural network with at least three layers of neurons: an input layer, one or more hidden layers, and an output layer with a directed acyclic graph representation network (Figure 2). The input layer receives the data vector x, while the output layer gives the output vector y. An activation

function is applied to activate the neurons in the hidden layer. For a three-layer network system, the nonlinear mapping between input x and output y is given by the equation:

An ANN is usually learned by adjusting the weights and biases in order to minimize a cost function, usually MSE using the error back-propagation algorithm.

The number of hidden neurons is no less important, since a wrong number may cause either overfitting or underfitting problems. Normally it is selected via trial and error, but this is computationally costly. Several heuristics or formulas have been proposed to avoid this cumbersome work, and success depends on the type of

Last but not least, ANN forecasting models can be separated into two broad groups, namely, the recursive multistep neural network (RMSNN) and the direct multistep neural network (DMSNN) (Figure 2). In RMSNN, the model forecast one time-step ahead, and the network is applied recursively, using previous predictions as inputs for subsequent forecasts—that is, a forecast horizon of 3 months will have, as inputs, the outputs of forecasts with lead times of 1 and 2 months. Similar to the RMSNN model, the DMSNN approach has a single or multiple neurons in both the input and hidden layers. However, it can have several neurons in the output layer representing multiple-month lead time forecasts. Similar to the RMSNN model, the DMSNN model is designed to forecast drought conditions using

<sup>1</sup>þe�<sup>x</sup>.

Of the activation functions, we should mention the hyperbolic

Satellite Data and Supervised Learning to Prevent Impact of Drought on Crop…

data, the complexity of the network architecture, etc. [31].

exþe�<sup>x</sup> and the sigmoidal function: ð Þ¼ <sup>x</sup> <sup>e</sup><sup>x</sup>

the present index value and several months of past index values as inputs.

energies of the configurations between visible and hidden units.

The standard type of RBM has binary-valued (Boolean/Bernoulli) hidden and

Bootstrap aggregating, or bagging, is an ML ensemble meta-algorithm designed to increase the stability and accuracy of unstable procedures, for example, artificial

ANNs are suitable for complex time series forecasting but have several weaknesses: (1) selection of the initial values of the weights (normally at random) can affect the learning process, leading to slower convergence or to different forecast results for each training process and (2) the training process may get stuck at local optima, especially in networks with several hidden layers. Hinton et al. [32] proposed a probabilistic generative model with multiple hidden layers that uses layer-wise unsupervised learning to pre-train the initial weights of the network and then fine-tune the whole network using standard supervised methods such as the back-propagation algorithm. Classically, a DBN is constructed by stacking multiple restricted Boltzmann machines (RBMs) on top of each other (Figure 3). The layers are trained by using the feature activations of one layer as the training data for the next layer. Better initial values of weights in all layers are obtained by greedy layer-wise unsupervised training, and the entire network is fine-tuned using an SL algorithm. Pre-training can be done with principal component analysis or nonlinear generalization [33]. An RBM [34] is a neural network model used for unsupervised learning. Typically, it consists of a single layer of hidden units (the outputs) with undirected and symmetrical connections to a layer of visible units (the data) (Figure 3). The configuration (bipartite graph) defines the state of each unit. Only connections between a hidden unit and a visible unit are permitted—that is, no connections between two visible units or between two hidden units are allowed. An RBM is a special type of generative energy-based model that is defined in terms of the

tangentf xð Þ¼ ex�e�<sup>x</sup>

DOI: http://dx.doi.org/10.5772/intechopen.85471

2.3 Deep belief networks (DBN)

visible units.

2.4 Bagging

5

Figure 1.

(A) Kernel trick: mapping the data from the input space into a feature space. (B) Loss function used in support vector regression (ε-insensitive loss) and least squares support regression (quadratic).

#### Figure 2.

Architectures of forecasting artificial neural networks. Recursive multistep neural network versus direct multistep neural network.

Satellite Data and Supervised Learning to Prevent Impact of Drought on Crop… DOI: http://dx.doi.org/10.5772/intechopen.85471

An ANN is usually learned by adjusting the weights and biases in order to minimize a cost function, usually MSE using the error back-propagation algorithm.

Of the activation functions, we should mention the hyperbolic tangentf xð Þ¼ ex�e�<sup>x</sup> exþe�<sup>x</sup> and the sigmoidal function: ð Þ¼ <sup>x</sup> <sup>e</sup><sup>x</sup> <sup>1</sup>þe�<sup>x</sup>.

The number of hidden neurons is no less important, since a wrong number may cause either overfitting or underfitting problems. Normally it is selected via trial and error, but this is computationally costly. Several heuristics or formulas have been proposed to avoid this cumbersome work, and success depends on the type of data, the complexity of the network architecture, etc. [31].

Last but not least, ANN forecasting models can be separated into two broad groups, namely, the recursive multistep neural network (RMSNN) and the direct multistep neural network (DMSNN) (Figure 2). In RMSNN, the model forecast one time-step ahead, and the network is applied recursively, using previous predictions as inputs for subsequent forecasts—that is, a forecast horizon of 3 months will have, as inputs, the outputs of forecasts with lead times of 1 and 2 months.

Similar to the RMSNN model, the DMSNN approach has a single or multiple neurons in both the input and hidden layers. However, it can have several neurons in the output layer representing multiple-month lead time forecasts. Similar to the RMSNN model, the DMSNN model is designed to forecast drought conditions using the present index value and several months of past index values as inputs.

#### 2.3 Deep belief networks (DBN)

function is applied to activate the neurons in the hidden layer. For a three-layer network system, the nonlinear mapping between input x and output y is given by

> wj f <sup>1</sup> ∑ n i¼0 wjixi

(A) Kernel trick: mapping the data from the input space into a feature space. (B) Loss function used in support

Architectures of forecasting artificial neural networks. Recursive multistep neural network versus direct

vector regression (ε-insensitive loss) and least squares support regression (quadratic).

" # � �

(7)

<sup>y</sup> <sup>¼</sup> <sup>1</sup> <sup>2</sup> <sup>∑</sup> h j¼1

the equation:

Drought - Detection and Solutions

Figure 1.

Figure 2.

4

multistep neural network.

ANNs are suitable for complex time series forecasting but have several weaknesses: (1) selection of the initial values of the weights (normally at random) can affect the learning process, leading to slower convergence or to different forecast results for each training process and (2) the training process may get stuck at local optima, especially in networks with several hidden layers. Hinton et al. [32] proposed a probabilistic generative model with multiple hidden layers that uses layer-wise unsupervised learning to pre-train the initial weights of the network and then fine-tune the whole network using standard supervised methods such as the back-propagation algorithm.

Classically, a DBN is constructed by stacking multiple restricted Boltzmann machines (RBMs) on top of each other (Figure 3). The layers are trained by using the feature activations of one layer as the training data for the next layer. Better initial values of weights in all layers are obtained by greedy layer-wise unsupervised training, and the entire network is fine-tuned using an SL algorithm. Pre-training can be done with principal component analysis or nonlinear generalization [33].

An RBM [34] is a neural network model used for unsupervised learning. Typically, it consists of a single layer of hidden units (the outputs) with undirected and symmetrical connections to a layer of visible units (the data) (Figure 3). The configuration (bipartite graph) defines the state of each unit. Only connections between a hidden unit and a visible unit are permitted—that is, no connections between two visible units or between two hidden units are allowed. An RBM is a special type of generative energy-based model that is defined in terms of the energies of the configurations between visible and hidden units.

The standard type of RBM has binary-valued (Boolean/Bernoulli) hidden and visible units.

#### 2.4 Bagging

Bootstrap aggregating, or bagging, is an ML ensemble meta-algorithm designed to increase the stability and accuracy of unstable procedures, for example, artificial

of the two possible subsets of this variable labels the arcs connecting to the subordinate decision node. Each tree extends as much as possible until all the terminal nodes are maximally homogeneous (a minimum of five examples in each leaf is

Satellite Data and Supervised Learning to Prevent Impact of Drought on Crop…

Once the random forest is generated, the output of new data is obtained by

The number of trees influences the error of prediction; it decreases as the number of trees (ntree) grows, but there is a threshold beyond which there is no significant gain [38, 39]. In general, ntree≈500 gives good results [40].RF can

recommended).

Figure 5.

7

Figure 4.

Structure of bootstrap aggregating, or bagging.

Architecture of the random forest model.

averaging the predictions of the K trees.

DOI: http://dx.doi.org/10.5772/intechopen.85471

#### Figure 3.

Basic deep belief network (DBN) structure with three hidden layers.

neural networks or decision trees [35]. Given a standard training set T of size n, the algorithm sample is taken from T uniformly and with replacement m new training sets, T', each of size n<sup>0</sup> (some observations may be repeated in each D). This process is known as a bootstrap sampling [36]. The basic idea is that the samples are de-correlated, and this reduces the expected error as m increases.

The m models are fitted using the above m bootstrap samples, and results of an unknown instance are obtained by averaging the output (for regression) or by voting (for classification) (Figure 4).

This method may slightly degrade the performance of stable algorithms (e.g., k-nearest neighbor) because smaller training sets are used to train each algorithm.

Bagging does not necessarily improve forecast accuracy in all cases. Nevertheless, this method and its derivatives tend to outperform traditional forecasting procedures [37].

#### 2.5 Random forest regression (RFR)

A random forest (RF) [38] is a collection of K binary recursive partitioning trees, where each tree is grown on a subset of n instances extracted with replacement from the original training data. It is an instance of bagging where the individual learners are de-correlated trees. Each tree is grown in a top-down recursive manner, from the root node to terminal nodes or leaves (Figure 5). In each node, a random sample of m (m ≈ p/3) predictors is chosen as candidates from the full set of p predictors. The data are partitioned into the two descendant branches by choosing the variable that minimized:

$$\text{RSS} = \sum\_{\text{light}} \left( \left. y\_i - \left. y\_L \right. \right. \right)^2 - \sum\_{\text{right}} \left( \left. y\_i - \left. y\_R \right. \right. \right)^2 \tag{8}$$

The advantage of selecting a random subset of predictors is that two trees generated on same training data will be de-correlated (independent of each other) because randomly different variables were selected at each split. Each internal (non-leaf) node is signed with a predictor determined by the RSS test, and each one Satellite Data and Supervised Learning to Prevent Impact of Drought on Crop… DOI: http://dx.doi.org/10.5772/intechopen.85471

of the two possible subsets of this variable labels the arcs connecting to the subordinate decision node. Each tree extends as much as possible until all the terminal nodes are maximally homogeneous (a minimum of five examples in each leaf is recommended).

Once the random forest is generated, the output of new data is obtained by averaging the predictions of the K trees.

The number of trees influences the error of prediction; it decreases as the number of trees (ntree) grows, but there is a threshold beyond which there is no significant gain [38, 39]. In general, ntree≈500 gives good results [40].RF can

Figure 5. Architecture of the random forest model.

neural networks or decision trees [35]. Given a standard training set T of size n, the algorithm sample is taken from T uniformly and with replacement m new training sets, T', each of size n<sup>0</sup> (some observations may be repeated in each D). This process is known as a bootstrap sampling [36]. The basic idea is that the samples are

The m models are fitted using the above m bootstrap samples, and results of an unknown instance are obtained by averaging the output (for regression) or by

This method may slightly degrade the performance of stable algorithms (e.g., k-nearest neighbor) because smaller training sets are used to train each algorithm. Bagging does not necessarily improve forecast accuracy in all cases. Nevertheless, this method and its derivatives tend to outperform traditional forecasting

A random forest (RF) [38] is a collection of K binary recursive partitioning trees,

right

yi � yR

<sup>∗</sup> <sup>2</sup> (8)

where each tree is grown on a subset of n instances extracted with replacement from the original training data. It is an instance of bagging where the individual learners are de-correlated trees. Each tree is grown in a top-down recursive manner, from the root node to terminal nodes or leaves (Figure 5). In each node, a random sample of m (m ≈ p/3) predictors is chosen as candidates from the full set of p predictors. The data are partitioned into the two descendant branches by choosing

> yi � yL <sup>∗</sup> <sup>2</sup> � <sup>∑</sup>

The advantage of selecting a random subset of predictors is that two trees generated on same training data will be de-correlated (independent of each other) because randomly different variables were selected at each split. Each internal (non-leaf) node is signed with a predictor determined by the RSS test, and each one

de-correlated, and this reduces the expected error as m increases.

Basic deep belief network (DBN) structure with three hidden layers.

voting (for classification) (Figure 4).

Drought - Detection and Solutions

2.5 Random forest regression (RFR)

the variable that minimized:

6

RSS ¼ ∑ left

procedures [37].

Figure 3.

successfully handle high dimensionality and multicollinearity, because it is both fast and insensitive to overfitting. It is, however, sensitive to the sampling design.

function in layer 2. As the number of parameters increases with the fuzzy rule increment, the model structure becomes more complicated. A very good

Satellite Data and Supervised Learning to Prevent Impact of Drought on Crop…

Boosting attempts to increase the performance of a given learning algorithm by iteratively adjusting the weight of an observation based on the last training/testing process. In other words, the meta-algorithm produces a sequence of models by

AdaBoost, the first boosting algorithm, is definitely beaten by noisy data; its performance is highly affected by outliers, as the algorithm tries to fit every point perfectly. Friedman [46] extended the concept to present gradient boosting, which constructs additive regression models by sequentially fitting a simple parameterized function (base learner) to current "pseudo"-residuals by least squares at each iteration. The pseudo-residuals are the gradient of the loss function being minimized with respect to the model values at each training data point evaluated in the current step. This reduces the loss of the loss function. We iteratively added each model and computed the loss. The loss represents the difference between the actual value and the predicted value (the error residual), and using this

A regularization method that penalizes various parts of the boosting algorithm is

necessary to avoid overfitting. This generally improves the performance of the

The time series that characterize the evolution of meteorological events (drought, precipitation) in the temporal domain have localized high- and lowfrequency components with dynamic nonlinearity and non-stationary features. MM models have not always proven to be good at capturing the behavior of the time series. Hybrid models can perform superbly when forecasting hydrological and climatological time series. Different combination techniques have been proposed in order to overcome the deficiencies of single models and improve forecasting performance [47]. Many combined models have been introduced in the literature, for

Here we will only focus on WT-ML hybrids, where ML is a machine learning

WT is a time-dependent spectral analysis that decomposes time series in the time-frequency space and provides a timescale illustration of processes and their relationships. In this method, the data series are broken down by transforming them into "wavelets," which are scaled and shifted versions of a mother wavelet [50]. This allows the use of long time intervals for low-frequency information and shorter intervals for high-frequency information and can reveal aspects of data such as tendencies, breakdown points, and discontinuities that other signal analysis tech-

There are two main alternatives for WT: discrete wavelet transform (DWT) and continuous wavelet transform (CWT). For DWT, the WT is applied using a discrete set of the wavelet scaling and shifting, whereas in the case of CWT, this scaling and shifting is continuous—that is, CWT is computationally expensive

method (e.g., ANN or SVR) and WT is a discrete wavelet transform [50].

loss value, the predictions are updated to minimize these residuals.

example, ANN-ARIMA [48], SVR-ARIMA [49], etc.

niques might miss, for example, Fourier transform.

description of ANFIS is presented in [43, 44].

DOI: http://dx.doi.org/10.5772/intechopen.85471

adaptive reweighting of the training set [45].

algorithm by reducing overfitting.

2.8.1 Wavelet transform (WT)

9

2.8 Hybrid models

2.7 Boosting
