4.11 Kohonen self-organizing map (KSOM)

KSOM maps the input data into two-dimensional discrete output map by clustering similar patterns. It consists of two interconnected layers namely, multidimensional input layer and competitive output layer with 'w' neurons (Figure 5).

Figure 4. A three layer feed-forward ANN model [7].

Figure 5. Kohonen self organizing map [40].

Each node or neuron 'i' (i = 1, 2, … w) is represented by an n-dimensional weight or reference vector wi = [wi1,….,win]. The 'w' nodes can be ordered so that similar neurons are located together and dissimilar neurons are remotely located on the map. The topology of network is indicated by the number of output neurons and their interconnections. The general network topology of KSOM is either a rectangular or a hexagonal grid. The number of neurons (map size), w, may vary from a few dozen up to several thousands, which affects accuracy and generalization capability of the KSOM. The optimum number of neurons (w) can be determined by below equation [41].

$$\mathbf{w} = \mathbf{5}\sqrt{\mathbf{N}}\tag{11}$$

w tð Þ¼ <sup>þ</sup> <sup>1</sup> w tð Þþ <sup>α</sup>ð Þ<sup>t</sup> hlmð Þ <sup>x</sup>‐w tð Þ (14)

u xð Þ<sup>i</sup> (16)

xi‐wli j j (17)

(15)

where α = learning rate; l and m = positions of the winning neuron and its neighboring output nodes; hlm = neighborhood function of the BMU l at iteration t. The most commonly used neighborhood function is the Gaussian which is

Nonlinear Evapotranspiration Modeling Using Artificial Neural Networks

DOI: http://dx.doi.org/10.5772/intechopen.81369

hlm <sup>¼</sup> exp ‐ <sup>l</sup>‐m<sup>2</sup> �

where l-m = distance between neurons l and m on the map grid; σ = width of the

The training steps are repeated until convergence. After the KSOM network is constructed, the homogeneous regions, that is, clusters are defined on the map. The KSOM trained network performance is evaluated using two errors namely, total

The topographic error, te, is an indication of the degree of preservation of the

where u(xi) = binary integer such that it is equal to 1 if the first and second best

The quantization error, qe, is an indication of the average distance between each data vector and its BMU at convergence, that is, the quality of the map fitting to

Training basically involves feeding training samples as input vectors through a neural network, calculating the error of the output layer, and then adjusting the weights of the network to minimize the error. There are different methods for adjusting the weights. These methods are called as "training algorithms". The objective of the training algorithm is to minimize the difference between the predicted output values and the measured output values [6]. Different training algorithms are: (i) gradient descent with momentum backpropagation (GDM) algorithm, (ii) Levenberg-Marquardt (LM) algorithm, (iii) Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi Newton algorithm, (iii) resilient back propagation (RBP) algorithm, (iv) conjugate gradient algorithm, (v) one-step secant (OSS) algorithm, (vi) cascade correlation (CC) algorithm, and (vii) Bayesian regularization (BR) algorithm. The training algorithms used in this study are only briefly

4.14 Gradient descent with momentum back propagation (GDM) algorithm

This method uses back-propagation to calculate derivatives of performance cost function with respect to the weight and bias variables of the network. Each variable

� � � <sup>2</sup>σð Þ<sup>t</sup> <sup>2</sup>

!

expressed as:

the data.

described below.

33

topological neighborhood.

topographic error (te) and quantization error (qe).

topology of the data when fitting the map to the original data set.

te <sup>¼</sup> <sup>1</sup> <sup>N</sup> <sup>∑</sup> N i¼1

matching units of the map are not adjacent units; otherwise it is zero.

qe <sup>¼</sup> <sup>1</sup> <sup>N</sup> <sup>∑</sup> N i¼1

where wli = prototype vector of the best matching unit for xi.

4.13 Type of ANN training algorithms

where N = total number of data samples or records. Once 'w' is known, the number of rows and columns in the KSOM can be determined as:

$$\frac{\mathbf{l\_1}}{\mathbf{l\_2}} = \sqrt{\frac{\mathbf{e\_1}}{\mathbf{e\_2}}} \tag{12}$$

where l1 and l2 = number of rows and columns, respectively; e1 = biggest eigen value of the training data set; e2 = second biggest eigen value.

### 4.12 Training the KSOM

The KSOM is trained iteratively: initially the weights are randomly assigned. When the n-dimensional input vector x is sent through the network, the Euclidean distance between weight 'w' neurons of SOM and the input is computed by,

$$|\mathbf{x} \cdot \mathbf{w}| = \sqrt{\sum\_{i=1}^{n} \left(\mathbf{x}\_{i} \cdot \mathbf{w}\_{i}\right)^{2}} \tag{13}$$

where xi = ith data sample or vector; wi = prototype vector for xi;jdenotes Euclidian distance.

The best matching unit (BMU) is also called as 'winning neuron' is the weight that closely matching to the input. The learning process takes place in between BMU and its neighboring neurons at each training iteration 't' with an aim to reduce the distance between weights and input.

Nonlinear Evapotranspiration Modeling Using Artificial Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.81369

$$\mathbf{w(t+1) = w(t) + a(t)h\_{\rm lm}(x \cdot w(t))}\tag{14}$$

where α = learning rate; l and m = positions of the winning neuron and its neighboring output nodes; hlm = neighborhood function of the BMU l at iteration t.

The most commonly used neighborhood function is the Gaussian which is expressed as:

$$\mathbf{h}\_{\rm lm} = \exp\left(\frac{|\mathbf{l} \cdot \mathbf{m}^2|}{2\sigma(\mathbf{t})^2}\right) \tag{15}$$

where l-m = distance between neurons l and m on the map grid; σ = width of the topological neighborhood.

The training steps are repeated until convergence. After the KSOM network is constructed, the homogeneous regions, that is, clusters are defined on the map. The KSOM trained network performance is evaluated using two errors namely, total topographic error (te) and quantization error (qe).

The topographic error, te, is an indication of the degree of preservation of the topology of the data when fitting the map to the original data set.

$$\mathbf{t}\_{\mathbf{e}} = \frac{\mathbf{1}}{\mathbf{N}} \sum\_{i=1}^{N} \mathbf{u}(\mathbf{x}\_{i}) \tag{16}$$

where u(xi) = binary integer such that it is equal to 1 if the first and second best matching units of the map are not adjacent units; otherwise it is zero.

The quantization error, qe, is an indication of the average distance between each data vector and its BMU at convergence, that is, the quality of the map fitting to the data.

$$\mathbf{q\_e} = \frac{\mathbf{1}}{\mathbf{N}} \sum\_{i=1}^{N} |\mathbf{x\_i} \mathbf{w\_{li}}| \tag{17}$$

where wli = prototype vector of the best matching unit for xi.

#### 4.13 Type of ANN training algorithms

Training basically involves feeding training samples as input vectors through a neural network, calculating the error of the output layer, and then adjusting the weights of the network to minimize the error. There are different methods for adjusting the weights. These methods are called as "training algorithms". The objective of the training algorithm is to minimize the difference between the predicted output values and the measured output values [6]. Different training algorithms are: (i) gradient descent with momentum backpropagation (GDM) algorithm, (ii) Levenberg-Marquardt (LM) algorithm, (iii) Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi Newton algorithm, (iii) resilient back propagation (RBP) algorithm, (iv) conjugate gradient algorithm, (v) one-step secant (OSS) algorithm, (vi) cascade correlation (CC) algorithm, and (vii) Bayesian regularization (BR) algorithm. The training algorithms used in this study are only briefly described below.

#### 4.14 Gradient descent with momentum back propagation (GDM) algorithm

This method uses back-propagation to calculate derivatives of performance cost function with respect to the weight and bias variables of the network. Each variable

Each node or neuron 'i' (i = 1, 2, … w) is represented by an n-dimensional weight or reference vector wi = [wi1,….,win]. The 'w' nodes can be ordered so that similar neurons are located together and dissimilar neurons are remotely located on the map. The topology of network is indicated by the number of output neurons and their interconnections. The general network topology of KSOM is either a rectangular or a hexagonal grid. The number of neurons (map size), w, may vary from a few dozen up to several thousands, which affects accuracy and generalization capability of the KSOM. The optimum number of neurons (w) can be determined

where N = total number of data samples or records. Once 'w' is known, the

ffiffiffiffi e1 e2 r

where l1 and l2 = number of rows and columns, respectively; e1 = biggest eigen

The KSOM is trained iteratively: initially the weights are randomly assigned. When the n-dimensional input vector x is sent through the network, the Euclidean distance between weight 'w' neurons of SOM and the input is computed by,

> ∑ n i¼1

The best matching unit (BMU) is also called as 'winning neuron' is the weight that closely matching to the input. The learning process takes place in between BMU and its neighboring neurons at each training iteration 't' with an aim to reduce

s

where xi = ith data sample or vector; wi = prototype vector for xi;jdenotes

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ð Þ xi‐wi 2

j j <sup>x</sup>‐<sup>w</sup> <sup>¼</sup>

l1 l2 ¼

number of rows and columns in the KSOM can be determined as:

value of the training data set; e2 = second biggest eigen value.

w ¼ 5√N (11)

(12)

(13)

by below equation [41].

Kohonen self organizing map [40].

Advanced Evapotranspiration Methods and Applications

Figure 5.

4.12 Training the KSOM

Euclidian distance.

32

the distance between weights and input.

is adjusted according to the gradient descent with momentum. The equation used for update of weight and bias is given by:

$$
\Delta w\_{ji}(n) = a.\Delta w\_{ji}(n-1) + \eta \frac{\partial E}{\partial w\_{ji}} \tag{18}
$$

Yi ¼ β<sup>0</sup> þ β1X1, <sup>i</sup> þ β2X2, <sup>i</sup> þ <sup>⋯</sup> þ βkXk, <sup>i</sup> þ ε<sup>i</sup> (20)

Yi ¼ b0 þ b1X1, <sup>i</sup> þ b2X2, <sup>i</sup> þ <sup>⋯</sup> þ bkXk, <sup>i</sup> þ ei (21)

<sup>Y</sup> <sup>¼</sup> <sup>Y</sup>^ <sup>¼</sup> b0 <sup>þ</sup> b1X1, <sup>i</sup> <sup>þ</sup> b2X2, <sup>i</sup> <sup>þ</sup> <sup>⋯</sup> <sup>þ</sup> bkXk, <sup>i</sup> (22)

<sup>T</sup>þ<sup>273</sup>Wsð Þ es � ea

); Δ = slope of saturation vapor pressure versus air temperature

); es and ea = saturation and actual vapor pressure (kPa), respec-

(23)

);

); γ = psychrometric

where Yi = ith observations of each of the dependent variable Y; X1, i, X2,i, ⋯, Xk, <sup>i</sup> = ith observations of each of the independent variables X1, X2, ⋯, Xk respectively; β0, β1, β2, ⋯, β<sup>n</sup> = fixed (but unknown) parameters; ε<sup>i</sup> = random variable that is

Nonlinear Evapotranspiration Modeling Using Artificial Neural Networks

DOI: http://dx.doi.org/10.5772/intechopen.81369

The task of regression modeling is to estimate the unknown parameters (β0, β1, β2, ⋯, βn) of the MLR model [Eq. (20)]. Thus, the pragmatic form of the statistical regression model obtained after applying the least square method is as follows [42].

where i ¼ 1, 2, …:, n; b0, b1, b2, ⋯, bk estimates or unstandardized regression coefficients of β0, β1, β2, ⋯, β<sup>n</sup> respectively; ei = estimated error (or residual) for the

The difference between the observed Y and the estimated Y is called the residual ^

The purpose of developing MLR models is to establish a simple equation which is easy to use and interpret. The MLR modeling is very useful, especially in case of limited field data. Moreover, it is versatile as it can accommodate any number of

The FAO-56 PM method is recommended as the standard method for estimating ETo in case of locations where measured lysimeter data is not available. The equa-

Δ þ γð Þ 1 þ 0:34Ws

where ETo = reference evapotranspiration calculated by FAO-56 PM method

tively; T = average daily air temperature (°C); G = soil heat flux (MJ m�<sup>2</sup> day�<sup>1</sup>

The ETo values obtained from above equation are used as target data in ANN due

For the purpose of this study, 15 different climatic locations distributed over

four agro-ecological regions (AERs) are selected. The selected locations are Parbhani, Kovilpatti, Bangalore, Solapur, Udaipur (semi-arid); Anantapur and Hissar (arid); Raipur, Faizabad, Ludhiana, and Ranichauri, (sub-humid); and Palampur, Jorhat, Mohanpur, and Dapoli (humid). Daily climate data of Tmin, Tmax, RHmin, RHmax, Ws, Sra for the period of 5 years (January 1, 2001 to

).

normally distributed.

ith observation.

(or residual error).

(mm day�<sup>1</sup>

curve (kPa <sup>o</sup>

constant (kPa <sup>o</sup>

5. Methodology

35

C�<sup>1</sup>

Ws = daily mean wind speed (m s�<sup>1</sup>

to unavailability of lysimeter measured values.

C�<sup>1</sup>

Therefore, estimate of

independent variables [43].

4.18 The FAO-56 Penman-Monteith method

tion for the estimation of daily ETo can be written as [3]:

ETo <sup>¼</sup> <sup>0</sup>:408Δð Þþ Rn � <sup>G</sup> <sup>γ</sup> <sup>900</sup>

); Rn = daily net solar radiation (MJ m�<sup>2</sup> day�<sup>1</sup>

where Δwji(n) = correction applied to the synaptic weight connecting neuron i to neuron j; α = momentum; η = learning-rate parameter; E = error function. The equation is known as the generalized delta rule and this is probably the simplest and most common way to train a network [37].

#### 4.15 Levenberg-Marquardt (LM) algorithm

This method is a modification of the classic Newton algorithm for finding an optimum solution to a minimization problem. In particular the LM utilizes the so called Gauss-Newton approximation that keeps the Jacobian matrix and discards second order derivatives of the error. The LM algorithm interpolates between the Gauss-Newton algorithm and the method of gradient descent. To update weights, the LM algorithm uses an approximation of the Hessian matrix.

$$\mathbf{W}\_{k+1} = \mathbf{W}\_k - \left[\mathbf{J}^T \mathbf{J} + \lambda \mathbf{I}\right]^{-1} \mathbf{J}^T \mathbf{e} \tag{19}$$

where W = weight; e = errors; I = identity matrix; λ = learning parameter; J = Jacobian matrix (first derivatives of errors with respect to the weights and biases); J<sup>T</sup> <sup>¼</sup> transpose of J; JTJ <sup>¼</sup> Hessian matrix. For <sup>λ</sup> = 0 the algorithm becomes Gauss-Newton method. For very large λ the LM algorithm becomes steepest decent algorithm. The 'λ' parameter governs the step size and is automatically adjusted (based on the direction of the error) at each iteration in order to secure convergence. If the error decreases between weight updates, then the 'λ' parameter is decreased by a factor of λ�. Conversely, if the error increases then 'λ' parameter is increased by a factor of λþ. The λ� and λ<sup>þ</sup> are defined by user. In LM algorithm training process converges quickly as the solution is approached, because Hessian does not vanish at the solution. LM algorithm has great computational and memory requirements and hence it can only be used in small networks. It is often characterized as more stable and efficient. It is faster and less easily trapped in local minima than other optimization algorithms [37].

#### 4.16 Online and batch modes of training

On-Line learning updates the weights after the presentation of each exemplar. In contrast, Batch learning updates the weights after the presentation of the entire training set. When the training datasets are highly redundant, the online mode is able to take the advantage of this redundancy and provides effective solutions to large and difficult problems. On the other hand, the batch mode of training provides an accurate estimate of gradient vector; convergence of local minimum is thereby guaranteed under simple conditions [23].

#### 4.17 Multiple linear regression (MLR)

MLR technique attempts to model the relationship between two or more explanatory (independent) variables and a response (dependent) variable by fitting a linear equation to the observed data. The general form of a MLR model is given as [42]:

Nonlinear Evapotranspiration Modeling Using Artificial Neural Networks DOI: http://dx.doi.org/10.5772/intechopen.81369

$$\mathbf{Y}\_{\text{i}} = \mathfrak{P}\_{\text{0}} + \mathfrak{P}\_{\text{1}}\mathbf{X}\_{\text{1}}\mathbf{i} + \mathfrak{P}\_{\text{2}}\mathbf{X}\_{\text{2}}\mathbf{i} + \dots + \mathfrak{P}\_{\text{k}}\mathbf{X}\_{\text{k}}\mathbf{i} + \varepsilon\_{\text{i}} \tag{20}$$

where Yi = ith observations of each of the dependent variable Y; X1, i, X2,i, ⋯, Xk, <sup>i</sup> = ith observations of each of the independent variables X1, X2, ⋯, Xk respectively; β0, β1, β2, ⋯, β<sup>n</sup> = fixed (but unknown) parameters; ε<sup>i</sup> = random variable that is normally distributed.

The task of regression modeling is to estimate the unknown parameters (β0, β1, β2, ⋯, βn) of the MLR model [Eq. (20)]. Thus, the pragmatic form of the statistical regression model obtained after applying the least square method is as follows [42].

$$\mathbf{Y}\_{\mathbf{i}} = \mathbf{b}\_0 + \mathbf{b}\_1 \mathbf{X}\_{\mathbf{1}\mathbf{i}} + \mathbf{b}\_2 \mathbf{X}\_{\mathbf{2}\mathbf{i}} + \dots + \mathbf{b}\_k \mathbf{X}\_{\mathbf{k}\mathbf{i}} + \mathbf{e}\_{\mathbf{i}} \tag{21}$$

where i ¼ 1, 2, …:, n; b0, b1, b2, ⋯, bk estimates or unstandardized regression coefficients of β0, β1, β2, ⋯, β<sup>n</sup> respectively; ei = estimated error (or residual) for the ith observation.

Therefore, estimate of

is adjusted according to the gradient descent with momentum. The equation used

∂E ∂wji

(18)

Δwjið Þ¼ n α:Δwjið Þþ n � 1 η

neuron j; α = momentum; η = learning-rate parameter; E = error function. The equation is known as the generalized delta rule and this is probably the simplest and

where Δwji(n) = correction applied to the synaptic weight connecting neuron i to

This method is a modification of the classic Newton algorithm for finding an optimum solution to a minimization problem. In particular the LM utilizes the so called Gauss-Newton approximation that keeps the Jacobian matrix and discards second order derivatives of the error. The LM algorithm interpolates between the Gauss-Newton algorithm and the method of gradient descent. To update weights,

TJ <sup>þ</sup> <sup>λ</sup><sup>I</sup> �<sup>1</sup>

J

Te (19)

the LM algorithm uses an approximation of the Hessian matrix.

Wkþ<sup>1</sup> ¼ Wk � J

where W = weight; e = errors; I = identity matrix; λ = learning parameter; J = Jacobian matrix (first derivatives of errors with respect to the weights and biases); J<sup>T</sup> <sup>¼</sup> transpose of J; JTJ <sup>¼</sup> Hessian matrix. For <sup>λ</sup> = 0 the algorithm becomes Gauss-Newton method. For very large λ the LM algorithm becomes steepest decent algorithm. The 'λ' parameter governs the step size and is automatically adjusted (based on the direction of the error) at each iteration in order to secure convergence. If the error decreases between weight updates, then the 'λ' parameter is decreased by a factor of λ�. Conversely, if the error increases then 'λ' parameter is increased by a factor of λþ. The λ� and λ<sup>þ</sup> are defined by user. In LM algorithm training process converges quickly as the solution is approached, because Hessian does not vanish at the solution. LM algorithm has great computational and memory requirements and hence it can only be used in small networks. It is often characterized as more stable and efficient. It is faster and less easily trapped in local minima

On-Line learning updates the weights after the presentation of each exemplar. In contrast, Batch learning updates the weights after the presentation of the entire training set. When the training datasets are highly redundant, the online mode is able to take the advantage of this redundancy and provides effective solutions to large and difficult problems. On the other hand, the batch mode of training provides an accurate estimate of gradient vector; convergence of local minimum is

MLR technique attempts to model the relationship between two or more explanatory (independent) variables and a response (dependent) variable by fitting a linear equation to the observed data. The general form of a MLR model is

for update of weight and bias is given by:

Advanced Evapotranspiration Methods and Applications

most common way to train a network [37].

than other optimization algorithms [37].

4.16 Online and batch modes of training

thereby guaranteed under simple conditions [23].

4.17 Multiple linear regression (MLR)

given as [42]:

34

4.15 Levenberg-Marquardt (LM) algorithm

$$\mathbf{Y} = \hat{\mathbf{Y}} = \mathbf{b}\_0 + \mathbf{b}\_1 \mathbf{X}\_{1\circ} + \mathbf{b}\_2 \mathbf{X}\_{2\circ} + \dots + \mathbf{b}\_k \mathbf{X}\_{k\circ} \tag{22}$$

The difference between the observed Y and the estimated Y is called the residual ^ (or residual error).

The purpose of developing MLR models is to establish a simple equation which is easy to use and interpret. The MLR modeling is very useful, especially in case of limited field data. Moreover, it is versatile as it can accommodate any number of independent variables [43].
