**3. Neural network optimization methods**

#### **3.1 Gradient back-propagation algorithm**

An Artificial Neural Networks (ANN) with randomly initialized weights usually shows poor results. That's why the most interesting characteristic of a neural network is its ability to learn, in other words, to adjust the weights of its connections according to the training data so that after training phase the faculty of generalization is obtained. This formulation turns the problem of learning into a problem of optimization.

In general, optimizing a system's parameters for a given task requires defining a metric that captures the inadequacy of the system for that task. This measure is called the cost function. Using an optimization algorithm, it is about finding the optimal parameters of the neural model that minimizes the cost.

For this kind of problem, there are two important classes of cost minimization search algorithms. Classical or gradient algorithms in which the central concept is that of direction of descent, are based on the derivatives of the cost function and any constraints, and advantageously made use of the specific information provided by the derivatives of different orders of these functions.

The alternative to these approaches is the use of meta-heuristics or heuristics such as genetic, stochastic, or evolutionary algorithms. Despite the notable advantage of not assuming regularity and their ability to locate the global minimum, these algorithms are strongly penalized by the relatively low convergence speeds and long computation times.

In the case of algorithms using the information provided by the derivatives of the functions defining the problem, each iteration comprises two main phases: the search for a direction of descent *dk* and the determination of a step of descent *η* given by the following formula:

$$\varkappa\_{k+1} = \varkappa\_k + \eta \, d\_k \tag{2}$$

The difference between these algorithms is manifested in the way these two steps are performed.

There are also three classes for such algorithms according to the strategy used to calculate the direction of descent:

1.gradient algorithms such as:

$$d\_k = -\nabla f(\mathbf{x}\_k) \tag{3}$$

*Tuning Artificial Neural Network Controller Using Particle Swarm Optimization Technique… DOI: http://dx.doi.org/10.5772/intechopen.96424*

2.algorithms based on Newton's method in which the direction of descent is the solution of the system

$$
\nabla^2 f(\mathbf{x}\_k) d\_k = \nabla f(\mathbf{x}\_k) \tag{4}
$$

3. the algorithms of the quasi-Newton type, in which an approximation *Hk* of the Hessian matrix evaluated in the iterates is built, the direction being then, as for the method of Newton, the solution of the linear system

$$H\_k d\_k = \nabla f(\mathbf{x}\_k) \tag{5}$$

The gradient back-propagation algorithm is the most widely used for weight adaptation, the goal of which is to find the appropriate combination of connection weights that minimizes the error function *E* defined by:

$$E = \frac{1}{2} \sum\_{k} \left( y\_k - \mathcal{y} r\_k \right)^2 \tag{6}$$

*yk* and *yrk* being, respectively, the desired output and the actual output of the *k* neuron for a given input vector.

This procedure is based on an extension of the Delta rule which involves a gradient descent and which consists in propagating an observation of the input of the neural network through the neural layer, to obtain the output values.

Compared to the desired outputs, the resulting errors allow the weights of the output neurons to be adjusted. Without the presence of the hidden layer, the knowledge of these errors allows a direct calculation of the gradient and makes the adjustment of the weights of these single neurons, easy as shown by the Delta rule. So for a network with hidden layers, ignoring the desired outputs of the hidden neurons, it thus remains impossible to know the errors of these neurons. So, as it is, this process cannot be used for weight adjustment of hidden neurons. The intuition which solves this difficulty and which gave rise to back-propagation was as follows: the activity of a neuron is linked to neurons of the preceding layer. Thus, the error of an output neuron is due to the hidden neurons of the previous layer in proportion to their influence; therefore according to their activation and the weights that connect the hidden neurons to the output neuron. Therefore, we seek to obtain the contributions of the *L* hidden neurons which gave the error of the output neuron *k*.

The back-propagation procedure consists in propagating the error gradient (error produced during the propagation of an input vector) in the network. In this phase, the propagation of an output neuron's error starts from the output layer to the hidden neurons.

It is therefore sufficient to retrace the original activation path backwards, starting from the errors of the output neurons, to obtain the error of all the neurons in the network. Once the corresponding error for each neuron is known, the weight adaptation relationships can be obtained.

#### **3.2 Second order optimization method**

Another class of methods, more sophisticated than the previous one, is based on second order algorithms, based on Newton's method which adapts the weights according to the following relation:

$$
\Delta w = -H^{-1} \nabla E \tag{7}
$$

*y k*ð Þ¼ þ 1 *f yk*ð Þ, … , *y k* � *ny*

particle swarm optimization approach.

*Deep Learning Applications*

**3. Neural network optimization methods**

**3.1 Gradient back-propagation algorithm**

optimization.

computation times.

steps are performed.

**6**

given by the following formula:

calculate the direction of descent:

1.gradient algorithms such as:

 , *u k*ð Þ, … , *u k*ð Þ � *nu* (1)

*f*ð Þ*:* is the nonlinear function mapping specified by the model, *ny* and *nu* are the

An Artificial Neural Networks (ANN) with randomly initialized weights usually shows poor results. That's why the most interesting characteristic of a neural network is its ability to learn, in other words, to adjust the weights of its connections according to the training data so that after training phase the faculty of generalization is obtained. This formulation turns the problem of learning into a problem of

In general, optimizing a system's parameters for a given task requires defining a metric that captures the inadequacy of the system for that task. This measure is called the cost function. Using an optimization algorithm, it is about finding the

For this kind of problem, there are two important classes of cost minimization search algorithms. Classical or gradient algorithms in which the central concept is that of direction of descent, are based on the derivatives of the cost function and any constraints, and advantageously made use of the specific information provided

The alternative to these approaches is the use of meta-heuristics or heuristics such as genetic, stochastic, or evolutionary algorithms. Despite the notable advantage of not assuming regularity and their ability to locate the global minimum, these algorithms are strongly penalized by the relatively low convergence speeds and long

In the case of algorithms using the information provided by the derivatives of the functions defining the problem, each iteration comprises two main phases: the search for a direction of descent *dk* and the determination of a step of descent *η*

The difference between these algorithms is manifested in the way these two

There are also three classes for such algorithms according to the strategy used to

*xk*þ<sup>1</sup> ¼ *xk* þ *η dk* (2)

*dk* ¼ �∇*f x*ð Þ*<sup>k</sup>* (3)

optimal parameters of the neural model that minimizes the cost.

by the derivatives of different orders of these functions.

number of past output and input samples respectively required for prediction. In this structure, the neural network controller and the neural network model

must be updated at the same time. But, it is a difficult task to have a neural identifier learn the system fast enough to adapt to varying parameters. Therefore, the neural controller is ineffective on variations of the system parameters. In this chapter, to solve this problem we propose a fast learning algorithm based on a

where the element *Hij* of the Hessian matrix *H* relates to the second partial derivatives of the cost function, compared to the weights. The elements of this matrix are defined by

$$H\_{\vec{\eta}} = \frac{\partial^2 E}{\partial w\_i \partial w\_j} \tag{8}$$

approaching the optimum, inherent in Newton's method, is in no way suppressed

*Tuning Artificial Neural Network Controller Using Particle Swarm Optimization Technique…*

Despite the interesting properties of this method, calculating the inverse of ð Þ *H* þ *μI* , makes its use tricky for heavy neural networks. As a result, like Newton's method, it is advisable to automatically switch to the conjugate gradient method when this divergence phenomenon appears. Second-order methods greatly reduce

The advantage of heuristic optimization methods is the minimization of non-derivable cost functions, even for a large number of parameters 1<*n* <10<sup>5</sup> � ). Among the effective methods, we distinguish the optimization algorithm by Particle Swarms introduced by Kennedy and Eberhart and improved by Clerc is an optimization technique by agents which is essentially inspired by the behavior

In addition, genetic algorithms, evolutionary algorithms, also constitute stochastic optimization techniques inspired by the theory of evolution according to Darwin, now widely used in numerical optimization, when the functions to be optimized are complex, irregular, poorly known or in question. Combinatorial

These heuristic methods of the methods presented previously (Levenberg–

2. they study a population as a whole while deterministic methods treat an

Experience has also shown that if the components as well as the evolution parameters are carefully tuned, it is possible to obtain extremely efficient and fast algorithms. However, this adjustment step can be very delicate and constitutes a

The architecture shown in **Figure 1** assumes the role of two neural blocks. Indeed, the weights of the neural model are adjusted by the identification error *e k*ð Þ, however the weights of the neural controller are trained by the tracking error *ec*ð Þ*k* . The multi-layer perceptron is used in the neural model and in the neural controller. Each block consists of three layers. The sigmoid activation function *s* is used

*th* output layer of the hidden layer is

*wjixi j* ¼ 1, 2, … , *n*<sup>2</sup> (10)

**4. Tuning neural network controller using classical approach**

Marquardt, Newton, Conjugate Gradient, ...) by three main aspects:

1. they do not require the gradient calculation,

drawback of the implementation of these methods.

Concerning the neural network model, the *j*

*<sup>h</sup> <sup>j</sup>* <sup>¼</sup> <sup>X</sup>*<sup>n</sup>*<sup>1</sup> *i*¼1

individual who will evolve towards the optimum,

here. At most, the divergence can be reduced.

*DOI: http://dx.doi.org/10.5772/intechopen.96424*

**3.3 Heuristic optimization methods**

social in flocks of birds or schools of fish.

3. they involve random operations.

optimization.

for all neurons.

**9**

described as follows

the number of iterations, but increase the computation time.

Like gradient-only methods, second-order methods determine the gradient by the back-propagation algorithm and generally approximate the Hessian matrix or its inverse, since the cost of its computation may quickly become prohibitive.

This type of method localizes, in a single iteration, the minimum of a quadratic empirical error criterion and requires several iterations when this criterion is not ideally quadratic.

In practice, the convergence of the corresponding algorithm towards an optimal solution is rapid so that a good number of error hyper-surfaces present a quadratic curvature in the immediate vicinity of the minima. This method nevertheless remains subject, when the error hyper-surfaces are complex, to convergence towards non-minimal solution points for which the gradient of the empirical error criterion is canceled out at the inflection points or at the saddle points. In addition, there are the possibilities of divergence of the algorithm when the Hessian matrix is not positive definite.

The evaluation and memorization of the inverse Hessian matrix, on which the second-order methods are based, is, however, a major handicap in the context of learning large networks.

But, the main drawback lies in the calculation of the second derivatives of *E* which is most often expensive and very difficult to carry out. A certain number of algorithms propose to get around this difficulty by using approximations of the Hessian matrix.

This approximation, at the basis of Gauss-Newton or Levenberg–Marquardt algorithms, is widely used in the identification of rheological parameters. This method is adapted especially for the problems of small dimensions since the computation of the Hessian matrix is easy. Whereas if the problem presents a large number of variables, it is generally advised to couple it with the conjugate gradient method or a Quasi-Newton method. Or, when the relative improvement in the objective function becomes too low, we automatically switch to the conjugate gradient method.

The Levenberg–Marquardt method, Marquardt method, another second-order method, very close to the Newton method described previously, in fact offers an interesting alternative by adjusting the weights as following:

$$
\Delta w\_{\vec{\eta}} = -[H + \mu I]^{-1} \nabla E \tag{9}
$$

*μ* is the Levenberg–Marquardt parameter and *I* is the identity matrix.

This method, making a compromise between the direction of the gradient and Newton's method, has the particularity of adapting to the shape of the error surface. Indeed, for low values of *μ*, the Levenberg–Marquardt method approaches Newton's method, and for large values of *μ*, the algorithm is quite simply a function of the gradient, knowing that the parameter *μ* is automatically updated based on the convergence of each iteration.

Stabilization is possible thanks to a reiterative process, i.e. if an iteration diverges, it can be started again by increasing the parameter *μ* until a convergent iteration is obtained. However, the phenomenon of strong divergence when

*Tuning Artificial Neural Network Controller Using Particle Swarm Optimization Technique… DOI: http://dx.doi.org/10.5772/intechopen.96424*

approaching the optimum, inherent in Newton's method, is in no way suppressed here. At most, the divergence can be reduced.

Despite the interesting properties of this method, calculating the inverse of ð Þ *H* þ *μI* , makes its use tricky for heavy neural networks. As a result, like Newton's method, it is advisable to automatically switch to the conjugate gradient method when this divergence phenomenon appears. Second-order methods greatly reduce the number of iterations, but increase the computation time.

#### **3.3 Heuristic optimization methods**

where the element *Hij* of the Hessian matrix *H* relates to the second partial derivatives of the cost function, compared to the weights. The elements of this

*Hij* <sup>¼</sup> *<sup>∂</sup>*<sup>2</sup>

inverse, since the cost of its computation may quickly become prohibitive.

Like gradient-only methods, second-order methods determine the gradient by the back-propagation algorithm and generally approximate the Hessian matrix or its

This type of method localizes, in a single iteration, the minimum of a quadratic empirical error criterion and requires several iterations when this criterion is not

In practice, the convergence of the corresponding algorithm towards an optimal solution is rapid so that a good number of error hyper-surfaces present a quadratic curvature in the immediate vicinity of the minima. This method nevertheless remains subject, when the error hyper-surfaces are complex, to convergence towards non-minimal solution points for which the gradient of the empirical error criterion is canceled out at the inflection points or at the saddle points. In addition, there are the possibilities of divergence of the algorithm when the Hessian matrix is

The evaluation and memorization of the inverse Hessian matrix, on which the second-order methods are based, is, however, a major handicap in the context of

But, the main drawback lies in the calculation of the second derivatives of *E* which is most often expensive and very difficult to carry out. A certain number of algorithms propose to get around this difficulty by using approximations of the

This approximation, at the basis of Gauss-Newton or Levenberg–Marquardt algorithms, is widely used in the identification of rheological parameters. This method is adapted especially for the problems of small dimensions since the computation of the Hessian matrix is easy. Whereas if the problem presents a large number of variables, it is generally advised to couple it with the conjugate gradient method or a Quasi-Newton method. Or, when the relative improvement in the objective function becomes too low, we automatically switch to the conjugate gra-

The Levenberg–Marquardt method, Marquardt method, another second-order method, very close to the Newton method described previously, in fact offers an

This method, making a compromise between the direction of the gradient and Newton's method, has the particularity of adapting to the shape of the error surface. Indeed, for low values of *μ*, the Levenberg–Marquardt method approaches Newton's method, and for large values of *μ*, the algorithm is quite simply a function of the gradient, knowing that the parameter *μ* is automatically updated based on the

∇*E* (9)

<sup>Δ</sup>*wij* ¼ �½ � *<sup>H</sup>* <sup>þ</sup> *<sup>μ</sup><sup>I</sup>* �<sup>1</sup>

*μ* is the Levenberg–Marquardt parameter and *I* is the identity matrix.

Stabilization is possible thanks to a reiterative process, i.e. if an iteration diverges, it can be started again by increasing the parameter *μ* until a convergent iteration is obtained. However, the phenomenon of strong divergence when

interesting alternative by adjusting the weights as following:

*E ∂wi∂w <sup>j</sup>*

(8)

matrix are defined by

*Deep Learning Applications*

ideally quadratic.

not positive definite.

Hessian matrix.

dient method.

**8**

convergence of each iteration.

learning large networks.

The advantage of heuristic optimization methods is the minimization of non-derivable cost functions, even for a large number of parameters 1<*n* <10<sup>5</sup> � ).

Among the effective methods, we distinguish the optimization algorithm by Particle Swarms introduced by Kennedy and Eberhart and improved by Clerc is an optimization technique by agents which is essentially inspired by the behavior social in flocks of birds or schools of fish.

In addition, genetic algorithms, evolutionary algorithms, also constitute stochastic optimization techniques inspired by the theory of evolution according to Darwin, now widely used in numerical optimization, when the functions to be optimized are complex, irregular, poorly known or in question. Combinatorial optimization.

These heuristic methods of the methods presented previously (Levenberg– Marquardt, Newton, Conjugate Gradient, ...) by three main aspects:


Experience has also shown that if the components as well as the evolution parameters are carefully tuned, it is possible to obtain extremely efficient and fast algorithms. However, this adjustment step can be very delicate and constitutes a drawback of the implementation of these methods.
