**5.2 Backpropagation Neural Networks**

248 Bio-Inspired Computational Algorithms and Their Applications

energetic properties in the time domain on-line, in real time. The *link* between both domains

This section is devoted to describe two important figures in pattern recognition: *backpropagation neural networks* BPNN and *genetic algorithms* GA. The BPNN is used as a reference classifier to compare the performance of the approach presented here to the word recognition problem. The GA is an integral part of the generation of the features in the

There are several major paradigms, or approaches, to machine learning. These include supervised, unsupervised, and reinforcement learning. In addition, many researchers and application developers combine two o more of these learning approaches into one system

*Supervised learning* is the most common form of learning and is sometimes called programming by example. The learning system is trained by showing it examples of the problem state or attributes along with the desired output or action. The learning system make a prediction based on the inputs and if the output differs from the desired output, then the system is adjusted or adapted to produce the correct output. This process is repeated over and over until the system learns to make accurate classifications or predictions. Historical data from databases, sensor logs, or trace logs is often used as the

*Unsupervised learning* is used when the learning system needs to recognize similarities between inputs or to identify features in the input data. The data is presented to the system, and it adapts so that it partitions the data into groups. The clustering or segmenting process continues until the system places the same data into the same group on successive passes over the data. An unsupervised learning algorithm performs a type of feature detection

*Reinforcement learning* is a type of supervised learning used when explicit input/output pairs of training data are not available. It can be used in cases where there is a sequence of inputs and the desired output is only known after the specific sequence occurs. This process of identifying the relationship between a series of input values and a later output value is called temporal credit assignment. Because we provide less specific error information, reinforcement learning usually takes longer than supervised learning and is less efficient. However, in many situations, having exact prior information about the desired outcome is not possible. In many ways, reinforcement learning is the most

Another important distinction in learning systems is whether the learning is done on-line or off-line. On-line learning means that the system is sent out to perform its tasks and that it can learn or adapt after each transaction is processed. On-line learning is like on the job training and places severe requirements on the learning algorithms. It must be very fast and very stable. Off-line learning, on the other hand, is more like a business seminar. You take

where important common attributes in the data are extracted.

is the energetic content of the signal.

**5. Algorithmic foundations** 

proposed technique.

[23].

**5.1 Learning paradigms** 

training or example data.

realistic form of learning.

Backpropagation is the most popular neural network architecture for *supervised learning*. It features a *feed-forward* connection topology, meaning that data flow through the network in a single direction, and uses a technique called the *backward propagation* of errors to adjust the connection weights Rumelhart, Hinton, and Williams 1986 in [23]. In addition to a layer of input and output units, a back-propagation network can have one or more layers of hidden units, which receive inputs only from other units, and not from the external environment. A backpropagation network with a single hidden layer or processing units can learn to model any continuous function when given enough units in the hidden layer. The primary applications of backpropagation networks are for prediction and classification.

Figure 7 shows the diagram of a backpropagation neural network and illustrates the three major steps in the training process.

Fig. 7. Topology of a backpropagation neural network.

**First**, input data is presented to the units of the input layer on the left, and it flows through the network until it reaches the network output units on the right. This is called the forward pass.

**Second**, the activations or values of the output units represent the actual or predicted output of the network, because this is supervised learning.

Optimal Feature Generation with

**5.3 Genetic Algorithms** 

solution to the problem.

exchanges their second parts.

formats:

point.

bit code.

helps to move away from local minima.

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 251

( ) 1 ( ) *w n ij j i ij* Δ + = +Δ ηδ

The mathematical basis for backward propagation is described in detail in [23]. When the weight changes are summed up (or batched) over an entire presentation of the training set,

In this section a brief description of a simple genetic algorithm is given. Genetic algorithms are based on concepts and methods observed in nature for the evolution of the species. Genetic algorithms were brought to the artificial intelligence arena by Goldberg [6], [24]. They apply certain operators to a population of solutions of the problem to be solved, in a such a way that the new population is improved compared to the previous one according to a certain criterion function *J* [5], [1], [6], [24]. Repetition of this procedure for a preselected number of iterations will produce a last generation whose best solution is the optimal

The solutions of the problem to be solved are coded in the *chromosome* and the following

*Reproduction.* Ensures that, in probability, the better a solution in the current population is,

*Crossover.* Selects pair of solutions randomly, splits them in a random position, and

*Mutation.* Selects randomly an element of a solution and alters it with some probability. It

The performance of the GA depends greatly on these parameters, as well as on the coding of the solutions in the chromosome. The solutions can be coded in some of the following

*Binary.* Bit strings represent the solution(s) of the problem. For instance, a chromosome could represent a series of integer indexes to address a database, or the value of a variable(s) that must be integer, or each bit could represent the state (present-absent) of a part of an

*Real valued.* The bit strings represent the value of a real valued variable, in fixed of floating

The aspect of one chromosome could be like this: C = {100101010101010101}; the interpretation will vary in accordance with the coding scheme selected to represent the knowledge domain of the problem. For instance, it might represent a set of six indices of three bits each one; or it could have a meaning with all the bits together, representing an 18

operations are applied to the coded versions of the solutions, in this order:

Besides the coding of the solutions, some parameters must be set up:

*p*, probability with which two solutions are selected for crossover. *m*, probability with which an element of a solution is mutated.

*N*, number of solutions in a population. Fixed or varied.

the more replicates it has in the next population,

architecture that is being optimized, and so on.

the error minimization function performed is called gradient descent.

 α

*y w n* (39)

**Third**, the difference between the desired and the actual output is computed, producing the network error. This error term is then passed backwards through the network to adjust the connection weights.

Each network input unit takes a single numeric value, *<sup>i</sup> x* , which is usually scaled or normalized to a value between 0.0 and 1.0. This value becomes the input unit activation. Next, we need to propagate the data forward, through the neural network. For each unit in the hidden layer, we compute the sum of the products of the input unit activations and the weights connecting those input layer units to the hidden layer. This sum is the inner product (also called the dot or scalar product) of the input vector and the weights in the hidden unit. Once this sum is computed, we add a threshold value and then pass this sum through a nonlinear activation function, *f* , producing the unit activation *<sup>i</sup> y* . The formula for computing the activation of any unit in a hidden or output layer in the network is

$$y\_i = f\left(sum\_{\cdot} = \sum\_{i} \mathbf{x}\_i w\_{\vec{\eta}} + \boldsymbol{\Theta}\_i\right) \tag{34}$$

where *i* ranges over all the units leading into the *j*-th unit, and the activation function is

$$f\left(sum\_{j} \right) = \frac{1}{1 + e^{-sum\_{j}}} \tag{35}$$

As mentioned earlier, we use the *S*-shape sigmoid or logistic function for *f*. The formula for calculating the changes of the weights is

$$
\Delta w\_{ij} = \eta \mathcal{S}\_j y\_i \tag{36}
$$

where *wij* is the weight connecting unit i to unit j, η is the learn rate parameter, *<sup>j</sup>* δ is the error signal for that unit, and *<sup>i</sup> y* is the output or activation value of unit i. For units in the output layer, the error signal is the difference between the target output *<sup>j</sup> t* and the actual output *<sup>i</sup> y* multiplied by the derivative of the logistic activation function.

$$\mathcal{S}\_{j} = \left(t\_{j} - y\_{j}\right) f\_{j}^{\top} \left(sum m\_{j}\right) = \left(t\_{j} - y\_{j}\right) y\_{j} \left(1 - y\_{j}\right) \tag{37}$$

For each unit in the hidden layer, the error signal is the derivative of the activation function multiplied by the sum of the products of the outgoing connection weights and their corresponding error signals. So for the hidden unit j.

$$\mathcal{S}\_{j} = f\_{j}^{\cdot} \left(sum\_{k} w\_{k} w\_{jk} \right) \tag{38}$$

where k ranges over the indices of the units receiving *j*-th unit´s output signal.

A common modification of the weight update rule is the use of a momentum term α , to cut down on oscillation of the weight change becomes a combination of the current weight change, computed as before, plus some fraction (α ranges from 0 to 1) of the previous weight change. This complicates the implementation because we now have to store the weight changes from the prior step.

$$
\Delta w\_{\vec{\eta}} \left( n + 1 \right) = \eta \mathcal{S}\_{\vec{\jmath}} y\_i + \alpha \Delta w\_{\vec{\eta}} \left( n \right) \tag{39}
$$

The mathematical basis for backward propagation is described in detail in [23]. When the weight changes are summed up (or batched) over an entire presentation of the training set, the error minimization function performed is called gradient descent.
