**5.3 Genetic Algorithms**

250 Bio-Inspired Computational Algorithms and Their Applications

**Third**, the difference between the desired and the actual output is computed, producing the network error. This error term is then passed backwards through the network to adjust the

Each network input unit takes a single numeric value, *<sup>i</sup> x* , which is usually scaled or normalized to a value between 0.0 and 1.0. This value becomes the input unit activation. Next, we need to propagate the data forward, through the neural network. For each unit in the hidden layer, we compute the sum of the products of the input unit activations and the weights connecting those input layer units to the hidden layer. This sum is the inner product (also called the dot or scalar product) of the input vector and the weights in the hidden unit. Once this sum is computed, we add a threshold value and then pass this sum through a nonlinear activation function, *f* , producing the unit activation *<sup>i</sup> y* . The formula for

computing the activation of any unit in a hidden or output layer in the network is

*<sup>i</sup>* ( ) *j i ij i y f sum x w* ==+

where *i* ranges over all the units leading into the *j*-th unit, and the activation function is

( ) <sup>1</sup> <sup>1</sup> *<sup>j</sup> <sup>j</sup> sum f sum*

As mentioned earlier, we use the *S*-shape sigmoid or logistic function for *f*. The formula for

*wij j <sup>i</sup>* Δ =ηδ

error signal for that unit, and *<sup>i</sup> y* is the output or activation value of unit i. For units in the output layer, the error signal is the difference between the target output *<sup>j</sup> t* and the actual

( ) ( )( ) ( ) ´ 1 *j j jj j j jj j*

For each unit in the hidden layer, the error signal is the derivative of the activation function multiplied by the sum of the products of the outgoing connection weights and their

> ( ) ´ *j j j k jk*

down on oscillation of the weight change becomes a combination of the current weight

weight change. This complicates the implementation because we now have to store the

 δ

α

output *<sup>i</sup> y* multiplied by the derivative of the logistic activation function.

δ

where k ranges over the indices of the units receiving *j*-th unit´s output signal.

A common modification of the weight update rule is the use of a momentum term

*e* <sup>−</sup> <sup>=</sup> <sup>+</sup>

η

=− =− − *t y f sum t yy y* (37)

θ

(34)

(35)

δis the

α, to cut

*y* (36)

is the learn rate parameter, *<sup>j</sup>*

<sup>=</sup> *<sup>f</sup> sum w* (38)

ranges from 0 to 1) of the previous

connection weights.

calculating the changes of the weights is

where *wij* is the weight connecting unit i to unit j,

δ

corresponding error signals. So for the hidden unit j.

change, computed as before, plus some fraction (

weight changes from the prior step.

In this section a brief description of a simple genetic algorithm is given. Genetic algorithms are based on concepts and methods observed in nature for the evolution of the species. Genetic algorithms were brought to the artificial intelligence arena by Goldberg [6], [24]. They apply certain operators to a population of solutions of the problem to be solved, in a such a way that the new population is improved compared to the previous one according to a certain criterion function *J* [5], [1], [6], [24]. Repetition of this procedure for a preselected number of iterations will produce a last generation whose best solution is the optimal solution to the problem.

The solutions of the problem to be solved are coded in the *chromosome* and the following operations are applied to the coded versions of the solutions, in this order:

*Reproduction.* Ensures that, in probability, the better a solution in the current population is, the more replicates it has in the next population,

*Crossover.* Selects pair of solutions randomly, splits them in a random position, and exchanges their second parts.

*Mutation.* Selects randomly an element of a solution and alters it with some probability. It helps to move away from local minima.

Besides the coding of the solutions, some parameters must be set up:


The performance of the GA depends greatly on these parameters, as well as on the coding of the solutions in the chromosome. The solutions can be coded in some of the following formats:

*Binary.* Bit strings represent the solution(s) of the problem. For instance, a chromosome could represent a series of integer indexes to address a database, or the value of a variable(s) that must be integer, or each bit could represent the state (present-absent) of a part of an architecture that is being optimized, and so on.

*Real valued.* The bit strings represent the value of a real valued variable, in fixed of floating point.

The aspect of one chromosome could be like this: C = {100101010101010101}; the interpretation will vary in accordance with the coding scheme selected to represent the knowledge domain of the problem. For instance, it might represent a set of six indices of three bits each one; or it could have a meaning with all the bits together, representing an 18 bit code.

Optimal Feature Generation with

recognition system.

**6.1 Methodology** 

the disposal of the next stage.

algorithm to determinate discriminant frequency bands.

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 253

AR/ARMA coefficients, Fourier coefficients, Cepstral coefficients, Mel Spectral Coefficients, and others [31]. In many the cases, the coefficients are computed over a short-time window (typically 10 ms) and over the voiced segments of the speech signal. As well as for the feature generation, for the classifier several options are available: Hidden Markov Models HHM (perhaps the most popular), neural networks (backpropagation, self-organizing maps, radial basis, etc.), support vector and other kernel machines, Gaussian mixtures, Bayesian type classifiers, LVQ, and others. It should be noted that this list of choices of each element in the pattern recognition chain is by no means extensive. Please note that by simply considering all the combinations of variables, features and classifiers mentioned here, it is easily seen that there exist too many ways to implement a restricted-vocabulary speech

Fig. 8. Block diagram showing the optimal feature generation with the GA and FLDR.

Figure 8 shows a block diagram of the whole speech recognition system. The system works in two phases: training and operation. The blocks of the training-phase are described below. Lexicon Acquisition and Normalization. In this stage, the set **L** = {**w0**,**w1**,**w2**,…, **wM-1**} of **M** words that will comprise the lexicon to be recognized is acquired from the speaker(s) that will use the system; vectors containing the digitized versions of the voice signals will be at

Power Spectrum. The power spectrum has been traditionally the source of features for speech recognition; here, the power spectrum of the voice signals will be used by the genetic

Optimal Feature Generation. The features selected here are a) the energy **E** of eight to twelve frequency regions (sub-bands) of the spectrum, b) the bandwidth (BW) of each sub-band, and c) the central frequency (FC) of the sub-bands. See Figure 9. Sub-band processing in speech have been previously used, but in different manners [32], [33], [34]. A genetic algorithm with elitism is used here to select each bandwidth (*BW*) its central frequency (*Fc*) and a number of sub-bands. The main parameters of the genetic algorithm are listed in table 1. The cost function of the GA is an expression aimed at maximizing the Fisher's

linear discriminant ratio (FLDR). The use of the FLDR in the cost function ensures

The primary reason of the success of genetic algorithms is its wide applicability, easy use and global perspective [6], [24], [25]. The next is the listing of a simple genetic algorithm.


The genetic algorithms find application in the field of speech processing via the solution to the problem of variable and feature selection [11], [14], [26], [27], [28], [29], [30].

## **6. Restricted-vocabulary speech recognition system**

The expression restricted-vocabulary speech recognition refers to the recognition of repetitions of spoken words that belongs to a limited set of words within a semantic field. This means that the words have connected meanings, for instance, the digits = {0,1,2,…, 9} or the days of the week={Saturday, Monday,…, Friday}. The applications of the recognition of limited size word-sets include voice-commanded systems, spoken entry and search for computer databases in warehouse systems, voice-assisted telephone dialing, man-machine interfaces, and others. The advantage of a system developed for a specific semantic field is that it can be built to be much more accurate than those constructed for the general speech recognition, also requiring less extensive training sets.

Restricted-vocabulary speech recognition is also an important research topic because of the intricacies involved in the underlying pattern recognition problem: variable selection, feature generation/selection and classifier selection. Variable selection is mostly restricted to select the raw digitized voice signal as the variable; alternatively, the surrounding environmental noise could be used as another variable for noise cancelation purposes. Feature generation has been carried out by obtaining linear predictive coefficients (LPC), AR/ARMA coefficients, Fourier coefficients, Cepstral coefficients, Mel Spectral Coefficients, and others [31]. In many the cases, the coefficients are computed over a short-time window (typically 10 ms) and over the voiced segments of the speech signal. As well as for the feature generation, for the classifier several options are available: Hidden Markov Models HHM (perhaps the most popular), neural networks (backpropagation, self-organizing maps, radial basis, etc.), support vector and other kernel machines, Gaussian mixtures, Bayesian type classifiers, LVQ, and others. It should be noted that this list of choices of each element in the pattern recognition chain is by no means extensive. Please note that by simply considering all the combinations of variables, features and classifiers mentioned here, it is easily seen that there exist too many ways to implement a restricted-vocabulary speech recognition system.

Fig. 8. Block diagram showing the optimal feature generation with the GA and FLDR.

### **6.1 Methodology**

252 Bio-Inspired Computational Algorithms and Their Applications

The primary reason of the success of genetic algorithms is its wide applicability, easy use and global perspective [6], [24], [25]. The next is the listing of a simple genetic algorithm.

The genetic algorithms find application in the field of speech processing via the solution to

The expression restricted-vocabulary speech recognition refers to the recognition of repetitions of spoken words that belongs to a limited set of words within a semantic field. This means that the words have connected meanings, for instance, the digits = {0,1,2,…, 9} or the days of the week={Saturday, Monday,…, Friday}. The applications of the recognition of limited size word-sets include voice-commanded systems, spoken entry and search for computer databases in warehouse systems, voice-assisted telephone dialing, man-machine interfaces, and others. The advantage of a system developed for a specific semantic field is that it can be built to be much more accurate than those constructed for the general speech

Restricted-vocabulary speech recognition is also an important research topic because of the intricacies involved in the underlying pattern recognition problem: variable selection, feature generation/selection and classifier selection. Variable selection is mostly restricted to select the raw digitized voice signal as the variable; alternatively, the surrounding environmental noise could be used as another variable for noise cancelation purposes. Feature generation has been carried out by obtaining linear predictive coefficients (LPC),

the problem of variable and feature selection [11], [14], [26], [27], [28], [29], [30].

**6. Restricted-vocabulary speech recognition system** 

recognition, also requiring less extensive training sets.

Figure 8 shows a block diagram of the whole speech recognition system. The system works in two phases: training and operation. The blocks of the training-phase are described below.

Lexicon Acquisition and Normalization. In this stage, the set **L** = {**w0**,**w1**,**w2**,…, **wM-1**} of **M** words that will comprise the lexicon to be recognized is acquired from the speaker(s) that will use the system; vectors containing the digitized versions of the voice signals will be at the disposal of the next stage.

Power Spectrum. The power spectrum has been traditionally the source of features for speech recognition; here, the power spectrum of the voice signals will be used by the genetic algorithm to determinate discriminant frequency bands.

Optimal Feature Generation. The features selected here are a) the energy **E** of eight to twelve frequency regions (sub-bands) of the spectrum, b) the bandwidth (BW) of each sub-band, and c) the central frequency (FC) of the sub-bands. See Figure 9. Sub-band processing in speech have been previously used, but in different manners [32], [33], [34]. A genetic algorithm with elitism is used here to select each bandwidth (*BW*) its central frequency (*Fc*) and a number of sub-bands. The main parameters of the genetic algorithm are listed in table 1. The cost function of the GA is an expression aimed at maximizing the Fisher's linear discriminant ratio (FLDR). The use of the FLDR in the cost function ensures

Optimal Feature Generation with

word classification.

**6.2 System design** 

energy can be calculated as:

voice in real-time a word can be immediately recognized.

In this section the details of the system design will be presented.

Consider the Fourier Transform of a signal:

filter bank that will operate in the time domain is:

by *K* matrices of *N* rows and 2000 columns.

**6.2.1 Characterization of the frequency spectrum using sub-bands** 

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 255

where *x[i]* is the time domain signal with and *X[k]* is its modified frequency espectrum, which is found by taking the DFT of the signal and dividing the first and last frequencies by the square rooth of two. Therefore, all that have to be done is to filter the sub-bands out of the time signals and then to calculate the energy in each sub-band. This can be perfectly accomplished by a bank of digital band-pass filters whose parameters match the parameters found by the genetic algorithm, and the advantage is that at the end of the last sample of

In the Operation Phase, the three first stages operate simultaneously at each time a sample of the voice is acquired, and occur between two successive sampling intervals; in the first stage the voice signal is acquired and a normalization coefficient *Nc* is updated with the maximum value of the signal; the block labeled filtering represent the action of the filter bank and the outgoing signal is squared and added sample by sample. At the end of a voiced segment of sound, the third block provides to a classifier with the set of features for

> *S Fst* ( ) () ω

Now consider the frequency spectrum split in sub-bands, as shown in Figure 9. Please notice in Figure 9 that *S(*w*)* has been normalized to unitary amplitude. For each sub-band, the

> <sup>=</sup> <sup>0</sup> *ub*

> > *lb*

where *Ei*, is the energy of the *i*-th sub-band, *k0* is a constant proportional to the bandwidth (*BWi*) of the *i*-th sub-band, *lb* and *ub* are the lower and upper bounds of the *i-*th sub-band.

In which *M* is the number of sub-bands, and *Srn* is the reduced version of the *n*-th *S(*ω*)*. In order to characterize a word in the vocabulary, *N* samples of *s(t)* must be entered. From the example in Figure 9 and without loss of generality, the set of parameters of the respective

The number of filters to be applied in the time-domain signal in this case is L. For a vocabulary of *K* words (*K* small), the spectra in which the solution must be searched is given

*i*

The feature vector of the *n-th* utterance of *s(t)* in the frequency domain becomes:

( ) ( )<sup>2</sup>

= ( ) (40)

*Ek S ω*  (41)

*Srn M* ( ) *ω* = [*E*<sup>1234</sup> *E E E E* ] (42)

1 12 23 3 = [ ] *BF C BW C BW C BW C BW <sup>L</sup> <sup>L</sup>* (43)


Table 1. Parameters of the Genetic Algorithm.

increasing both, class separability and cluster compactness between classes ω0 and ω1, being ω0 the class of the word to be recognized and ω1 the class of the rest of the *M*-1 words. The FLDR was described here in section 4.1, and the expression adopted here is the equation (31):

$$J\_2 = \frac{\left| \mathbf{S}\_b \right|}{\left| \mathbf{S}\_w \right|} = \left| \mathbf{S}\_w^{-1} \mathbf{S}\_b \right| $$

where |*Sb*| is the determinant of the inter-class covariance matrix and |*Sb*|is the determinant of the intra-class covariance class. At the end of the evolutionary process the genetic algorithm produced a set of vectors with the parameters [*E, BW, Fc*] and the number of sub-bands, between 8 and 12.

Fig. 9. Sub-band spectrum division to separate the power spectra in discriminant regions.

*Bank-filter Design.* During the operation phase, the calculation of the power espectrum of the incoming voice signals is not practical for real-time response because the discrete Fourier transform (DFT [2], [3]) requires too much time to be calculated, and the fast Fourier Transform (FFT [2], [3]) requires that the number of samples to be a power of 2, requiring zero padding most of the times. Instead of using the power spectrum, in this application it is used the Parseval's theorem, described in section 4.2, which states that the energy in time and frequency is equal. In discrete form, we recall equation (32):

$$\sum\_{i=0}^{N-1} \mathbf{x}[i]^2 = \frac{2}{N} \sum\_{k=0}^{N/2} \left\| \mathbf{X}[k] \right\|^2$$

where *x[i]* is the time domain signal with and *X[k]* is its modified frequency espectrum, which is found by taking the DFT of the signal and dividing the first and last frequencies by the square rooth of two. Therefore, all that have to be done is to filter the sub-bands out of the time signals and then to calculate the energy in each sub-band. This can be perfectly accomplished by a bank of digital band-pass filters whose parameters match the parameters found by the genetic algorithm, and the advantage is that at the end of the last sample of voice in real-time a word can be immediately recognized.

In the Operation Phase, the three first stages operate simultaneously at each time a sample of the voice is acquired, and occur between two successive sampling intervals; in the first stage the voice signal is acquired and a normalization coefficient *Nc* is updated with the maximum value of the signal; the block labeled filtering represent the action of the filter bank and the outgoing signal is squared and added sample by sample. At the end of a voiced segment of sound, the third block provides to a classifier with the set of features for word classification.

#### **6.2 System design**

254 Bio-Inspired Computational Algorithms and Their Applications

Mutation Gaussian ( 1-2 , 1-2) (scale,shrink)

Crossover Scattered, one-point and two-points

2

increasing both, class separability and cluster compactness between classes ω0 and ω1, being ω0 the class of the word to be recognized and ω1 the class of the rest of the *M*-1 words. The FLDR was described here in section 4.1, and the expression adopted here is the equation

*b*

*w S J SS <sup>S</sup>* <sup>−</sup> = =

where |*Sb*| is the determinant of the inter-class covariance matrix and |*Sb*|is the determinant of the intra-class covariance class. At the end of the evolutionary process the genetic algorithm produced a set of vectors with the parameters [*E, BW, Fc*] and the number

Fig. 9. Sub-band spectrum division to separate the power spectra in discriminant regions.

and frequency is equal. In discrete form, we recall equation (32):

*Bank-filter Design.* During the operation phase, the calculation of the power espectrum of the incoming voice signals is not practical for real-time response because the discrete Fourier transform (DFT [2], [3]) requires too much time to be calculated, and the fast Fourier Transform (FFT [2], [3]) requires that the number of samples to be a power of 2, requiring zero padding most of the times. Instead of using the power spectrum, in this application it is used the Parseval's theorem, described in section 4.2, which states that the energy in time

<sup>1</sup> /2 <sup>2</sup> <sup>2</sup>

*xi Xk N*

<sup>2</sup> [] [ ]

0 0

= = <sup>=</sup>

*N N i k*

−

1

*w b*

*Parameter Value* 

Population 20-120

Selection Elitism

Table 1. Parameters of the Genetic Algorithm.

of sub-bands, between 8 and 12.

(31):

In this section the details of the system design will be presented.

#### **6.2.1 Characterization of the frequency spectrum using sub-bands**

Consider the Fourier Transform of a signal:

$$S(\mathfrak{so}) = F\left(s\left(t\right)\right) \tag{40}$$

Now consider the frequency spectrum split in sub-bands, as shown in Figure 9. Please notice in Figure 9 that *S(*w*)* has been normalized to unitary amplitude. For each sub-band, the energy can be calculated as:

$$E\_l = k\_0 \sum\_{\rm lb}^{\rm ub} \left( \mathbf{S} \left( \boldsymbol{\omega} \right)^2 \right) \tag{41}$$

where *Ei*, is the energy of the *i*-th sub-band, *k0* is a constant proportional to the bandwidth (*BWi*) of the *i*-th sub-band, *lb* and *ub* are the lower and upper bounds of the *i-*th sub-band. The feature vector of the *n-th* utterance of *s(t)* in the frequency domain becomes:

$$\mathbf{Sr}\_n(\omega) = \begin{bmatrix} \mathbf{E}\_1 \ \mathbf{E}\_2 \ \mathbf{E}\_3 \ \mathbf{E}\_4 \cdots \mathbf{E}\_M \end{bmatrix} \tag{42}$$

In which *M* is the number of sub-bands, and *Srn* is the reduced version of the *n*-th *S(*ω*)*. In order to characterize a word in the vocabulary, *N* samples of *s(t)* must be entered. From the example in Figure 9 and without loss of generality, the set of parameters of the respective filter bank that will operate in the time domain is:

$$BF = \begin{bmatrix} \mathbf{C}\_1 \ \mathbf{B} \mathbf{W}\_1 \ \mathbf{C}\_2 \ \mathbf{B} \mathbf{W}\_2 \ \mathbf{C}\_3 \ \mathbf{B} \mathbf{W}\_3 \cdots \mathbf{C}\_L \ \mathbf{B} \mathbf{W}\_L \end{bmatrix} \tag{43}$$

The number of filters to be applied in the time-domain signal in this case is L. For a vocabulary of *K* words (*K* small), the spectra in which the solution must be searched is given by *K* matrices of *N* rows and 2000 columns.

Optimal Feature Generation with

[5].

**Female** 

**Male** 

**7. Results** 

c. Compute the covariance matrix, Si = cov(*C*i).

dM(x,μi) = (x-μi)\*Σ<sup>i</sup>

Processor mounted on an experimentation card.

**Set**

**G1 L2 Training** 

the feature vector *x* of an incoming command is:

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 257

To make the real-time implementation, a digital system must sample the input microphone continuously. Each sample of a command must be filtered by each filter (sub-band extraction). The integral of the energy of the signal leaving each filter over the period of the command is used to create the feature vector of that command, that is, *x* in equation 44. The feature vector is compared to each model in the vocabulary, and the command is recognized as the one with the minimum *dM* score. This is the so-called "minimum distance classifier"

To test the system, the following lexicons were used: *L0*={faster, slower, left, right, stop, forward, reverse, brake}, *L1*={zero, one, two, three, four, five, six, seven, eigth, nine}, L2={rápido, lento, izquierda, derecha, alto, adelante, reversa, freno}, L3 = {uno, dos, tres, cuatro, cinco}. In all the lexicons, 3 male and 3 female volunteers were enrolled. They donated 116 samples of each word, 16 for training and 100 for testing. To demostrate the power of our approach we used the minimum distance classifier with the Mahalanobis distance. In all the cases, the genetic algorithm was ran 30 times to find the best response in the training set. During the training phase, the leave-one-out method was used to exploit the limited size of the training set [1]. Table 2 summarizes the results. In columns 5 and 7 are shown the comparison against a backpropagation neural network using as features Cepstral coefficients. The experiments were done using Matlab(R) and its associated toolboxes of genetic algorithms, neural networks and digital signal processing. The real-time implementation was done with a TMS320LF2407 Texas Instruments(R) Digital Signal

> **Testing Set**

1Gender, 2 Lexicon, 3Minimum distance classifier, 4Backpropagation neural network [1] 8-32-K

Table 2. Percentage (%) of correct classification with 4 lexicons, 2 languages, 6 persons, male

neurons per layer, K according to the experiment, one neuron for each word in L.

and female voices, 2 classifiers. Simulations and real time implementation.

0 100 97 92 94 90 1 100 98 90 95 90 2 100 100 97 94 89 3 100 100 91 95 88

0 100 100 94 96 89 1 100 100 90 95 90 2 100 99 92 94 90 3 100 98 92 93 88

d. At any time, the Mahalanobis distance between the model of the *i*-th command and


Simulations Real-time on DSP MDC3 BPNN4 MDC3 BPNN4

> **Real scenario**

**Real scenario** 

**Testing Set**

#### **6.2.2 Genetic algorithm set up**

#### **6.2.2.1 Coding the chromosome**

The chromosome is comprised of the parameters of the filter bank described in equation 43. The number of sub-bands is fixed, and each one of the centers and bandwidths are subject to the genetic algorithm. Real numbers are used.

#### **6.2.2.2 The cost function**

The cost function is given by equation (32), criterion *J2*. The goal is to maximize J2 as a function of the centers and bandwidths.

#### **6.2.2.3 Restrictions**

The following restrictions apply:

*R1*. Sub-bands overlapping < = 50Hz,

*R2*. Bandwidth is limited to range from 40 to 400 Hz, varying according to the performance of the genetic algorithm.

#### **6.2.2.4 Operating parameters of the genetic algorithm**

The main parameters of the genetic algorithm are summarized in the Table 1. The values of the parameters are given according to the best results obtained by experimentation. The genetic algorithm was ran in Matlab®, using the genetic algorithms toolbox and the *gatool*  guide. The nomenclature of the parameters in Table 1 is the one used by Matlab®.

#### **6.2.2.5 Application's algorithm**

To make operational de methodology described so far, the following steps apply:

	- a. Construct a matrix *C* of 15-20 rows of *Srn(*w*)* and *M* columns (one row per sample of the command, one column per sub-band selected by the AG),
	- b. Compute the mean value overall the samples, to find the average energy per subband, this is the feature vector of the command (μi)

$$\text{d}M(\mathbf{x},\mu\_i) = (\mathbf{x}\cdot\mu\_i)^\* \Sigma\_i^{-1\*} (\mathbf{x}\cdot\mu\_i)^{-1} \tag{44}$$

To make the real-time implementation, a digital system must sample the input microphone continuously. Each sample of a command must be filtered by each filter (sub-band extraction). The integral of the energy of the signal leaving each filter over the period of the command is used to create the feature vector of that command, that is, *x* in equation 44. The feature vector is compared to each model in the vocabulary, and the command is recognized as the one with the minimum *dM* score. This is the so-called "minimum distance classifier" [5].
