**2.1 Methods for variable and feature selection and generation**

The methods for variable and feature selection are based on two approaches: the first is to consider the features as scalars -*scalar feature selection*-, and the other is to consider the features as vectors –*feature vector selection*-. In both approaches a class separability measurement criteria must be adopted; some criteria include the receiver operating curve (ROC), the Fisher Discriminant Ratio (FDR) or the one-dimensional divergence [1]. The goal is to select a subset of *k* from a total of *K* variables or features. In the sequel, the term features is used to represent variables and features.

#### **2.1.1 Scalar feature selection**

236 Bio-Inspired Computational Algorithms and Their Applications

partly, to the fact that it is capable of enabling many practical applications in artificial intelligence, such as natural language understanding [9], man-machine interfaces, help for the impaired, and others; on the other hand, it is an intriguing intellectual challenge in which new mathematical methods for feature generation and new and more sophisticated classifiers appear nearly every year [10], [11]. Practical problems that arise in the implementation of speech recognition algorithms include real-time requirements, to lower the computational complexity of the algorithms, and noise cancelation in general or specific

A specific case of speech recognition is word recognition, aimed at recognizing isolated words from a continuous speech signal; it find applications in system commanding as in wheelchairs, TV sets, industrial machinery, computers, cell phones, toys, and many others. A particularity of this specific speech processing niche is that usually the vocabulary is

In this chapter we present an innovative method for the restricted-vocabulary speech recognition problem in which a genetic algorithm is used to optimally generate the design parameters of a set of bank filters by searching in the frequency domain for a specific set of sub-bands and using the Fisher's linear discriminant ratio as the class separability criterion in the features space. In this way we use genetic algorithms to create optimum feature spaces in which the patterns from **N** classes will be distributed in distant and compact clusters. In our context, each class {ω0, ω1, ω2,…, ωN-1} represents one word of the lexicon. Another important part of this work is that the algorithm is required to run in real time on dedicated hardware, not necessarily a personal computer or similar platform, so the

This chapter is organized as follows: the section 2 will present the main ideas behind the concepts of variable and feature selection; section 3 presents an overview of the most representative speech recognition methods. The section 4 is devoted to explain some of the mathematical foundations of our method, including the Fourier Transform, the Fisher's linear discriminant ratio and the Parseval's theorem. Section 5 shows our algorithmic foundations, namely the genetic algorithms and the backpropagation neural networks, a powerful classifier used here for performance comparison purposes. The implementation of our speech recognition approach is depicted in section 6 and, finally, the conclusions and

Feature selection refers to the problem of selecting features that are most predictive of a given outcome. Optimal feature generation, however, refers to the derivation of features from input variables that are optimal in terms of class separability in the feature space. Optimal feature generation is of particular relevance to pattern recognition problems because it is the basis for achieving high correct classification rates: the better the discriminant features are represented, the better the classifier will categorize new incoming patterns. Feature generation is responsible for the way the patterns lay in the features space, therefore, shaping the decision boundary of every pattern recognition problem; linear as

well as non-linear classifiers can be beneficiaries of well-shaped feature spaces.

environments [12]. Speech recognition can be user or not-user dependant.

comprised of a relatively low amount of words; for instance, see [13] and [14].

algorithm developed should has low computational requirements.

the future work are drawn in section 7.

**2. Optimal variable and feature selection** 

The first step is to choose a class separability measuring criterion, C(K). The value of the criterion C(K) is computed for each of the available features, then the features are ranked in descending order of the values of C(K). The *k* features corresponding to the *k* best C(K) values are selected to form the feature vector. This approach is simple but it does not take into consideration existing correlations between features.

#### **2.1.2 Vector feature selection**

The scalar feature selection may not be effective with features with high mutual correlation; another disadvantage is that if one wishes to verify all possible combinations of the features –in the spirit of optimality- then it is evident that the computational burden is a major limitating factor. In order to reduce the complexity some suboptimal procedures have been suggested [1]:

*Sequential Backward Selection.* The following steps comprise this method:


The number of computations can be calculated from: 1+1/2 ((K+1)K – *k*(*k+1)).*

Optimal Feature Generation with

**3.1 Linear Predictive Coding (LPC)** 

Fig. 1. LPC model of speech.

*s n*[ ] which is denoted by:

will produce the *features*.

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 239

as the input variable; the mathematical transformations that will be applied to this variable

Two important and widely used techniques for speech recognition will be presented in this section due to its relevance to this field. *Linear Predictive Coding*, or *LPC*, is a predictive technique in which a linear combination of some *K* coefficients and the last *K* samples from the signal will predict the value of the next one; the *K* coefficients will represent the

LPC is one of the most advanced analytical techniques used in the estimation of patterns, based on the idea that the present sample can be predicted from a linear combination of some past samples, generating a spectral description based on short segments of signal

Figure 1 shows the model the LPC is based on, considering that the excitation *u*[*n*] is the pattern waiting to be recognized. The transfer function of the filter is described as [3]:

> ( ) ( ) , ( ) ( ) <sup>1</sup> *p*

where G is a gain parameter, *ak* are the coefficients of filter and *p* determines the order of the

[ ] [ ] [ ],

Considering that the linear combination of past samples is calculated by using an estimator

1 [ ] [ ], *p k k sn asn k* =

*s n a s n k Gu n*

filter. In Figure 1, the samples *s*[*n*] are related to the excitation *u*[*n*] by the equation:

1

=

*p k k*

*S z G G H z U Z A z a z*<sup>−</sup> =

== = <sup>−</sup>

1

*k k k*

= −+ (2)

= − (3)

(1)

distinctive features. The following section will explain the LPC method in detail.

considering a signal *s*[*n*] to be a response of an all-pole filter excitation *u*[*n*].

*Sequential Forward Selection.* The reverse of the previous method is as follows:


#### **2.1.3 Floating search methods**

The methods explained suffer from the *nesting* effect, which means that once a feature (or variable) has been discarded it can't be reconsidered again. Or, on the other hand, once a feature (or variable) was chosen, it can't be discarded. To overcome these problems, a technique known as *floating search method,* was introduced by Pudin and others in 1994 [1], allowing the features to enter and leave the set of the *k* chosen features. There are two ways to implement this technique: one springs from the forward selection and the other from de backward selection rationale. A three steps procedure is used, namely *inclusion*, *test*, and *exclusion.* Details of the implementation can be found in [1], [16].

#### **2.1.4 Some trends in feature selection**

Recent work in feature selection are, for instance, the one of Somol *et al*., [17], where besides of optimally selecting a subset of features, the size of the subset is also optimally selected. Sun and others [18] faced the problem of feature selection in conditions of a huge number or irrelevant features, using machine learning and numerical analysis methods without making any assumptions about the underlying data distributions. In other works, the a feature selection technique is accompanied by instance selection; instance selection refers to the "orthogonal version of the problem of feature selection" [19], involving the discovery of a subset of instances that will provide the classifier with a better predictive accuracy than using the entire set of instances in each class.

#### **2.2 Optimal feature generation**

As can be seen from section 2.1, the class separability measuring criterion in feature selection is used just to measure the effectiveness of the *k* features chosen out of a total of K features, with independence of how the features were generated. The topic of optimal feature generation refers to involving the class separability criterion as an integral part of the feature generation process itself. The task can be expressed as: If *x* is an *m*-dimensional vector of measurement samples, transform it into another *l*-dimensional vector andso that some class separability criterion is optimized. Consider, to this end, the linear transformation *y=ATx.*

By now, it will suffice to note the difference between *feature selection* and *feature generation*.

#### **3. Speech recognition**

For speech processing, the electrical signal obtained from an electromechanoacoustic transducer is digitized and quantized at a fixed rate (the sampling frequency, *Fs*), and subsequently segmented into small frames of a typical duration of 10 milliseconds. Regarding to section 2, the raw digitized values of the voice signal will be considered here as the input variable; the mathematical transformations that will be applied to this variable will produce the *features*.

Two important and widely used techniques for speech recognition will be presented in this section due to its relevance to this field. *Linear Predictive Coding*, or *LPC*, is a predictive technique in which a linear combination of some *K* coefficients and the last *K* samples from the signal will predict the value of the next one; the *K* coefficients will represent the distinctive features. The following section will explain the LPC method in detail.

#### **3.1 Linear Predictive Coding (LPC)**

238 Bio-Inspired Computational Algorithms and Their Applications

a. Compute the criterion value for each individual feature; select the feature with the

b. From all possible two-dimensional vectors that contains the winner from the previous

The methods explained suffer from the *nesting* effect, which means that once a feature (or variable) has been discarded it can't be reconsidered again. Or, on the other hand, once a feature (or variable) was chosen, it can't be discarded. To overcome these problems, a technique known as *floating search method,* was introduced by Pudin and others in 1994 [1], allowing the features to enter and leave the set of the *k* chosen features. There are two ways to implement this technique: one springs from the forward selection and the other from de backward selection rationale. A three steps procedure is used, namely *inclusion*, *test*, and

Recent work in feature selection are, for instance, the one of Somol *et al*., [17], where besides of optimally selecting a subset of features, the size of the subset is also optimally selected. Sun and others [18] faced the problem of feature selection in conditions of a huge number or irrelevant features, using machine learning and numerical analysis methods without making any assumptions about the underlying data distributions. In other works, the a feature selection technique is accompanied by instance selection; instance selection refers to the "orthogonal version of the problem of feature selection" [19], involving the discovery of a subset of instances that will provide the classifier with a better predictive accuracy than

As can be seen from section 2.1, the class separability measuring criterion in feature selection is used just to measure the effectiveness of the *k* features chosen out of a total of K features, with independence of how the features were generated. The topic of optimal feature generation refers to involving the class separability criterion as an integral part of the feature generation process itself. The task can be expressed as: If *x* is an *m*-dimensional vector of measurement samples, transform it into another *l*-dimensional vector andso that some class separability criterion is optimized. Consider, to this end, the linear

By now, it will suffice to note the difference between *feature selection* and *feature generation*.

For speech processing, the electrical signal obtained from an electromechanoacoustic transducer is digitized and quantized at a fixed rate (the sampling frequency, *Fs*), and subsequently segmented into small frames of a typical duration of 10 milliseconds. Regarding to section 2, the raw digitized values of the voice signal will be considered here

*Sequential Forward Selection.* The reverse of the previous method is as follows:

*exclusion.* Details of the implementation can be found in [1], [16].

step. Compute the criterion value for each of them and select the best one.

"best" value,

**2.1.3 Floating search methods** 

**2.1.4 Some trends in feature selection** 

using the entire set of instances in each class.

**2.2 Optimal feature generation** 

transformation *y=ATx.*

**3. Speech recognition** 

LPC is one of the most advanced analytical techniques used in the estimation of patterns, based on the idea that the present sample can be predicted from a linear combination of some past samples, generating a spectral description based on short segments of signal considering a signal *s*[*n*] to be a response of an all-pole filter excitation *u*[*n*].

Fig. 1. LPC model of speech.

Figure 1 shows the model the LPC is based on, considering that the excitation *u*[*n*] is the pattern waiting to be recognized. The transfer function of the filter is described as [3]:

$$H(z) = \frac{S(z)}{U(Z)} = \frac{G}{1 - \sum\_{k=1}^{p} a\_k z^{-k}} = \frac{G}{A(z)}\,'\tag{1}$$

where G is a gain parameter, *ak* are the coefficients of filter and *p* determines the order of the filter. In Figure 1, the samples *s*[*n*] are related to the excitation *u*[*n*] by the equation:

$$\mathbf{s}[n] = \sum\_{k=1}^{p} a\_k \mathbf{s}[n-k] + \mathbf{G}u[n]. \tag{2}$$

Considering that the linear combination of past samples is calculated by using an estimator *s n*[ ] which is denoted by:

$$\tilde{s}[n] = \sum\_{k=1}^{p} a\_k s[n-k] \,\tag{3}$$

Optimal Feature Generation with

where:

1

( ) [ ] [ ],

a window function *w*[*n*] in order to obtain an *s*'[*n*] signal:

*s n*

Using equation 14, the auto-correlation function is given by:

where the equation (11) is expanded to a matrix with the form:

=

*p k k*

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 241

( ) ( ), 1 ,

1 (0) ( ), *p k k Ep R aRk* =

*n Ri snsn i* ∞ =−∞

which is the autocorrelation function of the signal *s*[*n*], with *R*(*i*) as an even function. The coefficients *R*(*i-k*) generate auto-correlation matrix, which is a symmetric Toeplitz matrix; ie, all elements in each diagonal are equal. For practical purposes, the signal *s*[*n*] is analyzed in a finite interval. One popular method of approach this is by multiplying the signal *s*[*n*] times

> [ ] [ ], 0 1 '[ ] , 0, otherwise *snwn n N*

( ) '[ ] '[ ], 0.

One of the most common ways to find the coefficients {*ak*}, is by computational methods,

012 1 1 1 101 2 2 2 210 3 3 3

 

*RRR R a R RRR R a R RRR R a R*

<sup>=</sup>

*p p ppp*

and it is necessary to use an algorithm to find these coefficients; one of the most commonly

*RRR R a R*

*p p p* − − −

<sup>1</sup> ( 1) 1 ( 1)

 − − <sup>=</sup>

*Ri a Ri j*

<sup>−</sup> <sup>−</sup> = −

*E*

() ( ) *<sup>i</sup> <sup>i</sup> j j i i*

123 0

−−−

(0) 1 for 1,2,...,

*i p*

*k*

used, is the Levinson-Durbin one, which is described below [21]:

*E R a*

= = =

0

0

( )

*a k*

=

*i i i*

*Ri s ns n i i*

≤≤ − <sup>=</sup>

1 0

− − =

*N i n*

− =− ≤ ≤ (11)

= + (12)

= + (13)

(14)

= +≥ (15)

,

(16)

*aRi k Ri i p*

the error in the prediction is determined by the lack of accuracy with respect to *s*[*n*], which is defined as [20]:

$$\text{s[}n\text{]} = \text{s[}n\text{]} - \tilde{\text{s[}n]} = \text{s[}n\text{]} - \sum\_{k=1}^{p} a\_k \text{s[}n-k\text{]},\tag{4}$$

$$\mathbf{c}[z] = \mathbf{s}[z] \left( \mathbf{1} - \sum\_{k=1}^{p} a\_k z^{-k} \right). \tag{5}$$

from equation (5), it is possible to recognize that the sequence of prediction of the error has in its components a FIR-type filter system which is defined by:

$$A(z) = 1 - \sum\_{k=1}^{p} a\_k z^{-k} = \frac{E(z)}{S(z)},\tag{6}$$

equations (2) and (4) show that *e*[*n*]=*Gu*[*n*]. The estimation of the prediction coefficients is obtained by minimizing the error in the prediction. Where *e*[*n*]2 denotes the square error of the prediction and E is the total error over a time interval (m). The prediction error in a short time segment is defined as:

$$E = \sum\_{m} e[m]^2 = \sum\_{m} (s[m] - \sum\_{k=1}^{p} a\_k s[m-k])^2,\tag{7}$$

the coefficients {*ak*} minimize the prediction of error E on the fragment obtained by the partial derivatives of E with respect to such coefficients; this means that:

$$\frac{\partial E}{\partial a\_i} = 0, \quad 1 \le i \le p. \tag{8}$$

Through equations (7) and (8) the final equation is obtained:

$$\sum\_{k=1}^{p} a\_k \sum\_{n} \mathbf{s}[n-k] \mathbf{s}[n-i] = -\sum\_{n} \mathbf{s}[n] \mathbf{s}[n-i], \qquad \qquad 1 \le i \le p. \tag{9}$$

this equation is written in terms of least squares and is known as a normal equation. For any definitions of the signal *s*[*n*], equation (9) forms a set of *p* equations with *p* unknowns that must be solved for coefficients {*ak*}, trying to reduce the error *E* of equation (7). The minimum total squared error, denoted by *Ep*, is obtained by expanding equation (7) and substituting the result in equation (9), this is:

$$Ep = \sum\_{n} s^{2} \lceil n \rceil + \sum\_{k=1}^{p} a\_{k} \sum\_{n} s \lceil n \rceil s \lceil n-k \rceil. \tag{10}$$

using the autocorrelation method to solve it [8].

For application, it is assumed that the error of equation (7) is minimized for infinite duration defined as -∞<*n*<∞, thus equations (9) and (10) are simplified as:

$$\sum\_{k=1}^{p} a\_k R(i - k) = -R(i), \quad \qquad 1 \le i \le p\_\prime \tag{11}$$

$$Ep = R(0) + \sum\_{k=1}^{p} a\_k R(k),\tag{12}$$

where:

240 Bio-Inspired Computational Algorithms and Their Applications

the error in the prediction is determined by the lack of accuracy with respect to *s*[*n*], which is

[ ] [ ] [ ] [ ] [ ],

*en sn sn sn asn k*

*ez sz az*<sup>−</sup>

in its components a FIR-type filter system which is defined by:

from equation (5), it is possible to recognize that the sequence of prediction of the error has

1 ( ) () 1 , ( ) *p*

equations (2) and (4) show that *e*[*n*]=*Gu*[*n*]. The estimation of the prediction coefficients is obtained by minimizing the error in the prediction. Where *e*[*n*]2 denotes the square error of the prediction and E is the total error over a time interval (m). The prediction error in a short

the coefficients {*ak*} minimize the prediction of error E on the fragment obtained by the

*<sup>E</sup> i p*

<sup>∂</sup> = ≤≤

0, 1 .

[ ] [ ] [ ] [ ], 1 ,

− − =− − ≤ ≤ (9)

*a sn ksn i snsn i i p*

this equation is written in terms of least squares and is known as a normal equation. For any definitions of the signal *s*[*n*], equation (9) forms a set of *p* equations with *p* unknowns that must be solved for coefficients {*ak*}, trying to reduce the error *E* of equation (7). The minimum total squared error, denoted by *Ep*, is obtained by expanding equation (7) and

1

For application, it is assumed that the error of equation (7) is minimized for infinite duration

*n kn Ep s n a snsn k* =

*p k*

[ ] [ ] [ ],

=

*m mk E em sm asm k*

*i*

*a*

2

partial derivatives of E with respect to such coefficients; this means that:

Through equations (7) and (8) the final equation is obtained:

*k n n*

defined as -∞<*n*<∞, thus equations (9) and (10) are simplified as:

1

substituting the result in equation (9), this is:

using the autocorrelation method to solve it [8].

=

*p k* 1

*k k k*

*S z*

=−=− − (4)

(5)

=− = (6)

= =− − (7)

∂ (8)

=+ − (10)

=

1 [] []1 . *p*

*k k k E z Az az*

2 2 1 [ ] ( [ ] [ ]) , *p k*

=

−

= = −

*p k k*

defined as [20]:

time segment is defined as:

$$R(i) = \sum\_{n = \dots = \dots} s[n]s[n+i]\_\prime \tag{13}$$

which is the autocorrelation function of the signal *s*[*n*], with *R*(*i*) as an even function. The coefficients *R*(*i-k*) generate auto-correlation matrix, which is a symmetric Toeplitz matrix; ie, all elements in each diagonal are equal. For practical purposes, the signal *s*[*n*] is analyzed in a finite interval. One popular method of approach this is by multiplying the signal *s*[*n*] times a window function *w*[*n*] in order to obtain an *s*'[*n*] signal:

$$s^\*[n] = \begin{cases} \, \_0s[n]w[n] \, \_00 \le n \le N-1\\ 0 \, \_0 \text{otherwise} \end{cases} \tag{14}$$

Using equation 14, the auto-correlation function is given by:

$$R(i) = \sum\_{n=0}^{N-1-i} s^\prime[n]s^\prime[n+i], \qquad i \ge 0. \tag{15}$$

One of the most common ways to find the coefficients {*ak*}, is by computational methods, where the equation (11) is expanded to a matrix with the form:

$$
\begin{bmatrix} R\_0 & R\_1 & R\_2 & \cdots & R\_{p-1} \\ R\_1 & R\_0 & R\_1 & \cdots & R\_{p-2} \\ R\_2 & R\_1 & R\_0 & \cdots & R\_{p-3} \\ \vdots & \vdots & \vdots & & \vdots \\ R\_{p-1} & R\_{p-2} & R\_{p-3} & \cdots & R\_0 \end{bmatrix} \begin{bmatrix} a\_1 \\ a\_2 \\ a\_3 \\ \vdots \\ a\_p \end{bmatrix} = \begin{bmatrix} R\_1 \\ R\_2 \\ R\_3 \\ \vdots \\ R\_p \end{bmatrix} \tag{16}
$$

and it is necessary to use an algorithm to find these coefficients; one of the most commonly used, is the Levinson-Durbin one, which is described below [21]:

$$\begin{aligned} E^0 &= R(0) \\ a\_0 &= 1 \\ \text{for } i &= 1, 2, \dots, p \\ k\_i &= \frac{\left( R(i) - \sum\_{j=1}^{i-1} a\_j^{(i-1)} R(i-j) \right)}{E^{(i-1)}} \\ a\_i^{(i)} &= k\_i \end{aligned}$$

Optimal Feature Generation with

between the elements *a*(*i*) and *b*(*j*).

or more vectors.

possible paths

and

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 243

technique widely used in pattern recognition, particularly oriented to temporal distortions between vectors, such as the time of writing, speed of video camera, the omission of a letter, etc. These temporal variations are not proportional and vary accordingly to each person, object or event, and those situations are not repetitive in any aspects. DTW uses *dynamic programming* to find similarities and differences between two

This method considers two sequences representing feature vectors defined by *a*(*i*), *i*=1,2,…,*I*

aim of DTW is to find an appropriate distance between the two sequences and in a twodimensional plane, where each sequence represents one axis, and each point corresponds to the local relationship between two sequences. The nodes (*i,j*) of the plane are associated with a cost that is defined by the function *d*(*c*)=*d*(*i,j*)=*a*(*i*)-*b*(*j*), which represents the distance

The collection of points begins in the starting point (*i*0,*j*0) and finishes in the (*ik*,*jk*) nodes, and it being an ordered pair of size *k*, where *k* is the number of nodes along the way. Every path

> 1 0

and the distance between the two sequences is defined as the minimum value D of all the

( , ) min( ). *<sup>k</sup>*

There are normalization and temporary limitations in the search for the minimum distance between patterns to compare [22], [1]. These limitations are: endpoint, monotonicity

The final point is bounded by the size of windowing and performed in each pattern, at most

(1) 1, (1) 1 ( ) ,( ) .

*iK I jK J* = =

Figure 3 shows an example in which it is only partially considered one of the sequences, a

The monotonicity conditions try to maintain the temporal order of the normalization of the

*i j*

*K*

*k D di j* − =

( , ),

<sup>=</sup> (19)

*Dab D* = (20)

= = (21)

*k k* <sup>1</sup> *i i* <sup>+</sup> ≥ (22)

*k k j j* <sup>1</sup> . <sup>+</sup> ≥ (23)

*k k*

≠*J*). The

and *b*(*j*), *j*=1,2,…,*J*, where, in general, the number of elements differs in each vector (*I*

established with the points is associated with a total cost *D* and defined by

conditions, local continuity, global path and slope weight.

situation that is not allowed to search the minimum cost.

cases is empirical and defined to extremes, that is:

time, and avoid negative slopes, by

$$\begin{aligned} \text{if } i &> 1 \text{ then for } j = 1, 2, \dots, i - 1\\ a\_j^{(i)} &= a\_j^{(i-1)} - k\_i a\_{i-j}^{(i-1)}\\ \text{end}\\ \mathbf{E}^{(0)} &= (1 - k\_i^2) \mathbf{E}^{(i-1)}\\ \mathbf{end}\\ a\_j &= a\_j^{(p)} \quad j = 1, 2, \dots, p. \end{aligned}$$

An important feature of this algorithm is that, when making the recursion, an estimation of the half-quadratic prediction error must be made. This prediction satisfies the system function given in equation (17), which corresponds to the term A (z) of equation (1); namely:

$$A^{(i)}(z) = A^{(i-1)}(z) - k\_i z^{-i} A^{(i-1)} z^{-1} \,\, \, \, \, \, \tag{17}$$

where the fundamental part for the characterization of the signal in coefficients of prediction, is met by establishing an adequate number of coefficients *p***,** according to the sampling frequency (fs) and based on the resonance in kHz [3] which is:

$$p = 4 + \frac{f\_s}{1000} \tag{18}$$

where the optimal number of LPC coefficients is the one that represents the lowest mean square error possible. Figure 2 shows the calculation of LPC in a voice signal with 8 kHz sampling rate and the effect of varying the number of coefficients in a segment (frame).

Fig. 2. Comparison of the original signal spectrum and LPC envelope with different numbers of coefficients.

#### **3.2 Dynamic Time Warping (DTW)**

Another commonly used technique used in speech recognition is the dynamic time warping. Is presented here, again, for it relevance to the this field. Dynamic time warping is a technique widely used in pattern recognition, particularly oriented to temporal distortions between vectors, such as the time of writing, speed of video camera, the omission of a letter, etc. These temporal variations are not proportional and vary accordingly to each person, object or event, and those situations are not repetitive in any aspects. DTW uses *dynamic programming* to find similarities and differences between two or more vectors.

This method considers two sequences representing feature vectors defined by *a*(*i*), *i*=1,2,…,*I* and *b*(*j*), *j*=1,2,…,*J*, where, in general, the number of elements differs in each vector (*I*≠*J*). The aim of DTW is to find an appropriate distance between the two sequences and in a twodimensional plane, where each sequence represents one axis, and each point corresponds to the local relationship between two sequences. The nodes (*i,j*) of the plane are associated with a cost that is defined by the function *d*(*c*)=*d*(*i,j*)=*a*(*i*)-*b*(*j*), which represents the distance between the elements *a*(*i*) and *b*(*j*).

The collection of points begins in the starting point (*i*0,*j*0) and finishes in the (*ik*,*jk*) nodes, and it being an ordered pair of size *k*, where *k* is the number of nodes along the way. Every path established with the points is associated with a total cost *D* and defined by

$$D = \sum\_{k=0}^{K-1} d(i\_{k\prime}, j\_k)\_{\prime} \tag{19}$$

and the distance between the two sequences is defined as the minimum value D of all the possible paths

$$D(a,b) = \min\_{k}(D). \tag{20}$$

There are normalization and temporary limitations in the search for the minimum distance between patterns to compare [22], [1]. These limitations are: endpoint, monotonicity conditions, local continuity, global path and slope weight.

The final point is bounded by the size of windowing and performed in each pattern, at most cases is empirical and defined to extremes, that is:

$$\begin{aligned} i(1) &= 1, \; j(1) = 1\\ i(K) &= I, \; j(K) = I. \end{aligned} \tag{21}$$

Figure 3 shows an example in which it is only partially considered one of the sequences, a situation that is not allowed to search the minimum cost.

The monotonicity conditions try to maintain the temporal order of the normalization of the time, and avoid negative slopes, by

$$i\_{k+1} \ge i\_k \tag{22}$$

and

242 Bio-Inspired Computational Algorithms and Their Applications

if 1 then for 1,2,..., 1

> =− = −

*i ji*

*i i*

An important feature of this algorithm is that, when making the recursion, an estimation of the half-quadratic prediction error must be made. This prediction satisfies the system function given in equation (17), which corresponds to the term A (z) of equation (1); namely:

( ) ( 1) ( 1) <sup>1</sup> () () , *ii i <sup>i</sup> A z A z kz A z <sup>i</sup>*

where the fundamental part for the characterization of the signal in coefficients of prediction, is met by establishing an adequate number of coefficients *p***,** according to the

4

Fig. 2. Comparison of the original signal spectrum and LPC envelope with different

Another commonly used technique used in speech recognition is the dynamic time warping. Is presented here, again, for it relevance to the this field. Dynamic time warping is a

numbers of coefficients.

**3.2 Dynamic Time Warping (DTW)** 

where the optimal number of LPC coefficients is the one that represents the lowest mean square error possible. Figure 2 shows the calculation of LPC in a voice signal with 8 kHz sampling rate and the effect of varying the number of coefficients in a segment (frame).

1000

−

1,2,..., .

(i) 2 ( 1)

*k E*

*aa j p*

( )

sampling frequency (fs) and based on the resonance in kHz [3] which is:

*p j j*

= =

E (1 )

= −

end

end

( ) ( 1) ( 1)

− − −

− − − − = − (17)

*sf <sup>p</sup>* = + (18)

*ii i j j iij*

*a a ka*

$$j\_{k+1} \ge j\_k \,\tag{23}$$

Optimal Feature Generation with

Fig. 5. Global region and determination of slopes.

*Dab*

(0,0) 0 for 1 :

=

*i I*

= =

*Di Di*

*DjDj*

end for j 1: J

*D*

end for 1 :

for 1 :

=

*j J*

*i I*

=

end

(,) (,)

*Dab DI J*

=

end

features, whose general form is

distance of DTW is shown below [1] :

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 245

The optimal path layout defined a measure of dissimilarity between the two sequences of

1

=

*K k K F*

( , ) min ,

where *d*(*c*(*k*)) and *w*(*k*) are the local distance between the windows *i*(*k*) of the reference vector and *j*(*k*) of the recognize vector, and a weighting function in *k* to maintain a flexible way and improve alignment, respectively. The simplified computational algorithm for calculating the

( ,0) ( 1,0) 1

=− +

(0, ) (0, 1) 1

= −+

2 ( 1, ) 1 3 ( , 1) 1 ( , ) min( 1, 2, 3)

=− + = −+ =

*Di j c c c*

*c Di j c Di j*

1

=

*k*

<sup>=</sup>

( ( )) ( )

*dck wk*

<sup>⋅</sup>

( )

(26)

*w k*

1 ( 1, 1) ( , | 1, 1)

= − −+ − −

*c Di j di j i j*

Fig. 3. Example of a sequence that violates the rule of the endpoint.

Figure 4 shows an example without monotonicity paths, which are not allowed to find the optimal path.

Fig. 4. Without monotonicity path example.

The continuity conditions

$$i(K) - i(k-1) \le 1 \quad \text{And} \quad j(K) - j(K-1) \le 1,\tag{24}$$

are defined by maintaining the relationship between two consecutive points of the form:

$$c(K-1) = \begin{cases} (i(k), j(k)-1), \\ (i(k)-1, j(k)-1), \\ (i(k)-1, j(k)), \end{cases} \tag{25}$$

Global limitations define a region of nodes where the optimal path is found, and is based on a parallelogram that offers a feasible region [7], thereby avoiding unnecessary regions involved in processing. Figure 5 shows the values of the key points of the parallelogram.

Fig. 5. Global region and determination of slopes.

244 Bio-Inspired Computational Algorithms and Their Applications

Figure 4 shows an example without monotonicity paths, which are not allowed to find the

are defined by maintaining the relationship between two consecutive points of the form:

*cK ik jk*

( 1) ( ( ) 1, ( ) 1),

Global limitations define a region of nodes where the optimal path is found, and is based on a parallelogram that offers a feasible region [7], thereby avoiding unnecessary regions involved in processing. Figure 5 shows the values of the key points of the parallelogram.

( ( ), ( ) 1),

*ik jk*

 − −= − − <sup>−</sup>

( ( ) 1, ( )),

*ik jk*

*iK ik* ( ) ( 1) 1 − −≤ And *jK jK* ( ) ( 1) 1, − −≤ (24)

(25)

Fig. 3. Example of a sequence that violates the rule of the endpoint.

Fig. 4. Without monotonicity path example.

The continuity conditions

optimal path.

The optimal path layout defined a measure of dissimilarity between the two sequences of features, whose general form is

$$D(a,b) = \min\_{\mathcal{F}} \left[ \frac{\sum\_{k=1}^{K} d(c(k)) \cdot w(k)}{\sum\_{k=1}^{K} w(k)} \right],\tag{26}$$

where *d*(*c*(*k*)) and *w*(*k*) are the local distance between the windows *i*(*k*) of the reference vector and *j*(*k*) of the recognize vector, and a weighting function in *k* to maintain a flexible way and improve alignment, respectively. The simplified computational algorithm for calculating the distance of DTW is shown below [1] :

$$\begin{aligned} D(0,0)&=0\\ \text{for } i&=1:I\\ D(i,0)&=D(i-1,0)+1\\ \text{end}\\ \text{for } j&=1:J\\ D(0,j)&=D(0,j-1)+1\\ \text{end}\\ \text{for } i&=1:I\\ \text{for } j&=1:I\\ \text{for } j&=1:I\\ c1&=D(i-1,j-1)+d(i,j \mid i-1,j-1)\\ \angle 2&=D(i-1,j)+1\\ \angle 3&=D(i,j-1)+1\\ D(i,j)&=\min(c1,c2,c3)\\ \text{end}\\ D(a,b)&=D(I\_{.}f) \end{aligned}$$

Optimal Feature Generation with

from the red cluster, the bigger FLDR value is.

Fig. 6. Example of the FLDR values for two clusters.

feature 2

**4.2 Parseval's theorem and the Fourier Transform** 

N is the number of samples of the discrete signal,

X[k] is the *k*-th sample of the Fourier transform of x[i].

x[i] is the *i*-th sample of the discrete signal,

where

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 247

Figure 6 shows an example in which the FLDR is evaluated using equation (31); FLDR and the respective values are displayed. Notice that the more the blue clusters are separated

FLDR EXAMPLE: THE MORE SEPARATE THE CLUSTERS ARE THE BIGGER J IS

Parseval's theorem states, in an elegant manner, that the energy of a discrete signal in the


FLDR 436.30

feature 1

<sup>1</sup> /2 <sup>2</sup> <sup>2</sup>

*xi Xk N*

For a discrete signal x[i], the Fourier transform can be computed using the well known *Discrete* Fourier Transform via the efficient algorithm for its implementation, the FFT [2], [3]:

> π<sup>−</sup> <sup>−</sup>

The implication of the Parseval's theorem is that an algorithm can search for specific energetic properties of a signal in the frequency domain off-line, and then use the information obtained off-line to configure a bank of digital filters to look for the same

<sup>1</sup> <sup>2</sup>

*<sup>N</sup> <sup>n</sup> i k <sup>N</sup>*

0 [] []

=

*n Xk xne*

<sup>2</sup> [] [ ]

<sup>=</sup> (32)

<sup>=</sup> *<sup>k</sup>*=0, 1,…,*N-1* (33)

time domain can be calculated in the frequency domain by a simple relation [2], [3]:

FLDR = 263.28

*N N i k*

−

0 0

= =

#### **4. Mathematical foundations**

#### **4.1 Fisher's Linear Discriminant Ratio FLDR**

*Fisher's Linear Discriminant Ratio*, is used as an optimization criterion in several research fields, including speech recognition, handwriting recognition, and others [1]. Consider the following definitions:

#### **Within-class scatter matrix.**

$$S\_w = \sum\_{i=1}^{M} P\_i S\_i \tag{27}$$

where *Si* is the covariance matrix for class ω*i*, and *Pi* the a priori probability of class ω*i*. Trace{*Sw*} is a measure of the average variance of the features, or descriptive elements of the class.

#### **Between-class scatter matrix**

$$S\_b = \sum\_{i=1}^{M} P\_i(\mu\_i - \mu\_0)(\mu\_i - \mu\_0)^T \tag{28}$$

where μ*0* is the global mean vector

$$
\mu\_0 = \sum\_{i}^{M} P\_i \mu\_i \tag{29}
$$

Trace{*Sb*} is a measure of the average distance of the mean of each individual class from the respective global value.

#### **Mixture scatter matrix**

$$S\_{\
u} = \mathbb{E}\left[ (\mu\_i - \mu\_0)(\mu\_i - \mu\_0)^T \right] \tag{30}$$

*Sm* is the covariance matrix of the feature vector with respect to the global mean, and E[.] is the mathematical operator of the expected mean value. Based on the just given definitions, the following criteria can be expressed:

$$\begin{aligned} J\_1 &= \frac{\text{trace}\{S\_w\}}{\text{trace}\{S\_w\}} \\\\ J\_2 &= \frac{\left| S\_b \right|}{\left| S\_w \right|} = \left| S\_w^{-1} S\_b \right| \end{aligned} \tag{31}$$

It can be shown that J1 and J2 take large values when the samples in the *l*-dimensional are well clustered around their mean, within each class, and the clusters of different classes are well separated. Criteria in equation (31) can be used to guide an optimization process, since they measure the goodness of data clustered; the data to be clustered could be the set of features representative of the items of a class. Trace{*Sb*} is a measure of the average distance of the mean of each individual class from the respective global value.

Figure 6 shows an example in which the FLDR is evaluated using equation (31); FLDR and the respective values are displayed. Notice that the more the blue clusters are separated from the red cluster, the bigger FLDR value is.

Fig. 6. Example of the FLDR values for two clusters.

#### **4.2 Parseval's theorem and the Fourier Transform**

Parseval's theorem states, in an elegant manner, that the energy of a discrete signal in the time domain can be calculated in the frequency domain by a simple relation [2], [3]:

$$\sum\_{i=0}^{N-1} \mathbf{x}[i]^2 = \frac{2}{N} \sum\_{k=0}^{N/2} \left\| \mathbf{X}[k] \right\|^2 \tag{32}$$

where

246 Bio-Inspired Computational Algorithms and Their Applications

*Fisher's Linear Discriminant Ratio*, is used as an optimization criterion in several research fields, including speech recognition, handwriting recognition, and others [1]. Consider the

> 1 *M w ii i S PS* =

> > 0 0

( )( ) *<sup>M</sup> <sup>T</sup>*

> μ

μμμμ

where *Si* is the covariance matrix for class ω*i*, and *Pi* the a priori probability of class ω*i*. Trace{*Sw*} is a measure of the average variance of the features, or descriptive elements of the

1

*i S P*

=

*b ii i*

0

μ

1

=

2

of the mean of each individual class from the respective global value.

*b*

= =

*S J SS <sup>S</sup>*

*trace S <sup>J</sup> trace S*

*w*

It can be shown that J1 and J2 take large values when the samples in the *l*-dimensional are well clustered around their mean, within each class, and the clusters of different classes are well separated. Criteria in equation (31) can be used to guide an optimization process, since they measure the goodness of data clustered; the data to be clustered could be the set of features representative of the items of a class. Trace{*Sb*} is a measure of the average distance

*M i i i*

Trace{*Sb*} is a measure of the average distance of the mean of each individual class from the

*<sup>T</sup> S E mi i* 0 0 =− − ( )( ) μμμμ

*Sm* is the covariance matrix of the feature vector with respect to the global mean, and E[.] is the mathematical operator of the expected mean value. Based on the just given definitions,

> { } { } *m w*

> > 1

−

*w b*

<sup>=</sup> (27)

= −− (28)

<sup>=</sup> *<sup>P</sup>* (29)

(30)

(31)

**4. Mathematical foundations** 

following definitions:

class.

**Within-class scatter matrix.** 

**Between-class scatter matrix** 

where μ*0* is the global mean vector

the following criteria can be expressed:

respective global value. **Mixture scatter matrix** 

**4.1 Fisher's Linear Discriminant Ratio FLDR** 


For a discrete signal x[i], the Fourier transform can be computed using the well known *Discrete* Fourier Transform via the efficient algorithm for its implementation, the FFT [2], [3]:

$$X[k] = \sum\_{n=0}^{N-1} x[n]e^{-2i\pi k \frac{n}{N}} \quad k = 0, 1, \ldots, N-1 \tag{33}$$

The implication of the Parseval's theorem is that an algorithm can search for specific energetic properties of a signal in the frequency domain off-line, and then use the information obtained off-line to configure a bank of digital filters to look for the same

Optimal Feature Generation with

**5.2 Backpropagation Neural Networks** 

major steps in the training process.

Fig. 7. Topology of a backpropagation neural network.

output of the network, because this is supervised learning.

pass.

Genetic Algorithms and FLDR in a Restricted-Vocabulary Speech Recognition System 249

your salespeople off the floor and place them in an environment where they can focus on improving their skills without distractions. After a suitable training period, they are sent out to apply their new found knowledge and skills. In an intelligent system context, this means that we would gather data from situations that the systems have experienced. We could then augment this data with information about the desired system response to build a training data

Backpropagation is the most popular neural network architecture for *supervised learning*. It features a *feed-forward* connection topology, meaning that data flow through the network in a single direction, and uses a technique called the *backward propagation* of errors to adjust the connection weights Rumelhart, Hinton, and Williams 1986 in [23]. In addition to a layer of input and output units, a back-propagation network can have one or more layers of hidden units, which receive inputs only from other units, and not from the external environment. A backpropagation network with a single hidden layer or processing units can learn to model any continuous function when given enough units in the hidden layer. The primary

Figure 7 shows the diagram of a backpropagation neural network and illustrates the three

**First**, input data is presented to the units of the input layer on the left, and it flows through the network until it reaches the network output units on the right. This is called the forward

**Second**, the activations or values of the output units represent the actual or predicted

set. Once we have this database we can use it to modify the behavior of our system.

applications of backpropagation networks are for prediction and classification.

energetic properties in the time domain on-line, in real time. The *link* between both domains is the energetic content of the signal.
