**Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform**

Alfredo Victor Mantilla Caeiros and Hector Manuel Pérez Meana

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/49943

## **1. Introduction**

196 Modern Speech Recognition Approaches with Case Studies

Multimedia Signal Processing (MMSP'04), pp.363-366

Z. Liu, Z. Zhang, A. Acero, J. Droppo, and X. Huang (2004). Direct Filtering for Air- and Bone-Conductive Microphones, in proceedings of IEEE International Workshop on

> People that suffer from diseases such as throat cancer require that their larynx and vocal cords be extracted by surgery, requiring then rehabilitation in order to be able to reintegrate to their individual, social, familiar and work activities. To accomplish this, different methods have been suggested, such as: The esophageal speech, the use of tracheoesophageal prosthetics and the Artificial Larynx Transducer (ALT), also known as "electronic larynx" [1, 2].

> The ALT, which has the shape of a handheld device, introduces an excitation in the vocal tract by applying a vibration against the external walls of the neck. The excitation is then modulated by the movement of the oral cavity to produce the speech sound. This transducer is attached to the speaker's neck, and in some cases to the speaker's cheeks. The ALT is widely recommended by voice rehabilitation physicians because it is very easy to use even for new patients, although the voice produced by these transducers is unnatural and with low quality, besides that it is distorted by the ALT produced background noise. Thus, ALT results in a considerably degradation of the quality and intelligibility of speech, problem for which an optimal solution has not yet been found [2].

> The esophageal speech, on the other hand, is produced by the compression of the contained air in the vocal tract, from the stomach to the mouth through the esophagus. This air is swallowed and it produces a vibration of the esophageal upper muscle as it passes through the esophageal-larynx segment, producing the speech. The generated sound is similar to a burp, the tone is commonly very low, and the timbre is generally harsh. As in the ALT produced speech, the voiced segments of esophageal speech are the most affected parts of the speech within a word or phrase resulting an unnatural speech. Thus many efforts have been carried out to improve its quality and intelligibility.

© 2012 Mantilla Caeiros and Pérez Meana, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Mantilla Caeiros and Pérez Meana, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Several approaches have been proposed to improve the quality and intelligibility of alaryngeal speech, esophageal as well as ALT produced speech [2, 3].

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 199

The first stage consists of recording a speech file, from an esophageal speaker. The audio data is kept as a WAV file. All files are monaural and they are digitalized in PCM (Pulse

The digital signal is low-pass filtered to reduce the background noise. This stage is implemented by a 200 order digital FIR filter with a cut-off frequency of 900Hz. A common practice for speech recognition is the use of pre-emphasis filter in order to amplify the higher frequency components of the signal with the purpose of emulating the additional sensibility of the human ear to high frequencies. Generally a high pass filter characterized

The filtered signal is divided into 100ms segments (800 samples) and at the same time each segment is subdivided into 10ms blocks, on which the later processing is going to be realized. The size of these blocks is determined by the quasi-periodic and quasi-stationary

A Hamming window is applied to the segmented signal so that the extreme samples of the segments had less weight that the central samples. The window's length is chosen to be larger than the frame interval, preventing a loss of information which could take place

Code Modulation) format using a sampling rate of 8000Hz with a resolution of 8bits.

**Figure 1.** Block Diagram

**2.1. Data acquisition** 

**2.2. Preprocessing** 

by a slope of 20 dB per decade is used [4].

character of speech in such interval.

**2.3. Segmentation and frame windowing** 

during the transitions from one frame to the next.

This chapter presents an alaryngeal speech enhancement system, which uses several methods for speech recognition such as voiced and unvoiced segment detection, feature extraction method and pattern recognition algorithms.

The content of this chapter is a follows:

	- Pitch detection
	- Zero crossing
	- Formant Analysis

## **2. Methods**

Figure 1 shows a block diagram of the proposed system. It is based on the replacement of voiced segments of alaryngeal speech by their equivalent normal speech voiced segments, while keeping the unvoiced and silence segments without change. The main reason is that the voiced segments have a more significant impact on the speech quality and intelligibility than the unvoiced segments.

The following explains the stages of the system.

The content of this chapter is a follows:

windowing.


restored speech.

than the unvoiced segments.

The following explains the stages of the system.

**2. Methods** 

of alaryngeal speech signals.

extraction method and pattern recognition algorithms.

classifying voiced and unvoiced segments such as:

voiced segments present in segment under analysis.

Several approaches have been proposed to improve the quality and intelligibility of

This chapter presents an alaryngeal speech enhancement system, which uses several methods for speech recognition such as voiced and unvoiced segment detection, feature

1. Acquisition and preprocessing of esophageal speech: This section explains the acquisition and preprocessing of speech signal including filtering, segmentation and

2. Voiced/unvoiced segment detection: This section discusses several methods for

3. Feature extraction: The performance of any speech recognition algorithm strongly depends on the accuracy of the feature extraction method. This section exposes some important feature extraction methods such as the Linear Predictive Coding (LPC), the Cepstral Coefficients, as well as a feature extraction method based on an inner ear model, which takes into account the fundamental concepts of critical bands using a wavelet function. The later method emulates the basilar membrane operation, through a

4. Clasiffier: The parameter vector obtained in the feature extraction stage is supplied to a classifier. The classification stage consists of neural networks, which identifies the

5. Voice synthesis: The voiced segments detected are replaced by voiced segments of a normal speaker and concatenated with unvoiced and silent segments to produce the

6. Results: Finally, using objective and subjective evaluation methods, it shows that the proposed system provides a fairly good improvement of the quality and intelligibility

Figure 1 shows a block diagram of the proposed system. It is based on the replacement of voiced segments of alaryngeal speech by their equivalent normal speech voiced segments, while keeping the unvoiced and silence segments without change. The main reason is that the voiced segments have a more significant impact on the speech quality and intelligibility

multiresolution analysis similar to that performed by a wavelet transform.

alaryngeal speech, esophageal as well as ALT produced speech [2, 3].

## **2.1. Data acquisition**

The first stage consists of recording a speech file, from an esophageal speaker. The audio data is kept as a WAV file. All files are monaural and they are digitalized in PCM (Pulse Code Modulation) format using a sampling rate of 8000Hz with a resolution of 8bits.

## **2.2. Preprocessing**

The digital signal is low-pass filtered to reduce the background noise. This stage is implemented by a 200 order digital FIR filter with a cut-off frequency of 900Hz. A common practice for speech recognition is the use of pre-emphasis filter in order to amplify the higher frequency components of the signal with the purpose of emulating the additional sensibility of the human ear to high frequencies. Generally a high pass filter characterized by a slope of 20 dB per decade is used [4].

## **2.3. Segmentation and frame windowing**

The filtered signal is divided into 100ms segments (800 samples) and at the same time each segment is subdivided into 10ms blocks, on which the later processing is going to be realized. The size of these blocks is determined by the quasi-periodic and quasi-stationary character of speech in such interval.

A Hamming window is applied to the segmented signal so that the extreme samples of the segments had less weight that the central samples. The window's length is chosen to be larger than the frame interval, preventing a loss of information which could take place during the transitions from one frame to the next.

## **2.4. Voiced/unvoiced segment detection**

A voiced (sonorous) segment is characterized by a periodic or quasiperiodic behavior in time, a fine harmonic frequency structure produced by the vibration of the vocal chords, as well as a high energy concentration due to the little obstruction that the air meets in its way through the vocal tract. The vowels and some consonants present such behavior.

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 201

an ascending form and the first three formants are chosen as parameters of the speech segment. These formants are then stored in the system so that they can be employed to take the voiced/invoiced decision. Using the normalized Fast Fourier Transform (FFT) the

To take the decision whether the segment is voiced or not, the value of the formants amplitude is normalized each 100 millisecond segment. Then the algorithm finds the maximum value of each formant among the 10 values stored for each fragment. Then each

11 12 1 10

*Max Max Max*

.....

...

(1)

(2)

....

0 0.25 max

*AFx N*

max

max <sup>1</sup> 0.25

Next an 'and' logic operation is applied with the three formant array using the values obtained after the threshold operation. Here only the segments in which the three formants

Finally, using the three criterions mentioned above, a window is applied to the original signal which is equal to one if the segment is classified as voiced by the three methods; and it is equal to zero otherwise, such that only the voiced segments of the original signal are

The performance of any speech recognition algorithm strongly depends on the accuracy of the feature extraction method. This fact has motivated the development of several efficient algorithms to estimate a set of parameters that allow a robust characterization of the speech signal. Some of these methods are: The Linear Prediction Coefficients (LPCs), Formants

*AFx*

*AFx N AFx AFx AFx N*

21 22 2 10

*Max Max Max*

3 1 3 2 3 10

*Max Max Max*

11 1

*AF AF AF AF AF AF AF*

*AF AF AF AF AF AF AF*

*AF AF AF AF AF AF AF*

22 2

33 3

The local normalization process is justified for esophageal speakers due to the loss of energy as they speak. Once the normalized values are obtained, the decision is made using an experimental threshold value which is equal to 0.25. It can be seen as a logic mask in the algorithm if the normalized values greater than 0.25 are set to one, otherwise are set to zero,

value is divided between the estimated maximum values as shown in (1).

1

2

3

have values over the 0.25 are considered to be voiced segments.

as shown in (2).

obtained.

**2.5. Feature vector extraction** 

amplitude of the formant frequency can be obtained.

Several approaches have been proposed to detect the voiced segments of speech signals. However the use of a single criterion of decision to determine if a speech segment is voiced or unvoiced is not enough. Thus most algorithms in the speech processing area use the combination of more than one criterion. The proposed speech restoration method uses the combination of energy average, zero crossing and formant analysis of speech signal for voiced/unvoiced segment classification

## *2.4.1. Energy average*

A first criterion ponders the average power of each frame by comparing it to that of its surroundings. An interval of 100 milliseconds in the neighborhood of the actual frame was selected. As part of the system's initial configuration, two thresholds are fixed. If the quotient between the frame's average power and that of the frame's surrounding is smaller than the lower threshold, the frame is labeled as unvoiced. Otherwise, if the quotient is larger than the higher threshold, the frame would be taken as voiced. For those cases in which the rate of average power lies between both thresholds, the energy criterion is not enough to determine the signal's nature.

## *2.4.2. Zero crossing*

The second criterion is based on the signal periodicity using the number of zero crossings in each frame. Here two thresholds are used to establish that in a noise free speech segment of 10ms a voiced segment has about 12 zero crossings, while in an unvoiced segment has about 50 zero crossings [5, 6]. These values are not fixed and must be adjusted according to the sampling frequency used. In the proposed algorithm, for a sampling frequency of 8 kHz the maximum value of zero crossings that could be detected in 10ms is approximately 40. Thus an upper threshold of 30 was chosen for voiced/unvoiced classification.

## *2.4.3. Formant analysis*

The third criterion is based on the amplitude of formants which, represents the resonance frequency of the vocal tract. Formants are the envelope peaks of the speech signal power spectrum density. The frequencies in which the first formants are produced are of great importance in speech recognition [4].

The formants are obtained from the polynomial roots generated by the linear prediction coefficients (LPC) that represent the vocal tract filter. Once the formants, whose frequency is defined by the angle of the roots closer to the unitary circle, are obtained, they are ordered in an ascending form and the first three formants are chosen as parameters of the speech segment. These formants are then stored in the system so that they can be employed to take the voiced/invoiced decision. Using the normalized Fast Fourier Transform (FFT) the amplitude of the formant frequency can be obtained.

To take the decision whether the segment is voiced or not, the value of the formants amplitude is normalized each 100 millisecond segment. Then the algorithm finds the maximum value of each formant among the 10 values stored for each fragment. Then each value is divided between the estimated maximum values as shown in (1).

$$\begin{aligned} AF\_1 &= \left[ \frac{AF\_{1-1}}{AF\_{1\text{Max}}} \frac{AF\_{1-2}}{AF\_{1\text{Max}}} \dots \frac{AF\_{1-10}}{AF\_{1\text{Max}}} \right] \\ AF\_2 &= \left[ \frac{AF\_{2-1}}{AF\_{2\text{Max}}} \frac{AF\_{2-2}}{AF\_{2\text{Max}}} \dots \frac{AF\_{2-10}}{AF\_{2\text{Max}}} \right] \\ AF\_3 &= \left[ \frac{AF\_{3-1}}{AF\_{3\text{Max}}} \frac{AF\_{3-2}}{AF\_{3\text{Max}}} \dots \frac{AF\_{3-10}}{AF\_{3\text{Max}}} \right] \end{aligned} \tag{1}$$

The local normalization process is justified for esophageal speakers due to the loss of energy as they speak. Once the normalized values are obtained, the decision is made using an experimental threshold value which is equal to 0.25. It can be seen as a logic mask in the algorithm if the normalized values greater than 0.25 are set to one, otherwise are set to zero, as shown in (2).

$$\begin{array}{c|c}AFx - N & 0 \\ \hline AFx \text{ max} \end{array} = \begin{cases} 0 & \frac{AFx - N}{AFx \text{ max}} < 0.25 \\ & \begin{cases} 1 & \frac{AFx - N}{AFx \text{ max}} > 0.25 \end{cases} \end{array} \tag{2}$$

Next an 'and' logic operation is applied with the three formant array using the values obtained after the threshold operation. Here only the segments in which the three formants have values over the 0.25 are considered to be voiced segments.

Finally, using the three criterions mentioned above, a window is applied to the original signal which is equal to one if the segment is classified as voiced by the three methods; and it is equal to zero otherwise, such that only the voiced segments of the original signal are obtained.

#### **2.5. Feature vector extraction**

200 Modern Speech Recognition Approaches with Case Studies

**2.4. Voiced/unvoiced segment detection** 

voiced/unvoiced segment classification

enough to determine the signal's nature.

*2.4.1. Energy average* 

*2.4.2. Zero crossing* 

*2.4.3. Formant analysis* 

importance in speech recognition [4].

A voiced (sonorous) segment is characterized by a periodic or quasiperiodic behavior in time, a fine harmonic frequency structure produced by the vibration of the vocal chords, as well as a high energy concentration due to the little obstruction that the air meets in its way

Several approaches have been proposed to detect the voiced segments of speech signals. However the use of a single criterion of decision to determine if a speech segment is voiced or unvoiced is not enough. Thus most algorithms in the speech processing area use the combination of more than one criterion. The proposed speech restoration method uses the combination of energy average, zero crossing and formant analysis of speech signal for

A first criterion ponders the average power of each frame by comparing it to that of its surroundings. An interval of 100 milliseconds in the neighborhood of the actual frame was selected. As part of the system's initial configuration, two thresholds are fixed. If the quotient between the frame's average power and that of the frame's surrounding is smaller than the lower threshold, the frame is labeled as unvoiced. Otherwise, if the quotient is larger than the higher threshold, the frame would be taken as voiced. For those cases in which the rate of average power lies between both thresholds, the energy criterion is not

The second criterion is based on the signal periodicity using the number of zero crossings in each frame. Here two thresholds are used to establish that in a noise free speech segment of 10ms a voiced segment has about 12 zero crossings, while in an unvoiced segment has about 50 zero crossings [5, 6]. These values are not fixed and must be adjusted according to the sampling frequency used. In the proposed algorithm, for a sampling frequency of 8 kHz the maximum value of zero crossings that could be detected in 10ms is approximately 40. Thus

The third criterion is based on the amplitude of formants which, represents the resonance frequency of the vocal tract. Formants are the envelope peaks of the speech signal power spectrum density. The frequencies in which the first formants are produced are of great

The formants are obtained from the polynomial roots generated by the linear prediction coefficients (LPC) that represent the vocal tract filter. Once the formants, whose frequency is defined by the angle of the roots closer to the unitary circle, are obtained, they are ordered in

an upper threshold of 30 was chosen for voiced/unvoiced classification.

through the vocal tract. The vowels and some consonants present such behavior.

The performance of any speech recognition algorithm strongly depends on the accuracy of the feature extraction method. This fact has motivated the development of several efficient algorithms to estimate a set of parameters that allow a robust characterization of the speech signal. Some of these methods are: The Linear Prediction Coefficients (LPCs), Formants Frequencies Analysis, Mel Frequency Cepstral Coefficients (MFCC) among others [5,6]. This section discusses these methods and proposes one based on Wavelet Transform.

#### *2.5.1. Linear Prediction Coefficients (LPCs)*

The LPCs methods are based on the fact that the signal can be approximated from a weighted sum of precedent samples [7]. This approximation is given by:

$$s'\_n = \sum\_{k=1}^p a\_k s\_{n-k} \tag{3}$$

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 203

and imaginary part *<sup>p</sup>*

(6)

(8)

(9)

 (10)

.

(7)

denominator of (5) to zero and solving it to find its roots. The S plane conversion is done by substituting z by *skT e* , where sk is the pole in the s plane. The resultant roots are generally

Formants frequencies are obtained from the polynomial roots generated by the linear prediction coefficients. The formant frequency is defined by the angle of the roots closer to the unitary circle. A root, with an angle close to zero, indicates the existence of a formant near the origin. A root whose angle is in close proximity to π indicates that the formant is located near the maximum frequency, in this case 4000Hz. Since the frequency dominion is symmetric with respect to the vertical axis, the roots located in the inferior semi plane of z

> *<sup>p</sup> p p r j*

The roots (r) which are located in the superior semi plane near the unitary circle can be

*p p*

By using the arctangent function is possible to obtain the roots angle. Doing this, the roots

*p p <sup>f</sup> for* 

Once the formants are obtained, they are organized in ascending order, and the first three

The cepstral coefficients estimation is another widely used feature extraction method in speech recognition problems. These coefficients form a very good features vector for the

development of speech recognition algorithms, sometimes better than the LPC ones.

Cepstrum is defined as the inverse Fourier transform of spectrum module logarithm ( 9)

<sup>1</sup> *ct F* log S

 <sup>1</sup> *ct F* log E log H 

*for*

 

*r for*

0 0.01

*p*

2 *s*

0.01

conjugated complex pairs.

plane can be ignored.

obtained using (7).

Let rp as a linear prediction coefficient root with real part *<sup>p</sup>*

are chosen as parameters of the speech segment.

Developing de above equation it obtains:

*2.5.3. Mel Frequency Cepstral Coefficients (MFCC)* 

*p*

are mapped into the frequency dominion by using (8) to get the formants.

*r*

where ak (1<k<p) is a set of real constants known as predictor coefficients, that must be calculated, and p is the predictor order. The problem of linear prediction resides on finding the predictor coefficients ak that minimize the error between the real value of the function and the approximated function.

To minimize the total quadratic error is necessary to calculate the autocorrelation coefficients. This is a matrix equation with different recursive solutions, the commonly used is the Levinson recursion.

The developed algorithm takes each segment of 10 milliseconds and calculates its linear prediction coefficients. The number of predictor coefficients is obtained by substituting the sampling frequency value (*fs*) in (4).

$$p = 4 + \frac{f\_s}{1000} = 4 + \frac{8000}{1000} = 12\tag{4}$$

The sequence of the minimal error could be interpreted as the output of the H(z) filter when it is excited by the Sn signal. H(z) is usually known as an inverted filter. The approximated transfer function could be obtained if it is assumed that the transfer function S(z) of the signal is modeled as an only pole filter with the form of (5).

$$\hat{S}\left(z\right) = \frac{A}{H\left(z\right)} = \frac{A}{1 - \sum\_{k=1}^{p} a\_k z^{-k}}\tag{5}$$

The LPC coefficients correspond to ˆ *S z* poles. Therefore, the LPC analysis aims to calculate the filter properties of the vocal tract that produces the sonorous signal.

#### *2.5.2. Formant frequencies analysis*

Formants are the envelope peaks of the speech signal spectrum that represent the resonance frequency of the vocal tract.

If the spectrum of a speech signal can be approximated only by its poles, then the formants could be obtained from the ˆ *S z* poles. The poles of ˆ *S z* can be calculated making the denominator of (5) to zero and solving it to find its roots. The S plane conversion is done by substituting z by *skT e* , where sk is the pole in the s plane. The resultant roots are generally conjugated complex pairs.

Formants frequencies are obtained from the polynomial roots generated by the linear prediction coefficients. The formant frequency is defined by the angle of the roots closer to the unitary circle. A root, with an angle close to zero, indicates the existence of a formant near the origin. A root whose angle is in close proximity to π indicates that the formant is located near the maximum frequency, in this case 4000Hz. Since the frequency dominion is symmetric with respect to the vertical axis, the roots located in the inferior semi plane of z plane can be ignored.

Let rp as a linear prediction coefficient root with real part *<sup>p</sup>* and imaginary part *<sup>p</sup>* .

$$r\_p = \phi\_p + \mathrm{j}\theta\_p \tag{6}$$

The roots (r) which are located in the superior semi plane near the unitary circle can be obtained using (7).

$$r\_p = \begin{cases} 0 & \text{for} \quad \theta\_p \le 0.01\\ r\_p & \text{for} \quad \theta\_p > 0.01 \end{cases} \tag{7}$$

By using the arctangent function is possible to obtain the roots angle. Doing this, the roots are mapped into the frequency dominion by using (8) to get the formants.

$$for\_p = \mathcal{G}\_p \frac{f\_s}{2\pi} \tag{8}$$

Once the formants are obtained, they are organized in ascending order, and the first three are chosen as parameters of the speech segment.

#### *2.5.3. Mel Frequency Cepstral Coefficients (MFCC)*

202 Modern Speech Recognition Approaches with Case Studies

*2.5.1. Linear Prediction Coefficients (LPCs)* 

and the approximated function.

sampling frequency value (*fs*) in (4).

The LPC coefficients correspond to ˆ

*2.5.2. Formant frequencies analysis* 

could be obtained from the ˆ

frequency of the vocal tract.

is the Levinson recursion.

Frequencies Analysis, Mel Frequency Cepstral Coefficients (MFCC) among others [5,6]. This

The LPCs methods are based on the fact that the signal can be approximated from a

1

where ak (1<k<p) is a set of real constants known as predictor coefficients, that must be calculated, and p is the predictor order. The problem of linear prediction resides on finding the predictor coefficients ak that minimize the error between the real value of the function

To minimize the total quadratic error is necessary to calculate the autocorrelation coefficients. This is a matrix equation with different recursive solutions, the commonly used

The developed algorithm takes each segment of 10 milliseconds and calculates its linear prediction coefficients. The number of predictor coefficients is obtained by substituting the

> <sup>8000</sup> 4 4 12 1000 1000

> > 1

*k k k*

*S z* poles. Therefore, the LPC analysis aims to calculate

*a z* 

1 *p*

Formants are the envelope peaks of the speech signal spectrum that represent the resonance

If the spectrum of a speech signal can be approximated only by its poles, then the formants

*S z* poles. The poles of ˆ

The sequence of the minimal error could be interpreted as the output of the H(z) filter when it is excited by the Sn signal. H(z) is usually known as an inverted filter. The approximated transfer function could be obtained if it is assumed that the transfer function S(z) of the

*s*

*A A S z H z*

signal is modeled as an only pole filter with the form of (5).

ˆ

the filter properties of the vocal tract that produces the sonorous signal.

(3)

*<sup>f</sup> <sup>p</sup>* (4)

(5)

*S z* can be calculated making the

*p n knk k s as* 

section discusses these methods and proposes one based on Wavelet Transform.

weighted sum of precedent samples [7]. This approximation is given by:

The cepstral coefficients estimation is another widely used feature extraction method in speech recognition problems. These coefficients form a very good features vector for the development of speech recognition algorithms, sometimes better than the LPC ones.

Cepstrum is defined as the inverse Fourier transform of spectrum module logarithm ( 9)

$$\mathcal{L}\left(t\right) = F^{-1}\left[\log\left(\mathbb{S}\left(\boldsymbol{w}\right)\right)\right] \tag{9}$$

Developing de above equation it obtains:

$$c(t) \, := F^{-1}\left[\left[\log \operatorname{E}(o)\right]\right] \, + \left[\log \operatorname{H}(o)\right]\Big|\,\tag{10}$$

The above equation indicates that Cepstrum of a signal is the sum of Cepstrum excitation source and the vocal tract filter. The vocal tract information is of slow variation, and it appears in the first cepstrum coefficients. For speech recognition application the vocal tract information is more important than excitation source. The cepstral coefficients can be estimated from the LPC coefficients applying the following recursion:

$$\begin{aligned} c\_o &= \ln \sigma^2\\ c\_m &= a\_m + \sum\_{k=1}^{m-1} \left(\frac{k}{m}\right) c\_k a\_{m-k} \quad &1 \le m \le p\\ c\_m &\sum\_{k=1}^{m-1} \left(\frac{k}{m}\right) c\_k a\_{m-k} \quad &m > p \end{aligned} \tag{11}$$

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 205

 

(12)

*<sup>t</sup> ψ t te πt t* (13)

(15)

(16)

and . Thus to

(14)

<sup>1</sup> <sup>1</sup> cos 2 / 0

emulate the basilar membrane behavior, it is necessary to look for the more suitable filter bank which, according to the basilar membrane model given by Zhang [9], can be obtained if we set *θ=1* and *α=3, since* such values (12) result in the best approximation to the inner ear

<sup>1</sup> <sup>2</sup> cos 2 0

( 2) ( 2) ( )

It can be shown that *ψ (t)* presents the expected attributes of a mother wavelet since it

<sup>2</sup> ( ) *<sup>d</sup>* 

This means that *ψ (t)* can be used to analyze and then reconstruct a signal without loss of information [10]. That is the functions given by (13) constitute an unconditional basis in *L*2(**R**) and then we can estimate the expansion coefficients of an audio signal *f(t)* by using the scalar product between *f(t)* and the function *ψ(t)* with translation *τ* and scaling factor *s* as follows:

> 0 <sup>1</sup> , () *<sup>t</sup> C s f t dt s s*

A sampled version of (16) must be specified because we require characterizing discrete time speech signals. To this end, a sampling of the scale parameter, s, involving the

The critical bands theory models the basilar membrane operation as a filter bank in which the bandwidth of each filter increases as its central frequency also increases. This requirement can be satisfied using the Bark frequency scale that is a logarithmic scale in which the frequency resolution of any section of the basilar membrane is exactly equal to one Bark, regardless of its characteristic frequency. Because the Bark scale is characterized by a biological parameter, there is not an exact expression for it given as a result several different proposals available in the literature. Among them, the statistical fitting provided by Schroeder et al [11], appears to be a suitable choice. Thus using the approach provided, the relation between the linear frequency*, f* in Hz and the Bark frequency *Z*, is given by

 

1212 *j ω π j ω π*

2 2

 

3 3

 

*t t te t t*

1 !

Equation (12) defines a family of gamma-tone filters characterized by

 

2

psychoacoustical phenomenon known as critical bandwidths will be used [11].

satisfies the admissibility condition given by (15)

dynamics. From (12) we have:

Taking the Fourier transform of (13)

where *Cm* is the n-th LPC-Cepstral coefficients, *a* is the i-th LPC coefficients and m is the Cepstral index.

Usually the number of cepstral coefficients is equal to the number of LPC ones to avoid noise. A representation derived from the coefficients cepstrum are the Mel Frequency Cepstral Coefficients (MFCC) whose fundamental difference with Cepstrum coefficients is that the frequency bands are positioned according to a logarithmic scale known as MEL scale, which approximates the frequency response of the human auditory system more efficiently than Fast Fourier Transform(FFT).

#### *2.5.4. Feature extraction method based on Wavelet Transform*

Most widely used feature extraction methods, such as those described above, are based on modeling the form in which the speech signal is produced. However if the speech signals are processed taking into account the form in which they are perceived by the human ear, similar or even better results may be obtained. Thus using an ear model-based feature extraction method might represent an attractive alternative, since this approach allows characterizing the speech signal in the form that it is perceived [8]. This section proposes a feature extraction method based on an inner ear model, which takes into account the fundamentals concepts of critical bands.

In the inner ear, the basilar membrane carries out a time-frequency decomposition of the audible signal through a multiresolution analysis similar to that performed by a wavelet transform. Thus to develop a feature extraction method that emulates the basilar membrane operation, it should be able to carry out a similar frequency decomposition, as proposed in the inner ear model developed by Zhang et. al [9]. In this model the dynamics of basilar membrane, which has a characteristic frequency equal to *fc*, can be modeled by using a gamma-tone filter which consists of a gamma distribution multiplied by a pure tone of frequency *fc*. The shape of the gamma distribution α, is related to the filter order while the scale *θ*, is related to period of occurrence of the events under analysis, when they have a Poisson distribution. Thus the gamma-tone filter representing the impulse response of the basilar membrane is given by (12)

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 205

$$\left(\varphi\right)\_{\theta}^{\alpha}\left(t\right) = \frac{1}{\left(a-1\right) \left\lfloor \begin{array}{c} t^{\alpha-1}e^{\frac{-t}{\theta}} \\ \end{array}\right\rfloor} t^{\alpha-1} e^{\frac{-t}{\theta}} \cos\left(2\pi t \,/\,\theta\right) \quad t > 0 \tag{12}$$

Equation (12) defines a family of gamma-tone filters characterized by and . Thus to emulate the basilar membrane behavior, it is necessary to look for the more suitable filter bank which, according to the basilar membrane model given by Zhang [9], can be obtained if we set *θ=1* and *α=3, since* such values (12) result in the best approximation to the inner ear dynamics. From (12) we have:

$$
\psi(t) = \frac{1}{2}t^2 e^{-t} \cos\left(2\pi t\right) \quad t > 0 \tag{13}
$$

Taking the Fourier transform of (13)

204 Modern Speech Recognition Approaches with Case Studies

The above equation indicates that Cepstrum of a signal is the sum of Cepstrum excitation source and the vocal tract filter. The vocal tract information is of slow variation, and it appears in the first cepstrum coefficients. For speech recognition application the vocal tract information is more important than excitation source. The cepstral coefficients can be

*c a ca m p*

where *Cm* is the n-th LPC-Cepstral coefficients, *a* is the i-th LPC coefficients and m is the

Usually the number of cepstral coefficients is equal to the number of LPC ones to avoid noise. A representation derived from the coefficients cepstrum are the Mel Frequency Cepstral Coefficients (MFCC) whose fundamental difference with Cepstrum coefficients is that the frequency bands are positioned according to a logarithmic scale known as MEL scale, which approximates the frequency response of the human auditory system more

Most widely used feature extraction methods, such as those described above, are based on modeling the form in which the speech signal is produced. However if the speech signals are processed taking into account the form in which they are perceived by the human ear, similar or even better results may be obtained. Thus using an ear model-based feature extraction method might represent an attractive alternative, since this approach allows characterizing the speech signal in the form that it is perceived [8]. This section proposes a feature extraction method based on an inner ear model, which takes into account the

In the inner ear, the basilar membrane carries out a time-frequency decomposition of the audible signal through a multiresolution analysis similar to that performed by a wavelet transform. Thus to develop a feature extraction method that emulates the basilar membrane operation, it should be able to carry out a similar frequency decomposition, as proposed in the inner ear model developed by Zhang et. al [9]. In this model the dynamics of basilar membrane, which has a characteristic frequency equal to *fc*, can be modeled by using a gamma-tone filter which consists of a gamma distribution multiplied by a pure tone of frequency *fc*. The shape of the gamma distribution α, is related to the filter order while the scale *θ*, is related to period of occurrence of the events under analysis, when they have a Poisson distribution. Thus the gamma-tone filter representing the impulse response of the

*c ca m p*

1

(11)

estimated from the LPC coefficients applying the following recursion:

2

*m kmk*

*k*

*m*

1

*m*

*k*

efficiently than Fast Fourier Transform(FFT).

fundamentals concepts of critical bands.

basilar membrane is given by (12)

Cepstral index.

ln

*o*

*c*

1

*2.5.4. Feature extraction method based on Wavelet Transform* 

1

*k*

*m*

*m m m kmk k*

1

$$\Psi(\alpha) = -\frac{(\alpha - 2\pi)^2}{\left[1 + j\left(\omega - 2\pi\right)\right]^3} + \frac{(\alpha + 2\pi)^2}{\left[1 + j\left(\omega + 2\pi\right)\right]^3} \tag{14}$$

It can be shown that *ψ (t)* presents the expected attributes of a mother wavelet since it satisfies the admissibility condition given by (15)

$$\int\_{-\alpha}^{\alpha} \frac{\left| \Psi(\alpha) \right|^2}{\left| \alpha \right|} d\alpha < \infty \tag{15}$$

This means that *ψ (t)* can be used to analyze and then reconstruct a signal without loss of information [10]. That is the functions given by (13) constitute an unconditional basis in *L*2(**R**) and then we can estimate the expansion coefficients of an audio signal *f(t)* by using the scalar product between *f(t)* and the function *ψ(t)* with translation *τ* and scaling factor *s* as follows:

$$C\left(\tau, s\right) = \frac{1}{\sqrt{s}} \int\_0^\eta f(t)\varphi\left(\frac{t-\tau}{s}\right)dt\tag{16}$$

A sampled version of (16) must be specified because we require characterizing discrete time speech signals. To this end, a sampling of the scale parameter, s, involving the psychoacoustical phenomenon known as critical bandwidths will be used [11].

The critical bands theory models the basilar membrane operation as a filter bank in which the bandwidth of each filter increases as its central frequency also increases. This requirement can be satisfied using the Bark frequency scale that is a logarithmic scale in which the frequency resolution of any section of the basilar membrane is exactly equal to one Bark, regardless of its characteristic frequency. Because the Bark scale is characterized by a biological parameter, there is not an exact expression for it given as a result several different proposals available in the literature. Among them, the statistical fitting provided by Schroeder et al [11], appears to be a suitable choice. Thus using the approach provided, the relation between the linear frequency*, f* in Hz and the Bark frequency *Z*, is given by

$$Z = 7\ln\left(\frac{f}{650} + \sqrt{\left(\frac{f}{650}\right)^2 + 1}\right) \tag{17}$$

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 207

2 2 ( ) ( ) ( ( 1) ) *<sup>x</sup> d m x n mN x n m N* (23)

2 2 ( ) ( 1) *jj j v cm cm* (24)

a. The energy of the *m-th,* speech signal frame <sup>2</sup> *x n*( ) , where 1 *n N* and N is number of

b. The energy contained in each one of the 17 wavelet decomposition levels of *m-th* speech

d. The difference between the energy contained in each one of the 17 wavelet

where m is the number frame. Then the feature vector derived using the proposed approach

1 2 17 1 2 17 ( ) ( ), ( ), ( ),.., ( ), ( ), ( ), ( ),.., ( ) *m x n mN c m c m c m d m v m v m v m <sup>x</sup>*

The last eighteen members of the feature vector include the spectral dynamics of speech

The classification stage consists of one neural network, which identifies the vowel, in cascade with a parallel array of 5 neural networks, which are used to identify the alaryngeal speech segment to be changed by its equivalent normal speech segment, as shown in Figure 2. At this point, the estimated feature vector, given by (25), is feed into the first ANN (Figure 2) to estimate vowel present in the segment under analysis. Once the vowel is identified, the same feature vector is feed into the five ANN structures of the second stage, along with the output of first ANN, to identify the vowel-consonant combination contained in the voiced segment under analysis. The output of enabled ANN corresponds to the codebook index of identified segment. Thus the first ANN output is used to enable the ANN corresponding to the detected vowel, disabling the other four while the second ANN is used to identify the vowel-consonant or vowel-vowel combination. The ANN in the first stage has 10 hidden

The ANN training process is carried out in two steps. First the ANN used to identify the vowel contained in the speech segment is trained in a supervised manner using the backpropagation algorithm. After convergence is achieved, the enabled ANN in the second stage is used to identify the vowel-consonant or vowel-vowel combination and is also trained in a supervised manner using the backpropagation algorithm , while the coefficients vectors of the other 4 ANN are kept constant. In all cases 650 different alaryngeal voiced segments with a convergence factor equal to 0.009 are used, achieving a global mean square

signal concatenating the variation from the past feature vector to the current one.

**<sup>X</sup>** (25)

c. The difference between the energy of the previous and actual frames given by (23)

decomposition levels of current and previous frames given by (24)

2 22 2

neurons while the ANNs in the second stage have 25.

error of 0.1 after 400,000 iterations.

samples in the *m-th* frame.

( ) *C mj* , where 1 17 *j*

frame <sup>2</sup>

becomes

**2.6. Classification stage** 

Using (17) the *j-th* scaling factor *sj* given by the inverse of the *j-th* central frequency in Hz, *fc,* corresponding to the *j-th* band in the Bark frequency scale becomes

$$s\_j = \frac{e^{j/7}}{325\left(e^{2/7} - 1\right)}, \quad j = 1, 2, 3, \dots \tag{18}$$

The inclusion of bark frequency in the scaling factor estimation, as well as the relation between (17) and the dynamics of basilar membrane, allows frequency decomposition similar to that carried out by the human ear. Since the scaling factor given by (18) satisfies the Littlewood-Paley theorem (19)

$$\lim\_{j \to \pm \infty} \frac{s\_{j+1}}{s\_j} = \lim\_{j \to \pm \infty} \frac{e^{(j+1)/7} \left(e^{2j/7} - 1\right)}{e^{j/7} \left(e^{2(j+1)/7} - 1\right)} = e^{-1/7} \neq 1\tag{19}$$

there is not information loss during the sampling process. Finally the number of subbands is related to the sampling frequency as follows

$$j\_{\max} = \text{int}\left[\mathcal{T}\ln\left(\frac{f\_s}{1300} + \sqrt{\left(\frac{f\_s}{1300}\right)^2 + 1}\right)\right] \tag{20}$$

Therefore, for a sampling frequency equal to 8 KHz the number of subbands becomes 17. Finally, the translation axis is naturally sampled because the input data is a discrete time signal and then the *j-th* decomposition signal can be estimated as follows

$$C\_j(m) = \sum\_{-\infty}^{\infty} f\left(n\right) \nu\_j\left(n - m\right) \tag{21}$$

where

$$\Psi\_j(n) = \frac{1}{2} \left(\frac{nT}{s\_j}\right)^2 e^{-\left(\frac{nT}{s\_j}\right)} \cos\left(\frac{2\pi nT}{s\_j}\right), n > 0\tag{22}$$

In (22) T denotes the sampling period. The expansion coefficients *Cj* obtained for each subband are used to estimate the feature vector to be used during the training and recognition tasks.

Using (21), the feature vector used for voiced segment identification consists of the following parameters:


$$d\_x(m) = \overline{\mathbf{x}^2}(n - mN) - \overline{\mathbf{x}^2}(n - (m - 1)N) \tag{23}$$

d. The difference between the energy contained in each one of the 17 wavelet decomposition levels of current and previous frames given by (24)

$$
\overline{\overline{\upsilon\_j}} = \overline{c\_j^2}(m) - \overline{c\_j^2}(m-1) \tag{24}
$$

where m is the number frame. Then the feature vector derived using the proposed approach becomes

$$\mathbf{X}(m) = \left[ \overline{\mathbf{x}^2}(n - mN), \overline{c\_1^2}(m), \overline{c\_2^2}(m), \dots, \overline{c\_{17}^2}(m), d\_x(m), \overline{v}\_1(m), \overline{v}\_2(m), \dots, \overline{v}\_{17}(m) \right] \tag{25}$$

The last eighteen members of the feature vector include the spectral dynamics of speech signal concatenating the variation from the past feature vector to the current one.

#### **2.6. Classification stage**

206 Modern Speech Recognition Approaches with Case Studies

the Littlewood-Paley theorem (19)

where

recognition tasks.

following parameters:

related to the sampling frequency as follows

2

(17)

(19)

(20)

(22)

7 ln 1 650 650

2/7 , 1,2,3,....

1

 

*e*

2

(21)

1

 

*<sup>e</sup>* (18)

Using (17) the *j-th* scaling factor *sj* given by the inverse of the *j-th* central frequency in Hz, *fc,*

The inclusion of bark frequency in the scaling factor estimation, as well as the relation between (17) and the dynamics of basilar membrane, allows frequency decomposition similar to that carried out by the human ear. Since the scaling factor given by (18) satisfies

> ( 1)/7 2 /7 1 1/7 /7 2( 1)/7

*j j*

 

max 1300 1300

*s s j*

lim lim 1

there is not information loss during the sampling process. Finally the number of subbands is

int 7 ln 1

Therefore, for a sampling frequency equal to 8 KHz the number of subbands becomes 17. Finally, the translation axis is naturally sampled because the input data is a discrete time

> ( ) *Cm fn n m j j*

> > *j nT s*

In (22) T denotes the sampling period. The expansion coefficients *Cj* obtained for each subband are used to estimate the feature vector to be used during the training and

Using (21), the feature vector used for voiced segment identification consists of the

 

*j j nT nT ne n s s*

1 2 cos 0

2

*f f*

*f f <sup>Z</sup>*

/7

325 1 *j*

*<sup>e</sup> s j*

corresponding to the *j-th* band in the Bark frequency scale becomes

*j*

*j*

*j j j j <sup>j</sup>*

signal and then the *j-th* decomposition signal can be estimated as follows

*j*

2

*s e e*

*s e e*

The classification stage consists of one neural network, which identifies the vowel, in cascade with a parallel array of 5 neural networks, which are used to identify the alaryngeal speech segment to be changed by its equivalent normal speech segment, as shown in Figure 2. At this point, the estimated feature vector, given by (25), is feed into the first ANN (Figure 2) to estimate vowel present in the segment under analysis. Once the vowel is identified, the same feature vector is feed into the five ANN structures of the second stage, along with the output of first ANN, to identify the vowel-consonant combination contained in the voiced segment under analysis. The output of enabled ANN corresponds to the codebook index of identified segment. Thus the first ANN output is used to enable the ANN corresponding to the detected vowel, disabling the other four while the second ANN is used to identify the vowel-consonant or vowel-vowel combination. The ANN in the first stage has 10 hidden neurons while the ANNs in the second stage have 25.

The ANN training process is carried out in two steps. First the ANN used to identify the vowel contained in the speech segment is trained in a supervised manner using the backpropagation algorithm. After convergence is achieved, the enabled ANN in the second stage is used to identify the vowel-consonant or vowel-vowel combination and is also trained in a supervised manner using the backpropagation algorithm , while the coefficients vectors of the other 4 ANN are kept constant. In all cases 650 different alaryngeal voiced segments with a convergence factor equal to 0.009 are used, achieving a global mean square error of 0.1 after 400,000 iterations.

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 209

**Figure 3.** Detected voiced/unvoiced segments of esophageal speech signal of Spanish words "abeja" (a),

(c)

(a) (b)

Figure 4 shows the produced esophageal speech signal corresponding to the Spanish word "cachucha" (cap) together with the restored signal obtained using the proposed system. The

To evaluate the actual performance of the proposed system, two different criteria were used: the bark spectral distortion (MBSD) and the mean opinion scoring (MOS). The bark spectrum *L(f)* reflects the ear's nonlinear transformation of frequency and amplitude, together with the important aspects of its frequency and spectral integration properties in response to complex sounds. Using the Bark spectrum, an objective measure of the distortion can be defined using the overall distortion as the mean Euclidian distance

corresponding spectrograms of both signals are shown in Figure 5.

"adicto" (b) and "cupo" (c).

**Figure 2.** Pattern recognition stage. The first ANN indentifies the vowel present in the segment and the other 5 ANN identify the consonant-vowel combination.

## **2.7. Synthesis stage**

This stage provides the restored speech signal. According to Figure 1, if a silence or unvoiced segment is detected, the switch is enabled and the segment is concatenated with the previous one to produce the output signal. If voice activity is detected, the speech segment is analyzed using the energy analysis, the zero crossings number and the formant analysis explained in section 2.4. If a voiced segment is detected, it is identified using pattern recognition techniques (ANN). Then the alaryngeal voiced segment is replaced by the equivalent normal speech voiced segment, contained in the codebook, which is finally concatenated with the previous segments to synthesize the restored speech signal.

## **3. Results**

Figure 3 shows the plot of mono-aural recordings of Spanish words "abeja" (a), "adicto" (b) and "cupo" (c), pronounced by an esophageal speaker with a sample frequency of 8 kHz, respectively, including the detected voiced segments. Figure 3 shows that a correct detection is achieved using the combination of several features, in this case zero crossing, formants analysis and energy average.

other 5 ANN identify the consonant-vowel combination.

**2.7. Synthesis stage** 

**3. Results** 

analysis and energy average.

**Figure 2.** Pattern recognition stage. The first ANN indentifies the vowel present in the segment and the

This stage provides the restored speech signal. According to Figure 1, if a silence or unvoiced segment is detected, the switch is enabled and the segment is concatenated with the previous one to produce the output signal. If voice activity is detected, the speech segment is analyzed using the energy analysis, the zero crossings number and the formant analysis explained in section 2.4. If a voiced segment is detected, it is identified using pattern recognition techniques (ANN). Then the alaryngeal voiced segment is replaced by the equivalent normal speech voiced segment, contained in the codebook, which is finally

Figure 3 shows the plot of mono-aural recordings of Spanish words "abeja" (a), "adicto" (b) and "cupo" (c), pronounced by an esophageal speaker with a sample frequency of 8 kHz, respectively, including the detected voiced segments. Figure 3 shows that a correct detection is achieved using the combination of several features, in this case zero crossing, formants

concatenated with the previous segments to synthesize the restored speech signal.

**Figure 3.** Detected voiced/unvoiced segments of esophageal speech signal of Spanish words "abeja" (a), "adicto" (b) and "cupo" (c).

Figure 4 shows the produced esophageal speech signal corresponding to the Spanish word "cachucha" (cap) together with the restored signal obtained using the proposed system. The corresponding spectrograms of both signals are shown in Figure 5.

To evaluate the actual performance of the proposed system, two different criteria were used: the bark spectral distortion (MBSD) and the mean opinion scoring (MOS). The bark spectrum *L(f)* reflects the ear's nonlinear transformation of frequency and amplitude, together with the important aspects of its frequency and spectral integration properties in response to complex sounds. Using the Bark spectrum, an objective measure of the distortion can be defined using the overall distortion as the mean Euclidian distance

between the spectral vectors of the normal speech, *Ln(k,i)*, and the processed ones, *Lp(k,i)*, taken over successive frames as follows.

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 211

show the Bark spectral trace of both, the esophageal speech produced and enhanced signals, respectively corresponding to the Spanish words "hola" (hello) and "mochila" (bag). Here the MBSD during voiced segments was equal 0.2954 and 0.4213 for "hola" and "mochila", respectively, while during unvoiced segments the MBSD was 0.6815 and 0.7829 for "hola" and "mochila" respectively. The distortion decreases during the voiced periods as suggested by (26). Evaluation results using the Bark spectral distortion measures show that a good

**Figure 6.** Bark spectral trace of normal, Ln(k), and enhanced, Lp(k), speech signals of the Spanish word

**Figure 7.** Bark spectral trace of normal, Ln(k), and enhanced, Lp(k), speech signals of the Spanish word

enhancement can be achieved using the proposed method.

"hola".

"mochila".

**Figure 4.** Waveforms trace corresponding to the Spanish word, "Cachucha", (Cap). a) produced Esophageal speech, b) restored speech.

**Figure 5.** Spectrograms trace corresponding to the Spanish word, "Cachucha" (Cap). a) Normal speech, b) Produced Esophageal Speech, c) Restored speech.

$$MSSD = \frac{\sum\_{k=1}^{N} \sum\_{i=1}^{M} \left[ L\_n(k, i) - L\_p(k, i) \right]^2}{\sum\_{k=1}^{N} \sum\_{i=1}^{M} L\_n^2(k, i)} \tag{26}$$

where *Ln(k,i)* is the Bark spectrum of the kth segment of the original signal, *Lp(k,i)* is the Bark spectrum of the processed signal and M is the number of critical bands. Figures 6 and 7 show the Bark spectral trace of both, the esophageal speech produced and enhanced signals, respectively corresponding to the Spanish words "hola" (hello) and "mochila" (bag). Here the MBSD during voiced segments was equal 0.2954 and 0.4213 for "hola" and "mochila", respectively, while during unvoiced segments the MBSD was 0.6815 and 0.7829 for "hola" and "mochila" respectively. The distortion decreases during the voiced periods as suggested by (26). Evaluation results using the Bark spectral distortion measures show that a good enhancement can be achieved using the proposed method.

210 Modern Speech Recognition Approaches with Case Studies

taken over successive frames as follows.

Esophageal speech, b) restored speech.

b) Produced Esophageal Speech, c) Restored speech.

between the spectral vectors of the normal speech, *Ln(k,i)*, and the processed ones, *Lp(k,i)*,

**Figure 4.** Waveforms trace corresponding to the Spanish word, "Cachucha", (Cap). a) produced

**Figure 5.** Spectrograms trace corresponding to the Spanish word, "Cachucha" (Cap). a) Normal speech,

2

*n*

*L ki*

*n p*

*L ki L ki*

( ,) ( ,)

( ,)

1 1

*k i*

where *Ln(k,i)* is the Bark spectrum of the kth segment of the original signal, *Lp(k,i)* is the Bark spectrum of the processed signal and M is the number of critical bands. Figures 6 and 7

*N M*

1 1

*k i*

*MBSD*

*N M*

2

(26)

**Figure 6.** Bark spectral trace of normal, Ln(k), and enhanced, Lp(k), speech signals of the Spanish word "hola".

**Figure 7.** Bark spectral trace of normal, Ln(k), and enhanced, Lp(k), speech signals of the Spanish word "mochila".

A subjective evaluation was also performed using the Mean Opinion Scoring (MOS) in which the proposed system was evaluated by 200 normal speaking persons and 200 alaryngeal ones (Table 1 and Table 2), from the point of view of intelligibility and speech quality where 5 is the highest score and 1 is the lowest one. In both cases the speech intelligibility and quality evaluation without enhancement are shown for comparison. These evaluation results show that the proposed system improves the performance of [2] which reports a MOS of 2.91 when the enhancement system is used and 2.3 without enhancement. These results also show that, although the improvement is perceived by the alaryngeal and normal speakers, the improvement is larger in the opinion of alaryngeal speakers. Thus the proposed system is expected to have a quite good acceptance among the alaryngeal speakers, because the proposed system allows synthesizing several kinds of male and female speech signals.

Esophageal Speech Enhancement Using a Feature Extraction Method Based on Wavelet Transform 213

wavelet Haar wavelet Mexican hat

95% 75% 40% 79% 89%

wavelet

Morlet wavelet

The behavior of proposed feature extraction method was compared with the performance of several other wavelet functions for evaluation purposes. Comparison results are shown in Table 4 which show that proposed method has better performance than other wavelet based

**Table 4.** Performance of different wavelet based feature enhanced methods when an ANN is used as

This chapter proposed an alaryngeal speech restoration system, suitable for esophageal and ALT produced speech, based on a pattern recognition approach where the voiced segments are replaced by equivalent segments of normal speech contained in a codebook. Evaluation results show a correct detection of voiced segment by comparison between their spectrograms to those spectrograms of normal speech signal. Objective and subjective evaluation results show that the proposed system provides a good improvement in the intelligibility and quality of esophageal produced speech signals. These results show that proposed system is an attractive alternative to enhance the alaryngeal speech signals. This chapter also presents a flexible structure that allows the use of the proposed system to enhance esophageal and artificial laryinx produced speech signals without further modifications. The proposed system could be used to enhance alaryngeal speech in several practical situations such as telephone and teleconference systems, thus improving the voice

[1] H. K. Barney, H. L. Hawork, F. E., and Dunn, ( 1959), "An experimental transitorized

[2] G. Aguilar, M. Nakano-Miyatake and H. Perez-Meana, (2005), Alaryngeal Speech Enhancement Using Pattern Recognition Techniques", IEICE Trans. Inf. & Syst. Vol.

artifcial larynx",. Bell System Technical Journal, 38, 1337-1356..

Daub 4

feature extraction methods.

Recognition rate

identification method.

**4. Conclusions** 

**Author details** 

**5. References** 

and quality life of alaryngeal people.

E88-D, No. 7, pp. 1618-1622.

*Tecnológico de Monterrey, Campus Ciudad de Mexico, México* 

Alfredo Victor Mantilla Caeiros

Hector Manuel Pérez Meana *Instituto Politécnico Nacional, México* 

Proposed method

Finally, about 95% of alaryngeal persons participating in the subjective evaluation preferred the use of the proposed system during conversation. Subjective evaluation shows that quite a good performance enhancement can be obtained using the proposed system.




**Table 2.** Subjective evaluation of proposed alaryngeal speech enhancement system

The performance of the voiced segments classification stage was evaluated using 450 different alaryngeal voiced segments. The system failed to classify correctly 22 segments, which represents a misclassification rate of about 5% using a network as identification method, while a misclassification of about 7% was obtained using the Hidden Markov Models (HMM). The comparison results are given in Table 3.


**Table 3.** Recognition performance using two different identification methods using the feature extraction method based on wavelet transform.

The behavior of proposed feature extraction method was compared with the performance of several other wavelet functions for evaluation purposes. Comparison results are shown in Table 4 which show that proposed method has better performance than other wavelet based feature extraction methods.


**Table 4.** Performance of different wavelet based feature enhanced methods when an ANN is used as identification method.

## **4. Conclusions**

212 Modern Speech Recognition Approaches with Case Studies

A subjective evaluation was also performed using the Mean Opinion Scoring (MOS) in which the proposed system was evaluated by 200 normal speaking persons and 200 alaryngeal ones (Table 1 and Table 2), from the point of view of intelligibility and speech quality where 5 is the highest score and 1 is the lowest one. In both cases the speech intelligibility and quality evaluation without enhancement are shown for comparison. These evaluation results show that the proposed system improves the performance of [2] which reports a MOS of 2.91 when the enhancement system is used and 2.3 without enhancement. These results also show that, although the improvement is perceived by the alaryngeal and normal speakers, the improvement is larger in the opinion of alaryngeal speakers. Thus the proposed system is expected to have a quite good acceptance among the alaryngeal speakers, because the

proposed system allows synthesizing several kinds of male and female speech signals.

a good performance enhancement can be obtained using the proposed system.

**Table 1.** Subjective evaluation of esophageal speech without enhancement.

**Table 2.** Subjective evaluation of proposed alaryngeal speech enhancement system

Models (HMM). The comparison results are given in Table 3.

extraction method based on wavelet transform.

Finally, about 95% of alaryngeal persons participating in the subjective evaluation preferred the use of the proposed system during conversation. Subjective evaluation shows that quite

Quality Intelligibility Quality Intelligibility

Quality Intelligibility Quality Intelligibility

MOS 2.91 2.74 3.42 3.01 Var 0.17 0.102 0.16 0.103

The performance of the voiced segments classification stage was evaluated using 450 different alaryngeal voiced segments. The system failed to classify correctly 22 segments, which represents a misclassification rate of about 5% using a network as identification method, while a misclassification of about 7% was obtained using the Hidden Markov

**Identification Method Normal Speech Alaryngeal Speech**  ANN 98% 95% HMM 97% 93%

**Table 3.** Recognition performance using two different identification methods using the feature

MOS 2.30 2.61 2.46 2.80 Var 0.086 0.12 0.085 0.11

Normal listener Alaryngeal listener

Normal listener Alaryngeal listener

This chapter proposed an alaryngeal speech restoration system, suitable for esophageal and ALT produced speech, based on a pattern recognition approach where the voiced segments are replaced by equivalent segments of normal speech contained in a codebook. Evaluation results show a correct detection of voiced segment by comparison between their spectrograms to those spectrograms of normal speech signal. Objective and subjective evaluation results show that the proposed system provides a good improvement in the intelligibility and quality of esophageal produced speech signals. These results show that proposed system is an attractive alternative to enhance the alaryngeal speech signals. This chapter also presents a flexible structure that allows the use of the proposed system to enhance esophageal and artificial laryinx produced speech signals without further modifications. The proposed system could be used to enhance alaryngeal speech in several practical situations such as telephone and teleconference systems, thus improving the voice and quality life of alaryngeal people.

## **Author details**

Alfredo Victor Mantilla Caeiros *Tecnológico de Monterrey, Campus Ciudad de Mexico, México* 

Hector Manuel Pérez Meana *Instituto Politécnico Nacional, México* 

## **5. References**


[3] D. Cole, S. Sridharan and M. Geva, (1997), "Application of noise reduction techniques for alaryngeal speech enhancement", IEEE TECON, Speech and Image Processing for Computing and Telecommunications, pp. 491-494.

**Chapter 10** 

© 2012 Arora et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Arora et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

For individuals with severe to profound hearing losses, due to disease or damage to the inner ear, acoustic stimulation (via hearing aids) may not provide sufficient information for adequate speech perception. In such cases direct electrical stimulation of the auditory nerve by surgically implanted electrodes has been beneficial in restoring useful hearing. This chapter will provide a general overview regarding sound perception through electrical

**Cochlear Implant Stimulation** 

**Rates and Speech Perception** 

Komal Arora, Richard Dowell and Pam Dawson

Additional information is available at the end of the chapter

stimulation using multi channel cochlear implants.

**Figure 1.** Cochlear implant system (Cowan, 2007)

http://dx.doi.org/10.5772/49992

**1.1. Cochlear implant system** 

**1. Introduction** 


**Chapter 10** 
