**Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition**

Longbiao Wang, Kyohei Odani, Atsuhiko Kai, Norihide Kitaoka and Seiichi Nakagawa

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48430

## **1. Introduction**

24 Will-be-set-by-IN-TECH

[17] Viszlay, P., Juhár, J. & Pleva, M. [2012]. Alternative phonetic class definition in linear discriminant analysis of speech, *Proc. of the 19th International Conference on Systems, Signals and Image Processing, IWSSIP'12*, Vienna, Austria. Accepted, to be published. [18] Yang, J., Zhang, D., Frangi, A. F. & Yang, J.-Y. [2004]. Two–Dimensional PCA: A New Approach to Appearance–Based Face Representation and Recognition, *IEEE Transactions*

[19] Ye, J., Janardan, R. & Li, Q. [2005]. Two-dimensional linear discriminant analysis, *L. K. Saul, Y. Weiss and L. Bottou (Eds.): Advances in Neural Information Processing Systems*

[20] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V. & Woodland, P. [2006]. *The HTK Book (for HTK Version*

*on Pattern Analysis and Machine Intelligence* 26: 131–137.

17: 1569–1576.

*3.4)*. First Published Dec. 1995.

In a distant-talking environment, channel distortion drastically degrades speech recognition performance because of a mismatch between the training and testing environments. The current approach focusing on automatic speech recognition (ASR) robustness to reverberation and noise can be classified as speech signal processing [1, 4, 5, 14], robust feature extraction [10, 20], and model adaptation [3, 25].

In this chapter, we focus on speech signal processing in the distant-talking environment. Because both the speech signal and the reverberation are nonstationary signals, dereverberation to obtain clean speech from the convolution of nonstationary speech signals and impulse responses is very hard work. Several studies have focused on mitigating the above problem [8, 9, 11, 12]. [1] explored a speech dereverberation technique whose principle was the recovery of the envelope modulations of the original (anechoic) speech. They applied a technique that they originally developed to treat background noise [11] to the dereverberation problem. [7] proposed a novel approach for multimicrophone speech dereverberation. The method was based on the construction of the null subspace of the data matrix in the presence of colored noise, employing generalized singular-value decomposition or generalized eigenvalue decomposition of the respective correlation matrices. A reverberation compensation method for speaker recognition using spectral subtraction in which the late reverberation is treated as additive noise was proposed by [16, 17]. However, the drawback of this approach is that the optimum parameters for spectral subtraction are empirically estimated from a development dataset and the late reverberation cannot be subtracted well since it is not modeled precisely. [18] proposed a novel dereverberation method utilizing multi-step forward linear prediction. They estimated the linear prediction coefficients in a time domain and suppressed the amplitude of late reflections through spectral subtraction in a spectral domain.

©2012 Wang et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2012Wang et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

2 Will-be-set-by-IN-TECH 156 Modern Speech Recognition Approaches with Case Studies Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition <sup>3</sup>

*x*[*t*] = *h*[*t*] ∗ *s*[*t*] + *n*[*t*]. (1)

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

Dereverberation Based on Spectral Subtraction by

157

0 otherwise , (2)

*S*(*f* − *d*, *ω*)*H*(*d*, *ω*), (4)

*x*[*t*] = *s*[*t*] ∗ *hearly*[*t*] + *s*[*t* − *T*] ∗ *hlate*[*t*], (3)

where \* denotes the convolution operation. In this chapter, additive noise is ignored for

To analyze the effect of impulse response, the impulse response *h*[*t*] can be separated into two

*h*[*t*] t < T 0 otherwise ,

where *T* is the length of the spectral analysis window, and *h*[*t*] = *hearly*[*t*] + *δ*(*t* − *T*) ∗ *hlate*[*t*]. *δ*() is a dirac delta function (that is, a unit impulse function). The formula (1) can be rewritten

where the early effect is distortion within a frame (analysis window), and the late effect comes

When the length of impulse response is much shorter than analysis window size *T* used for short-time Fourier transform (STFT), STFT of distorted speech equals STFT of clean speech multiplied by STFT of impulse response *h*[*t*] (in this case, *h*[*t*] = *hearly*[*t*]). However, when the length of impulse response is much longer than an analysis window size, STFT of distorted

> *D*−1 ∑ *d*=1

where *f* is frame index, *H*(*ω*) is STFT of impulse response, *S*(*f* , *ω*) is STFT of clean speech *s* and *H*(*d*, *ω*) denotes the part of *H*(*ω*) corresponding to frame delay *d*. That is to say, with long impulse response, the channel distortion is no more of multiplicative nature in a linear

[17] proposed a far-field speaker recognition based on spectral subtraction. In this method, the early term of Eq. (3) was compensated by the conventional CMN, whereas the late term of Eq. (3) was treated as additive noise, and a noise reduction technique based on spectral

where *α* is the noise overestimation factor, *β* is the spectral floor parameter to avoid negative or underflow values, and *g*(*ω*) is a frequency-dependent value which is determined on a development and set as <sup>|</sup><sup>1</sup> <sup>−</sup> 0.9*ejω*<sup>|</sup> [17]. However, the drawback of this approach is that the optimum parameters *α*, *β* for the spectral subtraction are empirically estimated on a development dataset and the STFT of late effect of impulse response as the second term of

<sup>|</sup>*S*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup> <sup>=</sup> max(|*X*(*<sup>f</sup>* , *<sup>ω</sup>*)| − *<sup>α</sup>* · *<sup>g</sup>*(*ω*)|*X*(*<sup>f</sup>* <sup>−</sup> 1, *<sup>ω</sup>*)|, *<sup>β</sup>* · |*X*(*<sup>f</sup>* , *<sup>ω</sup>*)|), (5)

*h*[*t* + *T*] t ≥ 0

*hearly*[*t*] =

*hlate*[*t*] =

simplification, so Eq. (1) becomes *x*[*t*] = *h*[*t*] ∗ *s*[*t*].

parts *hearly*[*t*] and *hlate*[*t*] as [16, 17]

from previous multiple frames.

speech is usually approximated by

subtraction was applied as

*X*(*f* , *ω*) ≈ *S*(*f* , *ω*) ∗ *H*(*ω*)

spectral domain, rather it is convolutional [25].

= *S*(*f* , *ω*)*H*(0, *ω*) +

as

**Figure 1.** Schematic diagram of blind dereverberation method.

In this chapter, we propose a robust distant-talking speech recognition method based on spectral subtraction (SS) employing the multi-channel least mean square (MCLMS) algorithm. Speech captured by distant-talking microphones is distorted by the reverberation. With a long impulse response, the spectrum of the distorted speech is approximated by convolving the spectrum of clean speech with the spectrum of the impulse response as explained in the next section. This enables us to treat the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction can be easily applied to compensate for the late reverberation. By excluding the phase information from the dereverberation operation, the dereverberation reduction in a power spectral domain provides robustness against certain errors that the conventional sensitive inverse filtering method cannot achieve [18]. The compensation parameter (that is, the spectrum of the impulse response) for spectral subtraction is required. An adaptive MCLMS algorithm was proposed to blindly identify the channel impulse response in a time domain [12–14]. In this chapter, we extend the method to blindly estimate the spectrum of the impulse response for spectral subtraction in a frequency domain. The early reverberation is normalized by CMN [6]. Power SS is the most commonly used SS method. A previous study has shown that generalized SS (GSS) with a lower exponent parameter is more effective than power SS for noise reduction [26]. In this chapter, both of power SS and GSS are employed to suppress late reverberation. A diagram of the proposed method is shown in Fig. 1.

In this chapter, we also investigate the robustness of the power SS-based dereverberation under various reverberant conditions for large vocabulary continuous speech recognition (LVCSR). We analyze the effect factors (numbers of reverberation windows and channels, length of utterance, and the distance between sound source and microphone) of compensation parameter estimation for dereverberation based on power SS in a simulated reverberant environment.

The remainder of this paper is organized as follows. Section 2 describes the outline of blind dereverberation based on spectral subtraction. A multi-channel method based on the LMS algorithm and used to estimate the power spectrum of the impulse response (that is, a compensation parameter for spectral subtraction) is described in Section 3. Section 4 describes the experimental results of hands-free speech recognition in both simulated and real reverberant environments. Finally, Section 5 summarizes the paper.

## **2. Outline of blind dereverberation**

### **2.1. Dereverberation based on power SS**

If speech *s*[*t*] is corrupted by convolutional noise *h*[*t*] and additive noise *n*[*t*], the observed speech *x*[*t*] becomes

$$\mathbf{x}[t] = h[t] \* \mathbf{s}[t] + n[t]. \tag{1}$$

where \* denotes the convolution operation. In this chapter, additive noise is ignored for simplification, so Eq. (1) becomes *x*[*t*] = *h*[*t*] ∗ *s*[*t*].

2 Will-be-set-by-IN-TECH

Late reverberation reduction based on spectral subtraction

DFT IDFT

In this chapter, we propose a robust distant-talking speech recognition method based on spectral subtraction (SS) employing the multi-channel least mean square (MCLMS) algorithm. Speech captured by distant-talking microphones is distorted by the reverberation. With a long impulse response, the spectrum of the distorted speech is approximated by convolving the spectrum of clean speech with the spectrum of the impulse response as explained in the next section. This enables us to treat the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction can be easily applied to compensate for the late reverberation. By excluding the phase information from the dereverberation operation, the dereverberation reduction in a power spectral domain provides robustness against certain errors that the conventional sensitive inverse filtering method cannot achieve [18]. The compensation parameter (that is, the spectrum of the impulse response) for spectral subtraction is required. An adaptive MCLMS algorithm was proposed to blindly identify the channel impulse response in a time domain [12–14]. In this chapter, we extend the method to blindly estimate the spectrum of the impulse response for spectral subtraction in a frequency domain. The early reverberation is normalized by CMN [6]. Power SS is the most commonly used SS method. A previous study has shown that generalized SS (GSS) with a lower exponent parameter is more effective than power SS for noise reduction [26]. In this chapter, both of power SS and GSS are employed to suppress late reverberation. A diagram of the proposed

In this chapter, we also investigate the robustness of the power SS-based dereverberation under various reverberant conditions for large vocabulary continuous speech recognition (LVCSR). We analyze the effect factors (numbers of reverberation windows and channels, length of utterance, and the distance between sound source and microphone) of compensation parameter estimation for dereverberation based on power SS in a simulated reverberant

The remainder of this paper is organized as follows. Section 2 describes the outline of blind dereverberation based on spectral subtraction. A multi-channel method based on the LMS algorithm and used to estimate the power spectrum of the impulse response (that is, a compensation parameter for spectral subtraction) is described in Section 3. Section 4 describes the experimental results of hands-free speech recognition in both simulated and

If speech *s*[*t*] is corrupted by convolutional noise *h*[*t*] and additive noise *n*[*t*], the observed

real reverberant environments. Finally, Section 5 summarizes the paper.

**2. Outline of blind dereverberation**

**2.1. Dereverberation based on power SS**

Estimation of spectra of impulse responses

**Figure 1.** Schematic diagram of blind dereverberation method.

Multi-channel reverberant speech

method is shown in Fig. 1.

environment.

speech *x*[*t*] becomes

Early reverberation normalization

To analyze the effect of impulse response, the impulse response *h*[*t*] can be separated into two parts *hearly*[*t*] and *hlate*[*t*] as [16, 17]

$$h\_{array}[t] = \begin{cases} h[t] & \text{t < T} \\ 0 & \text{otherwise} \end{cases}$$

$$h\_{late}[t] = \begin{cases} h[t+T] & \text{t \ge 0} \\ 0 & \text{otherwise} \end{cases} \tag{2}$$

where *T* is the length of the spectral analysis window, and *h*[*t*] = *hearly*[*t*] + *δ*(*t* − *T*) ∗ *hlate*[*t*]. *δ*() is a dirac delta function (that is, a unit impulse function). The formula (1) can be rewritten as

$$\mathbf{x}[t]\_\cdot = \mathbf{s}[t]\_\cdot \ast h\_{early}[t] + \mathbf{s}[t-T]\_\cdot \ast h\_{late}[t]\_\cdot \tag{3}$$

where the early effect is distortion within a frame (analysis window), and the late effect comes from previous multiple frames.

When the length of impulse response is much shorter than analysis window size *T* used for short-time Fourier transform (STFT), STFT of distorted speech equals STFT of clean speech multiplied by STFT of impulse response *h*[*t*] (in this case, *h*[*t*] = *hearly*[*t*]). However, when the length of impulse response is much longer than an analysis window size, STFT of distorted speech is usually approximated by

$$\begin{split} X(f,\omega) &\approx S(f,\omega) \ast H(\omega) \\ &= S(f,\omega)H(0,\omega) + \sum\_{d=1}^{D-1} S(f-d,\omega)H(d,\omega), \end{split} \tag{4}$$

where *f* is frame index, *H*(*ω*) is STFT of impulse response, *S*(*f* , *ω*) is STFT of clean speech *s* and *H*(*d*, *ω*) denotes the part of *H*(*ω*) corresponding to frame delay *d*. That is to say, with long impulse response, the channel distortion is no more of multiplicative nature in a linear spectral domain, rather it is convolutional [25].

[17] proposed a far-field speaker recognition based on spectral subtraction. In this method, the early term of Eq. (3) was compensated by the conventional CMN, whereas the late term of Eq. (3) was treated as additive noise, and a noise reduction technique based on spectral subtraction was applied as

$$|\mathring{S}(f,\omega)| = \max(|X(f,\omega)| - \mathfrak{a} \cdot \mathfrak{g}(\omega)|X(f-1,\omega)|, \mathfrak{f} \cdot |X(f,\omega)|),\tag{5}$$

where *α* is the noise overestimation factor, *β* is the spectral floor parameter to avoid negative or underflow values, and *g*(*ω*) is a frequency-dependent value which is determined on a development and set as <sup>|</sup><sup>1</sup> <sup>−</sup> 0.9*ejω*<sup>|</sup> [17]. However, the drawback of this approach is that the optimum parameters *α*, *β* for the spectral subtraction are empirically estimated on a development dataset and the STFT of late effect of impulse response as the second term of

the right-hand side of Eq. (4) is not straightforward subtracted since the late reverberation is not modelled precisely.

In this chapter, we propose a dereverberation method based on spectral subtraction to estimate the STFT of the clean speech *S*ˆ(*f* , *ω*) based on Eq. (4), and the spectrum of the impulse response for the spectral subtraction is blindly estimated using the method described in Section 3. Assuming that phases of different frames is noncorrelated for simplification, the power spectrum of Eq. (4) can be approximated as

$$|X(f,\omega)|^2 \approx |S(f,\omega)|^2 |H(0,\omega)|^2 + \sum\_{d=1}^{D-1} |S(f-d,\omega)|^2 |H(d,\omega)|^2. \tag{6}$$

processing method is shown in Fig. 2. At first, the spectrum of additive noise is estimated and noise reduction is performed. Then the reverberation is suppressed using the estimated spectra of impulse responses. When additive noise is present, the power spectrum of Eq. (6)

where *N*(*f* , *ω*) is the spectrum of noise *n*(*t*). To suppress the noise and reverberation


<sup>2</sup>|*H*(*d*, *<sup>ω</sup>*)<sup>|</sup>

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

<sup>2</sup> <sup>+</sup> <sup>|</sup>*N*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

Dereverberation Based on Spectral Subtraction by

2, (9)

159

<sup>2</sup>*<sup>n</sup>*}, (10)

<sup>2</sup>*n*}, (11)

becomes


<sup>2</sup> ≈ |*S*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

...

...

...

...

simultaneously, Eq. (8) is modified as

<sup>|</sup>*X*ˆ(*f*,*ω*)<sup>|</sup>

<sup>|</sup>*X*<sup>ˆ</sup> *<sup>N</sup>*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

**multi-channel LMS algorithm**

*3.1.1. Identifiability and principle*

<sup>1</sup> In this study, stationary noise is assumed.

...

<sup>2</sup>*<sup>n</sup>*≈*max*{|*X*<sup>ˆ</sup> *<sup>N</sup>*(*f*,*ω*)<sup>|</sup>

*N*ˆ (*ω*) 1. In this paper, we set parameter *β*<sup>1</sup> equal to *β*2.

**3.1. Blind channel identification in time domain**

system identification in time domain was proposed by [13, 14].

<sup>2</sup>*<sup>n</sup>* <sup>≈</sup> *max*{|*X*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

<sup>2</sup>|*H*(0, *<sup>ω</sup>*)<sup>|</sup>

...

...

...

...

**Figure 2. Schematic diagram of an SS-based dereverberation and denoising method.**

<sup>2</sup>*<sup>n</sup>*−*α*1·

**3. Compensation parameter estimation for spectral subtraction by**

<sup>∑</sup>*D*−<sup>1</sup> *<sup>d</sup>*=<sup>1</sup> {|*X*ˆ(*<sup>f</sup>* <sup>−</sup>*d*,*ω*)<sup>|</sup>

where *N*ˆ (*ω*) is the mean of noise spectrum *N*(*f* , *ω*), and *X*ˆ *<sup>N</sup>*(*f* , *ω*) is the spectrum obtained by subtracting the spectrum of the observed speech from the estimated mean spectrum of noise

An adaptive multi-channel LMS algorithm for blind Single-Input Multiple-Output (SIMO)

<sup>2</sup>*<sup>n</sup>* <sup>−</sup> *<sup>α</sup>*<sup>2</sup> · |*N*<sup>ˆ</sup> (*ω*)<sup>|</sup>

<sup>2</sup>*n*|*H*<sup>ˆ</sup> (*d*,*ω*)<sup>|</sup> 2*n*} <sup>|</sup>*H*<sup>ˆ</sup> (0,*ω*)|2*<sup>n</sup>* ,*β*<sup>1</sup> ·|*X*<sup>ˆ</sup> *<sup>N</sup>*(*f*,*ω*)<sup>|</sup>

<sup>2</sup>*n*, *<sup>β</sup>*<sup>2</sup> · |*X*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

<sup>2</sup> + *D*−1 ∑ *d*=1

The estimated power spectrum of clean speech may not be very accurate due to the estimation error of the impulse response, especially the estimation error of early part of the impulse response. In addition, the unreliable estimated power spectrum of clean speech in a previous frame causes a furthermore estimation error in the current frame. In this chapter, the late reverberation is reduced based on the power SS, while the early reverberation is normalized by CMN at the feature extraction stage. A diagram of the proposed method is shown in Fig. 1. SS is used to prevent the estimated power spectrum obtained by reducing the late reverberation from being a negative value; the estimated power spectrum <sup>|</sup>*X*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup> <sup>2</sup> obtained by reducing the late reverberation then becomes

$$|\hat{X}(f,\omega)|^2 \approx \max\{|X(f,\omega)|^2 - a \cdot \frac{\sum\_{d=1}^{D-1} \{|\hat{X}(f-d,\omega)|^2 |\hat{H}(d,\omega)|^2\}}{|\hat{H}(0,\omega)|^2}, \mathfrak{f} \cdot |X(f,\omega)|^2\},\tag{7}$$

where <sup>|</sup>*X*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup> <sup>2</sup> <sup>=</sup> <sup>|</sup>*S*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup> <sup>2</sup>|*H*<sup>ˆ</sup> (0, *<sup>ω</sup>*)<sup>|</sup> 2, <sup>|</sup>*S*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup> <sup>2</sup> is the spectrum of estimated clean speech, *H*ˆ(*f* , *ω*) is the estimated STFT of the impulse response. To estimate the power spectra of the impulse responses, we extended the Multi-channel LMS algorithm for identifying the impulse responses in a time domain [14] to a frequency domain in Section 3.2.

#### **2.2. Dereverberation based on GSS**

Previous studies have shown that GSS with an arbitrary exponent parameter is more effective than power SS for noise reduction [26]. In this chapter, we extend GSS to suppress late reverberation. Instead of the power SS-based dereverberation given in Eq. (7), GSS-based dereverberation is modified as

$$|\hat{X}(f,\omega)|^{2n} \approx \max\{|X(f,\omega)|^{2n} - \mathfrak{a} \cdot \frac{\sum\_{d=1}^{D-1} \{|\hat{X}(f-d,\omega)|^{2n}|\hat{H}(d,\omega)|^{2n}\}}{|\hat{H}(0,\omega)|^{2n}}, \mathfrak{f} \cdot |X(f,\omega)|^{2n}\}\_{\lambda} (8)$$

where *n* is the exponent parameter. For power SS, the exponent parameter *n* is equal to 1. In this chapter, the exponent parameter *n* is set to 0.1 as this value yielded the best results [26].

The methods given in Eq. (7) and Eq. (8) are referred to *power SS-based* and *GSS-based dereverberation methods*, respectively.

#### **2.3. Dereverberation and denoising based on GSS**

The precision of impulse response estimation is drastically degraded when the additive noise is present. We present a dereverberation and denoising based on GSS. A diagram of the

processing method is shown in Fig. 2. At first, the spectrum of additive noise is estimated and noise reduction is performed. Then the reverberation is suppressed using the estimated spectra of impulse responses. When additive noise is present, the power spectrum of Eq. (6) becomes

$$|\mathcal{X}(f,\omega)|^2 \approx |\mathcal{S}(f,\omega)|^2 |H(0,\omega)|^2 + \sum\_{d=1}^{D-1} |\mathcal{S}(f-d,\omega)|^2 |H(d,\omega)|^2 + |\mathcal{N}(f,\omega)|^2,\tag{9}$$

where *N*(*f* , *ω*) is the spectrum of noise *n*(*t*). To suppress the noise and reverberation

#### **Figure 2. Schematic diagram of an SS-based dereverberation and denoising method.**

simultaneously, Eq. (8) is modified as

4 Will-be-set-by-IN-TECH

the right-hand side of Eq. (4) is not straightforward subtracted since the late reverberation is

In this chapter, we propose a dereverberation method based on spectral subtraction to estimate the STFT of the clean speech *S*ˆ(*f* , *ω*) based on Eq. (4), and the spectrum of the impulse response for the spectral subtraction is blindly estimated using the method described in Section 3. Assuming that phases of different frames is noncorrelated for simplification, the

> <sup>2</sup> + *D*−1 ∑ *d*=1

The estimated power spectrum of clean speech may not be very accurate due to the estimation error of the impulse response, especially the estimation error of early part of the impulse response. In addition, the unreliable estimated power spectrum of clean speech in a previous frame causes a furthermore estimation error in the current frame. In this chapter, the late reverberation is reduced based on the power SS, while the early reverberation is normalized by CMN at the feature extraction stage. A diagram of the proposed method is shown in Fig. 1. SS is used to prevent the estimated power spectrum obtained by reducing the late


<sup>2</sup>|*H*<sup>ˆ</sup> (*d*, *<sup>ω</sup>*)<sup>|</sup>

<sup>2</sup>*n*|*H*<sup>ˆ</sup> (*d*, *<sup>ω</sup>*)<sup>|</sup>

2} <sup>|</sup>*H*<sup>ˆ</sup> (0, *<sup>ω</sup>*)|<sup>2</sup> , *<sup>β</sup>* · |*X*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

<sup>2</sup> is the spectrum of estimated clean speech,

2*n*} <sup>|</sup>*H*<sup>ˆ</sup> (0, *<sup>ω</sup>*)|2*<sup>n</sup>* , *<sup>β</sup>* · |*X*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

<sup>2</sup>|*H*(*d*, *<sup>ω</sup>*)<sup>|</sup>

2. (6)

<sup>2</sup> obtained

<sup>2</sup>}, (7)

<sup>2</sup>*n*}, (8)

<sup>2</sup>|*H*(0, *<sup>ω</sup>*)<sup>|</sup>

reverberation from being a negative value; the estimated power spectrum <sup>|</sup>*X*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

∑*D*−<sup>1</sup>

2, <sup>|</sup>*S*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

*H*ˆ(*f* , *ω*) is the estimated STFT of the impulse response. To estimate the power spectra of the impulse responses, we extended the Multi-channel LMS algorithm for identifying the impulse

Previous studies have shown that GSS with an arbitrary exponent parameter is more effective than power SS for noise reduction [26]. In this chapter, we extend GSS to suppress late reverberation. Instead of the power SS-based dereverberation given in Eq. (7), GSS-based

where *n* is the exponent parameter. For power SS, the exponent parameter *n* is equal to 1. In this chapter, the exponent parameter *n* is set to 0.1 as this value yielded the best results [26]. The methods given in Eq. (7) and Eq. (8) are referred to *power SS-based* and *GSS-based*

The precision of impulse response estimation is drastically degraded when the additive noise is present. We present a dereverberation and denoising based on GSS. A diagram of the

*<sup>d</sup>*=<sup>1</sup> {|*X*ˆ(*<sup>f</sup>* <sup>−</sup> *<sup>d</sup>*, *<sup>ω</sup>*)<sup>|</sup>

*<sup>d</sup>*=<sup>1</sup> {|*X*ˆ(*<sup>f</sup>* <sup>−</sup> *<sup>d</sup>*, *<sup>ω</sup>*)<sup>|</sup>

<sup>2</sup> <sup>−</sup> *<sup>α</sup>* ·

<sup>2</sup>|*H*<sup>ˆ</sup> (0, *<sup>ω</sup>*)<sup>|</sup>

responses in a time domain [14] to a frequency domain in Section 3.2.

<sup>2</sup>*<sup>n</sup>* <sup>−</sup> *<sup>α</sup>* ·

**2.3. Dereverberation and denoising based on GSS**

∑*D*−<sup>1</sup>

not modelled precisely.

<sup>|</sup>*X*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

where <sup>|</sup>*X*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

<sup>|</sup>*X*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

power spectrum of Eq. (4) can be approximated as

by reducing the late reverberation then becomes

<sup>2</sup> <sup>≈</sup> *max*{|*X*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

<sup>2</sup> <sup>=</sup> <sup>|</sup>*S*ˆ(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

**2.2. Dereverberation based on GSS**

<sup>2</sup>*<sup>n</sup>* <sup>≈</sup> *max*{|*X*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>

*dereverberation methods*, respectively.

dereverberation is modified as

<sup>2</sup> ≈ |*S*(*<sup>f</sup>* , *<sup>ω</sup>*)<sup>|</sup>


$$|\hat{\mathcal{X}}(f\omega)|^{2n} \approx \max\{ |\hat{\mathcal{X}}\_N(f\omega)|^{2n} - a\_1 \cdot \frac{\sum\_{\ell=1}^{D-1} \{ |\hat{\mathcal{X}}(f - \ell\omega)|^{2n} |\hat{\mathcal{X}}(\ell\omega)|^{2n} \}}{|\hat{\mathcal{X}}(0\omega)|^{2n}}, \mathfrak{f}\_1 \cdot |\hat{\mathcal{X}}\_N(f\omega)|^{2n} \},\tag{10}$$

$$|\hat{X}\_N(f,\omega)|^{2n} \approx \max\{ |X(f,\omega)|^{2n} - a\_2 \cdot |\hat{N}(\omega)|^{2n}, \beta\_2 \cdot |X(f,\omega)|^{2n} \},\tag{11}$$

where *N*ˆ (*ω*) is the mean of noise spectrum *N*(*f* , *ω*), and *X*ˆ *<sup>N</sup>*(*f* , *ω*) is the spectrum obtained by subtracting the spectrum of the observed speech from the estimated mean spectrum of noise *N*ˆ (*ω*) 1. In this paper, we set parameter *β*<sup>1</sup> equal to *β*2.

## **3. Compensation parameter estimation for spectral subtraction by multi-channel LMS algorithm**

## **3.1. Blind channel identification in time domain**

#### *3.1.1. Identifiability and principle*

An adaptive multi-channel LMS algorithm for blind Single-Input Multiple-Output (SIMO) system identification in time domain was proposed by [13, 14].

<sup>1</sup> In this study, stationary noise is assumed.

Before introducing the MCLMS algorithm for the blind channel identification, we express what SIMO systems are *blind identifiable*. A multi-channel FIR (Finite Impulse Response) system can be blindly primarily because of the channel diversity. As an extreme counter-example, if all channels of a SIMO system are identical, the system reduces to a Single-Input Single-Output (SISO) system, becoming unidentifiable. In addition, the source signal needs to have sufficient modes to make the channels fully excited. The following two assumptions are made to guarantee an identifiable system:


In the following, these two conditions are assumed to hold so that we will be dealing with a blindly identifiable FIR (Finite Impulse Response) SIMO system.

In the absence of additive noise, we can take advantage of the fact that

$$\mathbf{x}\_{i} \* h\_{j} = \mathbf{s} \* h\_{i} \* h\_{j} = \mathbf{x}\_{j} \* h\_{i}, \ i, j = 1, 2, \cdots, N, i \neq j,\tag{12}$$

**R***x*+(*t* + 1) =

**R**˜ *<sup>x</sup>*+(*t* + 1) =

vector at time *t* + 1 is produced by:

mathematical expectation **R***xixj*

(*t* + 1) = **x***i*(*t* + 1)**x***<sup>T</sup>*

where **R**˜ *xixj*

approximated as

⎡ ⎢ ⎢ ⎣

⎡ ⎢ ⎢ ⎣ . . .

. . .

channel impulse responses can be uniquely determined.

. This error can be used to define a cost function at time *t* + 1

<sup>∑</sup>*n*�=<sup>1</sup> **<sup>R</sup>***xn xn* (*<sup>t</sup>* + <sup>1</sup>) −**R***x*2*x*<sup>1</sup> (*<sup>t</sup>* + <sup>1</sup>) ··· −**R***xN <sup>x</sup>*<sup>1</sup> (*<sup>t</sup>* + <sup>1</sup>) −**R***x*<sup>1</sup> *<sup>x</sup>*<sup>2</sup> (*<sup>t</sup>* + <sup>1</sup>) <sup>∑</sup>*n*�=<sup>2</sup> **<sup>R</sup>***xnxn* (*<sup>t</sup>* + <sup>1</sup>) ··· −**R***xN <sup>x</sup>*<sup>2</sup> (*<sup>t</sup>* + <sup>1</sup>)

. ..

. ..

**<sup>h</sup>***n*(*t*)=[*hn*(*t*, 0) *hn*(*t*, 1) ··· *hn*(*t*, *<sup>L</sup>* <sup>−</sup> <sup>1</sup>)]*T*, (20)

<sup>−</sup>**R**˜ *<sup>x</sup>*<sup>1</sup> *xN* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) <sup>−</sup>**R**˜ *<sup>x</sup>*<sup>2</sup> *xN* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) ··· <sup>∑</sup>*n*�=*<sup>N</sup>* **<sup>R</sup>**˜ *xnxn* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)

−**R***x*<sup>1</sup> *xN* (*<sup>t</sup>* + <sup>1</sup>) −**R***x*<sup>2</sup> *xN* (*<sup>t</sup>* + <sup>1</sup>) ··· <sup>∑</sup>*n*�=*<sup>N</sup>* **<sup>R</sup>***xnxn* (*<sup>t</sup>* + <sup>1</sup>)

<sup>∑</sup>*n*�=<sup>1</sup> **<sup>R</sup>**˜ *xn xn* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) <sup>−</sup>**R**˜ *<sup>x</sup>*2*x*<sup>1</sup> (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) ··· −**R**˜ *xN <sup>x</sup>*<sup>1</sup> (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) <sup>−</sup>**R**˜ *<sup>x</sup>*<sup>1</sup> *<sup>x</sup>*<sup>2</sup> (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) <sup>∑</sup>*n*�=<sup>2</sup> **<sup>R</sup>**˜ *xnxn* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) ··· −**R**˜ *xN <sup>x</sup>*<sup>2</sup> (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)

. .

where *hn*(*t*, *l*) is the *l-th* tap of the *n-th* impulse response at time *t*. If the SIMO system is blindly identifiable, the matrix **R***x*+ is rank deficient by 1 (in the absence of noise) and the

When the estimation of channel impulse responses is deviated from the true value, an error

filter at time *t*. Here we put a tilde in **R**˜ *xixj* to distinguish this instantaneous value from its

By minimizing the cost function *J* of Eq. (23), the impulse response is blindly derived. There are various methods to minimize the cost function *J*, for example, constrained Multi-Channel LMS (MCLMS) algorithm, constrained Multi-Channel Newton (MCN) algorithm and Variable Step-Size Unconstrained MCLMS (VSS-UMCLMS) algorithm and so forth [12, 14]. Among these methods, the VSS-UMCLMS achieves a nice balance between complexity and convergence speed [14]. Moreover, the VSS-UMCLMS is more practical and much easier to use since the step size does not have to be specified in advance. Therefore, in this chapter, we

The cost function *J*(*t* + 1) at time *t* + 1 diminishes and its gradient with respect to **h**ˆ(*t*) can be

<sup>Δ</sup>*J*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) <sup>≈</sup> <sup>2</sup>**R**˜ *<sup>x</sup>*+(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)**h**ˆ(*t*)

*<sup>J</sup>*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) = �**e**(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)�<sup>2</sup> <sup>=</sup> **<sup>e</sup>**(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)

apply VSS-UMCLMS algorithm to identify the multi-channel impulse responses.

*3.1.2. Variable step-size unconstrained multi-channel LMS algorithm in time domain*

**<sup>h</sup>**(*t*)=[**h**1(*t*)*<sup>T</sup>* **<sup>h</sup>**2(*t*)*<sup>T</sup>* ··· **<sup>h</sup>***N*(*t*)*T*]

. . . .

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

. . . .

**e**(*t* + 1) = **R**˜ *<sup>x</sup>*+(*t* + 1)**h**ˆ(*t*), (21)

*<sup>j</sup>* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>), *<sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> 1, 2, ··· , *<sup>N</sup>* and **<sup>h</sup>**ˆ(*t*) is the estimated model

*T*

⎤ ⎥ ⎥ ⎦

Dereverberation Based on Spectral Subtraction by

⎤ ⎥ ⎥ ⎦

*<sup>T</sup>*, (19)

**e**(*t* + 1). (23)

�**h**ˆ(*t*)�<sup>2</sup> (24)

, (18)

161

, (22)

. .

and have the following relation at time *t*:

$$\mathbf{x}\_i^T(t)\mathbf{h}\_j(t) = \mathbf{x}\_j^T(t)\mathbf{h}\_i(t), \quad i, j = 1, 2, \cdots, N, i \neq j,\tag{13}$$

where **h***i*(*t*) is the *i-th* impulse response at time *t* and

$$\begin{aligned} \mathbf{x}\_{\boldsymbol{\mathsf{H}}}(t) &= \begin{bmatrix} \mathbf{x}\_{\boldsymbol{\mathsf{H}}}(t) \ \mathbf{x}\_{\boldsymbol{\mathsf{H}}}(t-1) & \cdots & \mathbf{x}\_{\boldsymbol{\mathsf{H}}}(t-L+1) \end{bmatrix}^{T} \end{aligned} \tag{14}$$
 
$$\boldsymbol{n} = \mathbf{1}\_{\boldsymbol{\mathsf{L}}} \mathbf{2}\_{\boldsymbol{\mathsf{H}}} \cdots \mathbf{N}\_{\boldsymbol{\mathsf{H}}} \tag{14}$$

where **x***n*(*t*) is speech signal received from the *n-th* channel at time *t* and *L* is the number of taps of the impulse response. Multiplying Eq. (13) by **x***n*(*t*) and taking expectation yields,

$$\mathbf{R}\_{\mathbf{x}|\mathbf{x}\_{i}}(t+1)\mathbf{h}\_{j}(t) = \mathbf{R}\_{\mathbf{x}|\mathbf{x}\_{j}}(t+1)\mathbf{h}\_{i}(t),$$
 
$$\mathbf{i}, \mathbf{j} = 1, \mathbf{2}, \cdots, \mathbf{N}, \mathbf{i} \neq \mathbf{j}, \tag{15}$$

where **R***xixj* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) = *<sup>E</sup>*{**x***i*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)**x***<sup>T</sup> <sup>j</sup>* (*t* + 1)}. Eq. (15) comprises *N*(*N* − 1) distinct equations. By summing up the *N* − 1 cross relations associated with one particuar channel **h***j*(*t*), we get

$$\sum\_{\substack{i=1, i \neq j}}^{N} \mathbf{R}\_{\mathbf{x}\_i \mathbf{x}\_i}(t+1) \mathbf{h}\_j(t) = \sum\_{i=1, i \neq j}^{N} \mathbf{R}\_{\mathbf{x}\_i \mathbf{x}\_j}(t+1) \mathbf{h}\_i(t),$$
  $j = 1, 2, \cdots, N.$ 

Over all channels, we then have a total of *N* equations. In matrix form, this set of equations is written as:

$$\mathbf{R}\_{x+}(t+1)\mathbf{h}(t) = 0,\tag{17}$$

where

<sup>2</sup> In mathematics, the integers *a* and *b* are said to be co-prime if they have no common factor other than 1, or equivalently, if their greatest common divisor is 1.

#### 160 Modern Speech Recognition Approaches with Case Studies Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition <sup>7</sup> 161 Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

$$\mathbf{R}\_{\mathbf{x}+}(t+1) = \begin{bmatrix} \sum\_{\boldsymbol{\pi}\neq\mathbf{1}} \mathbf{R}\_{\mathbf{x}\_{\mathcal{N}}\mathbf{x}\_{\mathcal{U}}}(t+1) & -\mathbf{R}\_{\mathbf{x}\_{2}\mathbf{x}\_{1}}(t+1) & \cdots & -\mathbf{R}\_{\mathbf{x}\_{N}\mathbf{x}\_{1}}(t+1) \\ -\mathbf{R}\_{\mathbf{x}\_{1}\mathbf{x}\_{2}}(t+1) & \sum\_{\boldsymbol{\pi}\neq\mathbf{2}} \mathbf{R}\_{\mathbf{x}\_{\mathcal{N}}\mathbf{x}\_{\mathcal{U}}}(t+1) & \cdots & -\mathbf{R}\_{\mathbf{x}\_{N}\mathbf{x}\_{2}}(t+1) \\ \vdots & \vdots & \ddots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ -\mathbf{R}\_{\mathbf{x}\_{1}\mathbf{x}\_{N}}(t+1) & -\mathbf{R}\_{\mathbf{x}\_{2}\mathbf{x}\_{N}}(t+1) & \cdots & \sum\_{\boldsymbol{\pi}\neq\mathbf{N}} \mathbf{R}\_{\mathbf{x}\_{\mathcal{N}}\mathbf{x}\_{\mathcal{U}}}(t+1) \end{bmatrix} \tag{18}$$

$$
\tilde{\mathbf{R}}\_{\mathbf{x}+}(t+1) = \begin{bmatrix}
\frac{\sum\_{\mathbf{n}\neq 1} \mathbf{R}\_{\mathbf{x}\_{\mathbf{n}}\mathbf{x}\_{\mathbf{n}}}(t+1) & -\mathbf{R}\_{\mathbf{x}\_{2}\mathbf{x}\_{1}}(t+1) & \cdots & -\mathbf{R}\_{\mathbf{x}\_{N}\mathbf{x}\_{1}}(t+1) \\
\vdots & \vdots & \ddots & \vdots \\
\end{bmatrix}\tag{22}
$$

$$\mathbf{h}(t) = \begin{bmatrix} \mathbf{h}\_1(t)^T & \mathbf{h}\_2(t)^T & \cdots & \mathbf{h}\_N(t)^T \end{bmatrix}^T,\tag{19}$$

$$\mathbf{h}\_{\rm n}(t) = \begin{bmatrix} h\_{\rm n}(t,0) & h\_{\rm n}(t,1) & \cdots & h\_{\rm n}(t,L-1) \end{bmatrix}^{T},\tag{20}$$

where *hn*(*t*, *l*) is the *l-th* tap of the *n-th* impulse response at time *t*. If the SIMO system is blindly identifiable, the matrix **R***x*+ is rank deficient by 1 (in the absence of noise) and the channel impulse responses can be uniquely determined.

When the estimation of channel impulse responses is deviated from the true value, an error vector at time *t* + 1 is produced by:

$$\mathbf{e}(t+1) = \tilde{\mathbf{R}}\_{\mathbf{x}+}(t+1)\hat{\mathbf{h}}(t),\tag{21}$$

where **R**˜ *xixj* (*t* + 1) = **x***i*(*t* + 1)**x***<sup>T</sup> <sup>j</sup>* (*<sup>t</sup>* <sup>+</sup> <sup>1</sup>), *<sup>i</sup>*, *<sup>j</sup>* <sup>=</sup> 1, 2, ··· , *<sup>N</sup>* and **<sup>h</sup>**ˆ(*t*) is the estimated model filter at time *t*. Here we put a tilde in **R**˜ *xixj* to distinguish this instantaneous value from its mathematical expectation **R***xixj* .

This error can be used to define a cost function at time *t* + 1

6 Will-be-set-by-IN-TECH

Before introducing the MCLMS algorithm for the blind channel identification, we express what SIMO systems are *blind identifiable*. A multi-channel FIR (Finite Impulse Response) system can be blindly primarily because of the channel diversity. As an extreme counter-example, if all channels of a SIMO system are identical, the system reduces to a Single-Input Single-Output (SISO) system, becoming unidentifiable. In addition, the source signal needs to have sufficient modes to make the channels fully excited. The following two

1. The polynomials formed from *hn*, *n* = 1, 2, ··· , *N*, where *hn* is *n-th* impulse response and *N* is the channel number, are co-prime <sup>2</sup> , i.e., the channel transfer functions *Hn*(*z*) do not

2. The autocorrelation matrix **<sup>R</sup>***ss* <sup>=</sup> *<sup>E</sup>*{*s*(*k*)*sT*(*k*)} of input signal is of full rank (such that the

In the following, these two conditions are assumed to hold so that we will be dealing with a

**<sup>x</sup>***n*(*t*)=[*xn*(*t*) *xn*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) ··· *xn*(*<sup>t</sup>* <sup>−</sup> *<sup>L</sup>* <sup>+</sup> <sup>1</sup>)]*T*,

where **x***n*(*t*) is speech signal received from the *n-th* channel at time *t* and *L* is the number of taps of the impulse response. Multiplying Eq. (13) by **x***n*(*t*) and taking expectation yields,

By summing up the *N* − 1 cross relations associated with one particuar channel **h***j*(*t*), we get

Over all channels, we then have a total of *N* equations. In matrix form, this set of equations is

<sup>2</sup> In mathematics, the integers *a* and *b* are said to be co-prime if they have no common factor other than 1, or equivalently,

*N* ∑ *i*=1,*i*�=*j*

**R***xixj*

*j* = 1, 2, ··· , *N*. (16)

**R***xixi*(*t* + 1)**h***j*(*t*) = **R***xixj*

(*t* + 1)**h***j*(*t*) =

*xi* ∗ *hj* = *s* ∗ *hi* ∗ *hj* = *xj* ∗ *hi*, *i*, *j* = 1, 2, ··· , *N*, *i* �= *j*, (12)

*<sup>j</sup>* (*t*)**h***i*(*t*), *i*, *j* = 1, 2, ··· , *N*, *i* �= *j*, (13)

*n* = 1, 2, ··· , *N*, (14)

*<sup>j</sup>* (*t* + 1)}. Eq. (15) comprises *N*(*N* − 1) distinct equations.

(*t* + 1)**h***i*(*t*),

**R***x*+(*t* + 1)**h**(*t*) = 0, (17)

(*t* + 1)**h***i*(*t*), *i*, *j* = 1, 2, ··· , *N*, *i* �= *j*, (15)

assumptions are made to guarantee an identifiable system:

single-input multiple-output (SIMO) system can be fully excited).

In the absence of additive noise, we can take advantage of the fact that

blindly identifiable FIR (Finite Impulse Response) SIMO system.

*<sup>i</sup>* (*t*)**h***j*(*t*) = **<sup>x</sup>***<sup>T</sup>*

share any common zeros;

and have the following relation at time *t*:

where **<sup>R</sup>***xixj*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) = *<sup>E</sup>*{**x***i*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)**x***<sup>T</sup>*

if their greatest common divisor is 1.

written as:

where

*N* ∑ *i*=1,*i*�=*j*

**R***xixi*

**x***T*

where **h***i*(*t*) is the *i-th* impulse response at time *t* and

$$J(t+1) = \|\mathbf{e}(t+1)\|^2 = \mathbf{e}(t+1)^T \mathbf{e}(t+1). \tag{23}$$

By minimizing the cost function *J* of Eq. (23), the impulse response is blindly derived. There are various methods to minimize the cost function *J*, for example, constrained Multi-Channel LMS (MCLMS) algorithm, constrained Multi-Channel Newton (MCN) algorithm and Variable Step-Size Unconstrained MCLMS (VSS-UMCLMS) algorithm and so forth [12, 14]. Among these methods, the VSS-UMCLMS achieves a nice balance between complexity and convergence speed [14]. Moreover, the VSS-UMCLMS is more practical and much easier to use since the step size does not have to be specified in advance. Therefore, in this chapter, we apply VSS-UMCLMS algorithm to identify the multi-channel impulse responses.

#### *3.1.2. Variable step-size unconstrained multi-channel LMS algorithm in time domain*

The cost function *J*(*t* + 1) at time *t* + 1 diminishes and its gradient with respect to **h**ˆ(*t*) can be approximated as

$$\Delta I(t+1) \approx \frac{2\tilde{\mathbf{R}}\_{\mathbf{x}+}(t+1)\hat{\mathbf{h}}(t)}{||\hat{\mathbf{h}}(t)||^{2}}\tag{24}$$

#### 8 Will-be-set-by-IN-TECH 162 Modern Speech Recognition Approaches with Case Studies Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition <sup>9</sup>

and the model filter **h**ˆ(*t* + 1) at time *t* + 1 is

$$
\hat{\mathbf{h}}(t+1) = \hat{\mathbf{h}}(t) - 2\mu \tilde{\mathbf{R}}\_{\mathbf{x}+}(t+1)\hat{\mathbf{h}}(t), \tag{25}
$$

which is theoretically equivalent to the adaptive algorithm proposed by [2] although the cost functions are defined in different ways in these two adaptive blind SIMO identification algothrithms. In Eq. (25), *μ* is step size for Multi-channel LMS.

With such a simplified adaptive algorithm, the primary concern is whether it would converge to the trivial all-zero estimate. Fortunately this will not happen as long as the initial estimate **h**ˆ(0) is not orthogonal to the true channel impulse response vector **h** [2].

Finally, an optimal step size for the unconstrained MCLMS at time *t* + 1 is obtained by

$$
\mu\_{opt}(t+1) = \frac{\hat{\mathbf{h}}^T(t)\Delta I(t+1)}{||\Delta I(t+1)||^2}.\tag{26}
$$

(a) RWCP (b) CENSREC-4

163

Dereverberation Based on Spectral Subtraction by

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

(a) RWCP database Array no Array type Room Angle RT60 (S) linear Echo room (panel) 150◦ 0.30 circle Echo room (cylinder) 30◦ 0.38 linear Tatami-floored room (S) 120◦ 0.47 circle Tatami-floored room (S) 120◦ 0.47 circle Tatami-floored room (L) 90◦ 0.60 circle Tatami-floored room (L) 130◦ 0.60 linear Conference room 50◦ 0.78 linear Echo room (panel) 70◦ 1.30

(b) CENSREC-4 database Array no Room Room size RT60 (s) Office 9.0 ˛A 6.0 m 0.25 Japanese style room 3.5 ˛A 2.5 m 0.40 Lounge 11.5 ˛A 27.0 m 0.50 Japanese style bath 1.5 ˛A 1.0 m 0.60 Living room 7.0 ˛A 3.0 m 0.65 Meeting room 7.0 ˛A 8.5 m 0.65 Elevator hall 11.5 ˛A 6.5 m 0.75 RT60 (second): reverberation time in room; S: small; L: large

Multi-channel distorted speech signals simulated by convolving multi-channel impulse responses with clean speech were used to create artificial reverberant speech. Six kinds of multi-channel impulse responses measured in various acoustical reverberant environments were selected from the Real World Computing Partnership (RWCP) sound scene database [23]. Table 1 lists the details of recording conditions (impulse responses with array no 3-8 in RWCP

**Table 1.** Details of recording conditions for impulse response measurement

*4.1.1. Experimental setup for isolated word recognition task*

**Figure 3.** Illustration of microphone array.

The details of the VSS-UMCLMS were described in [14].

### **3.2. Extending VSS-UMCLMS algorithm to compensation parameter estimation for spectral subtraction**

To blindly estimate the compensation parameter (that is, the spectrum of impulse response), we extend the MCLMS algorithm mentioned in Section 3.1 from a time domain to a frequency domain in this section.

The spectrum of distorted signal is a convolution operation of the spectrum of clean speech and that of impulse response as shown in Eq. (4). The spectrum of the impulse response is dependent on frequency *ω*, and the varibale *ω* is omitted for simplification. Thus, in the absence of additive noise, the spectra of distorted signals have the following relation at frame *f* on the frequency domain:

$$\mathbf{X}\_{i}^{T}(f)\mathbf{H}\_{\hat{f}}(f) = \mathbf{X}\_{\hat{f}}^{T}(f)\mathbf{H}\_{i}(f), \quad i, j = 1, 2, \dots, N, \quad i \neq j,\tag{27}$$

Where **<sup>X</sup>***n*(*f*)=[*Xn*(*f*) *Xn*(*<sup>f</sup>* <sup>−</sup> <sup>1</sup>) ... *Xn*(*<sup>f</sup>* <sup>−</sup> *<sup>D</sup>* <sup>+</sup> <sup>1</sup>)]*<sup>T</sup>* is a D-dimention vector of spectra of the distorted speech received from the *n-th* channel at frame *f* , *Xn*(*f*) is the spectrum of the distorted speech received from the *n-th* channel at frame *f* for frequency *ω*, **H***n*(*f*) = [*Hn*(*<sup>f</sup>* , 0) *Hn*(*<sup>f</sup>* , 1) ... *Hn*(*<sup>f</sup>* , *<sup>d</sup>*) ... *Hn*(*<sup>f</sup>* , *<sup>D</sup>* <sup>−</sup> <sup>1</sup>)]*T*, *<sup>d</sup>* <sup>=</sup> 0, 1, ..., *<sup>D</sup>* <sup>−</sup> 1 is a D-dimensional vector of spectra of the impulse response, and *Hn*(*f* , *d*) is the spectrum of the impulse response for frequency *ω* at frame *f* corresponding to frame delay *d* (that is, at frame *f* + *d*).

Using Eq. (27) in place of Eq. (13), the spectra of the impulse responses can be blindly estimated by the VSS-UMCLMS mentioned in Section 3.1.2.

#### **4. Experiments**

#### **4.1. Experimental setup**

The proposed dereverberation method based on spectral subtraction is evaluated on an isolated word recognition task in a simulated reverberant environment, and a large vocabulary continuous speech recognition task in both a simulated reverberant environment and a real reverberant environment, respectively.

**Figure 3.** Illustration of microphone array.

8 Will-be-set-by-IN-TECH

which is theoretically equivalent to the adaptive algorithm proposed by [2] although the cost functions are defined in different ways in these two adaptive blind SIMO identification

With such a simplified adaptive algorithm, the primary concern is whether it would converge to the trivial all-zero estimate. Fortunately this will not happen as long as the initial estimate

(*t*)Δ*J*(*t* + 1)

Finally, an optimal step size for the unconstrained MCLMS at time *t* + 1 is obtained by

**3.2. Extending VSS-UMCLMS algorithm to compensation parameter estimation**

To blindly estimate the compensation parameter (that is, the spectrum of impulse response), we extend the MCLMS algorithm mentioned in Section 3.1 from a time domain to a frequency

The spectrum of distorted signal is a convolution operation of the spectrum of clean speech and that of impulse response as shown in Eq. (4). The spectrum of the impulse response is dependent on frequency *ω*, and the varibale *ω* is omitted for simplification. Thus, in the absence of additive noise, the spectra of distorted signals have the following relation at frame

Where **<sup>X</sup>***n*(*f*)=[*Xn*(*f*) *Xn*(*<sup>f</sup>* <sup>−</sup> <sup>1</sup>) ... *Xn*(*<sup>f</sup>* <sup>−</sup> *<sup>D</sup>* <sup>+</sup> <sup>1</sup>)]*<sup>T</sup>* is a D-dimention vector of spectra of the distorted speech received from the *n-th* channel at frame *f* , *Xn*(*f*) is the spectrum of the distorted speech received from the *n-th* channel at frame *f* for frequency *ω*, **H***n*(*f*) = [*Hn*(*<sup>f</sup>* , 0) *Hn*(*<sup>f</sup>* , 1) ... *Hn*(*<sup>f</sup>* , *<sup>d</sup>*) ... *Hn*(*<sup>f</sup>* , *<sup>D</sup>* <sup>−</sup> <sup>1</sup>)]*T*, *<sup>d</sup>* <sup>=</sup> 0, 1, ..., *<sup>D</sup>* <sup>−</sup> 1 is a D-dimensional vector of spectra of the impulse response, and *Hn*(*f* , *d*) is the spectrum of the impulse response

Using Eq. (27) in place of Eq. (13), the spectra of the impulse responses can be blindly

The proposed dereverberation method based on spectral subtraction is evaluated on an isolated word recognition task in a simulated reverberant environment, and a large vocabulary continuous speech recognition task in both a simulated reverberant environment

for frequency *ω* at frame *f* corresponding to frame delay *d* (that is, at frame *f* + *d*).

*<sup>μ</sup>opt*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) = **<sup>h</sup>**<sup>ˆ</sup> *<sup>T</sup>*

**<sup>h</sup>**ˆ(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>) = **<sup>h</sup>**ˆ(*t*) <sup>−</sup> <sup>2</sup>*μ***R**˜ *<sup>x</sup>*+(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)**h**ˆ(*t*), (25)

�Δ*J*(*<sup>t</sup>* <sup>+</sup> <sup>1</sup>)�<sup>2</sup> . (26)

*<sup>j</sup>* (*f*)**H***i*(*f*), *i*, *j* = 1, 2, ..., *N*, *i* �= *j*, (27)

and the model filter **h**ˆ(*t* + 1) at time *t* + 1 is

algothrithms. In Eq. (25), *μ* is step size for Multi-channel LMS.

The details of the VSS-UMCLMS were described in [14].

**for spectral subtraction**

domain in this section.

*f* on the frequency domain:

**4. Experiments**

**4.1. Experimental setup**

**X***T*

*<sup>i</sup>* (*f*)**H***j*(*f*) = **<sup>X</sup>***<sup>T</sup>*

estimated by the VSS-UMCLMS mentioned in Section 3.1.2.

and a real reverberant environment, respectively.

**h**ˆ(0) is not orthogonal to the true channel impulse response vector **h** [2].



#### (b) CENSREC-4 database


RT60 (second): reverberation time in room; S: small; L: large

**Table 1.** Details of recording conditions for impulse response measurement

#### *4.1.1. Experimental setup for isolated word recognition task*

Multi-channel distorted speech signals simulated by convolving multi-channel impulse responses with clean speech were used to create artificial reverberant speech. Six kinds of multi-channel impulse responses measured in various acoustical reverberant environments were selected from the Real World Computing Partnership (RWCP) sound scene database [23]. Table 1 lists the details of recording conditions (impulse responses with array no 3-8 in RWCP

microphone SONY ECM-C10 A/D board Tokyo Electron device

background noise electric fan sampling frequency 16 kHz quantization bit rate 16 bits

**Table 2.** Conditions for recording in real environment.

average time for all utterances was about 5.8 s.

environment, respectively.

*4.1.2. Experimental setup for LVCSR task*

evaluate our proposed method.

recognition performance for clean isolated words was 96.0%.

recording room size [m] 7.1(D) ˛ A 3.3(W) ˛ A 2.5(H) number of speakers 5 male speakers number of utterances 100 utterances

subtracted value was controlled so that it did not become negative (*β* = 0.15). The speech

In this study, both the artificial reverberant speech and real reverberant speech were used to

For artificial reverberant speech, multi-channel distorted speech signals simulated by convolving multi-channel impulse responses with clean speech were used. Fifteen kinds of multi-channel impulse responses measured in various acoustical reverberant environments were selected from the real world computing partnership (RWCP) sound scene database [23] and the CENSREC-4 database [24]. Table 1 lists the details of 15 recording conditions. The illustration of microphone array is shown in Fig. 3. For RWCP database, a 2–8 channel circle or linear microphone array was taken from a circle + linear microphone array (30 channels). The circle type microphone array had a diameter of 30 cm. The microphones of the linear microphone array were located at 2.83 cm intervals. Impulse responses were measured at several positions 2 m from the microphone array. For the CENSREC-4 database, 2 or 4 channel microphones were taken from a linear microphone array (7 channels) with the two microphones located at 2.125 cm intervals. Impulse responses were measured at several positions 0.5 m from the microphone array. The Japanese Newspaper Article Sentences (JNAS) corpus [15] was used as clean speech. Hundred utterances from the JNAS database convolved with the multi-channel impulse responses shown in Table 1 were used as test data. The

For reverberant speech in a real environment, we recorded multi-channel speech degraded simultaneously by background noise and reverberation. Table 2 gives the conditions and content of the recordings. One hundred utterances from the JNAS corpus, uttered by five male speakers seated on the chairs labeled A to E in Fig. 5, were recorded by a multi-channel recording device. The heights of the microphone array and the utterance position of each speaker were about 0.8 m and 1.0 m, respectively. An electric fan with high air volume located behind the speaker in position A was used as background noise. An average SNR of the speech was about 18 dB. We used amicrophone array with 9 channels (Fig. 5) and a pin microphone to record speech in the distant-talking environment and close-talking

TD-BD-16ADUSB

Dereverberation Based on Spectral Subtraction by

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

165

(about 20 utterances per speaker)

**Figure 4.** Illustration of the analysis window for spectral subtraction.

database were used in the isolated word recognition task). The illustration of microphone array is shown in Fig. 3. A four-channel circle or linear microphone array was taken from a circle + linear microphone array (30 channels). The four-channel circle type microphone array had a diameter of 30 *cm*, and the four microphones were located at equal 90◦ intervals. The four microphones of the linear microphone array were located at 11.32 *cm* intervals. Impulse responses were measured at several positions 2 *m* from the microphone array. The sampling frequency was 48 *kHz*.

For clean speech, 20 male speakers each with a close microphone uttered 100 isolated words. The 100 isolated words were phonetically balanced common isolated words selected from the Tohoku University and Panasonic isolated spoken word database [21]. The average time of all utterances was about 0.6 s. The sampling frequency was 12 *kHz*. The impulse responses sampled at 48 *kHz* were downsampled to 12 *kHz* so that they could be convolved with clean speech. The frame length was 21.3 ms, and the frame shift was 8 ms with a 256-point Hamming window. Then, 116 Japanese speaker-independent syllable-based HMMs (strictly speaking, mora-unit HMMs [22]) were trained using 27,992 utterances read by 175 male speakers from the Japanese Newspaper Article Sentences (JNAS) corpus [15]). Each continuous-density HMM had five states, four with probability density functions (pdfs) of output probability. Each pdf consisted of four Gaussians with full-covariance matrices. The acoustic model was common for the baseline and proposed methods, and it was trained in a clean condition. The feature space comprised 10 mel-frequency cepstral coefficients. Firstand second-order derivatives of the cepstra plus first and second derivatives of the power component were also included (32 feature parameters in total).

The number of reverberant windows *D* in Eq. (4) was set to eight, which was empirically determined. In general, the window size D is proportional to RT60. However, the window size D is also affected by the reverberation property; for example, the ratio of power of the late reverberation to the power of the early reverberation. In our preliminary experiment with partial test data, the performance of our proposed method with a window size D = 2 to 16 outperformed the baseline significantly and the window size D = 8 achieved the best result. Automatic estimation of the optimum window size D is our future work. The length of the Hamming window for discrete Fourier transformation was 256 (21.3 *ms*), and the rate of overlap was 1/2. An illustration of the analysis window is shown in Fig. 4. For the proposed dereverberation based on spectral subtraction, the previous clean power spectra estimated with a skip window were used to estimate the current clean power spectrum 3. The spectrum of the impulse response *H*ˆ (*d*, *ω*) is estimated using the corresponding utterance to be recognized with average duration of about 0.6 second. No special parameters such as over-subtraction parameters were used in spectral subtraction (*α* = 1), except that the

<sup>3</sup> Eq. (27) is true when using a skip window and the spectrum of the impulse response can be blindly estimated.


**Table 2.** Conditions for recording in real environment.

subtracted value was controlled so that it did not become negative (*β* = 0.15). The speech recognition performance for clean isolated words was 96.0%.

#### *4.1.2. Experimental setup for LVCSR task*

10 Will-be-set-by-IN-TECH

database were used in the isolated word recognition task). The illustration of microphone array is shown in Fig. 3. A four-channel circle or linear microphone array was taken from a circle + linear microphone array (30 channels). The four-channel circle type microphone array had a diameter of 30 *cm*, and the four microphones were located at equal 90◦ intervals. The four microphones of the linear microphone array were located at 11.32 *cm* intervals. Impulse responses were measured at several positions 2 *m* from the microphone array. The sampling

For clean speech, 20 male speakers each with a close microphone uttered 100 isolated words. The 100 isolated words were phonetically balanced common isolated words selected from the Tohoku University and Panasonic isolated spoken word database [21]. The average time of all utterances was about 0.6 s. The sampling frequency was 12 *kHz*. The impulse responses sampled at 48 *kHz* were downsampled to 12 *kHz* so that they could be convolved with clean speech. The frame length was 21.3 ms, and the frame shift was 8 ms with a 256-point Hamming window. Then, 116 Japanese speaker-independent syllable-based HMMs (strictly speaking, mora-unit HMMs [22]) were trained using 27,992 utterances read by 175 male speakers from the Japanese Newspaper Article Sentences (JNAS) corpus [15]). Each continuous-density HMM had five states, four with probability density functions (pdfs) of output probability. Each pdf consisted of four Gaussians with full-covariance matrices. The acoustic model was common for the baseline and proposed methods, and it was trained in a clean condition. The feature space comprised 10 mel-frequency cepstral coefficients. Firstand second-order derivatives of the cepstra plus first and second derivatives of the power

The number of reverberant windows *D* in Eq. (4) was set to eight, which was empirically determined. In general, the window size D is proportional to RT60. However, the window size D is also affected by the reverberation property; for example, the ratio of power of the late reverberation to the power of the early reverberation. In our preliminary experiment with partial test data, the performance of our proposed method with a window size D = 2 to 16 outperformed the baseline significantly and the window size D = 8 achieved the best result. Automatic estimation of the optimum window size D is our future work. The length of the Hamming window for discrete Fourier transformation was 256 (21.3 *ms*), and the rate of overlap was 1/2. An illustration of the analysis window is shown in Fig. 4. For the proposed dereverberation based on spectral subtraction, the previous clean power spectra estimated with a skip window were used to estimate the current clean power spectrum 3. The spectrum of the impulse response *H*ˆ (*d*, *ω*) is estimated using the corresponding utterance to be recognized with average duration of about 0.6 second. No special parameters such as over-subtraction parameters were used in spectral subtraction (*α* = 1), except that the

<sup>3</sup> Eq. (27) is true when using a skip window and the spectrum of the impulse response can be blindly estimated.

**Figure 4.** Illustration of the analysis window for spectral subtraction.

component were also included (32 feature parameters in total).

frequency was 48 *kHz*.

In this study, both the artificial reverberant speech and real reverberant speech were used to evaluate our proposed method.

For artificial reverberant speech, multi-channel distorted speech signals simulated by convolving multi-channel impulse responses with clean speech were used. Fifteen kinds of multi-channel impulse responses measured in various acoustical reverberant environments were selected from the real world computing partnership (RWCP) sound scene database [23] and the CENSREC-4 database [24]. Table 1 lists the details of 15 recording conditions. The illustration of microphone array is shown in Fig. 3. For RWCP database, a 2–8 channel circle or linear microphone array was taken from a circle + linear microphone array (30 channels). The circle type microphone array had a diameter of 30 cm. The microphones of the linear microphone array were located at 2.83 cm intervals. Impulse responses were measured at several positions 2 m from the microphone array. For the CENSREC-4 database, 2 or 4 channel microphones were taken from a linear microphone array (7 channels) with the two microphones located at 2.125 cm intervals. Impulse responses were measured at several positions 0.5 m from the microphone array. The Japanese Newspaper Article Sentences (JNAS) corpus [15] was used as clean speech. Hundred utterances from the JNAS database convolved with the multi-channel impulse responses shown in Table 1 were used as test data. The average time for all utterances was about 5.8 s.

For reverberant speech in a real environment, we recorded multi-channel speech degraded simultaneously by background noise and reverberation. Table 2 gives the conditions and content of the recordings. One hundred utterances from the JNAS corpus, uttered by five male speakers seated on the chairs labeled A to E in Fig. 5, were recorded by a multi-channel recording device. The heights of the microphone array and the utterance position of each speaker were about 0.8 m and 1.0 m, respectively. An electric fan with high air volume located behind the speaker in position A was used as background noise. An average SNR of the speech was about 18 dB. We used amicrophone array with 9 channels (Fig. 5) and a pin microphone to record speech in the distant-talking environment and close-talking environment, respectively.

Distorted speech # CMN Power SS-based dereverberation 69.4 76.0 73.2 80.6 71.4 80.3 71.8 78.6 67.7 74.4 63.1 71.2 Ave. 69.4 76.9

167

Dereverberation Based on Spectral Subtraction by

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

2 4 6 8 10

Number of reverberation windows D

**Figure 6.** Effect of the number of reverberation windows D on power SS-based dereverberation for

combined with delay-and-sum beamforming was used as a baseline.

beamforming [27] is performed for all methods in this chapter. The conventional CMN

The power SS-based dereverberation method by Eq. (7) improved speech recognition significantly compared with CMN for all severe reverberant conditions. The reason was that the proposed method compensated for both the late and early reverberation. The proposed method achieved an average relative error reduction rate of 24.5% in relation to conventional

**(a) Effect factor analysis of power SS-based dereverberation in the simulated reverberant**

In this section, we describe the use of four microphones to estimate the spectrum of the impulse responses without a particular explanation. Delay-and-sum beamforming (BF) was performed on the 4-channel dereverberant speech signals. For the proposed method, each speech channel was compensated by the corresponding estimated impulse response. Preliminary experimental results for isolated word recognition showed that the power SS-based dereverberation method significantly improved the speech recognition performance significantly compared with traditional CMN with beamforming. In this section, we

Proposed method CMN

Delay-and-sum beamforming was performed for all methods

**Table 5.** Isolated word recognition results (%).

25

30

35

Word accuracy rate (%)

speech recognition.

CMN with beamforming.

*4.2.2. LVCSR results*

**environment**

40

45

**Figure 5.** Illustration of recording settings and microphone array in real environment


**Table 3.** Conditions for large vocabulary continuous speech recognition


**Table 4.** Conditions for SS-based denoising and dereverberation. "DN": denoising. "DR": dereverberation.

Table 3 gives the conditions for speech recognition. The acoustic models were trained with the ASJ speech databases of phonetically balanced sentences (ASJ-PB) and the JNAS. In total, around 20K sentences (clean speech) uttered by 132 speakers were used for each gender. Table 4 gives the conditions for SS-based denoising and dereverberation. The parameters shown in Table 4 were determined empirically. For SS-based dereverberation method without background noise, the parameter *α* was equal to *α*<sup>1</sup> and *β* was equal to *β*1. The number of reverberant windows *D* was set to 6 (192 ms). An illustration of the analysis window is shown in Fig. 4. An open-source LVCSR decoder software "Julius" [19] that is based on word trigram and triphone context-dependent HMMs is used.

#### **4.2. Experimental results**

#### *4.2.1. Isolated word recognition results*

Table 5 shows the isolated word recognition results in a simulated reverberant environment. "Distorted speech #" in Table 5 corresponds to "array no" in Table 1. Delay-and-sum


Delay-and-sum beamforming was performed for all methods

**Table 5.** Isolated word recognition results (%).

12 Will-be-set-by-IN-TECH

 

 

 

**Table 3.** Conditions for large vocabulary continuous speech recognition

and triphone context-dependent HMMs is used.

**4.2. Experimental results**

*4.2.1. Isolated word recognition results*


dereverberation.

**Figure 5.** Illustration of recording settings and microphone array in real environment

Sampling frequency 16 kHz Frame length 25 ms Frame shift 10 ms

Acoustic model 5 states, 3 output probability

Feature space 25 dimensions with CMN

method Power SS GSS

analysis window Hamming window length 32 ms window shift 16 ms noise overestimation factor *α α*2= *α*1= *α*1=*α*2=

spectral floor parameter *β β*1=*β*2=0.15

Table 3 gives the conditions for speech recognition. The acoustic models were trained with the ASJ speech databases of phonetically balanced sentences (ASJ-PB) and the JNAS. In total, around 20K sentences (clean speech) uttered by 132 speakers were used for each gender. Table 4 gives the conditions for SS-based denoising and dereverberation. The parameters shown in Table 4 were determined empirically. For SS-based dereverberation method without background noise, the parameter *α* was equal to *α*<sup>1</sup> and *β* was equal to *β*1. The number of reverberant windows *D* was set to 6 (192 ms). An illustration of the analysis window is shown in Fig. 4. An open-source LVCSR decoder software "Julius" [19] that is based on word trigram

Table 5 shows the isolated word recognition results in a simulated reverberant environment. "Distorted speech #" in Table 5 corresponds to "array no" in Table 1. Delay-and-sum

**Table 4.** Conditions for SS-based denoising and dereverberation. "DN": denoising. "DR":

left-to-right triphone HMMs

(12 MFCCs + Δ + Δpower)

DN DR DN DR

3.0 1.0 0.1

 

 

**Figure 6.** Effect of the number of reverberation windows D on power SS-based dereverberation for speech recognition.

beamforming [27] is performed for all methods in this chapter. The conventional CMN combined with delay-and-sum beamforming was used as a baseline.

The power SS-based dereverberation method by Eq. (7) improved speech recognition significantly compared with CMN for all severe reverberant conditions. The reason was that the proposed method compensated for both the late and early reverberation. The proposed method achieved an average relative error reduction rate of 24.5% in relation to conventional CMN with beamforming.

#### *4.2.2. LVCSR results*

#### **(a) Effect factor analysis of power SS-based dereverberation in the simulated reverberant environment**

In this section, we describe the use of four microphones to estimate the spectrum of the impulse responses without a particular explanation. Delay-and-sum beamforming (BF) was performed on the 4-channel dereverberant speech signals. For the proposed method, each speech channel was compensated by the corresponding estimated impulse response. Preliminary experimental results for isolated word recognition showed that the power SS-based dereverberation method significantly improved the speech recognition performance significantly compared with traditional CMN with beamforming. In this section, we


not vary greatly and was significantly better than the baseline.

0.1 s of utterance is used to estimate the compensation parameter.

Word accuracy rate (%)

for speech recognition.

0.1 0.2 0.5 1.0 2.0 4.0 length of

Length of utterance used for parameter estimation (s)

**Figure 8.** Effect of length of utterance used for parameter estimation on power SS-based dereverberation

that the optimal number of reverberation windows D depends on the reverberation time. The best average result of all reverberant speech was obtained when D equals 6. The speech recognition performance with the number of reverberation windows between 4 and 10 did

We analyzed the influence of the number of channels on parameter estimation and delay-and-sum beamforming. Besides four channels, two and eight channels were also used to estimate the compensation parameter and perform beamforming. Channel numbers corresponding to Fig. 3(a) shown in Table 7 were used. The results are shown in Fig. 7. The speech recognition performance of the SS-based dereverberation method without beamforming was hardly affected by the number of channels. That is, the compensation parameter estimation is robust to the number of channels. Combined with beamforming,

the more channels that are used and the better is the speech recognition performance.

Thus far, the whole utterance has been used to estimate the compensation parameter. The effect of the length of utterance used for parameter estimation was investigated, with the results shown in Fig. 8. The longer the length of utterance used, the better is the speech recognition performance. Deterioration in speech recognition was not experienced with the length of the utterance used for parameter estimation greater than 1 s. The speech recognition performance of the SS-based dereverberation method is better than the baseline even if only

We also compared the power SS-based dereverberation method on LVCSR in different simulated reverberant environments. The experimental results shown in Fig. 9. Naturally, the speech recognition rate deteriorated as the reverberation time increased. Using the SS-based dereverberation method, the reduction in the speech recognition rate was smaller than in conventional CMN, especially for impulse responses with a long reverberation time. For RWCP database, the SS-based dereverberation method achieved a relative word recognition error reduction rate of 19.2% relative to CMN with delay-and-sum beamforming. We also conducted an LVCSR experiment with SS-based dereverberation under different reverberant conditions (CENSREC-4), with the reverberation time between 0.25 and 0.75 s and the distance between microphone and sound source 0.5 m. A similar trend to the above results was observed. Therefore, the SS-based dereverberation method is robust to various reverberant

Proposed method CMN

utterance

Dereverberation Based on Spectral Subtraction by

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

169

The results with bold font indicate the best result corresponding to each array

**Table 6.** Detail results based on different number of reverberation windows D and reverberant environments (%)


**Figure 7.** Effect of the number of channels on power SS-based dereverberation for speech recognition.

evaluated the power SS-based dereverberation method for LVCSR and analyzed the effect factor (number of reverberation windows D in Eq. (7), channel number, and length of utterance) for compensation parameter estimation based on power SS using RWCP database. The word accuracy rate for LVCSR with clean speech was 92.6%.

The effect of the number of reverberation windows on speech recognition is shown in Fig. 6. The detail results based on different number of reverberation windows D and reverberant environments (that is, different reverberation times) were shown in Table 6. The results shown on Fig. 6 and Table 6 were not performed delay-and-sum beamforming. The results show

#### 168 Modern Speech Recognition Approaches with Case Studies Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition <sup>15</sup> 169 Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

14 Will-be-set-by-IN-TECH

Array no # Number of reverberation windows D 2 4 6 8 10 **81.45** 80.43 79.94 79.67 79.98 43.89 55.71 **57.69** 54.06 51.98 23.40 32.02 **33.46** 33.29 32.81 28.77 38.42 39.69 **39.88** 38.92 22.89 30.26 33.34 **33.59** 31.71 21.01 27.46 **31.79** 31.32 28.97 15.89 20.55 23.32 **23.92** 22.54 14.26 17.94 **21.41** 21.12 20.24 Ave 31.44 37.85 40.08 39.61 38.39

The results with bold font indicate the best result corresponding to each array

environments (%)

database)

The word accuracy rate for LVCSR with clean speech was 92.6%.

Word accuracy rate (%)

**Table 6.** Detail results based on different number of reverberation windows D and reverberant

2 channels 17, 29 1, 9 4 channels 17, 21, 25, 29 1, 5, 9, 13 8 channels 17, 19, 21, 23, 1, 3, 5, 7, 9,

**Table 7.** Channel number corresponding to Fig. 3(a) using for dereverberation and denoising (RWCP

CMN CMN+BF

Proposed method Proposed method+BF

2 4 8

Channel number

**Figure 7.** Effect of the number of channels on power SS-based dereverberation for speech recognition.

evaluated the power SS-based dereverberation method for LVCSR and analyzed the effect factor (number of reverberation windows D in Eq. (7), channel number, and length of utterance) for compensation parameter estimation based on power SS using RWCP database.

The effect of the number of reverberation windows on speech recognition is shown in Fig. 6. The detail results based on different number of reverberation windows D and reverberant environments (that is, different reverberation times) were shown in Table 6. The results shown on Fig. 6 and Table 6 were not performed delay-and-sum beamforming. The results show

Linear array Circle array

25, 27, 29, 30 11, 13, 15, 17

**Figure 8.** Effect of length of utterance used for parameter estimation on power SS-based dereverberation for speech recognition.

that the optimal number of reverberation windows D depends on the reverberation time. The best average result of all reverberant speech was obtained when D equals 6. The speech recognition performance with the number of reverberation windows between 4 and 10 did not vary greatly and was significantly better than the baseline.

We analyzed the influence of the number of channels on parameter estimation and delay-and-sum beamforming. Besides four channels, two and eight channels were also used to estimate the compensation parameter and perform beamforming. Channel numbers corresponding to Fig. 3(a) shown in Table 7 were used. The results are shown in Fig. 7. The speech recognition performance of the SS-based dereverberation method without beamforming was hardly affected by the number of channels. That is, the compensation parameter estimation is robust to the number of channels. Combined with beamforming, the more channels that are used and the better is the speech recognition performance.

Thus far, the whole utterance has been used to estimate the compensation parameter. The effect of the length of utterance used for parameter estimation was investigated, with the results shown in Fig. 8. The longer the length of utterance used, the better is the speech recognition performance. Deterioration in speech recognition was not experienced with the length of the utterance used for parameter estimation greater than 1 s. The speech recognition performance of the SS-based dereverberation method is better than the baseline even if only 0.1 s of utterance is used to estimate the compensation parameter.

We also compared the power SS-based dereverberation method on LVCSR in different simulated reverberant environments. The experimental results shown in Fig. 9. Naturally, the speech recognition rate deteriorated as the reverberation time increased. Using the SS-based dereverberation method, the reduction in the speech recognition rate was smaller than in conventional CMN, especially for impulse responses with a long reverberation time. For RWCP database, the SS-based dereverberation method achieved a relative word recognition error reduction rate of 19.2% relative to CMN with delay-and-sum beamforming. We also conducted an LVCSR experiment with SS-based dereverberation under different reverberant conditions (CENSREC-4), with the reverberation time between 0.25 and 0.75 s and the distance between microphone and sound source 0.5 m. A similar trend to the above results was observed. Therefore, the SS-based dereverberation method is robust to various reverberant

16 Will-be-set-by-IN-TECH 170 Modern Speech Recognition Approaches with Case Studies Dereverberation Based on Spectral Subtraction by Multi-Channel LMS Algorithm for Hands-Free Speech Recognition <sup>17</sup>

Distorted speech # CMN only Power SS-based method GSS-based method 44.35 63.34 65.95 27.59 40.79 49.16 25.61 42.55 49.29 73.90 79.26 80.77 27.06 42.28 45.38 29.62 50.78 56.13 65.24 71.67 74.35 Ave. 41.91 55.81 60.15

171

Dereverberation Based on Spectral Subtraction by

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

**Table 8.** Comparison of Word accuracy for LVCSR with power SS-based method and GSS-based

Distorted CMN Power SS GSS Speech # only DN DNR DN DNR 28.2 37.4 48.8 30.3 48.3 16.0 25.9 33.5 18.8 36.3 9.5 21.3 31.3 13.9 32.8 55.8 72.2 69.9 60.4 68.2 17.2 24.4 32.0 20.9 37.7 26.1 32.8 45.3 30.0 51.7 54.4 64.6 66.5 57.7 68.8 Average 29.6 39.8 46.7 33.1 49.1

Delay-and-sum beamforming was performed for all methods

Delay-and-sum beamforming was performed for all methods

**(c) Results in the real noisy reverberant environment**

reverberant conditions.

**Table 9.** Word accuracy for LVCSR with the simulated noisy reverberant speech (%).

that of CMN. The GSS-based DNR using Eq. (10) improved speech recognition performance significantly compared to both the CMN method and the power SS-based DNR for almost all

Table 10 shows the speech recognition results for the real noisy reverberant speech under the same conditions as the simulated noisy reververant speech. The word accuracy rate for close-talking speech recorded in a real environment was 88.3%. We investigated the best channel combination in the real environment and the best speech recognition performance was obtained when channels 6, 7, 8, and 9 described in Fig. 5 were used. Therefore, this channel combination was used in this study. Power SS-based DN and GSS-based DN achieved a smaller improvement in recognition performance compared with the simulated noisy reverberant environment because the type of background noise in the real environment was different from that in the simulated environment. On the other hand, the power SS-based DNR markedly improved the speech recognition performance compared to CMN. The GSS-based DNR improved speech recognition performance significantly compared to

method in the simulated reverberant environment (%)

**Figure 9.** Word accuracy for LVCSR in different simulated reverberant environments.

conditions for both isolated word recognition and LVCSR. The reason is that the SS-based dereverberation method can compensate for late reverberation through SS using an estimated power spectrum of the impulse response.

#### **(b) Results of GSS-based method in the simulated reverberant environment**

In this section, reverberation and noise suppression using only 2 speech channels is described. In both power SS-based and GSS-based dereverberation methods, speech signals from two microphones were used to estimate blindly the compensation parameters for the power SS and GSS (that is, the spectra of the channel impulse responses), and then reverberation was suppressed by SS and the spectrum of dereverberant speech was inverted into a time domain. Finally, delay-and-sum beamforming was performed on the two-channel dereverberant speech.

The results of power SS-based method and the GSS-based method without background noise were compared in Table 8. "Distorted speech #" in Table 8 corresponds to "array no" in Table 1. The speech recognition performance was drastically degraded under reverberant conditions because the conventional CMN did not suppress the late reverberation. Delay-and-sum beamforming with CMN (41.91%) could not markedly improve the speech recognition performance because of the small number of microphones and the small distance between the microphone pair. In contrast, the power SS-based dereverberation using Eq. (7) markedly improved the speech recognition performance. The GSS-based dereverberation using Eq. (8) improved speech recognition performance significantly compared with the power SS-based dereverberation and CMN for all reverberant conditions. The GSS-based method achieved an average relative word error reduction rate of 31.4% compared to the conventional CMN and 9.8% compared to the power SS-based method.

Table 9 shows the speech recognition results for the power SS and GSS-based denoising and dereverberation methods for the simulated noisy and reverberant speech. "Distorted speech #", "DN" and "DNR" in Table 9 denote the "array #" in Table 1, "denoising", and "denoising and dereverberation", respectively. The speech recognition performance of conventional CMN was drastically degraded owing to the noisy and reverberant conditions and the fact that CMN did not suppress the late reverberation. The power SS-based DN improved speech recognition performance significantly compared to the CMN for all reverberant conditions. The GSS-based DN using Eq. (11), however, did not improve the speech recognition performance compared to the power SS-based DN. On the other hand, the power SS-based DNR achieved a marked improvement in the speech recognition performance compared with


Delay-and-sum beamforming was performed for all methods

16 Will-be-set-by-IN-TECH

conditions for both isolated word recognition and LVCSR. The reason is that the SS-based dereverberation method can compensate for late reverberation through SS using an estimated

In this section, reverberation and noise suppression using only 2 speech channels is described. In both power SS-based and GSS-based dereverberation methods, speech signals from two microphones were used to estimate blindly the compensation parameters for the power SS and GSS (that is, the spectra of the channel impulse responses), and then reverberation was suppressed by SS and the spectrum of dereverberant speech was inverted into a time domain. Finally, delay-and-sum beamforming was performed on the two-channel dereverberant

The results of power SS-based method and the GSS-based method without background noise were compared in Table 8. "Distorted speech #" in Table 8 corresponds to "array no" in Table 1. The speech recognition performance was drastically degraded under reverberant conditions because the conventional CMN did not suppress the late reverberation. Delay-and-sum beamforming with CMN (41.91%) could not markedly improve the speech recognition performance because of the small number of microphones and the small distance between the microphone pair. In contrast, the power SS-based dereverberation using Eq. (7) markedly improved the speech recognition performance. The GSS-based dereverberation using Eq. (8) improved speech recognition performance significantly compared with the power SS-based dereverberation and CMN for all reverberant conditions. The GSS-based method achieved an average relative word error reduction rate of 31.4% compared to the conventional CMN and

Table 9 shows the speech recognition results for the power SS and GSS-based denoising and dereverberation methods for the simulated noisy and reverberant speech. "Distorted speech #", "DN" and "DNR" in Table 9 denote the "array #" in Table 1, "denoising", and "denoising and dereverberation", respectively. The speech recognition performance of conventional CMN was drastically degraded owing to the noisy and reverberant conditions and the fact that CMN did not suppress the late reverberation. The power SS-based DN improved speech recognition performance significantly compared to the CMN for all reverberant conditions. The GSS-based DN using Eq. (11), however, did not improve the speech recognition performance compared to the power SS-based DN. On the other hand, the power SS-based DNR achieved a marked improvement in the speech recognition performance compared with

(b) CENSREC-4 database

 

(s)

38.5%

**Figure 9.** Word accuracy for LVCSR in different simulated reverberant environments.

**(b) Results of GSS-based method in the simulated reverberant environment**

50.3%

Reverberation time

(a) RWCP database

power spectrum of the impulse response.

9.8% compared to the power SS-based method.

0.30 0.38 0.47 0.60 0.78 1.30 Ave.

CMN+BF Proposed method+BF

Word accuracy rate (%)

speech.

**Table 8.** Comparison of Word accuracy for LVCSR with power SS-based method and GSS-based method in the simulated reverberant environment (%)


Delay-and-sum beamforming was performed for all methods

**Table 9.** Word accuracy for LVCSR with the simulated noisy reverberant speech (%).

that of CMN. The GSS-based DNR using Eq. (10) improved speech recognition performance significantly compared to both the CMN method and the power SS-based DNR for almost all reverberant conditions.

#### **(c) Results in the real noisy reverberant environment**

Table 10 shows the speech recognition results for the real noisy reverberant speech under the same conditions as the simulated noisy reververant speech. The word accuracy rate for close-talking speech recorded in a real environment was 88.3%. We investigated the best channel combination in the real environment and the best speech recognition performance was obtained when channels 6, 7, 8, and 9 described in Fig. 5 were used. Therefore, this channel combination was used in this study. Power SS-based DN and GSS-based DN achieved a smaller improvement in recognition performance compared with the simulated noisy reverberant environment because the type of background noise in the real environment was different from that in the simulated environment. On the other hand, the power SS-based DNR markedly improved the speech recognition performance compared to CMN. The GSS-based DNR improved speech recognition performance significantly compared to


**Author details**

Norihide Kitaoka *Nagoya University, Japan*

Seiichi Nakagawa

**6. References**

*Shizuoka University, Japan*

October 1996.

pp. 1214-1225.

2837-2846.

1074-1090.

pp. 1127-1138.

3, March 2005, pp. 173-176.

Longbiao Wang, Kyohei Odani and Atsuhiko Kai

2-3, February/March 2004, pp. 189-203.

3701-3704, Salt Lake City, USA, May 2001.

*Processing*, Vol. 15, No. 2, February 2007, pp. 430-440.

[1] Avendano, C. & Hermansky, H. (1996). Study on the dereverberation of speech based on temporal envelope filtering. *Proceedings of ICSLP-1996*, pp. 889-892, Philadelphia, USA,

173

Dereverberation Based on Spectral Subtraction by

Multi-Channel LMS Algorithm for Hands-Free Speech Recognition

[2] Chen, H., Cao, X., & Zhu, J. (2002). Convergence of stochastic-approximation-based algorithms for blind channel identification. *IEEE Trans. Information Theory*, Vol. 48, 2002,

[3] Couvreur, L. & Couvreur, C. (2004). Blind model selection for automatic speech recognition in reverberant environments. *Journal of VLSI Signal Processing*, Vol. 36, No.

[4] Delcroix, M., Hikichi, T. & Miyoshi, M. (1994). On a blind speech dereverberation algorithm using multi-channel linear prediction. *IEEE Transations on Fundamentals of Electronics, Communications and Computer Sciences*, Vol. E89-A, No. 10, October 2006, pp.

[5] Delcroix, M., Hikichi, T. & Miyoshi, M. (1994). Precise dereverberation using multi-channel linear prediction. *IEEE Transations on Audio, Speech, and Language*

[6] Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. *IEEE*

[7] Gannot, S. & Moonen, M. (2003). Subspace methods for multimicrophone speech dereverberation. *EURASIP Journal on Applied Signal Processing*, October 2003, pp.

[8] Gillespie, B. W., Malvar, H. S. & Florencio, D. A. F. (2001). Speech dereverberation via maximum-kurtosis subband adaptive filtering, *Proceedings of ICASSP-2001*, Vol. 6, pp.

[9] Habets, E. A. P. (2004). Single-channel speech dereverberation based on spectral subtraction, *Proceedings of the 15th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC-2004)*, pp. 250-254, Veldhoven, Netherlands, November 2004. [10] Hermansky, H. & Morgan, N. (1994). RASTA processing of speech. *IEEE Transations on*

[11] Hermansky, H., Wan, E. A. & Avendano, C. (1995). Speech enhancement based on temporal processing, *Proceedings of ICASSP-1995*, pp. 405-408, Detroit, USA, May 1995. [12] Huang, Y. & Benesty, J. (2002). Adaptive multichannel least mean square and Newton algorithms for blind channel identification. *Signal Processing*, Vol. 82, No. 8, August 2002,

[13] Huang, Y., Benesty, J. & Chen, J. (2005). Optimal step size of the adaptive multi-channel LMS algorithm for blind SIMO identification. *IEEE Signal Processing Letters*, Vol. 12, No.

*Trans. Acous. Speech Signal Processing*, Vol. 29, No. 2, 1981, pp. 254-272.

*Speech and Audio Processing*, Vol. 2, No. 4, October 1994, pp. 578-589.

*Toyohashi University of Technology, Japan*

Delay-and-sum beamforming was performed for all methods

**Table 10.** Word accuracy for LVCSR with the real noisy reverberant speech (%).

both the CMN method and the power SS-based DNR for almost all speakers. The GSS-based DNR achieved an average relative word error reduction rate of 39.1% and 11.5% compared to conventional CMN and power SS-based DNR, respectively. These results show that our proposed method is also effective in a real environment under the same denoising and dereverberation conditions as the simulated noisy reverberant environment.

## **5. Conclusion**

In this chapter, we proposed a blind spectral subtraction based dereverberation method for hands-free speech recognition method. We treated the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction was applied to compensate for the late reverberation. The early reverberation was normalized by CMN. The time-domain MCLMS algorithm was extended to blindly estimate the spectrum of the impulse response for spectral subtraction in a frequency domain. We evaluated our proposed methods on isolated word recognition task and LVCSR task. The proposed spectral subtraction based on multi-channel LMS significantly outperformed than the conventional CMN. For isolated word recognition task, a relative error reduction rate of 24.5% in relation to the conventional CMN was achieved. For LVCSR task without background noise, the proposed method achieved an average relative word error reduction rate of 31.5% compared to conventional CMN in the simulated reverberant environment. We also presented a denoising and dereverberation method based on spectral subtraction and evaluated it in both the simulated noisy reverberant environment and the real noisy reverberant environment. The GSS-based method achieved an average relative word error reduction rate of 39.1% and 11.5% compared to conventional CMN and power SS-based method, respectively. These results show that our proposed method is also effective in a real noisy reverberant environment.

In this chapter, we also investigated the effect factors (numbers of reverberation windows and channels, and length of utterance) for compensation parameter estimation. We reached the following conclusions: 1) the speech recognition performance with the number of reverberation windows between 4 and 10 did not vary greatly and was significantly better than the baseline; 2) the compensation parameter estimation was robust to the number of channels; and 3) degradation of speech recognition did not occur with the length of utterance used for parameter estimation longer than 1 *s*. We also compared the SS-based dereverberation method on LVCSR in different simulated reverberant environments. A similar trend was observed.

## **Author details**

18 Will-be-set-by-IN-TECH

Speakers / CMN Power SS GSS Position only DN DNR DN DNR A 60.2 67.7 78.9 64.7 79.5 B 75.6 72.2 78.5 72.5 83.2 C 67.4 63.2 69.4 66.7 77.5 D 59.1 53.9 74.9 60.8 78.7 E 42.9 51.0 62.8 50.0 61.7 Average 60.9 61.6 73.1 62.9 76.2

both the CMN method and the power SS-based DNR for almost all speakers. The GSS-based DNR achieved an average relative word error reduction rate of 39.1% and 11.5% compared to conventional CMN and power SS-based DNR, respectively. These results show that our proposed method is also effective in a real environment under the same denoising and

In this chapter, we proposed a blind spectral subtraction based dereverberation method for hands-free speech recognition method. We treated the late reverberation as additive noise, and a noise reduction technique based on spectral subtraction was applied to compensate for the late reverberation. The early reverberation was normalized by CMN. The time-domain MCLMS algorithm was extended to blindly estimate the spectrum of the impulse response for spectral subtraction in a frequency domain. We evaluated our proposed methods on isolated word recognition task and LVCSR task. The proposed spectral subtraction based on multi-channel LMS significantly outperformed than the conventional CMN. For isolated word recognition task, a relative error reduction rate of 24.5% in relation to the conventional CMN was achieved. For LVCSR task without background noise, the proposed method achieved an average relative word error reduction rate of 31.5% compared to conventional CMN in the simulated reverberant environment. We also presented a denoising and dereverberation method based on spectral subtraction and evaluated it in both the simulated noisy reverberant environment and the real noisy reverberant environment. The GSS-based method achieved an average relative word error reduction rate of 39.1% and 11.5% compared to conventional CMN and power SS-based method, respectively. These results show that our proposed method is

In this chapter, we also investigated the effect factors (numbers of reverberation windows and channels, and length of utterance) for compensation parameter estimation. We reached the following conclusions: 1) the speech recognition performance with the number of reverberation windows between 4 and 10 did not vary greatly and was significantly better than the baseline; 2) the compensation parameter estimation was robust to the number of channels; and 3) degradation of speech recognition did not occur with the length of utterance used for parameter estimation longer than 1 *s*. We also compared the SS-based dereverberation method on LVCSR in different simulated reverberant environments. A

Delay-and-sum beamforming was performed for all methods

also effective in a real noisy reverberant environment.

similar trend was observed.

**5. Conclusion**

**Table 10.** Word accuracy for LVCSR with the real noisy reverberant speech (%).

dereverberation conditions as the simulated noisy reverberant environment.

Longbiao Wang, Kyohei Odani and Atsuhiko Kai *Shizuoka University, Japan*

Norihide Kitaoka *Nagoya University, Japan*

Seiichi Nakagawa *Toyohashi University of Technology, Japan*

## **6. References**

	- [14] Huang, Y., Benesty, J. & Chen, J. (2006). *Acoustic MIMO Signal Processing*, Springer-Verlag, ISBN 978-3-540-37630-9, Berlin, Germany.
	- [15] Itou, K., Yamamoto, M., Takeda, K., Takezawa, T., Matsuoka, T., Kobayashi, T., Shikano, K. & Itahashi, S. (1999). JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research. *Journal of the Acoustical Society of Japan (E)*, Vol. 20, No. 3, May 1999, pp. 199-206.

**Speech Enhancement** 


**Speech Enhancement** 

20 Will-be-set-by-IN-TECH

[14] Huang, Y., Benesty, J. & Chen, J. (2006). *Acoustic MIMO Signal Processing*,

[15] Itou, K., Yamamoto, M., Takeda, K., Takezawa, T., Matsuoka, T., Kobayashi, T., Shikano, K. & Itahashi, S. (1999). JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research. *Journal of the Acoustical Society of Japan (E)*, Vol. 20, No. 3,

[16] Jin, Q., Pan, Y. & Schultz, t. (2006). Far-field speaker recognition, *Proceedings of*

[17] Jin, Q., Schultz, t. & Waibel, A. (2007). Far-field speaker recognition. *IEEE Transations on Audio, Speech, and Language Processing*, Vol. 15, No. 7, September 2007, pp. 2023-2032. [18] Kinoshita, K., Delocroix, M., Nakatani, T. & Miyoshi, M. (2009). Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction. *IEEE Transations on Audio, Speech, and Language Processing*, Vol. 17, No. 4, May 2009, pp.

[19] Lee, A., Kawahara, T. & Shikano, K. (2001). Julius—an open source real-time large vocabulary recognition engine. *Proceedings of European Conference on Speech*

[20] Maganti, H. & Matassoni, M. (2010). An audiotry modulation spectral feature for reverberant speech recognition, *Proceedings of INTERSPEECH-2010*, pp. 570-573,

[21] Makino, S., Niyada, K., Mafune, Y. & Kido, K. (1992). Tohoku University and Panasonic isolated spoken word database. *Journal of the Acoustical Society of Japan*, Vol. 48, No. 12,

[22] Nakagawa, S., Hanai, K., Yamamoto, K. & Minematsu, N. (1999). Comparison of syllable-based HMMs and triphone-based HMMs in Japanese speech recognition. *Proceedings of International Workshop on Automatic Speech Recognition and Understanding*,

[23] Nakamura, S., Hiyane, K., Asano, F. & Nishiura, T. (2000). Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition,

[24] Nakayama, M., Nishiura, T., Denda, Y., Kitaoka, N., Yamamoto, K., Yamada, T., Tsuge, S., Miyajima, C., Fujimoto, M., Takiguchi, T., Tamura, S., Ogawa, T., Matsuda, S., Kuroiwa, S., Takeda, K. & Nakamura, S. (2008). CENSREC-4: Development of evaluation framework for distant-talking speech recognition under reverberant environments, *Proceedings of INTERSPEECH-2008*, pp. 968-971, Brisbane, Australia,

[25] Raut, C., Nishimoto, T. & Sagayama, S. (2006). Adaptation for long convolutional distortion by maximum likelihood based state filtering approach, *Proceedings of*

[26] Sim, B. L., Tong, Y. C. & Chang J. S. (1998). A parametric formulation of the generalized spectral subtraction method. *IEEE Transations on Speech and Audio Processing*, Vol. 6, No.

[27] Van Veen, B. & Buckley, K. (1988). Beamforming: A versatile approach to spatial

*ICASSP-2006*, pp. 1133-1136, Toulouse, France, May 2006.

filtering. *IEEE ASSP Mag.*, Vol. 5, No. 2, March 2011, pp. 4-24.

Springer-Verlag, ISBN 978-3-540-37630-9, Berlin, Germany.

*ICASSP-2006*, pp. 937-940, Toulouse, France, May 2006.

*Communication and Technology*, September 2001, pp. 1691-1694.

May 1999, pp. 199-206.

Makuhari, Japan, September 2010.

December 1992, pp. 899-905 (in Japanese).

*Proceedings of IREC-2000*, pp. 965-971.

534-545.

1999, pp. 393-396.

September 2008.

4, July 1998, pp. 328-337.

**Chapter 8** 

© 2012 Nakayama et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Nakayama et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Improvement on Sound Quality of** 

**the Body Conducted Speech from** 

**Optical Fiber Bragg Grating Microphone** 

Speech communication can be impaired by the wide range of noise conditions present in air. Researchers in the field of speech applications have been investigating how to improve the performances of signal extraction and its recognition in the conditions. However, it is not yet possible to measure clear speech in environments where there are low Signal-to-Noise Ratios (SNR) of about 0 dB or less (H. Hirsch and D. Pearce, 2000). Standard rate scales, such as CENSREC (N. Kitaoka et al., 2006) and AURORA (H. Hirsch and D. Pearce, 2000), are typically discussed for evaluating performances of speech recognition in noisy environments and have shown that speech recognition rates are approximately 50–80% when under the influence of noise, demonstrating the difficulty of achieving high percentages. With these backgrounds, many signal extraction and retrieval methods have been proposed in previous research. There is one of approaches in signal extractions, bodyconducted speech (BCS) which is little influence from noise in air however it does not measure 2 kHz above in frequency characteristics. However, these need normal speech or parameters measured simultaneously with body-conducted speech. Because these parameters are not measured in noisy environments, the authors have been investigating the use of body-conducted speech which is generally called bone-conducted speech, where the signal is also conducted through the skin and bone in a human body (S. Ishimitsu, 2008) (M. Nakayama et al., 2011). Conventional retrieval methods for sound quality of bodyconducted speech are the Modulation Transfer Function (MTF), Linear Predictive Coefficients (LPC), direct filtering and the use of a throat microphone (T. Tamiya, and T. Shimamura, 2006) (T. T. Vu et al., 2006) (Z. Liu et al., 2004) (S. Dupont, et al., 2004). As a research in state-of-the art, the research fields is expanded to speech communications

Masashi Nakayama, Shunsuke Ishimitsu and Seiji Nakagawa

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47844

**1. Introduction** 

Masashi Nakayama, Shunsuke Ishimitsu and Seiji Nakagawa

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47844

## **1. Introduction**

Speech communication can be impaired by the wide range of noise conditions present in air. Researchers in the field of speech applications have been investigating how to improve the performances of signal extraction and its recognition in the conditions. However, it is not yet possible to measure clear speech in environments where there are low Signal-to-Noise Ratios (SNR) of about 0 dB or less (H. Hirsch and D. Pearce, 2000). Standard rate scales, such as CENSREC (N. Kitaoka et al., 2006) and AURORA (H. Hirsch and D. Pearce, 2000), are typically discussed for evaluating performances of speech recognition in noisy environments and have shown that speech recognition rates are approximately 50–80% when under the influence of noise, demonstrating the difficulty of achieving high percentages. With these backgrounds, many signal extraction and retrieval methods have been proposed in previous research. There is one of approaches in signal extractions, bodyconducted speech (BCS) which is little influence from noise in air however it does not measure 2 kHz above in frequency characteristics. However, these need normal speech or parameters measured simultaneously with body-conducted speech. Because these parameters are not measured in noisy environments, the authors have been investigating the use of body-conducted speech which is generally called bone-conducted speech, where the signal is also conducted through the skin and bone in a human body (S. Ishimitsu, 2008) (M. Nakayama et al., 2011). Conventional retrieval methods for sound quality of bodyconducted speech are the Modulation Transfer Function (MTF), Linear Predictive Coefficients (LPC), direct filtering and the use of a throat microphone (T. Tamiya, and T. Shimamura, 2006) (T. T. Vu et al., 2006) (Z. Liu et al., 2004) (S. Dupont, et al., 2004). As a research in state-of-the art, the research fields is expanded to speech communications

© 2012 Nakayama et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Nakayama et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

between a patient and an operator in a Magnetic Resonance Imaging (MRI) room which has a noisy sound environment with a strong magnetic field (A. Moelker et al., 2005). Conventional microphone such as an accelerometer composed of magnetic materials are not allowed in this environment, which requires a special microphone made of non-magnetic material.

Improvement on Sound Quality of the Body Conducted Speech from Optical Fiber Bragg Grating Microphone 179

Recorder TEAC RD-200T Microphone Ono Sokki MI-1431 Microphone amplifier Ono Sokki SR-2200

Accelerometer Ono Sokki NP-2110 Accelerometer amplifier Ono Sokki PS-602 Accelerometer position Upper lip

**Table 1.** Recording environments for microphone and accelerometer

**Figure 1.** Accelerometer

**Figure 2.** Speech from microphone in quiet

Microphone position 30cm (Between mouth and microphone)

For this environment the authors proposed a speech communication system that uses a BCS microphone with an optical fiber bragg grating (OFBG microphone) (M. Nakayama et al., 2011). It is composed of only non-magnetic materials, is suitable for the environment and should provide clear signals using our retrieval method. Previous research using an OFBG microphone demonstrated the effectiveness and performance of signal extraction in an MRI room. Its performance of speech recognition was evaluated using an acoustic model constructed with unspecified normal speech (M. Nakayama et al., 2011). It is concluded that an OFBG microphone can produce a clear signal with an improved performance compared to an acoustic model made by unspecified speeches. The original signal of an OFBG microphone enabled conversation however some stress was felt because its signal was low in sound quality. Therefore one of the research aims is to improve the quality with our retrieval method which used differential acceleration and noise reduction methods.

In this chapter, it will be shown in experiments and discussions for the body-conducted speeches with the method which is measured with an accelerometer and an OFBG microphone, as one of topics is a state-of-the-art in the research field of signal extraction under noisy environment. Especially, it is mainly investigated in evaluations of the microphones, signal retrievals with the method and applying the method to a signal in sentence unit long for estimating and recovering of sound qualities.

## **2. Speech and body-conducted speech**

## **2.1. Conventional body-conducted speech microphone**

Speech as air-conducted sound is easily affected by surrounding noise. In contrast, bodyconducted speech is solid-propagated sound and thus less affected by noise. A word is uttered by a 20-year-old male in a quiet room. Table 1 details the recording environments for microphone and acclerometer emploied in this research. Speech is measured 30 cm from the mouth using a microphone, and body-conducted speech is extracted from the upper lip using the accelerometer as conventional microphone which is shown in Figure 1. This microphone position is that commonly used for the speech input of a car navigation system. The upper lip, as a signal-extraction position, provides the best cepstral coefficients as feature parameters for speech recognition (S. Ishimitsu et al., 2004). Figures 2 and 3 show uttered words "Asahi" in quiet room, taken from the JEIDA database, which contains 100 local place names (S. Itahashi, 1991). Speech is measured a cleary signal in frequency characteristics however body-conducted speech lacks high-frequency components above 2 kHz. So the performance is reduced when the signal is used for the recognition directory.


**Table 1.** Recording environments for microphone and accelerometer

**Figure 1.** Accelerometer

178 Modern Speech Recognition Approaches with Case Studies

material.

between a patient and an operator in a Magnetic Resonance Imaging (MRI) room which has a noisy sound environment with a strong magnetic field (A. Moelker et al., 2005). Conventional microphone such as an accelerometer composed of magnetic materials are not allowed in this environment, which requires a special microphone made of non-magnetic

For this environment the authors proposed a speech communication system that uses a BCS microphone with an optical fiber bragg grating (OFBG microphone) (M. Nakayama et al., 2011). It is composed of only non-magnetic materials, is suitable for the environment and should provide clear signals using our retrieval method. Previous research using an OFBG microphone demonstrated the effectiveness and performance of signal extraction in an MRI room. Its performance of speech recognition was evaluated using an acoustic model constructed with unspecified normal speech (M. Nakayama et al., 2011). It is concluded that an OFBG microphone can produce a clear signal with an improved performance compared to an acoustic model made by unspecified speeches. The original signal of an OFBG microphone enabled conversation however some stress was felt because its signal was low in sound quality. Therefore one of the research aims is to improve the quality with our

retrieval method which used differential acceleration and noise reduction methods.

sentence unit long for estimating and recovering of sound qualities.

**2.1. Conventional body-conducted speech microphone** 

**2. Speech and body-conducted speech** 

In this chapter, it will be shown in experiments and discussions for the body-conducted speeches with the method which is measured with an accelerometer and an OFBG microphone, as one of topics is a state-of-the-art in the research field of signal extraction under noisy environment. Especially, it is mainly investigated in evaluations of the microphones, signal retrievals with the method and applying the method to a signal in

Speech as air-conducted sound is easily affected by surrounding noise. In contrast, bodyconducted speech is solid-propagated sound and thus less affected by noise. A word is uttered by a 20-year-old male in a quiet room. Table 1 details the recording environments for microphone and acclerometer emploied in this research. Speech is measured 30 cm from the mouth using a microphone, and body-conducted speech is extracted from the upper lip using the accelerometer as conventional microphone which is shown in Figure 1. This microphone position is that commonly used for the speech input of a car navigation system. The upper lip, as a signal-extraction position, provides the best cepstral coefficients as feature parameters for speech recognition (S. Ishimitsu et al., 2004). Figures 2 and 3 show uttered words "Asahi" in quiet room, taken from the JEIDA database, which contains 100 local place names (S. Itahashi, 1991). Speech is measured a cleary signal in frequency characteristics however body-conducted speech lacks high-frequency components above 2 kHz. So the performance is reduced when the signal is used for the recognition directory.

**Figure 2.** Speech from microphone in quiet

**Figure 4.** OFBG microphone

**Figure 5.** Signal recording in an MRI room

**Figure 6.** BCS from OFBG microphone

**Figure 3.** BCS from accelerometer in quiet

#### **2.2. Optical Fiber Bragg Grating microphone**

To extend testing to scenarios such as that in which noise sound is generated with strong magnetic field, in communications between a patient and an operator in an MRI room, an OFBG microphone is employed to record body-conducted speech there because it can measure a clearer signal than an accelerometer and be used in an environment with a strong magnetic field. It is examined the effectiveness of the microphone in an MRI room in which a magnetic field is produced by an open-type magnetic resonance imaging system. Tables 2 and 3 detail the recording environments for OFBG microphone which is shown in Figure 4. Noise levels in the room did not measure at the recording point such as the mouth of the speaker because a sound-level meter did not permit into the room since it composed from magnetic materials. Therefore, the noise level is measured at the entrance of the room, and consequently may be higher than the noise level at the signal recording point; the noise level is given in Table 2. Owing to patient discomfort during the recordings, only 20 words and 5 sentences were recorded in the room where a scene is shown in Figure 5. Figure 6 shows the body-conducted speech recorded from the OFBG microphone in the room when activated a MRI. Compared the signal with conventional BCS, it is clearer than that for body-conducted speech measured by accelerometer because characteristics of frequencies above 2 kHz can be found.

**Figure 4.** OFBG microphone

180 Modern Speech Recognition Approaches with Case Studies

**Figure 3.** BCS from accelerometer in quiet

**2.2. Optical Fiber Bragg Grating microphone** 

characteristics of frequencies above 2 kHz can be found.

To extend testing to scenarios such as that in which noise sound is generated with strong magnetic field, in communications between a patient and an operator in an MRI room, an OFBG microphone is employed to record body-conducted speech there because it can measure a clearer signal than an accelerometer and be used in an environment with a strong magnetic field. It is examined the effectiveness of the microphone in an MRI room in which a magnetic field is produced by an open-type magnetic resonance imaging system. Tables 2 and 3 detail the recording environments for OFBG microphone which is shown in Figure 4. Noise levels in the room did not measure at the recording point such as the mouth of the speaker because a sound-level meter did not permit into the room since it composed from magnetic materials. Therefore, the noise level is measured at the entrance of the room, and consequently may be higher than the noise level at the signal recording point; the noise level is given in Table 2. Owing to patient discomfort during the recordings, only 20 words and 5 sentences were recorded in the room where a scene is shown in Figure 5. Figure 6 shows the body-conducted speech recorded from the OFBG microphone in the room when activated a MRI. Compared the signal with conventional BCS, it is clearer than that for body-conducted speech measured by accelerometer because

**Figure 5.** Signal recording in an MRI room

**Figure 6.** BCS from OFBG microphone

182 Modern Speech Recognition Approaches with Case Studies


acoustic models estimated by HTK with JNAS to evaluate closeness of signals when highest

Table 5 shows recognition results of isolated word recognition in each data set, and Table 6 gives averages of recognition results in each speaker. The recognition results for the OFGB microphone are found to be superior to the recognition results for the conventional BCS microphone. The differences in isolated-word recognition rates are about 15% to 35% respectively. These results show the effectiveness of the OFBG microphone when is

Speaker two males (22 and 23 years old)

Acoustic model gender-dependent triphone model Model conditions 16 mixture Gaussian, clustered 3000 states Feature vectors MFCC(12)+ΔMFCC(12)+ΔPow(1)=25 dim.

Speaker MRI off MRI on

Male 1 85% 80% 90% 30% 40% 50% Male 2 90% 75% 85% 50% 60% 60% Female 1 35% 35% 35% 20% 20% 20% Female 2 80% 70% 70% 75% 70% 75%

Speaker MRI off MRI on Male 1 85.0% 40.0% Male 2 83.3% 56.7% Female 1 35.0% 20.0% Female 2 73.3% 73.3%

Number of datasets 20 words × three sets/person Vocabulary JEIDA 100 local place names

Training condition more than 20,000 samples

Recognition system Julian 3.4.2

**Table 4.** Experimental conditions for isolated word recognition

**Table 5.** Recognition results of isolated word recognition in each data set

**Table 6.** Averages of recognition results

two female (23 and 24 years old)

JANS with HTK 2.0

set 1 set 2 set 3 set 1 set 2 set 3

recognition performance is achieved (S. Young et al., 2000) (K. Itou et al, 1999).

**3.2. Experimental results** 

measured clearly signals with it.

**Table 2.** Recording environment 1 for OFBG microphone


**Table 3.** Recording environment 2 for OFBG microphone

## **3. Speech recognition with OFBG microphone**

The quality of the signal recorded with the OFBG microphone, is higher than the quality of BCS recorded with accelerometer. Generally, the quality of speech sound is evaluated by the mean opinion score from 1 to 5 however this requires much evaluation data to achieve adequate significance levels. For the reason, it is evaluated the sound quality through speech recognition using acoustic models estimated with the speech of unspecified speakers as results of recognition performances. In speech recognition, the best candidate is chosen and decided by likelihoods derived from acoustic models and feature parameters such as cepstral parameters, which are calculated from the recorded speech (D. Li, and D. O'Shaughnessy, 2003) (L. Rabiner, 1993). As a result, the recognition performances and likelihoods are statistical results since human errors and other factors are not considered.

## **3.1. Experimental conditions**

Table 4 shows the experimental conditions for isolated word recognition in speech recognition. The experiment employs the Julius, speech recognition decoder, which is a large-vocabulary continuous-speech recognition system for Japanese language (T. Kawahara et al., 1999) (A. Lee et al., 2001). The decoder requires a dictionary, acoustic models and language models. The dictionary describes connections of sub-words in each word, such as phonemes and syllables, which are the acoustic models. Language models give the probability for a present word given a former word in corpora. The purpose of the experiment is only the evaluation of the clarity or the similarity of signals and acoustic models. Since language models are not required in this experiment, Julian version 3.4.2 is used for isolated-word recognition especially. Thus, the experiments are used the same acoustic models estimated by HTK with JNAS to evaluate closeness of signals when highest recognition performance is achieved (S. Young et al., 2000) (K. Itou et al, 1999).

## **3.2. Experimental results**

182 Modern Speech Recognition Approaches with Case Studies

**Table 2.** Recording environment 1 for OFBG microphone

**Table 3.** Recording environment 2 for OFBG microphone

**3.1. Experimental conditions** 

**3. Speech recognition with OFBG microphone** 

MRI model HITACH AIRIS II Environment MRI (OFF): 61.6 dB SPL

Speakers two males (22 and 23 years old)

Vocabulary twenty words × two sets: JEIDA 100 local place names

Device name Type name

Recorder TEAC LX-10

The quality of the signal recorded with the OFBG microphone, is higher than the quality of BCS recorded with accelerometer. Generally, the quality of speech sound is evaluated by the mean opinion score from 1 to 5 however this requires much evaluation data to achieve adequate significance levels. For the reason, it is evaluated the sound quality through speech recognition using acoustic models estimated with the speech of unspecified speakers as results of recognition performances. In speech recognition, the best candidate is chosen and decided by likelihoods derived from acoustic models and feature parameters such as cepstral parameters, which are calculated from the recorded speech (D. Li, and D. O'Shaughnessy, 2003) (L. Rabiner, 1993). As a result, the recognition performances and likelihoods are statistical results since human errors and other factors are not considered.

Table 4 shows the experimental conditions for isolated word recognition in speech recognition. The experiment employs the Julius, speech recognition decoder, which is a large-vocabulary continuous-speech recognition system for Japanese language (T. Kawahara et al., 1999) (A. Lee et al., 2001). The decoder requires a dictionary, acoustic models and language models. The dictionary describes connections of sub-words in each word, such as phonemes and syllables, which are the acoustic models. Language models give the probability for a present word given a former word in corpora. The purpose of the experiment is only the evaluation of the clarity or the similarity of signals and acoustic models. Since language models are not required in this experiment, Julian version 3.4.2 is used for isolated-word recognition especially. Thus, the experiments are used the same

Optical-electronic conversion device Optoacoustics EOU200

Pickup Optoacoustics Optimic4130

MRI (ON): 81.1 dB SPL

two females (23 and 24 years old)

five sentences × three sets: ATR database sentences

Table 5 shows recognition results of isolated word recognition in each data set, and Table 6 gives averages of recognition results in each speaker. The recognition results for the OFGB microphone are found to be superior to the recognition results for the conventional BCS microphone. The differences in isolated-word recognition rates are about 15% to 35% respectively. These results show the effectiveness of the OFBG microphone when is measured clearly signals with it.


**Table 4.** Experimental conditions for isolated word recognition


**Table 5.** Recognition results of isolated word recognition in each data set


**Table 6.** Averages of recognition results

## **4. Improvement on sound quality of body-conducted speech in word unit**

The OFBG microphone can measure a high quality signal compared to a BCS of an accelerometer. To realize conversations without stress, signals with improved in sound qualities are required. Consequently, one of aims in the research is to invent and examine a method for improving sound quality. Many researchers and researches which are already introduced in the chapter of introduction, are unaware that a BCS does not have frequency components 2 kHz and higher. Mindful of this condition, conventional retrieval methods for BCS that need the speech and its parameters are proposed and investigated, however speech is not measured easily in noisy environments. Therefore a signal retrieval method for a BCS only performs well with itself. In realizing this progressive idea, the method is invented a signal retrieval method without speech and the other parameters because effective frequency components in signals over 2 kHz are found however there contains very low gains.

#### **4.1. Differential acceleration**

Formula (1) shows an equation for estimating using the differential acceleration from the original BCS.

$$\mathbf{x}\_{\text{differential}}(i) = \mathbf{x}(i+1) - \mathbf{x}(i) \tag{1}$$

Improvement on Sound Quality of the Body Conducted Speech from Optical Fiber Bragg Grating Microphone 185

*H*

An estimated spectrum *HEstimate(ω)* can be converted to a retrieval signal from the differential acceleration signal. It can be calculated from the speech spectrum *HSpeech(ω)* and noise spectrum *HNoise(ω)*. In particular, *HSpeech(ω)* is calculated with autocorrelation functions and linear prediction coefficients using a Levinson-Durbin algorithm (J. Durbin, 1960), and

Signal retrieval for a signal measured by an OFBG microphone is performed using the same parameters in the method because a propagation path of body-conducted speech in a human body is not affected by either quiet or noisy environments. Figure 8 shows a retrieval signal from Figure 7 using a Wiener-filtering method where the linear prediction coefficients and autocorrelation functions are 1 and the frame width is 764 samples. These procedures were repeated five times on a signal to remove a stationary noise. From a retrieval signal, high frequency components from 2 kHz and above were recovered with these settings. This proposed method could also be applied to obtain a clear signal from body-conducted speech measured with OFBG microphone in noisy sound and high

*Speech Noise*

 ( ) ( ) () () *Speech*

*Estimate*

*H*

*HNoise(ω)* is then estimated using autocorrelation functions.

**Figure 7.** Differential acceleration from OFBG microphone

**4.3. Evaluations** 

magnetic field environment.

*H H* (2)

*xdifferential(i)* is the differential acceleration signal that is calculated from each frame of a BCS. Because of low gains in its amplitude, it requires adjusting to a suitable level for hearing or processing. Figure 7 shows a differential acceleration estimated from Figure 6 using Formula (1), with the adjusted gain. It seems that the differential acceleration signal is composed of speech mixed with stationary noise, so we expected to be able to remove it completely with the noise reduction method because the signal has a high SNR compared to the original signal. Consequently, it is proposed the signal estimation method using differential acceleration and a conventional noise reduction method (M. Nakayama et al., 2011).

#### **4.2. Noise reduction method**

As a first approach to noise reduction, it is examined the effectiveness of a spectral subtraction method for the reduction of stationary noise. However, improvements in performances for the frequency components is inadequated with this approach. The noise spectrum is simply subtracted by a spectral subtraction method, so a Wiener-filtering method is expected to estimate the spectrum envelope of speech using linear prediction coefficients. Therefore, it is tried to extract a clear signal using the Wiener-filtering method, which could estimate and obtain the effective frequency components from noisy speech. Formula (2) shows the equation used for the Wiener-filtering method.

$$H\_{E, \text{estimate}} \text{(} oo \text{)} = \frac{H\_{\text{Spec}ch} \text{(} oo \text{)}}{H\_{\text{Spec}ch} \text{(} oo \text{)} + H\_{\text{Noise}} \text{(} oo \text{)}} \tag{2}$$

An estimated spectrum *HEstimate(ω)* can be converted to a retrieval signal from the differential acceleration signal. It can be calculated from the speech spectrum *HSpeech(ω)* and noise spectrum *HNoise(ω)*. In particular, *HSpeech(ω)* is calculated with autocorrelation functions and linear prediction coefficients using a Levinson-Durbin algorithm (J. Durbin, 1960), and *HNoise(ω)* is then estimated using autocorrelation functions.

#### **4.3. Evaluations**

184 Modern Speech Recognition Approaches with Case Studies

contains very low gains.

original BCS.

2011).

**4.1. Differential acceleration** 

**4.2. Noise reduction method** 

**4. Improvement on sound quality of body-conducted speech in word unit** 

The OFBG microphone can measure a high quality signal compared to a BCS of an accelerometer. To realize conversations without stress, signals with improved in sound qualities are required. Consequently, one of aims in the research is to invent and examine a method for improving sound quality. Many researchers and researches which are already introduced in the chapter of introduction, are unaware that a BCS does not have frequency components 2 kHz and higher. Mindful of this condition, conventional retrieval methods for BCS that need the speech and its parameters are proposed and investigated, however speech is not measured easily in noisy environments. Therefore a signal retrieval method for a BCS only performs well with itself. In realizing this progressive idea, the method is invented a signal retrieval method without speech and the other parameters because effective frequency components in signals over 2 kHz are found however there

Formula (1) shows an equation for estimating using the differential acceleration from the

*xdifferential(i)* is the differential acceleration signal that is calculated from each frame of a BCS. Because of low gains in its amplitude, it requires adjusting to a suitable level for hearing or processing. Figure 7 shows a differential acceleration estimated from Figure 6 using Formula (1), with the adjusted gain. It seems that the differential acceleration signal is composed of speech mixed with stationary noise, so we expected to be able to remove it completely with the noise reduction method because the signal has a high SNR compared to the original signal. Consequently, it is proposed the signal estimation method using differential acceleration and a conventional noise reduction method (M. Nakayama et al.,

As a first approach to noise reduction, it is examined the effectiveness of a spectral subtraction method for the reduction of stationary noise. However, improvements in performances for the frequency components is inadequated with this approach. The noise spectrum is simply subtracted by a spectral subtraction method, so a Wiener-filtering method is expected to estimate the spectrum envelope of speech using linear prediction coefficients. Therefore, it is tried to extract a clear signal using the Wiener-filtering method, which could estimate and obtain the effective frequency components from noisy speech.

Formula (2) shows the equation used for the Wiener-filtering method.

( ) ( 1) ( ) *differential x i xi xi* (1)

Signal retrieval for a signal measured by an OFBG microphone is performed using the same parameters in the method because a propagation path of body-conducted speech in a human body is not affected by either quiet or noisy environments. Figure 8 shows a retrieval signal from Figure 7 using a Wiener-filtering method where the linear prediction coefficients and autocorrelation functions are 1 and the frame width is 764 samples. These procedures were repeated five times on a signal to remove a stationary noise. From a retrieval signal, high frequency components from 2 kHz and above were recovered with these settings. This proposed method could also be applied to obtain a clear signal from body-conducted speech measured with OFBG microphone in noisy sound and high magnetic field environment.

**Figure 7.** Differential acceleration from OFBG microphone

word differs from a speaker in a former section. Noise within the engine room, under the two conditions of anchorage and cruising, were 93 and 98 dB SPL, respectively, and the SNR measurements from microphone. There was –20 and –25 dB SNR, respectively. In this research, the signal is experimented under cruising condition to estimate retrieval signals.

A 22-year-old male uttered A01 sentence from the ATR503 sentence database, and the sentence is a commonly used sentence in speech recognition and application (M. Abe et al.,

/a/ /ra/ /yu/ /ru/ /ge/ /N/ /ji/ /tsu/ /wo/ /su/ /be/ /te/ /ji/ /bu/ /N/ /no/ /ho/ /u/ /he/ /ne/ /ji/

Figures 10 and 11 show a speech and a body-conducted speech in sentence unit measured by a conventional microphone and accelerometer in a quiet room when a 22 years-old male uttered the sentence. Although the accelerometer is held with fingers, sounds are measured clearly because it was firmly held to the upper lip with a suitable pressure. Figure 12 shows a differential acceleration from Figure 11, becomes clearly signal with little noise because the

(a) Main engine of Oshima-maru (b) Signal recording in the engine room

Figures 13 and 14 show a speech and a body-conducted speech in sentence unit in the noisy environment. Speech is completely swamped by the intense noise from the engine and generators. On the other hand, body-conducted speech in Figure 14 is affected a little by the noise but can be measured. Because SNR in Figure 14 has low gain, differential acceleration in Figure 15 is considered that the performance of signal retrieval is reduced. Figure 16 shows the signal retrieval from the differential acceleration works well when the treated four times since the performance is sufficient to recover the frequency characteristics. As a result, it is concluded that body-conducted speech is as clear as

1991). And the sentence is composed of the followings in sub-word of mora.

/ma/ /ge/ /ta/ /no/ /da/

**Figure 9.** The engine room in Oshima-maru

possible without noise disturbance.

BCS is high SNR.

**Figure 8.** Retrieval signal from OFBG microphone

## **5. Improvement on sound quality of body-conducted speech in sentence unit**

The effectiveness of signal retrieval for body-conducted speech in word unit measured by an accelerometer and an OFBG microphone has been demonstrated at former sections. However the effectiveness of body-conducted speech in word unit is proven, signals in sentence unit need to be examined for practical use such as conversations in the noisy environment. Though the investigation for the sentence unit is an important evaluation, so it could revolutionize speech communications in the environment. As a first step in signal retrieval for sentence unit, the method adopts the method to signals in word unit because the transfer function between the microphone and sound source seems to change little whether word or sentence unit, and is examined a body-conducted speech in sentence unit directly measured by an accelerometer and an OFBG microphone.

## **5.1. Body-conducted speech from an accelerometer**

In experiments on signal retrieval using an accelerometer, speech and body-conducted speech were measured in a quiet room of our laboratory and engine room of the training ship at the Oshima National College of Maritime Technology, where there is noisy environments with working a main engine and two generator, are shown Figures 9 (a) and (b). The recording environment is also used Table 1, however the speaker who uttered a word differs from a speaker in a former section. Noise within the engine room, under the two conditions of anchorage and cruising, were 93 and 98 dB SPL, respectively, and the SNR measurements from microphone. There was –20 and –25 dB SNR, respectively. In this research, the signal is experimented under cruising condition to estimate retrieval signals.

A 22-year-old male uttered A01 sentence from the ATR503 sentence database, and the sentence is a commonly used sentence in speech recognition and application (M. Abe et al., 1991). And the sentence is composed of the followings in sub-word of mora.

 /a/ /ra/ /yu/ /ru/ /ge/ /N/ /ji/ /tsu/ /wo/ /su/ /be/ /te/ /ji/ /bu/ /N/ /no/ /ho/ /u/ /he/ /ne/ /ji/ /ma/ /ge/ /ta/ /no/ /da/

(a) Main engine of Oshima-maru (b) Signal recording in the engine room

186 Modern Speech Recognition Approaches with Case Studies

**Figure 8.** Retrieval signal from OFBG microphone

**unit** 

**5. Improvement on sound quality of body-conducted speech in sentence** 

The effectiveness of signal retrieval for body-conducted speech in word unit measured by an accelerometer and an OFBG microphone has been demonstrated at former sections. However the effectiveness of body-conducted speech in word unit is proven, signals in sentence unit need to be examined for practical use such as conversations in the noisy environment. Though the investigation for the sentence unit is an important evaluation, so it could revolutionize speech communications in the environment. As a first step in signal retrieval for sentence unit, the method adopts the method to signals in word unit because the transfer function between the microphone and sound source seems to change little whether word or sentence unit, and is examined a body-conducted speech in sentence unit

In experiments on signal retrieval using an accelerometer, speech and body-conducted speech were measured in a quiet room of our laboratory and engine room of the training ship at the Oshima National College of Maritime Technology, where there is noisy environments with working a main engine and two generator, are shown Figures 9 (a) and (b). The recording environment is also used Table 1, however the speaker who uttered a

directly measured by an accelerometer and an OFBG microphone.

**5.1. Body-conducted speech from an accelerometer** 

Figures 10 and 11 show a speech and a body-conducted speech in sentence unit measured by a conventional microphone and accelerometer in a quiet room when a 22 years-old male uttered the sentence. Although the accelerometer is held with fingers, sounds are measured clearly because it was firmly held to the upper lip with a suitable pressure. Figure 12 shows a differential acceleration from Figure 11, becomes clearly signal with little noise because the BCS is high SNR.

Figures 13 and 14 show a speech and a body-conducted speech in sentence unit in the noisy environment. Speech is completely swamped by the intense noise from the engine and generators. On the other hand, body-conducted speech in Figure 14 is affected a little by the noise but can be measured. Because SNR in Figure 14 has low gain, differential acceleration in Figure 15 is considered that the performance of signal retrieval is reduced. Figure 16 shows the signal retrieval from the differential acceleration works well when the treated four times since the performance is sufficient to recover the frequency characteristics. As a result, it is concluded that body-conducted speech is as clear as possible without noise disturbance.

**Figure 12.** Differential acceleration of sentence in quiet

**Figure 13.** Speech of sentence in noise environment

**Figure 10.** Speech of sentence in quiet

**Figure 11.** BCS of sentence in quiet

**Figure 12.** Differential acceleration of sentence in quiet

**Figure 10.** Speech of sentence in quiet

**Figure 11.** BCS of sentence in quiet

**Figure 13.** Speech of sentence in noise environment

**Figure 16.** Retrieval BCS of sentence in noise environment

has high level.

**5.2. Body-conducted speech from OFBG microphone** 

The quality of the signal measured by the OFBG microphone in the noisy environment of an MRI room was investigated here. A speaker uttered the sentence A01 during the operation of MRI devices, such that there was an 81 dB SPL-noise environment. Although a sound level meter was not permitted in the room, so it is measured in front of the gate door in the room. Figure 17 shows the signal of the uttered sentence recorded by the OFBG microphone in the MRI room when MRI equipment was in operation. Since the signal is clear, it is expected that the frequency characteristics of the signal can be recovered employing the signal retrieval method. Figures 18 and 19 show the differential acceleration and retrieved signal from the OFBG microphone in the MRI room when the MRI equipment was in operation and the method treated three times. These figures confirm to improve in the sound quality of BCS in sentence, and it also concluded that the SNR in BCS is best when it

**Figure 14.** BCS of sentence in noise environment

**Figure 15.** Differential acceleration of sentence in noise environment

**Figure 16.** Retrieval BCS of sentence in noise environment

**Figure 14.** BCS of sentence in noise environment

**Figure 15.** Differential acceleration of sentence in noise environment

#### **5.2. Body-conducted speech from OFBG microphone**

The quality of the signal measured by the OFBG microphone in the noisy environment of an MRI room was investigated here. A speaker uttered the sentence A01 during the operation of MRI devices, such that there was an 81 dB SPL-noise environment. Although a sound level meter was not permitted in the room, so it is measured in front of the gate door in the room. Figure 17 shows the signal of the uttered sentence recorded by the OFBG microphone in the MRI room when MRI equipment was in operation. Since the signal is clear, it is expected that the frequency characteristics of the signal can be recovered employing the signal retrieval method. Figures 18 and 19 show the differential acceleration and retrieved signal from the OFBG microphone in the MRI room when the MRI equipment was in operation and the method treated three times. These figures confirm to improve in the sound quality of BCS in sentence, and it also concluded that the SNR in BCS is best when it has high level.

This section presents improvements on sound quality of body-conducted speeches measured with an accelerometer and an OFBG microphone. Especially, an MRI room has heavy noisy sound and high magnetic field environment. The environment does not allow bringing accelerometer such as a conventional body-conducted speech microphone which is made from magnetic materials. For conversations and communications between a patient and an operator in the room, an OFBG microphone is proposed, which can measure clear

And then, the performances of signal retrieval method in sentence with the microphones that are an accelerometer and an OFBG microphone were evaluated, and the effectiveness is confirmed with time–frequency analysis and speech recognition. From this background, it is investigated estimating clear body-conducted speech in sentence unit from an OFBG microphone with our signal retrieval method that used combined differential acceleration and noise reduction. Applying the method to the signal measured recovered which in sound quality that was evaluated using time-frequency analysis. Thus, its retrieval method can also be applied to a signal measured by an OFBG microphone with the same settings because its conduction path is not affected by the noise in the air. The signals were measured in quiet and noisy rooms, specifically an engine room and MRI room. The signals were clearly obtained employing the signal retrieval method and the same settings

**Figure 19.** Retrieval signal of sentence in MRI room

**6. Conclusions and future works** 

signals compared to accelerometer.

**Figure 17.** BCS of sentence in MRI room

**Figure 18.** Differential acceleration of sentence in MRI room

**Figure 19.** Retrieval signal of sentence in MRI room

## **6. Conclusions and future works**

192 Modern Speech Recognition Approaches with Case Studies

**Figure 17.** BCS of sentence in MRI room

**Figure 18.** Differential acceleration of sentence in MRI room

This section presents improvements on sound quality of body-conducted speeches measured with an accelerometer and an OFBG microphone. Especially, an MRI room has heavy noisy sound and high magnetic field environment. The environment does not allow bringing accelerometer such as a conventional body-conducted speech microphone which is made from magnetic materials. For conversations and communications between a patient and an operator in the room, an OFBG microphone is proposed, which can measure clear signals compared to accelerometer.

And then, the performances of signal retrieval method in sentence with the microphones that are an accelerometer and an OFBG microphone were evaluated, and the effectiveness is confirmed with time–frequency analysis and speech recognition. From this background, it is investigated estimating clear body-conducted speech in sentence unit from an OFBG microphone with our signal retrieval method that used combined differential acceleration and noise reduction. Applying the method to the signal measured recovered which in sound quality that was evaluated using time-frequency analysis. Thus, its retrieval method can also be applied to a signal measured by an OFBG microphone with the same settings because its conduction path is not affected by the noise in the air. The signals were measured in quiet and noisy rooms, specifically an engine room and MRI room. The signals were clearly obtained employing the signal retrieval method and the same settings used for the word unit as a first step. To obtain a clearer signal with the signal retrieval method, the pressure at which the microphone is held is important, and the sounds have high SNR in original BCS.

Improvement on Sound Quality of the Body Conducted Speech from Optical Fiber Bragg Grating Microphone 195

K. Itou, M. Yamamoto, K. Takeda, T. Takezawa, T. Matsuoka, T. Kobayashi, K. Shikano, and S. Itahashi (1999). JNAS : Japanese speech corpus for large vocabulary continuous speech recognition research, Journal of the Acoustical Society of Japan (E), 20(3),

M. Abe, Y. Sagisaka, T. Umeda, and H. Kuwabara (1990). Manual of Japanese Speech

M. Nakayama, S. Ishimitsu, and S. Nakagawa (2011). A study of making clear bodyconducted speech using differential acceleration, IEEJ Transactions on Electrical and

M. Nakayama, S. Ishimitsu, H. Nagoshi, S. Nakagawa, and K. Fukui (2011). Body-conducted speech microphone using an Optical Fiber Bragg Grating for high magnetic field and

N. Kitaoka, T. Yamada, S. Tsuge, C. Miyajima, T. Nishiura, M. Nakayama, Y. Denda, M. Fujimoto, K. Yamamoto, T. Takiguchi, S. Kuroiwa, K. Takeda, and S. Nakamura (2006). CENSREC-1-C: development of evaluation framework for voice activity detection

S. Dupont, C. Ris, and D. Bachelart (2004). Combined use of closetalk and throat microphones for improved speech recognition under non-stationary background noise, in proceedings of COST278 and ISCA Tutorial and Research Workshop (ITRW) on

S. Ishimitsu (2008). Construction of a Noise-Robust Body-Conducted Speech Recognition

S. Ishimitsu, H. Kitakaze, Y. Tsuchibushi, H. Yanagawa, and M. Fukushima (2004). A noiserobust speech recognition system making use of body-conducted signals, Acoustical

S. Ishimitsu, M. Nakayama, and Y. Murakami (2004). Study of Body-Conducted Speech Recognition for Support of Maritime Engine Operation, in Journal of the JIME, Vol.39

S. Itahashi (1991). A noise database and Japanese common speech data corpus, Journal of

S. Young, J. Jansen, J. Odell, and P. Woodland (2000). The HTK Book for V2.0, Cambridge

T. Kawahara, A. Lee, T. Kobayashi, K. Takeda, N. Minematsu, K. Itou , A. Ito, M. Yamamoto, A. Yamada, T. Utsuro, and K. Shikano (1999). Japanese dictation toolkit –

T. T. Vu, M. Unoki, and M. Akagi (2006). A Study on Restoration of Boneconducted Speech

T. Tamiya, and T. Shimamura (2006). Improvement of Body-Conducted Speech Quality by

With LPC Based Model, IEICE Technical Report, SP2005-174, pp.67-78

Adaptive Filters, IEICE Technical Report, SP2006-191, pp.41-46

under noisy environment, IPSJ SIG Technical Report, 2006-SLP-63, pp.1–6

L. Rabiner (1993). Fundamentals of Speech Recognition, Prentice Hall

noisy environments, in proceedings of Forum Acusticum 2011

Robustness Issues in Conversational Interaction, paper31

System, in Chapter of Speech Recognition, IN-TECH

Science and Technology, Vol. 25 No. 2, pp.166-169

1997 version, Journal of ASJ, Vol. 20, No. 3, pp.233–239

No.4, pp.35-40 (in Japanese)

ASJ, Vol.47 No.12, pp. 951-953

University

Electronic Engineering, Vol.6 Issue 2, pp.144-150

pp.199-206

Database, ATR

As future works, it needs to extend the signal retrieval method for practical use and improvement of algorithm for advance.

## **Author details**

Masashi Nakayama *Kagawa National College of Technology, Japan National Institute of Advanced Industrial Science and Technology (AIST), Japan* 

Shunsuke Ishimitsu *Hiroshima City University, Japan* 

Seiji Nakagawa *National Institute of Advanced Industrial Science and Technology (AIST), Japan* 

## **Acknowledgement**

The authors thank Mr. K. Oda, Mr. H. Nagoshi and his colleagues in Ishimitsu laboratory of Hiroshima City University, members of the Living Informatics Research Group, Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST) for their support in the signal recording, and crew members of the training ship, Oshimamaru, Oshima National College of Maritime Technology.

## **7. References**


improvement of algorithm for advance.

*Kagawa National College of Technology, Japan* 

high SNR in original BCS.

**Author details** 

Masashi Nakayama

Shunsuke Ishimitsu

**Acknowledgement** 

Seiji Nakagawa

**7. References** 

*Hiroshima City University, Japan* 

used for the word unit as a first step. To obtain a clearer signal with the signal retrieval method, the pressure at which the microphone is held is important, and the sounds have

As future works, it needs to extend the signal retrieval method for practical use and

The authors thank Mr. K. Oda, Mr. H. Nagoshi and his colleagues in Ishimitsu laboratory of Hiroshima City University, members of the Living Informatics Research Group, Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST) for their support in the signal recording, and crew members of the training ship, Oshima-

A. Lee, T. Kawahara, and K. Shikano (2001). Julius - an open source real-time large vocabulary recognition engine, in Proceedings of European Conference on Speech

A. Moelker, R. A. J. J. Maas, M. W. Vogel, M. Ouhlous, and P. M. T. Pattynama (2005). Importance of bone-conducted sound transmission on patient hearing in the MR

H. Hirsch, and D. Pearce (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in

J. Durbin (1960). The Fitting of Time-Series Models, Review of the International Statistical

scanner, Journal of Magnetic Resonance Imaging, Volume 22, Issue 1, pp.163-169 D. Li, and D. O'Shaughnessy (2003). Speech Processing: A Dynamic and Optimization-

*National Institute of Advanced Industrial Science and Technology (AIST), Japan* 

*National Institute of Advanced Industrial Science and Technology (AIST), Japan* 

Communication and Technology (EUROSPEECH), pp. 1691-1694

maru, Oshima National College of Maritime Technology.

Oriented Approach, Marcel Dekker Inc.

Institute, Vol.28 No.3, pp. 233-244

proceedings of ISCA ITRW ASR2000, pp.181-188

	- Z. Liu, Z. Zhang, A. Acero, J. Droppo, and X. Huang (2004). Direct Filtering for Air- and Bone-Conductive Microphones, in proceedings of IEEE International Workshop on Multimedia Signal Processing (MMSP'04), pp.363-366

**Chapter 9** 

© 2012 Mantilla Caeiros and Pérez Meana, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is

© 2012 Mantilla Caeiros and Pérez Meana, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

properly cited.

been carried out to improve its quality and intelligibility.

which an optimal solution has not yet been found [2].

**Esophageal Speech Enhancement Using** 

Alfredo Victor Mantilla Caeiros and Hector Manuel Pérez Meana

the Artificial Larynx Transducer (ALT), also known as "electronic larynx" [1, 2].

People that suffer from diseases such as throat cancer require that their larynx and vocal cords be extracted by surgery, requiring then rehabilitation in order to be able to reintegrate to their individual, social, familiar and work activities. To accomplish this, different methods have been suggested, such as: The esophageal speech, the use of tracheoesophageal prosthetics and

The ALT, which has the shape of a handheld device, introduces an excitation in the vocal tract by applying a vibration against the external walls of the neck. The excitation is then modulated by the movement of the oral cavity to produce the speech sound. This transducer is attached to the speaker's neck, and in some cases to the speaker's cheeks. The ALT is widely recommended by voice rehabilitation physicians because it is very easy to use even for new patients, although the voice produced by these transducers is unnatural and with low quality, besides that it is distorted by the ALT produced background noise. Thus, ALT results in a considerably degradation of the quality and intelligibility of speech, problem for

The esophageal speech, on the other hand, is produced by the compression of the contained air in the vocal tract, from the stomach to the mouth through the esophagus. This air is swallowed and it produces a vibration of the esophageal upper muscle as it passes through the esophageal-larynx segment, producing the speech. The generated sound is similar to a burp, the tone is commonly very low, and the timbre is generally harsh. As in the ALT produced speech, the voiced segments of esophageal speech are the most affected parts of the speech within a word or phrase resulting an unnatural speech. Thus many efforts have

**a Feature Extraction Method Based** 

**on Wavelet Transform** 

http://dx.doi.org/10.5772/49943

**1. Introduction** 

Additional information is available at the end of the chapter

**Chapter 9** 
