**Robust Speech Recognition for Adverse Environments**

Chung-Hsien Wu and Chao-Hong Liu

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/47843

## **1. Introduction**

As the state-of-the-art speech recognizers can achieve a very high recognition rate for clean speech, the recognition performance generally degrades drastically under noisy environments. Noise-robust speech recognition has become an important task for speech recognition in adverse environments. Recent research on noise-robust speech recognition mostly focused on two directions: (1) removing the noise from the corrupted noisy signal in signal space or feature space - such as noise filtering: spectral subtraction (Boll 1979), Wiener filtering (Macho et al. 2002) and RASTA filtering (Hermansky et al. 1994), and speech or feature enhancement using model-based approach: SPLICE (Deng et al. 2003) and stochastic vector mapping (Wu et al. 2002); (2) compensating the noise effect into acoustic models in model space so that the training environment can match the test environment - such as PMC (Wu et al. 2004) or multi-condition/multi-style training (Deng et al. 2000). The noise filtering approaches require some assumption of prior information, such as the spectral characteristic of the noise. The performance will degrade when the noisy environment vary drastically or under unknown noise environment. Furthermore, (Deng et al. 2000; Deng et al. 2003) have shown that the use of denoising or preprocessing are superior to retraining the recognizers under the matched noise conditions with no preprocessing.

Stochastic vector mapping (SVM) (Deng et al. 2003; Wu et al. 2002) and sequential noise estimation (Benveniste et al. 1990; Deng et al. 2003; Gales et al. 1996) for noise normalization have been proposed and achieved significant improvement in noisy speech recognition. However, there still exist some drawbacks and limitations. First, the performance of sequential noise estimation will decrease when the noisy environment vary drastically. Second, the environment mismatch between training data and test data still exists and results in performance degradation. Third, the maximum-likelihood-based stochastic vector

© 2012 Wu and Liu, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Wu and Liu, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

mapping (SPLICE) requires annotation of environment type and stereo training data. Nevertheless, the stereo data are not available for most noisy environments.

Robust Speech Recognition for Adverse Environments 5

Furthermore, multilinguality frequently occurs in speech content, and the ability to process speech in multiple languages by the speech recognition systems has become increasingly desirable due to the trend of globalization. In general, there are different approaches to achieving multilingual speech recognition. One approach employing external language identification (LID) systems (Wu et al. 2006) to firstly identify the language of the input utterance and the corresponding monolingual system is then selected to perform the speech recognition (Waibel et al. 2000). The accuracy of the external LID system is the main factor to

Another approach to multilingual speech recognition is to run all the monolingual recognizers in parallel and select the output generated by the recognizer that obtains the maximum likelihood score. The performance of the multilingual speech recognition depends on the post-end selection of the maximum likelihood sequence. The popular approaches to multilingual speech recognition are the utilization of a multilingual phone set. The multilingual phones are usually created by merging the phones across the target languages that are acoustically similar in an attempt to obtain a minimal phone set that

In this chapter, an approach to phonetic unit generation for mixed-language or multilingual speech recognition is presented (Huang et al. 2007). The International Phonetic Alphabet (IPA) representation is employed for phonetic unit modeling. Context-dependent triphones for Mandarin and English speech are constructed based on the IPA representation. Acoustic and contextual analysis is investigated to characterize the properties among the multilingual context-dependent phonetic units. Acoustic likelihood is adopted for the pair-wise similarity estimation of the context-dependent phone models to construct a confusing matrix. The hyperspace analog to language (HAL) model is used for contextual modeling and then used

The organization of this paper is as follows. Section 2 presents two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping. Section 3 describes an approach to edit disfluency detection and correction for rich transcription. In Section 4, fusion of acoustic and contextual analysis is described to generate phonetic units for mixed-language or multilingual speech recognition. Finally the

In this section, an approach to feature enhancement for noisy speech recognition is presented. Three prior models are introduced to characterize clean speech, noise and noisy speech, respectively. The framework of the system is shown in Figure 1. Sequential noise estimation is employed for prior model construction based on noise-normalized stochastic vector mapping (NN-SVM). Therefore, feature enhancement can work without stereo training data and manual tagging of background noise type based on auto-clustering on the estimated noise data. Environment model adaptation is also adopted to reduce the

covers all the sounds existing in all the target languages (Kohler 2001).

for contextual similarity estimation between phone models.

conclusions are provided in the last section.

**2. Speech recognition in noisy environment** 

mismatch between the training data and the test data.

the overall system performance.

In order to overcome the insufficiency of tracking ability in the sequential expectationmaximization (EM) algorithm, in this chapter, the prior models were introduced to provide more information in sequential noise estimation. Furthermore, an environment model adaptation is constructed to reduce the mismatch between the training data and the test data. Finally, minimum classification error (MCE)-based approach (Wu et al. 2002) was employed without the stereo training data and an unsupervised frame-based autoclustering was adopted to automatically detect the environment type of the training data (Hsieh et al. 2008).

For recognition of disfluent speech, a number of cues can be observed when edit difluency occurs in the spontaneous speech. These cues can be detected from linguistic features, acoustic features (Shriberg et al. 2000) and integrated knowledge sources (Bear et al. 1992). (Shriberg et al. 2005) outlined phonetic consequences of disfluency to improve models for disfluency processing in speech applications. Four types of disfluency based on intonation, segment duration and pause duration were presented in (Savova et al. 2003). Soltau et al. used a discriminatively trained full covariance Gaussian system for rich transcription (Soltau et al. 2005). (Furui et al. 2005) presented the approaches to corpus collection, analysis and annotation for conversational speech processing.

(Charniak et al. 2001) proposed an architecture for parsing the transcribed speech using an edit word detector to remove edit words or fillers from the sentence string, and then a standard statistical parser was used to parse the remaining words. The statistical parser and the parameters estimated by boosting were employed to detect and correct the disfluency. (Heeman et al. 1999) presented a statistical language model that is able to identify POS tags, discourse markers, speech repairs and intonational phrases. A noisy channel model was used to model the disfluency in (Johnson et al. 2004). (Snover et al. 2004) combined the lexical information and rules generated from 33 rule templates for disfluency detection. (Hain et al. 2005) presented the techniques in front-end processing, acoustic modeling, language and pronunciation modeling for transcribing the conversational telephone speech automatically. (Liu et al. 2005) compared the HMM, maximum entropy, and conditional random fields for disfluency detection in detail.

In this chapter an approach to the detection and correction of the edit disfluency based on the word order information is presented (Yeh et al. 2006). The first process attempts to detect the interruption points (IPs) based on hypothesis testing. Acoustic features including duration, pitch and energy features were adopted in hypothesis testing. In order to circumvent the problems resulted from disfluency especially in edit disfluency, a reliable and robust language model for correcting speech recognition errors was employed. For handling language-related phenomena in edit disfluency, a cleanup language model characterizing the structure of the cleanup sentences and an alignment model for aligning words between deletable region and correction part are proposed for edit disfluency detection and correction.

Furthermore, multilinguality frequently occurs in speech content, and the ability to process speech in multiple languages by the speech recognition systems has become increasingly desirable due to the trend of globalization. In general, there are different approaches to achieving multilingual speech recognition. One approach employing external language identification (LID) systems (Wu et al. 2006) to firstly identify the language of the input utterance and the corresponding monolingual system is then selected to perform the speech recognition (Waibel et al. 2000). The accuracy of the external LID system is the main factor to the overall system performance.

Another approach to multilingual speech recognition is to run all the monolingual recognizers in parallel and select the output generated by the recognizer that obtains the maximum likelihood score. The performance of the multilingual speech recognition depends on the post-end selection of the maximum likelihood sequence. The popular approaches to multilingual speech recognition are the utilization of a multilingual phone set. The multilingual phones are usually created by merging the phones across the target languages that are acoustically similar in an attempt to obtain a minimal phone set that covers all the sounds existing in all the target languages (Kohler 2001).

In this chapter, an approach to phonetic unit generation for mixed-language or multilingual speech recognition is presented (Huang et al. 2007). The International Phonetic Alphabet (IPA) representation is employed for phonetic unit modeling. Context-dependent triphones for Mandarin and English speech are constructed based on the IPA representation. Acoustic and contextual analysis is investigated to characterize the properties among the multilingual context-dependent phonetic units. Acoustic likelihood is adopted for the pair-wise similarity estimation of the context-dependent phone models to construct a confusing matrix. The hyperspace analog to language (HAL) model is used for contextual modeling and then used for contextual similarity estimation between phone models.

The organization of this paper is as follows. Section 2 presents two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping. Section 3 describes an approach to edit disfluency detection and correction for rich transcription. In Section 4, fusion of acoustic and contextual analysis is described to generate phonetic units for mixed-language or multilingual speech recognition. Finally the conclusions are provided in the last section.

## **2. Speech recognition in noisy environment**

4 Modern Speech Recognition Approaches with Case Studies

and annotation for conversational speech processing.

random fields for disfluency detection in detail.

detection and correction.

(Hsieh et al. 2008).

mapping (SPLICE) requires annotation of environment type and stereo training data.

In order to overcome the insufficiency of tracking ability in the sequential expectationmaximization (EM) algorithm, in this chapter, the prior models were introduced to provide more information in sequential noise estimation. Furthermore, an environment model adaptation is constructed to reduce the mismatch between the training data and the test data. Finally, minimum classification error (MCE)-based approach (Wu et al. 2002) was employed without the stereo training data and an unsupervised frame-based autoclustering was adopted to automatically detect the environment type of the training data

For recognition of disfluent speech, a number of cues can be observed when edit difluency occurs in the spontaneous speech. These cues can be detected from linguistic features, acoustic features (Shriberg et al. 2000) and integrated knowledge sources (Bear et al. 1992). (Shriberg et al. 2005) outlined phonetic consequences of disfluency to improve models for disfluency processing in speech applications. Four types of disfluency based on intonation, segment duration and pause duration were presented in (Savova et al. 2003). Soltau et al. used a discriminatively trained full covariance Gaussian system for rich transcription (Soltau et al. 2005). (Furui et al. 2005) presented the approaches to corpus collection, analysis

(Charniak et al. 2001) proposed an architecture for parsing the transcribed speech using an edit word detector to remove edit words or fillers from the sentence string, and then a standard statistical parser was used to parse the remaining words. The statistical parser and the parameters estimated by boosting were employed to detect and correct the disfluency. (Heeman et al. 1999) presented a statistical language model that is able to identify POS tags, discourse markers, speech repairs and intonational phrases. A noisy channel model was used to model the disfluency in (Johnson et al. 2004). (Snover et al. 2004) combined the lexical information and rules generated from 33 rule templates for disfluency detection. (Hain et al. 2005) presented the techniques in front-end processing, acoustic modeling, language and pronunciation modeling for transcribing the conversational telephone speech automatically. (Liu et al. 2005) compared the HMM, maximum entropy, and conditional

In this chapter an approach to the detection and correction of the edit disfluency based on the word order information is presented (Yeh et al. 2006). The first process attempts to detect the interruption points (IPs) based on hypothesis testing. Acoustic features including duration, pitch and energy features were adopted in hypothesis testing. In order to circumvent the problems resulted from disfluency especially in edit disfluency, a reliable and robust language model for correcting speech recognition errors was employed. For handling language-related phenomena in edit disfluency, a cleanup language model characterizing the structure of the cleanup sentences and an alignment model for aligning words between deletable region and correction part are proposed for edit disfluency

Nevertheless, the stereo data are not available for most noisy environments.

In this section, an approach to feature enhancement for noisy speech recognition is presented. Three prior models are introduced to characterize clean speech, noise and noisy speech, respectively. The framework of the system is shown in Figure 1. Sequential noise estimation is employed for prior model construction based on noise-normalized stochastic vector mapping (NN-SVM). Therefore, feature enhancement can work without stereo training data and manual tagging of background noise type based on auto-clustering on the estimated noise data. Environment model adaptation is also adopted to reduce the mismatch between the training data and the test data.

Robust Speech Recognition for Adverse Environments 7

(3)

*n* of

*<sup>x</sup>* with *<sup>y</sup>* - *<sup>n</sup>* and *x n*- as

*n* plays an important role in

*k*

(4)

where the posterior probability *pk ye* | , can be estimated using the Bayes theory based on


*pk e py ke pk ye*

1

*<sup>r</sup>* denotes the mapping function parameters. Generally, ( )*<sup>e</sup>* are estimated

For the estimation of the mapping function parameter ( )*<sup>e</sup>* , if the stereo data, which contain a clean speech signal and the corrupted noisy speech signal with the identical clean speech signal, are available, the SPLICE-based approach can be directly adopted. However, the stereo data are not easily available in real-life applications. In this chapter an MCE-based approach is proposed to overcome the limitation. Furthermore, the environment type of the noisy speech data is needed for training the environment model ( )*<sup>e</sup>* . The noisy speech data are manually classified into *NE* noisy environments types. This strategy assigns each noisy speech file to only one environment type and is very time consuming. Actually, each noisy speech file contains several segments with different types of noisy environment. Since the noisy speech annotation affects the purity of the training data for the environment model ( )*<sup>e</sup>* , this section introduces a frame-based unsupervised noise clustering approach to

In (Boll 1979), the concept of noise normalization is proposed to reduce the effect of background noise in noisy speech for feature enhancement. If the noise feature vector

each frame can be estimated first, the NN-SVM is conducted from

1 1

*NE Nk e e*

*e k*

( )

*x n F y n y n p k y ne r*


The process for noise normalization makes the environment model e more noise-tolerable.

This section employs a frame-based sequential noise estimation algorithm (Benveniste et al. 1990; Deng et al. 2003; Gales et al. 1996) by incorporating the prior models. In the procedure,

*j*

*Nk*

 


*p j e py je*

the environment model e as follows:

1

and ( ) ( )

*Nk <sup>e</sup> <sup>e</sup> <sup>k</sup> <sup>k</sup>*

from a set of training data using maximum likelihood criterion.

construct a more precise categorization of the noisy speech.

*2.1.2. Noise-Normalized Stochastic Vector Mapping (NN-SVM)* 

Eq.**Error! Reference source not found.**(2) by replacing *y* and

Obviously, the estimation algorithm of noise feature vector

noise-normalized stochastic vector mapping.

**2.2. Prior model for sequential noise estimation** 

**Figure 1.** Diagram of training and test phases for noise-robust speech recognition

#### **2.1. NN-SVM for cepstral feature enhancement**

#### *2.1.1. Stochastic Vector Mapping (SVM)*

The SVM-based feature enhancement approach estimates the clean speech feature *x* from the noisy speech feature *y* through an environment-dependent mapping function ( ) ; *<sup>e</sup> F y* , where ( )*<sup>e</sup>* denotes the mapping function parameters and *e* denotes the corresponding environment of noisy speech feature *y* .

Assuming that the training data of the noisy speech *Y* can be partitioned into *Ne* different noisy environments, the feature vectors of *Y* under an environment *e* can be modeled by a Gaussian mixture model (GMM) with *Nk* mixtures:

$$p\left(y \mid e; \Omega\_{\mathbf{e}}\right) = \sum\_{k=1}^{N\_k} p\left(k \mid e\right) p\left(y \mid k, e\right) = \sum\_{k=1}^{N\_k} a\rho\_k^e \cdot N\left(y \mid \xi\_k^{\varepsilon}, R\_k^{\varepsilon}\right) \tag{1}$$

where e represents the environment model. The clean speech feature *x* can be estimated using a stochastic vector mapping function which is defined as follows:

$$\hat{\chi} \triangleq F\left(y; \Theta^{(\epsilon)}\right) = y + \sum\_{\epsilon=1}^{N\_{\tilde{\epsilon}}} \sum\_{k=1}^{N\_{\tilde{\epsilon}}} p\left(k \mid y, e\right) r\_k^{\epsilon} \tag{2}$$

where the posterior probability *pk ye* | , can be estimated using the Bayes theory based on the environment model e as follows:

6 Modern Speech Recognition Approaches with Case Studies

**Figure 1.** Diagram of training and test phases for noise-robust speech recognition

The SVM-based feature enhancement approach estimates the clean speech feature

the noisy speech feature *y* through an environment-dependent mapping function ( ) ; *<sup>e</sup> F y* , where ( )*<sup>e</sup>* denotes the mapping function parameters and *e* denotes the

Assuming that the training data of the noisy speech *Y* can be partitioned into *Ne* different noisy environments, the feature vectors of *Y* under an environment *e* can be modeled by a

> <sup>e</sup> 1 1 |; | |, ; , *N N k k*

*k k py e pk e py ke N y R*

( )

1 1 ; |, *NE Nk e e*

*e k x F y y pk yer*

where e represents the environment model. The clean speech feature

using a stochastic vector mapping function which is defined as follows:

*e ee k kk*

 

(1)

*k*

(2)

*x* from

*x* can be estimated

**2.1. NN-SVM for cepstral feature enhancement** 

corresponding environment of noisy speech feature *y* .

Gaussian mixture model (GMM) with *Nk* mixtures:

*2.1.1. Stochastic Vector Mapping (SVM)* 

$$p\left(k \mid y, e\right) = \frac{p\left(k \mid e\right)p\left(y \mid k, e\right)}{\sum\_{j=1}^{N\_k} p\left(j \mid e\right)p\left(y \mid j, e\right)}\tag{3}$$

and ( ) ( ) 1 *Nk <sup>e</sup> <sup>e</sup> <sup>k</sup> <sup>k</sup> <sup>r</sup>* denotes the mapping function parameters. Generally, ( )*<sup>e</sup>* are estimated from a set of training data using maximum likelihood criterion.

For the estimation of the mapping function parameter ( )*<sup>e</sup>* , if the stereo data, which contain a clean speech signal and the corrupted noisy speech signal with the identical clean speech signal, are available, the SPLICE-based approach can be directly adopted. However, the stereo data are not easily available in real-life applications. In this chapter an MCE-based approach is proposed to overcome the limitation. Furthermore, the environment type of the noisy speech data is needed for training the environment model ( )*<sup>e</sup>* . The noisy speech data are manually classified into *NE* noisy environments types. This strategy assigns each noisy speech file to only one environment type and is very time consuming. Actually, each noisy speech file contains several segments with different types of noisy environment. Since the noisy speech annotation affects the purity of the training data for the environment model ( )*<sup>e</sup>* , this section introduces a frame-based unsupervised noise clustering approach to construct a more precise categorization of the noisy speech.

#### *2.1.2. Noise-Normalized Stochastic Vector Mapping (NN-SVM)*

In (Boll 1979), the concept of noise normalization is proposed to reduce the effect of background noise in noisy speech for feature enhancement. If the noise feature vector *n* of each frame can be estimated first, the NN-SVM is conducted from Eq.**Error! Reference source not found.**(2) by replacing *y* and *<sup>x</sup>* with *<sup>y</sup>* - *<sup>n</sup>* and *x n*- as

$$\hat{\tilde{\chi}} \cdot \tilde{n} \triangleq F\left(\boldsymbol{y} \cdot \tilde{n}; \Theta^{(e)}\right) = \tilde{\boldsymbol{y}} \cdot \tilde{n} + \sum\_{e=1}^{N\_{\tilde{E}}} \sum\_{k=1}^{N\_{\tilde{k}}} p\left(k \mid \boldsymbol{y} \cdot \tilde{n}, e\right) r\_{k}^{e} \tag{4}$$

The process for noise normalization makes the environment model e more noise-tolerable. Obviously, the estimation algorithm of noise feature vector *n* plays an important role in noise-normalized stochastic vector mapping.

#### **2.2. Prior model for sequential noise estimation**

This section employs a frame-based sequential noise estimation algorithm (Benveniste et al. 1990; Deng et al. 2003; Gales et al. 1996) by incorporating the prior models. In the procedure, only noisy speech feature vector of the current frame is observed. Since the noise and clean speech feature vectors are missing simultaneously, the relation among clean speech, noise and noisy speech is required first. Then the sequential EM algorithm is introduced for online noise estimation based on the relation. In the meantime, the prior models are involved to provide more information for noise estimation.

#### *2.2.1. The acoustic environment model*

The nonlinear acoustic environment model is introduced first for noise estimation in (Deng et al. 2003). Given the cepstral features of a clean speech *x* , an additive noise *n* and a channel distortion *h* , the approximated nonlinear relation among *x* , *n* , *h* and the corrupted noisy speech *y* in cepstral domain is estimated as:

$$\mathbf{y} \approx \mathbf{h} + \mathbf{x} + \mathbf{g} \begin{pmatrix} \mathbf{n} \ \mathbf{h} \ \mathbf{n} \ \mathbf{x} \end{pmatrix}, \; \mathbf{g}(\mathbf{z}) = \mathbf{C} \ln \left( I + \exp \left[ \mathbf{C}^T \left( \mathbf{z} \right) \right] \right) \tag{5}$$

Robust Speech Recognition for Adverse Environments 9

 

(9)

(11)

denotes the posterior

, , ,

, 0 00 00 0 00 0

 

*y x nx nx x x nx nn m d m d*

> 

The noisy speech prior model will be employed to search the most similar clean speech

Sequential EM algorithm is employed for sequential noise estimation. In this section, the prior clean speech, noise and noisy speech model are considered to construct a robust noise estimation procedure. Based on the sequential EM algorithm, the estimated noise is obtained

<sup>1</sup> 1 1 1 11 (n) ln ( , , |n)| ,n *t t t tt Q E py M D y <sup>t</sup>*

noise GMM in which the noisy speech *y* occurs from frame 1 to frame t+1. The objective

1 11 1

*t t t tt*

1 11 1 1 1 1 11

*t t t tt*

*E p y M D y Const*

ln ( | , ,n)

( , ) ln ( | , ,n)

 

ln ( | , ,n)| ,n

*mmdd*

1 1 1 1 11

(n) ln ( , , |n)| ,n

1

 

*E py md*

 

*d*

*N*

*Q E py M D y*

*m d*

*m*

1 1

*m d*

*m d*

1 11 1

*m d t N N*

 

1 11

*m d*

*n Qn* . In the E-step of the sequential EM algorithm, an objective

1 11 1

1 11 1 1 1 1 1 11 1 1 1 11

*t tt t t t tt*

, ,

 

> 

*mm dd*


ln ( | , ,n) ln ( | ,n) ln ( |n)| ,n

*E py M D pM D pD y*

*E y p y m d Const*

 

> 1 , ,1 1

*m d p y m d Const*

denotes the Kronecker delta function and ( ,) *m d*

can be estimated according to the Bayes rule as:

*t t*

*<sup>t</sup> D* denote the mixture index sequence of the clean speech GMM and the

(10)

1 1 1

*t t*

 

*y Const*


*N N m d y yy y md md md*

*T*


(8)

 

1 1 y; y; ,

*g G I G*

*m d p wN*

mixture component and noise mixture component in sequential noise estimation.


 

, 0 0 0 0

 

*E E w ww*

 

*d m md m d*

*y n x x Tn x*

[ ], [ ], =

*I G I G*

00 ,

*m d m n nx x y*

 

 

*2.2.3. Sequential noise estimation* 

from 1 1 arg max ( ) *t t <sup>n</sup>*

<sup>1</sup> *<sup>M</sup>t* and <sup>1</sup>

1

function is simplified for the M-step as:

1

*t N*

1

*t N N*

function is defined as:

where <sup>1</sup>

*t*

where *m m*, 

probability. ( , ) *m d* 

 

where **C** denotes the discrete cosine transform matrix. In order to linearize the nonlinear model, the first order Taylor series expansion was used around two updated operating points 0 *<sup>n</sup>* and <sup>x</sup> 0 denoting the initial noise feature and the mean vector of the prior clean speech model, respectively. By ignoring the channel distortion effect, for which 0 *h* , Eq.**Error! Reference source not found.**(5) is then derived as:

$$\begin{aligned} y & \approx \boldsymbol{\mu}\_0^\boldsymbol{\boldsymbol{x}} + \boldsymbol{\mathcal{G}} \left( \boldsymbol{n}\_0 \cdot \boldsymbol{\mu}\_0^\boldsymbol{\boldsymbol{x}} \right) + \boldsymbol{\mathcal{G}} \left( \boldsymbol{n}\_0 \cdot \boldsymbol{\mu}\_0^\boldsymbol{\boldsymbol{x}} \right) \left( \boldsymbol{x} \cdot \boldsymbol{\mu}\_0^\boldsymbol{\boldsymbol{x}} \right) + \left[ \boldsymbol{I} \cdot \boldsymbol{\mathcal{G}} \left( \boldsymbol{n}\_0 \cdot \boldsymbol{\mu}\_0^\boldsymbol{\boldsymbol{x}} \right) \right] \left( \boldsymbol{n} \cdot \boldsymbol{n}\_0 \right) \\\\ \text{where} \ \boldsymbol{\mathcal{G}}(\boldsymbol{z}) &= \text{-} \text{-} \text{diag} \left( \frac{\boldsymbol{I}}{\boldsymbol{I} + \exp\left[ \boldsymbol{\mathcal{C}}^T \boldsymbol{z} \right]} \right) \boldsymbol{\mathcal{C}}^T. \end{aligned} \tag{6}$$

#### *2.2.2. The prior models*

The three prior models *<sup>n</sup>* , *x* and *<sup>y</sup>* , which denotes noise, clean speech and noisy speech models respectively, can provide more information for sequential noise estimation. First, the noise and clean speech prior models are characterized by GMMs as:

$$p\left(n; \Phi\_n\right) = \sum\_{d=1}^{N\_d} \boldsymbol{w}\_d^n \cdot \mathrm{N}\left(\boldsymbol{\eta}; \boldsymbol{\mu}\_{d\ \prime}^n \boldsymbol{\Sigma}\_d^n\right), \ p\left(\mathbf{x}; \Phi\_\mathbf{x}\right) = \sum\_{m=1}^{N\_m} \boldsymbol{w}\_m^x \cdot \mathrm{N}\left(\mathbf{x}; \boldsymbol{\mu}\_{m\ \prime}^x \boldsymbol{\Sigma}\_m^x\right) \tag{7}$$

where the pre-training data for noisy and clean speech are required to train the model parameters of the two GMMs, *n* and *<sup>x</sup>* .

While the prior noisy speech model is needed in sequential noise estimation, the noisy speech model parameters are derived according to the prior clean speech and noise models using the approximated linear model around two operating points 0 *n* and <sup>x</sup> 0 as follows:

#### Robust Speech Recognition for Adverse Environments 9

$$p\left(\mathbf{y}; \boldsymbol{\Phi}\_{\boldsymbol{y}}\right) = \sum\_{m=1}^{N\_{\rm m}} \sum\_{d=1}^{N\_{d}} \boldsymbol{z} \boldsymbol{w}\_{m,d}^{\boldsymbol{y}} \cdot \mathcal{N}\left(\mathbf{y}; \boldsymbol{\mu}\_{m,d}^{\boldsymbol{y}}, \boldsymbol{\Sigma}\_{m,d}^{\boldsymbol{y}}\right) \tag{8}$$

$$\begin{aligned} \boldsymbol{\mu}\_{m,d}^{y} &= \boldsymbol{\mu}\_{0}^{x} + \boldsymbol{\mathcal{G}} \left( \boldsymbol{\mu}\_{0}^{n} \cdot \boldsymbol{\mu}\_{0}^{x} \right) + \boldsymbol{\mathcal{G}} \left( \boldsymbol{\mu}\_{0}^{n} \cdot \boldsymbol{\mu}\_{0}^{x} \right) \left( \boldsymbol{\mu}\_{m}^{x} \cdot \boldsymbol{\mu}\_{0}^{x} \right) + \left[ \boldsymbol{I} \cdot \boldsymbol{\mathcal{G}} \left( \boldsymbol{\mu}\_{0}^{n} \cdot \boldsymbol{\mu}\_{0}^{x} \right) \right] \left( \boldsymbol{\mu}\_{d}^{n} \cdot \boldsymbol{\mu}\_{0}^{u} \right) \\ \boldsymbol{\Sigma}\_{m,d}^{y} &= \left[ \boldsymbol{I} + \boldsymbol{\mathcal{G}} \left( \boldsymbol{\mu}\_{0}^{n} \cdot \boldsymbol{\mu}\_{0}^{x} \right) \right] \boldsymbol{\Sigma}\_{m}^{x} \left[ \boldsymbol{I} + \boldsymbol{\mathcal{G}}^{T} \left( \boldsymbol{\mu}\_{0}^{n} \cdot \boldsymbol{\mu}\_{0}^{x} \right) \right]^{T} \\ \boldsymbol{\mu}\_{0}^{y} &= \boldsymbol{\Sigma} \left[ \boldsymbol{\mu}\_{d}^{y} \right] \boldsymbol{\downarrow} \quad \boldsymbol{\mu}\_{0}^{x} = \boldsymbol{\Sigma} \left[ \boldsymbol{\mu}\_{m}^{x} \right] \boldsymbol{\mu}\_{m}^{y} \text{ w}\_{m,d}^{y} \text{=} \boldsymbol{w}\_{m} \text{ w}\_{d} \end{aligned} \tag{9}$$

The noisy speech prior model will be employed to search the most similar clean speech mixture component and noise mixture component in sequential noise estimation.

#### *2.2.3. Sequential noise estimation*

8 Modern Speech Recognition Approaches with Case Studies

provide more information for noise estimation.

corrupted noisy speech *y* in cepstral domain is estimated as:

Eq.**Error! Reference source not found.**(5) is then derived as:

 

*T*

 1

*Nd*

*d pn w Nn*

parameters of the two GMMs, *n* and *<sup>x</sup>* .

; ;, ,

*n d dd*

using the approximated linear model around two operating points 0

exp

*I Cz*

*2.2.1. The acoustic environment model* 

points 0 *<sup>n</sup>* and <sup>x</sup>

where () -

*2.2.2. The prior models* 

0 

*<sup>I</sup> G z Cdiag <sup>C</sup>*

only noisy speech feature vector of the current frame is observed. Since the noise and clean speech feature vectors are missing simultaneously, the relation among clean speech, noise and noisy speech is required first. Then the sequential EM algorithm is introduced for online noise estimation based on the relation. In the meantime, the prior models are involved to

The nonlinear acoustic environment model is introduced first for noise estimation in (Deng et al. 2003). Given the cepstral features of a clean speech *x* , an additive noise *n* and a channel distortion *h* , the approximated nonlinear relation among *x* , *n* , *h* and the

where **C** denotes the discrete cosine transform matrix. In order to linearize the nonlinear model, the first order Taylor series expansion was used around two updated operating

speech model, respectively. By ignoring the channel distortion effect, for which 0 *h* ,

0 00 00 0 00 0 - - - -- - *x x xx <sup>x</sup> <sup>y</sup>*

The three prior models *<sup>n</sup>* , *x* and *<sup>y</sup>* , which denotes noise, clean speech and noisy speech models respectively, can provide more information for sequential noise estimation.

<sup>1</sup>

where the pre-training data for noisy and clean speech are required to train the model

While the prior noisy speech model is needed in sequential noise estimation, the noisy speech model parameters are derived according to the prior clean speech and noise models

 

*T*

.

First, the noise and clean speech prior models are characterized by GMMs as:

*n nn*

*<sup>y</sup> h x gn h x* -- , ( ) Cln exp C *<sup>T</sup> <sup>g</sup> zI z* (5)

denoting the initial noise feature and the mean vector of the prior clean

*gn Gn x I Gn n n* (6)

 

; ; , *Nm*

*m px w Nx*

*x xx x m mm*

*n* 

 and <sup>x</sup> 0 

as follows:

(7)

Sequential EM algorithm is employed for sequential noise estimation. In this section, the prior clean speech, noise and noisy speech model are considered to construct a robust noise estimation procedure. Based on the sequential EM algorithm, the estimated noise is obtained from 1 1 arg max ( ) *t t <sup>n</sup> n Qn* . In the E-step of the sequential EM algorithm, an objective function is defined as:

$$\mathbb{E}(Q\_{t+1}(\mathbf{n}) \triangleq E\Big[\ln p(y\_1^{t+1}, \mathcal{M}\_1^{t+1}, D\_1^{t+1} \mid \mathbf{n}) \mid y\_1^{t+1}, \mathbf{n}\_1^t\Big] \tag{10}$$

where <sup>1</sup> <sup>1</sup> *<sup>M</sup>t* and <sup>1</sup> 1 *<sup>t</sup> D* denote the mixture index sequence of the clean speech GMM and the noise GMM in which the noisy speech *y* occurs from frame 1 to frame t+1. The objective function is simplified for the M-step as:

$$\begin{split} Q\_{t+1}(\mathbf{n}) & \triangleq E\left[\ln p(y\_1^{t+1}, M\_1^{t+1}, D\_1^{t+1} | \mathbf{n}) \mid y\_1^{t+1}, \mathbf{n}\_1^t\right] \\ &= E\left[\ln p(y\_1^{t+1} | \boldsymbol{M}\_1^{t+1}, D\_1^{t+1}, \mathbf{n}) + \ln p(M\_1^{t+1} | \boldsymbol{D}\_1^{t+1}, \mathbf{n}) + \ln p(D\_1^{t+1} | \mathbf{n}) \mid y\_1^{t+1}, \mathbf{n}\_1^t\right] \\ & \approx E\left[\ln p(y\_1^{t+1} | \boldsymbol{M}\_1^{t+1}, D\_1^{t+1}, \mathbf{n}) \mid y\_1^{t+1}, \mathbf{n}\_1^t\right] + \text{Const} \\ &= \sum\_{\tau=1}^{t+1} E\left[\left(\sum\_{m=1, d=1}^{N\_m} \sum\_{l=1}^{d\_d} \ln p(y\_\tau \mid m, d, \mathbf{n}) \cdot \delta\_{m, \tau}, \mathbf{o} \cdot \delta\_{d, \tau}\right) \mid y\_1^{t+1}, \mathbf{n}\_1^t\right] + \text{Const} \\ &= \sum\_{\tau=1}^{t+1} \sum\_{m=1}^{N\_d} \sum\_{l=1}^{d\_d} E\left[\delta\_{m, \tau} \delta\_{d, \tau} \mid y\_1^{t+1}, \mathbf{n}\_1^t\right] \ln p(y\_\tau \mid m, d, \mathbf{n}) + \text{Const} \\ &= \sum\_{\tau=1}^{t+1} \sum\_{m=1}^{N\_d} \sum\_{l=1}^{N\_d} \mathbf{y}\_\tau(m, d) \cdot \ln p(y\_\tau \mid m, d, \mathbf{n}) + \text{Const} \end{split} \tag{11}$$

where *m m*, denotes the Kronecker delta function and ( ,) *m d* denotes the posterior probability. ( , ) *m d* can be estimated according to the Bayes rule as:

$$\begin{split} \mathcal{I}\_{\tau}(m,d) &= \mathbb{E}\left[\mathcal{S}\_{\mathbf{m}\_{\tau},m}\mathcal{S}\_{d\_{\tau},d} \mid y\_{1}^{t+1},\mathbf{n}\_{1}^{t}\right] \\ &= \sum\_{M\_{1}^{d\_{1}^{t}}}\sum\_{D\_{1}^{d\_{1}^{t}}} p\Big(M\_{1}^{t+1},D\_{1}^{t+1} \mid y\_{1}^{t+1},\mathbf{n}\_{1}^{t}\Big)\delta\_{m\_{\tau},m}\delta\_{d\_{\tau},d} \\ &= p\Big(m\_{\tau},d \mid y\_{\tau},\mathbf{n}\_{\tau-1}\Big) \\ &= \frac{p\Big(y\_{\tau} \mid m,d,\mathbf{n}\_{\tau-1}\Big)p\Big(m,d \mid \mathbf{n}\_{\tau-1}\Big)}{\sum\sum\_{m=1}^{N\_{u}}p\Big(y\_{\tau} \mid m,d,\mathbf{n}\_{\tau-1}\Big)p\Big(m,d \mid \mathbf{n}\_{\tau-1}\Big)} \\ &= \frac{p\Big(y\_{\tau} \mid m,d,\mathbf{n}\_{\tau-1}\Big)\cdot w\_{m}\cdot w\_{d}}{\sum\sum\_{m=1}^{N\_{u}}p\Big(y\_{\tau} \mid m,d,\mathbf{n}\_{\tau-1}\Big)\cdot w\_{m}\cdot w\_{d}} \\ \end{split} \tag{12}$$

Robust Speech Recognition for Adverse Environments 11

,

*md t*

n

(15)

, n n

*md I G I G*

 

n n

*t t*

1

00 , 0 0

*m d*

*t t*

*t md t*

*N N m d <sup>y</sup>*

, n

1 00 , 1

The prior models are used to search the most similar noise or clean speech mixture component. Given the two mixture components, the estimation of the posterior probability

Because the prior models are usually not complete enough to represent the universal data, the environment mismatch between the training data and the test data will result in the degradation on feature enhancement performance. In this section, an environment model adaptation strategy is proposed before the test phase to deal with the problem. The environment model adaptation procedure contains two parts: The first one is model parameter adaptation on noise prior model *n* and noisy speech prior model *y* in the training phase and adaptation phase. The second is on noise-normalized SVM function ( )*<sup>e</sup>*

For noise and noisy speech prior model adaptation, MAP adaptation is applied to the noise prior model *n* first. The adaptation equations for the noise prior model parameters given *T* frames of the adaptation noise data *z*, which is estimated using the un-adapted prior

*<sup>T</sup> T T <sup>T</sup> nn n n <sup>n</sup> d d dt t dd d d t d d d d d t t t*

 

where the conjugate prior density of the mixture weight is the Dirichlet distribution with

, , 1 1

and the joint conjugate prior density of mean and variance parameters

  (16)

 

*sz z p s*

, , 1 1 1 1

> 

*T T N N d d*

 

*md I G y*

2 1 1 1

*Q R K sK <sup>s</sup>*

1 1 11 2 nn 1 n n

n n | |

*t tt tt t*

<sup>1</sup> <sup>1</sup> <sup>1</sup>

*<sup>t</sup> N N <sup>T</sup> t xx <sup>y</sup>*

*<sup>T</sup> <sup>x</sup> <sup>y</sup>*

1 m=1 1 1 1 n n

 


*t*

*R*

m=1 1

will be more accurate.

**2.3. Environment model adaptation** 

and environment model e in the adaptation phase.

*T T n d d d dt t d dt t t*

1 1

*ws s*

*d d dt d d t t d d t*

*sz s*

 

*2.3.1. Model adaptation on noise and noisy speech prior models* 

, , 1 1

*d*

*t*


*t m d*

*d*

2 1 1 2 n n

*<sup>Q</sup> <sup>K</sup>*

*t*

*t*

*s*

models, are defined as:

 

1

hyper-parameter *<sup>d</sup>*

( ,) *m d* 

*t*

 

where the likelihood <sup>1</sup> *p y md* | , ,n can be approximated using the approximated linear model as:

$$\begin{aligned} p\left(\boldsymbol{y}\_{\tau} \mid \boldsymbol{m}, \boldsymbol{d}, \mathbf{n}\_{\tau-1}\right) &\sim \mathcal{N}\Big[\boldsymbol{y}\_{\tau}; \boldsymbol{\mu}\_{m,\epsilon}^{y}\left(\mathbf{n}\_{\tau-1}\right), \boldsymbol{\Sigma}\_{m,d}^{y}\right] \\ \boldsymbol{\mu}\_{m,d}^{y}\Big(\boldsymbol{\mu}\_{\tau-1}^{y}\Big) &= \boldsymbol{\mu}\_{0}^{x} + \boldsymbol{g}\Big(\mathbf{n}\_{0} \cdot \boldsymbol{\mu}\_{0}^{x}\Big) + \boldsymbol{G}\Big(\mathbf{n}\_{0} \cdot \boldsymbol{\mu}\_{0}^{x}\Big)\Big(\boldsymbol{\mu}\_{m}^{x} \cdot \boldsymbol{\mu}\_{0}^{x}\Big) + \left[I \cdot \boldsymbol{G}\Big(\mathbf{n}\_{0} \cdot \boldsymbol{\mu}\_{0}^{x}\Big)\Big]\Big(\mathbf{n}\_{\tau-1} \cdot \boldsymbol{n}\_{0}\Big) \\ \boldsymbol{\Sigma}\_{m,d}^{y} &= \left[I + \boldsymbol{G}\Big(\mathbf{n}\_{0} \cdot \boldsymbol{\mu}\_{0}^{x}\Big)\right]\boldsymbol{\Sigma}\_{m}^{x}\Big[I + \boldsymbol{G}^{T}\Big(\mathbf{n}\_{0} \cdot \boldsymbol{\mu}\_{0}^{x}\Big)\Big]^{T} \end{aligned} \tag{13}$$

Also, a forgetting factor is employed to control the effect of the features of the preceding frames.

$$\begin{split} Q\_{t+1}(\mathbf{n}) &= \sum\_{\tau=1}^{t+1} \boldsymbol{\varepsilon}^{t+1-\tau} \sum\_{m=1}^{N\_{\text{m}}} \boldsymbol{\gamma}\_{\tau}(m,d) \cdot \ln p(\boldsymbol{y}\_{\tau} \mid m,d,\mathbf{n}) + \text{Const} \\ \widetilde{Q}\_{t+1}(\mathbf{n}) &= -\sum\_{\tau=1}^{t+1} \boldsymbol{\varepsilon}^{t+1-\tau} \sum\_{m=1}^{N\_{\text{m}}} \sum\_{d=1}^{N\_{\text{d}}} \boldsymbol{\gamma}\_{\tau}(m,d) \cdot \left[ \boldsymbol{y}\_{\tau} - \boldsymbol{\mu}\_{m,d}^{y} \left( \mathbf{n}\_{\tau} \right) \right]^{\mathrm{T}} \left( \boldsymbol{\Sigma}\_{m,d}^{y} \right)^{-1} \left[ \boldsymbol{y}\_{\tau} - \boldsymbol{\mu}\_{m,d}^{y} \left( \mathbf{n}\_{\tau} \right) \right] \\ &= \boldsymbol{\varepsilon} \widetilde{Q}\_{t}(\mathbf{n}) - \mathbf{R}\_{t+1}(\mathbf{n}) \\ R\_{t+1}(\mathbf{n}) &= \sum\_{m=1}^{N\_{\text{m}}} \sum\_{d=1}^{N\_{\text{d}}} \boldsymbol{\gamma}\_{t+1}(m,d) \cdot \left[ \boldsymbol{y}\_{\tau} - \boldsymbol{\mu}\_{m,d}^{y} \left( \mathbf{n}\_{\tau} \right) \right]^{\mathrm{T}} \left( \boldsymbol{\Sigma}\_{m,d}^{y} \right)^{-1} \left[ \boldsymbol{y}\_{\tau} - \boldsymbol{\mu}\_{m,d}^{y} \left( \mathbf{n}\_{\tau} \right) \right] \end{split} \tag{14}$$

In the M-step, the iterative stochastic approximation is introduced to derive the solution. Finally, sequential noise estimation is performed as follows:

 2 1 1 1 1 1 11 2 nn 1 n n 2 1 1 2 n n <sup>1</sup> <sup>1</sup> <sup>1</sup> 00 , 0 0 1 m=1 1 1 1 n n 1 1 00 , 1 n n | | n n | n , n n | n , n *t t t m d t t t t tt tt t t t <sup>t</sup> N N <sup>T</sup> t xx <sup>y</sup> m d d t t <sup>T</sup> <sup>x</sup> <sup>y</sup> t md t Q R K sK <sup>s</sup> <sup>Q</sup> <sup>K</sup> md I G I G R s md I G y* , m=1 1 n *N N m d <sup>y</sup> md t d* (15)

The prior models are used to search the most similar noise or clean speech mixture component. Given the two mixture components, the estimation of the posterior probability ( ,) *m d* will be more accurate.

#### **2.3. Environment model adaptation**

10 Modern Speech Recognition Approaches with Case Studies

where the likelihood <sup>1</sup> *p y md* | , ,n

*p y md N y*

 

model as:

frames.

*m d m*

1 1

(n) n

 

*m*

*d*

*R md y*

*Q R*

*d*

1

*t t*

1

*t*

 


111

*t t tt*

, | ,n

1 1 11 ,,

*mmdd*

(12)

 

 

> 

can be approximated using the approximated linear

(13)

(14)

n

 

, ,

*m d m d*

*y* 1 1

*m d*

1 1

 

*m d*

1 , ,1 1

*t t*


1

*T*

n - n - - - n - n -n

, 1 0 00 00 0 00 10

*g G I G*

 

Also, a forgetting factor is employed to control the effect of the features of the preceding

(n) ( ,) n n

 

In the M-step, the iterative stochastic approximation is introduced to derive the solution.

<sup>1</sup>

 

*<sup>t</sup> N N <sup>T</sup> <sup>t</sup> yy y t m d m d m d*

 

<sup>1</sup> <sup>1</sup> <sup>1</sup> 1 ,, ,

*y*

*Nm y y*

  *p y md w w*

 

*p y md p md*


1

 

, | ,n

1 1 1 1

*p md y*

*t t*

*md E y*

 

*M D*

( ,) | ,n

 

*mmdd*

 

1 1

*m d*

*N N*

*m d*

1 1

*m d*

*N N*

*m c*

 

(n) ( , ) ln ( | , ,n)

 

*Q m d p y m d Const*

*Q md y y*

*<sup>N</sup> <sup>T</sup>*

n - n -

, 0 0 0 0

*y xx T x*

 

*I G I G*

1 11

11 , 1

*t t m d*

1 11

*t N N t*

n ( ,) n

Finally, sequential noise estimation is performed as follows:

1

*m d*

*m d*

*m d*

*m d*

*m d m*

 

 

1 , 1,

 

*y y m c m d y n x x xxx x*


*p y md w w*

 


1

*p y md p md*

*pM D y*

Because the prior models are usually not complete enough to represent the universal data, the environment mismatch between the training data and the test data will result in the degradation on feature enhancement performance. In this section, an environment model adaptation strategy is proposed before the test phase to deal with the problem. The environment model adaptation procedure contains two parts: The first one is model parameter adaptation on noise prior model *n* and noisy speech prior model *y* in the training phase and adaptation phase. The second is on noise-normalized SVM function ( )*<sup>e</sup>* and environment model e in the adaptation phase.

#### *2.3.1. Model adaptation on noise and noisy speech prior models*

For noise and noisy speech prior model adaptation, MAP adaptation is applied to the noise prior model *n* first. The adaptation equations for the noise prior model parameters given *T* frames of the adaptation noise data *z*, which is estimated using the un-adapted prior models, are defined as:

$$\begin{aligned} \widetilde{\boldsymbol{w}}\_{d} &= \left(\boldsymbol{\nu}\_{d} - \boldsymbol{1}\right) + \sum\_{t=1}^{T} \boldsymbol{s}\_{d,t} \Bigg/ \sum\_{d=1}^{N\_{d}} \left(\boldsymbol{\nu}\_{d} - \boldsymbol{1}\right) + \sum\_{d=1}^{N\_{d}} \sum\_{t=1}^{T} \boldsymbol{s}\_{d,t} \\ \boldsymbol{\widetilde{\boldsymbol{\mu}}}\_{d} &= \boldsymbol{\tau}\_{d} \boldsymbol{\rho}\_{d} + \sum\_{t=1}^{T} \boldsymbol{s}\_{d,t} \cdot \boldsymbol{z}\_{t} \Bigg/ \boldsymbol{\tau}\_{d} + \sum\_{t=1}^{T} \boldsymbol{s}\_{d,t} \\ \boldsymbol{\widetilde{\boldsymbol{\Sigma}}}\_{d}^{n} &= \boldsymbol{\nu}\_{d} + \sum\_{t=1}^{T} \boldsymbol{s}\_{d,t} \left(\boldsymbol{z}\_{t} - \boldsymbol{\widetilde{\boldsymbol{\mu}}}\_{d}^{n}\right) \Bigg/ \boldsymbol{z}\_{t} - \boldsymbol{\widetilde{\boldsymbol{\mu}}}\_{d}^{n} \Bigg/ \boldsymbol{\tau}\_{d} + \boldsymbol{\tau}\_{d} \Big(\boldsymbol{\rho}\_{d} - \boldsymbol{\widetilde{\boldsymbol{\mu}}}\_{d}^{n}\Big) \Bigg/ \boldsymbol{\rho}\_{d} - \boldsymbol{\widetilde{\boldsymbol{\mu}}}\_{d}^{n} \Bigg/ \boldsymbol{\left(\boldsymbol{\alpha}\_{d} - \boldsymbol{p}\right) + \sum\_{t=1}^{T} \boldsymbol{s}\_{d,t}} \end{aligned} \tag{16}$$

where the conjugate prior density of the mixture weight is the Dirichlet distribution with hyper-parameter *<sup>d</sup>* and the joint conjugate prior density of mean and variance parameters

is the Normal-Wishart distribution with hyper-parameters *<sup>d</sup>* , *<sup>d</sup>* , *<sup>d</sup>* , and *<sup>d</sup>* . The two distributions are defined as follows:

Robust Speech Recognition for Adverse Environments 13

0.04% improvement (different background noise types and channel characteristic to the training data), the environment model adaptation can slightly reduce the mismatch between

**Mode Set A Set B Set C Overall** 

Multi-condition 87.82 86.27 83.78 86.39 Clean only 61.34 55.75 66.14 60.06

Multi-condition 92.92 89.15 90.09 90.85 Clean only 87.82 85.34 83.77 86.02

Multi-condition 91.49 89.16 89.62 90.18 Clean only 87.82 87.09 85.08 86.98

Multi-condition 91.42 89.18 89.85 90.21

Multi-condition 91.06 90.79 90.77 90.89

Multi-condition 91.07 90.90 90.81 90.95

In this section two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are presented. The prior model was introduced for precise noise estimation. Then the environment model adaptation is constructed to reduce the environment mismatch between the training data and the test data. Experimental results demonstrate that the proposed approach can slightly outperform

In this section, a novel approach to detecting and correcting the edit disfluency in spontaneous speech is presented. Hypothesis testing using acoustic features is fist adopted to detect potential interruption points (IPs) in the input speech. The word order of the utterance is then cleaned up based on the potential IPs using a class-based cleanup language model. The deletable region and the correction are aligned using an alignment model.

Finally, a log linear weighting mechanism is applied to optimize the performance.

the SPLICE-based approach without stereo data on AURORA2 database.

**3. Speech recognition in disfluent environment** 

Clean only 87.84 86.77 85.23 86.89

Clean only 87.56 87.33 86.32 87.22

Clean only 87.55 87.44 86.38 87.27

the training data and test data.

No Denoising

MCE

SPLICE with Recursive-EM

Proposed approach (manual tag, no adaptation)

Proposed approach (auto-clustering, no adaptation)

Proposed approach (auto-clustering, with adaptation)

**2.5. Conclusions** 

**Table 1.** Experimental results (%) on AURORA2

**Methods Training-**

$$\begin{split} &g\left(w\_{1},\ldots,w\_{N\_{d}}\mid\nu\_{1},\ldots,\nu\_{N\_{d}}\right)\propto\prod\_{d=1}^{N\_{d}}w\_{d}^{\nu\_{d}-1} \\ &g\left(\mu\_{d}^{u},\Sigma\_{d}^{u}\mid\tau\_{d},\rho\_{d},\alpha\_{d},\nu\_{d}\right)\propto\Big|\Sigma\_{d}^{u}\Big|^{(a\_{d}-p)/2}\exp\bigg[-\frac{\tau\_{d}}{2}\Big(\mu\_{d}^{u}-\rho\_{d}\Big)^{T}\tau\_{d}\Big(\mu\_{d}^{u}-\rho\_{d}\Big)\bigg]\exp\bigg[-\frac{1}{2}\text{tr}\big(\nu\_{d}\Sigma\_{d}^{u}\Big)\Big] \end{split} \tag{17}$$

where 0 *<sup>d</sup>* , 1 *<sup>k</sup> p* and 0 *<sup>k</sup>* . After adaptation of noise prior model, the noisy speech prior model *y* is then adapted using the clean speech prior model *x* and the newly adapted noise prior model *n* based on Eq.**Error! Reference source not found.**(8).

#### *2.3.2. Model adaptation of noise-normalized SVM (NN-SVM)*

For NN-SVM adaptation, model parameters e and mapping function parameters in ( ) ; *<sup>e</sup> F y* need to be adapted in the adaptation phase. First, adaptation of model parameter e is similar to that of noise prior model. Second, the adaptation of ( ) ( ) 1 *Nk <sup>e</sup> <sup>e</sup> <sup>k</sup> <sup>k</sup> <sup>r</sup>* is an iterative procedure. While ( ) ( ) 1 *Nk <sup>e</sup> <sup>e</sup> <sup>k</sup> <sup>k</sup> <sup>r</sup>* is not a random variable and does not follow any conjugate prior density, a maximum likelihood (ML)-based adaptation which is similar to the correction vector estimation of SPLICE is employed as:

$$\widehat{\sigma\_k^{(e)}} = \sum\_{t} p\left(k \mid y\_t \ \mathsf{\cdot} \ \mathsf{\cdot}, e\right) \left(\widetilde{\mathbf{x\_t}} \ \mathsf{\cdot} y\_t\right) \bigg/ \sum\_{t} p\left(k \mid y\_t \ \mathsf{\cdot} \ \mathsf{\cdot}, e\right) \tag{18}$$

where the temporal estimated clean speech *<sup>t</sup> x* are estimated using the un-adapted noise normalized stochastic mapping function in Eq.(4).

#### **2.4. Experimental results**

Table 1 shows the experimental results of the proposed approach on AURORA2 database. The AURORA2 database contains both clean and noisy utterances of the TIDIGITS corpus and is available from ELDA (Evaluations and Language resources Distribution Agency). Two results of previous research were illustrated for comparison and three experiments were conducted for different experimental conditions: no denoising, SPLICE with recursive EM using stereo data (Deng et al. 2003), the proposed approach using manual annotation without adaptation, and the proposed approach using auto-clustered training data without and with adaptation. The overall results show that the proposed approach slightly outperformed the SPLICE-based approach with recursive EM algorithm under the lack of stereo training data and manual annotation. Furthermore, based on the results in Set B with 0.11% improvement (different background noise types to the training data) and Set C with


0.04% improvement (different background noise types and channel characteristic to the training data), the environment model adaptation can slightly reduce the mismatch between the training data and test data.

**Table 1.** Experimental results (%) on AURORA2

## **2.5. Conclusions**

12 Modern Speech Recognition Approaches with Case Studies

distributions are defined as follows:

*d d*

*gw w w*

iterative procedure. While ( ) ( )

**2.4. Experimental results** 

 

*NN d*

*p* and 0 *<sup>k</sup>*

,..., | ,...,

1 1

 

 , 1 *<sup>k</sup>* 

*g*

where 0 *<sup>d</sup>* 

is the Normal-Wishart distribution with hyper-parameters *<sup>d</sup>*

1

 

*d*

*d d*

*N*

*d*

*2.3.2. Model adaptation of noise-normalized SVM (NN-SVM)* 

1

/2

prior model *y* is then adapted using the clean speech prior model *x* and the newly

For NN-SVM adaptation, model parameters e and mapping function parameters in ( ) ; *<sup>e</sup> F y* need to be adapted in the adaptation phase. First, adaptation of model parameter

conjugate prior density, a maximum likelihood (ML)-based adaptation which is similar to

 ( ) | -, - | -, *<sup>e</sup> k t tt t t t*

Table 1 shows the experimental results of the proposed approach on AURORA2 database. The AURORA2 database contains both clean and noisy utterances of the TIDIGITS corpus and is available from ELDA (Evaluations and Language resources Distribution Agency). Two results of previous research were illustrated for comparison and three experiments were conducted for different experimental conditions: no denoising, SPLICE with recursive EM using stereo data (Deng et al. 2003), the proposed approach using manual annotation without adaptation, and the proposed approach using auto-clustered training data without and with adaptation. The overall results show that the proposed approach slightly outperformed the SPLICE-based approach with recursive EM algorithm under the lack of stereo training data and manual annotation. Furthermore, based on the results in Set B with 0.11% improvement (different background noise types to the training data) and Set C with

adapted noise prior model *n* based on Eq.**Error! Reference source not found.**(8).

e is similar to that of noise prior model. Second, the adaptation of ( ) ( )

1

*Nk <sup>e</sup> <sup>e</sup> <sup>k</sup> <sup>k</sup>*

the correction vector estimation of SPLICE is employed as:

where the temporal estimated clean speech

normalized stochastic mapping function in Eq.(4).

*<sup>p</sup> <sup>T</sup> n n <sup>n</sup> <sup>d</sup> <sup>n</sup> <sup>n</sup> <sup>n</sup> d d d d dd d d d dd d d d*

 

<sup>1</sup> , |, , , exp exp tr 2 2

 , *<sup>d</sup>* , 

. After adaptation of noise prior model, the noisy speech

*<sup>r</sup>* is not a random variable and does not follow any

*r p k y ne x y p k y ne* (18)

*<sup>t</sup> x* are estimated using the un-adapted noise

   *<sup>d</sup>* , and *<sup>d</sup>* 

. The two

1

*Nk <sup>e</sup> <sup>e</sup> <sup>k</sup> <sup>k</sup> <sup>r</sup>* is an

(17)

 

> In this section two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are presented. The prior model was introduced for precise noise estimation. Then the environment model adaptation is constructed to reduce the environment mismatch between the training data and the test data. Experimental results demonstrate that the proposed approach can slightly outperform the SPLICE-based approach without stereo data on AURORA2 database.

## **3. Speech recognition in disfluent environment**

In this section, a novel approach to detecting and correcting the edit disfluency in spontaneous speech is presented. Hypothesis testing using acoustic features is fist adopted to detect potential interruption points (IPs) in the input speech. The word order of the utterance is then cleaned up based on the potential IPs using a class-based cleanup language model. The deletable region and the correction are aligned using an alignment model. Finally, a log linear weighting mechanism is applied to optimize the performance.

## **3.1. Edit disfluency ANalsis**

In conversational utterances, several problems such as interruption, correction, filled pause, and ungrammatical sentence are detrimental for speech recognition. The definitions of disfluencies have been discussed in SimpleMDE. Edit disfluencies are portions of speech in which a speaker's utterance is not complete and fluent; instead the speaker corrects or alters the utterance, or abandons it entirely and starts over. In general, edit disfluencies can be divided into four categories: repetitions, revisions, restarts and complex disfluencies. Since complex disfluencies consist of multiple or nested edits, it seems reasonable to consider the complex disfluencies as a combination of the other simple disfluencies: repetitions, revisions, and restarts. Edit disfluencies have a complex internal structure, consisting of the deletable region (delreg), interruption point (IP) and correction. Editing terms such as fillers, particles and markers are optional and follow the IP in edit disfluency.

Robust Speech Recognition for Adverse Environments 15

**Figure 2.** The framework of transcription system for spontaneous speech with edit disfluencies

There are two processing stages in the edit disfluency correction module: cleanup and alignment. As shown in Figure 4, cleanup process divides the word string into three parts: deletable region (delreg), editing term, and correction according to the locations of potential IPs detected by the IP detection module. Cleanup process is performed by shifting the correction part and replaces the deletable region to form a new cleanup transcription. The edit disfluency correction module is composed of an *n*-gram language model and the alignment model. The *n*-gram model regards the cleanup transcriptions as fluent utterances

**Figure 3.** The correction process for the edit disfluency

In spontaneous speech, acoustic features such as short pause (silence and filler), energy and pitch reset generally appear along with the occurrence of edit dislfuency. Based on these features, we can detect the possible IPs. Furthermore, since IPs generally appear at the boundary of two successive words, we can exclude the unlikely IPs whose positions are within a word. Besides, since the structural patterns between the deletable word sequence and correction word sequence are very similar, the deletable word sequence in edit disfluency is replaceable by the correction word sequence.

## **3.2. Framework of edit disfluency transcription system**

The overall transcription task for conversational speech with edit disfluency in the proposed method is composed of two main mechanisms; IP detection module and edit disfluency correction module. The framework is shown in Figure 2. IP detection module predicts the potential IPs first. Edit disfluency correction module generates the rich transcription that contains information of interruption, text transcription from the speaker's utterances and the cleaned-up text transcription without disfluencies. Figure 3 shows the correction process for edit disfluency.

The speech signal is fed to both acoustic feature extraction module and speech recognition engine in IP detection module. Information about durations of syllables and silence from speech recognition is provided for acoustic feature extraction. Combined with side information from speech recognition, duration-, pitch-, and energy-related features are extracted and used to model the IPs using a Gaussian mixture model (GMM). Besides, in order to perform hypothesis testing on IP detection, an anti-IP GMM is also constructed based on the extracted features from the non-IP regions. The hypothesis testing verifies if the posterior probability of the acoustic features of a syllable boundary is above a threshold and therefore determines if the syllable boundary is an IP. Since IP is an event that happens in interword location, we can remove the detected IPs that do not appear in the word boundary.

**Figure 2.** The framework of transcription system for spontaneous speech with edit disfluencies

**Figure 3.** The correction process for the edit disfluency

In conversational utterances, several problems such as interruption, correction, filled pause, and ungrammatical sentence are detrimental for speech recognition. The definitions of disfluencies have been discussed in SimpleMDE. Edit disfluencies are portions of speech in which a speaker's utterance is not complete and fluent; instead the speaker corrects or alters the utterance, or abandons it entirely and starts over. In general, edit disfluencies can be divided into four categories: repetitions, revisions, restarts and complex disfluencies. Since complex disfluencies consist of multiple or nested edits, it seems reasonable to consider the complex disfluencies as a combination of the other simple disfluencies: repetitions, revisions, and restarts. Edit disfluencies have a complex internal structure, consisting of the deletable region (delreg), interruption point (IP) and correction. Editing terms such as fillers,

In spontaneous speech, acoustic features such as short pause (silence and filler), energy and pitch reset generally appear along with the occurrence of edit dislfuency. Based on these features, we can detect the possible IPs. Furthermore, since IPs generally appear at the boundary of two successive words, we can exclude the unlikely IPs whose positions are within a word. Besides, since the structural patterns between the deletable word sequence and correction word sequence are very similar, the deletable word sequence in edit

The overall transcription task for conversational speech with edit disfluency in the proposed method is composed of two main mechanisms; IP detection module and edit disfluency correction module. The framework is shown in Figure 2. IP detection module predicts the potential IPs first. Edit disfluency correction module generates the rich transcription that contains information of interruption, text transcription from the speaker's utterances and the cleaned-up text transcription without disfluencies. Figure 3 shows the correction process for

The speech signal is fed to both acoustic feature extraction module and speech recognition engine in IP detection module. Information about durations of syllables and silence from speech recognition is provided for acoustic feature extraction. Combined with side information from speech recognition, duration-, pitch-, and energy-related features are extracted and used to model the IPs using a Gaussian mixture model (GMM). Besides, in order to perform hypothesis testing on IP detection, an anti-IP GMM is also constructed based on the extracted features from the non-IP regions. The hypothesis testing verifies if the posterior probability of the acoustic features of a syllable boundary is above a threshold and therefore determines if the syllable boundary is an IP. Since IP is an event that happens in interword location, we can remove the detected IPs that do not appear in the word

particles and markers are optional and follow the IP in edit disfluency.

disfluency is replaceable by the correction word sequence.

**3.2. Framework of edit disfluency transcription system** 

**3.1. Edit disfluency ANalsis** 

edit disfluency.

boundary.

There are two processing stages in the edit disfluency correction module: cleanup and alignment. As shown in Figure 4, cleanup process divides the word string into three parts: deletable region (delreg), editing term, and correction according to the locations of potential IPs detected by the IP detection module. Cleanup process is performed by shifting the correction part and replaces the deletable region to form a new cleanup transcription. The edit disfluency correction module is composed of an *n*-gram language model and the alignment model. The *n*-gram model regards the cleanup transcriptions as fluent utterances

and models their word order information. The alignment model finds the optimal correspondence between deletable region and correction in edit disfluency.

Robust Speech Recognition for Adverse Environments 17

(20)

(22)

(23)

(21)

potential IP. Under the assumption of independence, the probability of IP appearing in *silencek* can be regarded as the product of probabilities obtained from *silencek* and the syllables around it. The probability density functions (PDFs) under each hypothesis are

*P Seq H P Seq E*

*P Seq H P Seq E*

*3.3.1. IP detection using posterior probability of silence duration* 

\_0 \_ ; |

 | | *syllable silence syllable silence ip*

Where *Eip* denotes that IP is embedded in *silencek* and *ip E* means that IP does not appear in

Since IPs always appear at the inter-syllable position, the *n-*1 silence positions between *n* syllables will be considered as the IP candidates. By this, we can take the IP detection as the problem to verify whether each of the *n-*1 silence positions is an IP or not. In conversation, speakers may hesitate to find the correct words when disfluency appears. Hesitation is usually realized as a pause. Since the length of silence is very sensitive to disfluency, we use normal distributions to model the posterior probabilities of that IP appears and does not

<sup>2</sup>

<sup>2</sup>


*Seq P Seq E*

*Seq P Seq E*


*ip ip*

*nip nip*

\_ \_ ; | <sup>1</sup> | | *syllable silence syllable silence ip*

\_

*syllable silence*

By introducing the threshold

*H L* 1 syllable\_silence : Seq

denoted and estimated as

and

silence , that is, k

: Interuption point *ip <sup>k</sup> E silence* : *ip E* Interuption point *<sup>k</sup> silence*

appear in *silencek* , respectively.

*syllable silence P Seq H*

*syllable silence*

\_ 1

\_ 0 ; ;

to adjust the precision and recall rates,

*L Seq P Seq <sup>H</sup>* (19)

means the IP is embedded in *silencek*. Conceptually, *silencek* is a

*P Seq E P Seq E*

*silence syllable ip ip*

2

2

*P Seq E P Seq E*

*silence syllable ip ip*

**Figure 4.** The cleanup language model for the edit disfluency

## **3.3. Potential interruption point detection**

For IP detection, instead of detecting exact IP, potential IPs are selected for further processing. Since the IP is the point at which the speaker breaks off the deletable region, some acoustic events will go along with it. For syllabic languages like Chinese, every character is pronounced as a monosyllable, while a word is composed of one to several syllables. The speech input of the syllabic languages with *n* syllables can be described as a sequence,

\_ 11 22 1 *syllable silence* , , , ,..., , , *n n Seq syllable silence syllable silence silence syllable*

and then this sequence can be separated into a syllable sequence

1 2 *syllable* , , ..., , *<sup>n</sup> Seq syllable syllable syllable*

and a silence sequence

$$Seq\_{silenc} \equivsilence\_{1\prime} \text{ silence}\_{2\prime}...\text{silence}\_{n-1}.$$

We model the interruption detection problem as choosing between *H0* , which is termed the IP not embedded in the silence hypothesis, and *H1* which is the IP embedded in the silence hypothesis. The likelihood ratio test is employed to detect the potential IPs. The function *L Seq syllable silence* \_ is termed the likelihood ratio since it indicates for each value of *Sequencesyllable silence* \_ the likelihood of *H*1 versus the likelihood of *H*<sup>0</sup> .

Robust Speech Recognition for Adverse Environments 17

$$L\left(\text{Seq}\_{\text{syliable\\_s}}\right) = \frac{P\left(\text{Seq}\_{\text{sylable\\_s}}; \text{lene}; H1\right)}{P\left(\text{Seq}\_{\text{sylable\\_s}}; \text{lene}; H0\right)}\tag{19}$$

By introducing the threshold to adjust the precision and recall rates, *H L* 1 syllable\_silence : Seq means the IP is embedded in *silencek*. Conceptually, *silencek* is a potential IP. Under the assumption of independence, the probability of IP appearing in *silencek* can be regarded as the product of probabilities obtained from *silencek* and the syllables around it. The probability density functions (PDFs) under each hypothesis are denoted and estimated as

$$\begin{split} P\left(\mathsf{Seq}\text{syliable\\_sides:}\, H\_1\right) &= P\left(\mathsf{Seq}\text{sylable\\_sides:}\, \mid \, E\_{ip}\right) \\ &= P\left(\mathsf{Seq}\text{sillene}\mid \, E\_{ip}\right) \times P\left(\mathsf{Seq}\text{sylable\\_l}\mid E\_{ip}\right) \end{split} \tag{20}$$

and

16 Modern Speech Recognition Approaches with Case Studies

**Figure 4.** The cleanup language model for the edit disfluency

and then this sequence can be separated into a syllable sequence

*Sequencesyllable silence* \_ the likelihood of *H*1 versus the likelihood of *H*<sup>0</sup> .

**3.3. Potential interruption point detection** 

sequence,

and a silence sequence

and models their word order information. The alignment model finds the optimal

For IP detection, instead of detecting exact IP, potential IPs are selected for further processing. Since the IP is the point at which the speaker breaks off the deletable region, some acoustic events will go along with it. For syllabic languages like Chinese, every character is pronounced as a monosyllable, while a word is composed of one to several syllables. The speech input of the syllabic languages with *n* syllables can be described as a

\_ 11 22 1 *syllable silence* , , , ,..., , , *n n Seq syllable silence syllable silence silence syllable*

1 2 *syllable* , , ..., , *<sup>n</sup> Seq syllable syllable syllable*

12 1 *silence* , ,..., . *<sup>n</sup> Seq silence silence silence*

We model the interruption detection problem as choosing between *H0* , which is termed the IP not embedded in the silence hypothesis, and *H1* which is the IP embedded in the silence hypothesis. The likelihood ratio test is employed to detect the potential IPs. The function *L Seq syllable silence* \_ is termed the likelihood ratio since it indicates for each value of

correspondence between deletable region and correction in edit disfluency.

$$\begin{split} P\left(\mathsf{Seq}\_{\text{syliable\\_silence}}; H0\right) &= P\left(\mathsf{Seq}\_{\text{syliable\\_silence}} \mid \neg E\_{ip}\right) \\ &= P\left(\mathsf{Seq}\_{\text{silleence}} \mid \neg E\_{ip}\right) \times P\left(\mathsf{Seq}\_{\text{sylable}} \mid \neg E\_{ip}\right) \end{split} \tag{21}$$

Where *Eip* denotes that IP is embedded in *silencek* and *ip E* means that IP does not appear in silence , that is, k

: Interuption point *ip <sup>k</sup> E silence*

: *ip E* Interuption point *<sup>k</sup> silence*

#### *3.3.1. IP detection using posterior probability of silence duration*

Since IPs always appear at the inter-syllable position, the *n-*1 silence positions between *n* syllables will be considered as the IP candidates. By this, we can take the IP detection as the problem to verify whether each of the *n-*1 silence positions is an IP or not. In conversation, speakers may hesitate to find the correct words when disfluency appears. Hesitation is usually realized as a pause. Since the length of silence is very sensitive to disfluency, we use normal distributions to model the posterior probabilities of that IP appears and does not appear in *silencek* , respectively.

$$P\left(Seqsize \mid E\_{ip}\right) = \frac{2}{\sqrt{2\pi}\sigma\_{ip}} \exp\left(-\frac{\left(Seqsize - \mu\_{ip}\right)^2}{2\sigma\_{ip}^2}\right) \tag{22}$$

$$P\left(Seq{slance} \mid \neg E\_{ip}\right) = \frac{2}{\sqrt{2\pi}\sigma\_{nip}} \exp\left(-\frac{\left(Seq{slance} \mid \neg \mu\_{nip}\right)^2}{2\sigma\_{nip}}\right) \tag{23}$$

Where *ip* , *nip* , <sup>2</sup> *nip* and <sup>2</sup> *ip* denote the means and variances of the silence duration containing and not containing the IP, respectively.

#### *3.3.2. Syllable-based acoustic features extraction*

Acoustic features including duration, pitch, and energy for each syllable (Soltau et al. 2005) are adopted for IP detection. A feature vector of the syllables within an observation window around the silence is formed as the input of the GMM. That is, we are interested in the syllables around the silence that may appear as an IP. A window of 2*w* syllables with *w* syllables after and before *silencek* is used. First, the subscript will be translated according to the position of silence as *nk n Syl Syl* . And we then extract the features of syllables within the observation windows.

Since the durations of syllables are not the same even for the same syllable, the duration ratio is defined as the average duration of the syllable normalized by the average duration over all syllables.

$$\inf\_{\{\text{inf}\_{\text{duration}\_{i}} = \frac{\sum\_{j=1}^{n\_{i}}{\text{syllabel}}\}} \frac{\text{duration}\left(\text{syllabel}\_{i,j}\right)}{\text{syllabel}\_{i,j}} \tag{24}$$

Robust Speech Recognition for Adverse Environments 19

*T*

(27)

Where dim*<sup>s</sup> O R <sup>s</sup>* , *S DPE* , , represents the single kind feature vectors and *dim* means the dimensions of the feature vector consisting of duration-related, pitch-related and energyrelated features. The following equation shows the estimation for duration-related features.

1 10 1 2 1

1

, where *W* is the number of mixture components, and *N* is the

<sup>1</sup>

and *i* are the mean vector and covariance matrix of the *i*-th component. *Ot*

*t t*

 

*PiO*

*PiO O*


*t*



*<sup>N</sup> <sup>T</sup> t titi*

*PiO O O*

*t*


*PiO*

*PiO*

 

*T*

(29)

(30)

 

(28)

*<sup>i</sup>* is a mixture weight which must satisfy the

 

*i M*

(31)

(32)

*w w*

1 10 1 2

*nf nf nf nf nf nf <sup>O</sup>*

*3.3.3. Gaussian mixture model for interruption point detection* 

The GMM is adopted for IP detection using the acoustic features.

is the GMM for class *Cj* and

*D*

the IP. *<sup>j</sup>* 

constraint

where *<sup>i</sup>* 

1

Gaussian density function:

*W i i* 

1

*N O*

,..., , , , ,...,

*nf nf nf nf nf nf* 

*duration duration duration duration duration duration*

,..., , , , ,..., *w w*

Where *C EE j ip ip* , means the hypothesis set for *silencek* containing and not containing

1 1

denotes the *t*-th observation in the training corpus. The parameters , , , 1.. *ii i*

1 <sup>1</sup> <sup>ˆ</sup> | , *N i t t*

*N*

*N*

1

*t i N*

 

1

1

*t*

*t*

*O O*

*ti i ti i ti*

dim/2 1/2

*i*

; , exp <sup>2</sup> <sup>2</sup>

ˆ

1

*t i N*

ˆ

can be estimated iteratively using the EM algorithm for mixture *i*

*P Seq C P O N O*


*syllable j t j i ti i i*

*duration duration duration duration duration duration*

Where *syllablei,j* means the *j*-th samples of syllable *i* in the corpus. |syllable| means the number of the syllable. *ni* is the number of syllable *i* in the corpus. Similarly, for energy and pitch, frame-based statistics are used to calculate the normalized features for each syllable.

Considering the result of speech recognition, the features are normalized to be the first order features. For modeling the speaking rate and variation in the energy and pitch during the utterance, the 2nd order feature called delta-duration, delta-energy and delta-pitch are obtained from the forward difference of the 1st order features. The following equation shows the estimation for delta-duration, which can also be applied for the estimation of deltaenergy and delta-pitch.

$$
\Delta \mathbf{n} f\_{\text{duration}\_i} = \begin{cases}
\mathbf{n} f\_{\text{duration}\_{i+1}} - \mathbf{n} f\_{\text{duration}\_i} & \text{if } \cdot \text{ - } \mathbf{w} < \text{i } < \mathbf{w} \\
& \mathbf{0} & \text{others}
\end{cases}
\tag{25}
$$

Where *w* is half of the observation window size. Totally, there are three kinds of two orders features after feature extraction. We combine these features to form a vector with 24*w-*6 features to be the observation vector of the GMM. The acoustic features are denoted as the syllable-based observation sequence that corresponds to the potential IP, *silencek* , by

$$\left\{\mathbf{O} = \left[\mathbf{O}\_{D'}, \mathbf{O}\_{p'}, \mathbf{O}\_E\right] \in \mathbb{R}^{\text{dim}}\right\} \tag{26}$$

Where dim*<sup>s</sup> O R <sup>s</sup>* , *S DPE* , , represents the single kind feature vectors and *dim* means the dimensions of the feature vector consisting of duration-related, pitch-related and energyrelated features. The following equation shows the estimation for duration-related features.

$$\mathbf{O}\_{D} = \begin{bmatrix} \mathbf{nf}\_{\text{duration}\_{-w+1}} \dots \mathbf{nf}\_{\text{duration}\_{-1}} \mathbf{nf}\_{\text{duration}\_{0}} \, \text{nf}\_{\text{duration}\_{+1}} \, \text{nf}\_{\text{duration}\_{+2}} \dots \mathbf{nf}\_{\text{duration}\_{+w}} \\ \boldsymbol{\Delta} \mathbf{nf}\_{\text{duration}\_{-w+1}} \dots \boldsymbol{\Delta} \mathbf{nf}\_{\text{duration}\_{-1}} \, \text{\Delta \mathbf{nf}\_{\text{duration}\_{0}} \, \text{\Delta \mathbf{nf}\_{\text{duration}\_{+1}}} \, \text{\Delta \mathbf{nf}\_{\text{duration}\_{+2}} \, \text{\dots} \mathbf{nf}\_{\text{duration}\_{+w-1}}} \end{bmatrix}^{\text{T}} \tag{27}$$

#### *3.3.3. Gaussian mixture model for interruption point detection*

18 Modern Speech Recognition Approaches with Case Studies

and <sup>2</sup>

containing and not containing the IP, respectively.

*3.3.2. Syllable-based acoustic features extraction* 

*ip* 

Acoustic features including duration, pitch, and energy for each syllable (Soltau et al. 2005) are adopted for IP detection. A feature vector of the syllables within an observation window around the silence is formed as the input of the GMM. That is, we are interested in the syllables around the silence that may appear as an IP. A window of 2*w* syllables with *w* syllables after and before *silencek* is used. First, the subscript will be translated according to the position of silence as *nk n Syl Syl* . And we then extract the features of syllables within

Since the durations of syllables are not the same even for the same syllable, the duration ratio is defined as the average duration of the syllable normalized by the average duration

1

*<sup>i</sup> <sup>i</sup>*

*nf*

*i*

*duration*

*i*

*n*

*j duration syllable n*

1 1

Where *syllablei,j* means the *j*-th samples of syllable *i* in the corpus. |syllable| means the number of the syllable. *ni* is the number of syllable *i* in the corpus. Similarly, for energy and pitch, frame-based statistics are used to calculate the normalized features for each syllable.

Considering the result of speech recognition, the features are normalized to be the first order features. For modeling the speaking rate and variation in the energy and pitch during the utterance, the 2nd order feature called delta-duration, delta-energy and delta-pitch are obtained from the forward difference of the 1st order features. The following equation shows the estimation for delta-duration, which can also be applied for the estimation of delta-

*i j*

1

syllable-based observation sequence that corresponds to the potential IP, *silencek* , by

0 *i i*

*duration duration*

*nf nf if w i w nf others* 

Where *w* is half of the observation window size. Totally, there are three kinds of two orders features after feature extraction. We combine these features to form a vector with 24*w-*6 features to be the observation vector of the GMM. The acoustic features are denoted as the

*duration syllable*

*duration syllable*


dim , , *O OOO R DPE* (26)

.

*i j*

(24)

(25)

.

*i j*

denote the means and variances of the silence duration

Where *ip* , *nip* , <sup>2</sup> *nip* 

the observation windows.

over all syllables.

energy and delta-pitch.

The GMM is adopted for IP detection using the acoustic features.

$$P\left(Seqqsyliable \mid \mathcal{C}\_j\right) \equiv P\left(\mathcal{O}\_t \mid \mathcal{A}\_j\right) = \sum\_{i=1}^{W} \alpha\_i \mathcal{N}\left(\mathcal{O}\_t; \mu\_{i\cdot}, \Sigma\_i\right) \tag{28}$$

Where *C EE j ip ip* , means the hypothesis set for *silencek* containing and not containing the IP. *<sup>j</sup>* is the GMM for class *Cj* and *<sup>i</sup>* is a mixture weight which must satisfy the constraint 1 1 *W i i* , where *W* is the number of mixture components, and *N* is the Gaussian density function:

$$N\left(O\_t; \mu\_i, \Sigma\_i\right) = \frac{1}{\left(2\pi\right)^{\text{dim}/2} \left|\Sigma\_i\right|^{1/2}} \exp\left(-\frac{1}{2} \left(O\_t - \mu\_i\right)^T \Sigma\_i^{-1} \left(O\_t - \mu\_i\right)\right) \tag{29}$$

where *<sup>i</sup>* and *i* are the mean vector and covariance matrix of the *i*-th component. *Ot* denotes the *t*-th observation in the training corpus. The parameters , , , 1.. *ii i i M* can be estimated iteratively using the EM algorithm for mixture *i*

$$
\hat{\phi}\_i = \frac{1}{N} \sum\_{t=1}^N P\left(i \mid \mathcal{O}\_{t'} \mathcal{A}\right) \tag{30}
$$

$$\hat{\mu}\_{i} = \frac{\sum\_{t=1}^{N} \mathbf{P}\{i \mid \mathbf{O}\_{t'} \boldsymbol{\lambda}\} \mathbf{O}\_{t}}{\sum\_{t=1}^{N} \mathbf{P}\{i \mid \mathbf{O}\_{t'} \boldsymbol{\lambda}\}} \tag{31}$$

$$\hat{\Sigma}\_{i} = \frac{\sum\_{t=1}^{N} P\left(i \mid O\_{t}, \mathcal{X}\right) \left(O\_{t} - \hat{\mu}\_{i}\right) \left(O\_{t} - \hat{\mu}\_{i}\right)^{T}}{\sum\_{t=1}^{N} P\left(i \mid O\_{t}, \mathcal{X}\right)} \tag{32}$$

$$\text{Where } P\{\mathbf{i} \mid \mathbf{O}\_{i'}, \mathbf{\hat{\lambda}}\} = \frac{P\{\mathbf{O}\_{i} \mid \mathbf{\lambda}\} a o\_{i}}{\sum\_{j=1}^{W} P\{\mathbf{O}\_{i} \mid \mathbf{\lambda}\} a\_{j}} \text{ and } N \text{ denote the total number of feature observations.}$$

Robust Speech Recognition for Adverse Environments 21

based *n*-gram model with modified Kneser-Ney discounting probabilities for further


defined as the role that a word plays in a sentence such as noun, verb, adjective… etc.

*P w Class w P w Class w P w Class w w*

Where *Class* means the conversion function that translates a word sequence into a word class sequence. In this section, we employ two word classes: semantic class and parts-ofspeech (POS) class. A semantic class, such as the synsets in WordNet (http://wordnet.princeton.edu/) or concepts in the UMLS (http://www.nlm.nih.gov/ research/umls/), contains the words that share a semantic property based on semantic relations, such as hyponym and hypernym. POS is called syntactic or grammatical categories

The other essential issue of *n*-gram model for correcting edit disfluency is the number of orders in Markov model. Since IP is the point at which the speaker breaks off the deletable region and the correction consists of the portion of the utterance that has been repaired by the speaker and can be considered fluent. By removing part of the word string will lead to a shorter string and result in the condition that higher probability is obtained for shorter word string. As a result, short word string will be favored. To deal with this problem, we can increase the order to constrain the perplexity and normalize the word length by aligning the

In conversational speech, the structural pattern of a deletable region is usually similar to that of the correction. Sometimes, the deletable region appears as a substring of the correction. Accordingly, we can find the structural pattern in the starting point of the correction which generally follows the IP. Then, we can take the potential IP as the center and align the word string before and after it. Since the correction is used for replacing the deletable region and ending the utterance, there exists a correspondence between the words in the deletable region and the correction. We may, therefore, model the alignment assuming the conditional probability of the correction given the possible deletable region. According to this observation, class-based alignment is proposed to clean up edit disfluency.

*P f Class w P Class w Class w P l k m*

<sup>|</sup> <sup>|</sup> | , *<sup>k</sup>*

(35)

1 1 , ,

where fertility *kf* means the number of words in the correction corresponding to the word *wk* in the deletable region. *k* and *l* are the positions of the words *wk* and *wl* in the

*kk l k k t l klm*

*i n j n*

1 1 1 11 1 1

*it t j*

(34)

 

, ,... , ,... , ,....

1 2

*i j n*

*3.4.2. Alignment model between the deletable region and the correction* 

,... , ,.... | ,...

*n nt nt N t n*

1 2 21 1

*Pw w w w w w*

*n f*

*t n nt nt N t N*

12 1 2 2 1

*Pw w w w w w w*

smoothing.

deletable region and the correction.

The alignment model can be described as

#### *3.3.4. Potential interruption point extraction*

Based on the assumption that IP appears generally at the boundary of two successive words, we can remove the detected IPs that do not appear in the word boundary. After the removal of unlikely IPs, the remaining IPs will be kept for further processing. Since the word graph or word lattice is obtained from speech recognition module, every path in the word graph or word lattice form its potential IP set for an input utterance.

#### **3.4. Lingusitic processing for edit disfluency correction**

In previous section, potential IPs has been detected from the acoustic features. However, correcting edit disfluency using the linguistic features is, in fact, one of the keys for rich transcription. In this section, the edit disfluency is detected by maximizing the likelihood of the language model for the cleaned-up utterances and the word correspondence between the deletable region and the correction given the position of the IP. Consider the word sequence *W\** in the word lattice generated by the speech recognition engine. We can model the word string *W\** using a log linear mixture model in which language model and alignment are both included.

$$\begin{split} W^\* &= \operatorname\*{arg\,max}\_{W,IP} P\{W;IP\} \\ &= \operatorname\*{arg\,max}\_{W,IP} P\{w\_1, w\_2, \dots, w\_{t^\*}, w\_{t+1}, \dots w\_{n^\*}, w\_{n+t^\*}, \dots w\_{2n-t^\*}, w\_{2n-t+1}, \dots w\_N; IP\} \\ &= \operatorname\*{arg\,max}\_{W,n,t} \left( P\{w\_1, w\_2, \dots, w\_{t^\*}, w\_{n+1}, \dots w\_{2n-t^\*}, w\_{2n-t+1}, \dots w\_N\}^a \\ &\quad \times P\{w\_{t+1}, \dots w\_n \mid w\_{n+1}, \dots w\_{2n-t^\*}, w\_{2n-t+1}, \dots w\_N\}^{(1-a)} \\ &= \operatorname\*{arg\,max}\_{W,n,t} \left( \begin{matrix} a \log\left(P\{w\_1, w\_2, \dots w\_t, w\_{n+1}, \dots w\_{2n-t^\*}, w\_{2n-t+1}, \dots w\_N\}\right) \\ &+ (1-a)\log\left(P\{w\_{t+1}, \dots w\_n \mid w\_{n+1}, \dots w\_{2n-t^\*}, w\_{2n-t+1}, \dots w\_N\}\right) \end{matrix} \right) \end{split} \tag{33}$$

where and 1 are the combination weight for cleanup language model and alignment model. *IP* means the interruption point obtained from the IP detection module and *n* is the position of the potential IP.

#### *3.4.1. Language model of cleanup utterance*

In the past, statistical language models have been applied to speech recognition and have achieved significant improvement in the recognition results. However, probability estimation of word sequences can be expensive and always suffers from the problem of data sparseness. In practice, the statistical language model is often approximated by the classbased *n*-gram model with modified Kneser-Ney discounting probabilities for further smoothing.

20 Modern Speech Recognition Approaches with Case Studies

*P O*

*P O*

1

*3.3.4. Potential interruption point extraction* 

*j*


 

*t j*

Based on the assumption that IP appears generally at the boundary of two successive words, we can remove the detected IPs that do not appear in the word boundary. After the removal of unlikely IPs, the remaining IPs will be kept for further processing. Since the word graph or word lattice is obtained from speech recognition module, every path in the word graph or

In previous section, potential IPs has been detected from the acoustic features. However, correcting edit disfluency using the linguistic features is, in fact, one of the keys for rich transcription. In this section, the edit disfluency is detected by maximizing the likelihood of the language model for the cleaned-up utterances and the word correspondence between the deletable region and the correction given the position of the IP. Consider the word

in the word lattice generated by the speech recognition engine. We can model

*P w w w w w w w w w IP*

*t n nt nt N*

12 1 2 2 1 , , 1 1 2 21

*Pw w w w w w w*

*W nt t n n nt nt N*

*Pw w w w w w*

 

are the combination weight for cleanup language model and alignment

*Pw w w w w w*

*t n nt nt N*

  (33)

*t n n nt nt N*

*t t n n nt nt N*

,... | ,... , ,....

log , ,... , ,... , ,.... arg max + 1 log ,... | ,... , ,....

model. *IP* means the interruption point obtained from the IP detection module and *n* is the

In the past, statistical language models have been applied to speech recognition and have achieved significant improvement in the recognition results. However, probability estimation of word sequences can be expensive and always suffers from the problem of data sparseness. In practice, the statistical language model is often approximated by the class-

12 1 1 2 2 1 ,

arg max , ,... , ,... , ,... , ,.... ;

12 1 2 2 1 <sup>1</sup> , , 1 1 2 21

*Pw w w w w w w*

, ,... , ,... , ,.... arg max

using a log linear mixture model in which language model and

and *N* denote the total number of feature observations.

 

word lattice form its potential IP set for an input utterance.

**3.4. Lingusitic processing for edit disfluency correction** 


Where

*t W*

 


*PiO*

sequence *W\**

where 

the word string *W\**

\*

and 1

position of the potential IP.

alignment are both included.

,

*W IP*

*W IP*

*W nt*

*3.4.1. Language model of cleanup utterance* 

arg max ;

*W P W IP*

$$\begin{aligned} &\operatorname{P}\left(w\_1, w\_2, \ldots, w\_t, w\_{n+1}, \ldots, w\_{2n-t}, w\_{2n-t+1}, \ldots \varpi\_N\right) \\ &= \prod\_{i=1}^t \operatorname{P}\left(w\_i \mid \operatorname{Class}\left(w\_1^{i-1}\right)\middle| \operatorname{P}\left(w\_{n+1} \mid \operatorname{Class}\left(w\_1^t\right)\right)\right) \prod\_{j=n+2}^N \operatorname{P}\left(w\_j \mid \operatorname{Class}\left(w\_1^t w\_{n+1}^{j-1}\right)\right) \end{aligned} \tag{34}$$

Where *Class* means the conversion function that translates a word sequence into a word class sequence. In this section, we employ two word classes: semantic class and parts-ofspeech (POS) class. A semantic class, such as the synsets in WordNet (http://wordnet.princeton.edu/) or concepts in the UMLS (http://www.nlm.nih.gov/ research/umls/), contains the words that share a semantic property based on semantic relations, such as hyponym and hypernym. POS is called syntactic or grammatical categories defined as the role that a word plays in a sentence such as noun, verb, adjective… etc.

The other essential issue of *n*-gram model for correcting edit disfluency is the number of orders in Markov model. Since IP is the point at which the speaker breaks off the deletable region and the correction consists of the portion of the utterance that has been repaired by the speaker and can be considered fluent. By removing part of the word string will lead to a shorter string and result in the condition that higher probability is obtained for shorter word string. As a result, short word string will be favored. To deal with this problem, we can increase the order to constrain the perplexity and normalize the word length by aligning the deletable region and the correction.

#### *3.4.2. Alignment model between the deletable region and the correction*

In conversational speech, the structural pattern of a deletable region is usually similar to that of the correction. Sometimes, the deletable region appears as a substring of the correction. Accordingly, we can find the structural pattern in the starting point of the correction which generally follows the IP. Then, we can take the potential IP as the center and align the word string before and after it. Since the correction is used for replacing the deletable region and ending the utterance, there exists a correspondence between the words in the deletable region and the correction. We may, therefore, model the alignment assuming the conditional probability of the correction given the possible deletable region. According to this observation, class-based alignment is proposed to clean up edit disfluency. The alignment model can be described as

$$\begin{aligned} &P\left(w\_{n+1}, \ldots w\_{2n-t}, w\_{2n-t+1}, \ldots w\_N \mid w\_{t+1}, \ldots w\_n\right) \\ &= \prod\_{k=t+1}^n \left(P\left(f\_k \mid \text{Class}\left(w\_k\right)\right) \prod\_{l=1}^{f\_k} P\left(\text{Class}\left(w\_l\right) \mid \text{Class}\left(w\_k\right)\right)\right) \prod\_{k,l,m} P\left(l \mid k, m\right) \end{aligned} \tag{35}$$

where fertility *kf* means the number of words in the correction corresponding to the word *wk* in the deletable region. *k* and *l* are the positions of the words *wk* and *wl* in the

deletable region and the correction, respectively. *m* denotes the number of words in the deletable region. The alignment model for cleanup contains three parts: fertility probability, translation or corresponding probability and distortion probability. The fertility probability of word *wk* is defined as

$$P\left(f\_k \mid \text{Class}\left(w\_k\right)\right) = \frac{w\_{\text{isClass}(w\_k)}}{\sum\_{p=0}^{W} \sum\_{w\_{j \in \text{Class}(w\_k)}} \mathcal{S}\left(f\_j = p\right)}\tag{36}$$

Robust Speech Recognition for Adverse Environments 23

with IP is larger than that of the silences without IP. According to this result, we can estimate the posterior probability of silence duration using a GMM for IP detection. For

Since IP detection can be regarded as a position determination problem, an observation window over several syllables is adopted. In this observation window, the values of pitch and energy of the syllables just before an IP are usually larger than that after the IP. This phenomenon means the pitch reset and energy reset co-occur with IP in the edit disfluency. This generally happens in the syllables of the first word just after the IP. The pitch reset event is very obvious when the disfluency type is repair. Similar to the pitch, energy plays the same role when edit disfluency appears, but the effect is not so obvious compared to the pitch. The filler words or phrase after IP will be lengthened to strive for the time for the speaker to construct the correction and attract the listener to pay attention to. This factor can

The hypothesis testing, combined with the GMM model with four mixture components using the syllable features, will determine if the silence contains the IP. The parameter

should be determined to achieve a better result. The overall IP error rate defined in RT'04F will be simply the average number of missed IP detections and falsely detected IPs per

Where *<sup>M</sup> IP n* and *FA IP n* denote the numbers of missed and false alarm IPs respectively.

Since the goal of the IP detection module is to detect the potential IPs, false alarm for IP detection is not a serious problem compared to miss error. That is to say, we want to obtain

0.25. Since the IP always appears in word boundary, this constraint can be used to remove

For evaluating the edit disfluency correction model, two different types of transcriptions were used: human generated transcription (REF) and speech-to-text recognition output (STT). Using the reference transcriptions provides the best case for the evaluation of the edit disfluency correction module because there are no word errors in the transcription. For practicability, the syllable lattice from speech recognition is fed to the edit disfluency

For class-based approach, part of speech (POS) and semantic class are employed as the word class. Herein, semantic class is obtained based on Hownet (http://www.keenage.com/) that

*IP*

high recall rate without much increase in false alarm rate. Finally, the threshold

*IP n* means the number of reference IPs. We can adjust the threshold

*n n Error*

*M IP FA IP*

(38)

for *<sup>M</sup> IP n* and

was set to

*IP*

*n*

hypothesis testing, an anti-IP GMM is also constructed.

achieve significant improvement in IP detection rate.

*3.5.3. Clean-up disfluency using linguistic information* 

correction module for performance assessment.

reference IP:

*FA IP n* .

unlikely IPs.

where is an indicator function and *N* means the maximum value of fertility. The translation or corresponding probability is measured according to (Wu et al. 1994).

$$P\left(\text{Class}\left(w\_{l}\right) \mid \text{Class}\left(w\_{k}\right)\right) = \frac{2 \times Depth\left(LCS\left(\text{Class}\left(w\_{l}\right), \text{Class}\left(w\_{k}\right)\right)\right)}{Depth\left(\text{Class}\left(w\_{l}\right)\right) + Depth\left(\text{Class}\left(w\_{k}\right)\right)}\tag{37}$$

where *Depth* denotes the depth of the word class and *LCN* denotes the lowest common subsumer of the words. The distortion probability *Pl km* | , is the mapping probability of the word sequence between the deletable region and the correction.

## **3.5. Experimental results and discussion**

To evaluate the performance of the proposed approach, a transcription system for spontaneous speech with edit dsifluencies in Mandarin was developed. A speech recognition engine using Hidden Markov Model Toolkit (HTK) was constructed as the syllable recognizer using 8 states (3 states for initial, and 5 states for final in Mandarin).

## *3.5.1. Experimental data*

The Mandarin Conversational Dialogue Corpus (MCDC), collected from 2000 to 2001 at the Institute of Linguistics of Academia Sinica, Taiwan, consists of 30 digitized conversational dialogues of a total length of 27 hours. 60 subjects were randomly chosen from daily life in Taiwan area. It was annotated according to (Yeh et al. 2006) that gives concise explanations and detailed operational definitions of each tag in Mandarin. Corresponding to SimpleMDE, direct repetitions, partial repetitions, overt repairs and abandoned utterances are taken as edit disfluency in MCDC. The dialogs tagged as number 01, 02, 03 and 05 are used as the test corpus. For training the parameters in the speech recognizer, MAT Speech Database, TCC-300 and MCDC were employed.

### *3.5.2. Potential interruption point detection*

According to the observation of the MCDC, the probability density function (pdf) of the duration of the silences with or without IPs is obtained. The average duration of the silences with IP is larger than that of the silences without IP. According to this result, we can estimate the posterior probability of silence duration using a GMM for IP detection. For hypothesis testing, an anti-IP GMM is also constructed.

22 Modern Speech Recognition Approaches with Case Studies

of word *wk* is defined as

where

deletable region and the correction, respectively. *m* denotes the number of words in the deletable region. The alignment model for cleanup contains three parts: fertility probability, translation or corresponding probability and distortion probability. The fertility probability

*f f*

*i k*

*f p*

*Depth Class w Depth Class w*

*l k*

(36)

(37)

*j*


*w*

translation or corresponding probability is measured according to (Wu et al. 1994).

probability of the word sequence between the deletable region and the correction.

*p w*

where *Depth* denotes the depth of the word class and *LCN* denotes the lowest common subsumer of the words. The distortion probability *Pl km* | , is the mapping

To evaluate the performance of the proposed approach, a transcription system for spontaneous speech with edit dsifluencies in Mandarin was developed. A speech recognition engine using Hidden Markov Model Toolkit (HTK) was constructed as the syllable recognizer using 8 states (3 states for initial, and 5 states for final in Mandarin).

The Mandarin Conversational Dialogue Corpus (MCDC), collected from 2000 to 2001 at the Institute of Linguistics of Academia Sinica, Taiwan, consists of 30 digitized conversational dialogues of a total length of 27 hours. 60 subjects were randomly chosen from daily life in Taiwan area. It was annotated according to (Yeh et al. 2006) that gives concise explanations and detailed operational definitions of each tag in Mandarin. Corresponding to SimpleMDE, direct repetitions, partial repetitions, overt repairs and abandoned utterances are taken as edit disfluency in MCDC. The dialogs tagged as number 01, 02, 03 and 05 are used as the test corpus. For training the parameters in the speech recognizer, MAT Speech Database,

According to the observation of the MCDC, the probability density function (pdf) of the duration of the silences with or without IPs is obtained. The average duration of the silences

*Depth LCS Class w Class w P Class w Class w*

0

*j Class wk*

is an indicator function and *N* means the maximum value of fertility. The

2 , | *l k*

*P f Class w*

*l k*

**3.5. Experimental results and discussion** 

*3.5.1. Experimental data* 

TCC-300 and MCDC were employed.

*3.5.2. Potential interruption point detection* 

*k k N*

Since IP detection can be regarded as a position determination problem, an observation window over several syllables is adopted. In this observation window, the values of pitch and energy of the syllables just before an IP are usually larger than that after the IP. This phenomenon means the pitch reset and energy reset co-occur with IP in the edit disfluency. This generally happens in the syllables of the first word just after the IP. The pitch reset event is very obvious when the disfluency type is repair. Similar to the pitch, energy plays the same role when edit disfluency appears, but the effect is not so obvious compared to the pitch. The filler words or phrase after IP will be lengthened to strive for the time for the speaker to construct the correction and attract the listener to pay attention to. This factor can achieve significant improvement in IP detection rate.

The hypothesis testing, combined with the GMM model with four mixture components using the syllable features, will determine if the silence contains the IP. The parameter should be determined to achieve a better result. The overall IP error rate defined in RT'04F will be simply the average number of missed IP detections and falsely detected IPs per reference IP:

$$Error\_{IP} = \frac{n\_{M-IP} + n\_{FA-IP}}{n\_{IP}} \tag{38}$$

Where *<sup>M</sup> IP n* and *FA IP n* denote the numbers of missed and false alarm IPs respectively. *IP n* means the number of reference IPs. We can adjust the threshold for *<sup>M</sup> IP n* and *FA IP n* .

Since the goal of the IP detection module is to detect the potential IPs, false alarm for IP detection is not a serious problem compared to miss error. That is to say, we want to obtain high recall rate without much increase in false alarm rate. Finally, the threshold was set to 0.25. Since the IP always appears in word boundary, this constraint can be used to remove unlikely IPs.

## *3.5.3. Clean-up disfluency using linguistic information*

For evaluating the edit disfluency correction model, two different types of transcriptions were used: human generated transcription (REF) and speech-to-text recognition output (STT). Using the reference transcriptions provides the best case for the evaluation of the edit disfluency correction module because there are no word errors in the transcription. For practicability, the syllable lattice from speech recognition is fed to the edit disfluency correction module for performance assessment.

For class-based approach, part of speech (POS) and semantic class are employed as the word class. Herein, semantic class is obtained based on Hownet (http://www.keenage.com/) that

defines the relation "IS-A" as the primary feature. There are 26 and 30 classes in POS class and semantic class respectively. By this, we can categorize the words according to their hypernyms or concepts, and every word can map to its own semantic class.

Robust Speech Recognition for Adverse Environments 25

This investigation has proposed an approach to edit disfluency detection and correction for rich transcription. The proposed theoretical approach, based on a two stage process, aims to model the behavior of edit disfluency and cleanup the disfluency. IP detection module using hypothesis testing from the acoustic features is employed to detect the potential IPs. Wordbased linguistic module consists of a cleanup language model and an alignment model is used for verifying the position of the IP and therefore correcting the edit disfluency. Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. In an investigation of the linguistic properties of edit disfluency, the linguistic module was explored for correcting disfluency based on the potential IPs. The experimental results indicate a significant improvement in performance was achieved. In the future, this framework will be extended to deal with the problem

resulted from subword to improve the performance of the rich transcription system.

This section presents an approach to generating phonetic units for mixed-language or multilingual speech recognition. Acoustic and contextual analysis is performed to characterize multilingual phonetic units for phone set creation. Acoustic likelihood is utilized for similarity estimation of phone models. The hyperspace analog to language (HAL) model is adopted for contextual modeling and contextual similarity estimation. A confusion matrix combining acoustic and contextual similarities between every two phonetic units is built for phonetic unit clustering. Multidimensional scaling (MDS) method

In multilingual speech recognition, it is very important to determine a global phone inventory for different languages. When an authentic multilingual phone set is defined, the acoustic models and pronunciation lexicon can be constructed (Chen et al. 2002). The simplest approach to phone set definition is to combine the phone inventories of different languages together without sharing the units across the languages. The second one is to map language-dependent phones to the global inventory of the multilingual phonetic association based on phonetic knowledge to construct the multilingual phone inventory. Several global phone-based phonetic representations such as International Phonetic Alphabet (IPA) (Mathews 1979), Speech Assessment Methods Phonetic Alphabet (Wells 1989) and Worldbet (Hieronymus 1993) are generally used. The third one is to merge the language-dependent phone models using a hierarchical phone clustering algorithm to obtain a compact multilingual inventory. In this approach, the distance measure between acoustic models, such as Bhattacharyya distance (Mak et al. 1996) and Kullback-Leibler (KL) divergence (Goldberger et al. 2005), is employed to perform the bottom-up clustering. Finally, the multilingual phone models are generated with the use of a phonetic top-down clustering

**4. Speech recognition in multilingual environment** 

is applied to the confusion matrix for reducing dimensionality.

**4.1. Introduction** 

procedure (Young et al. 1994).

**3.6. Conclusion and future work** 

The edit word detection (EWD) task is to detect the regions of the input speech containing the words in the deletable regions. One of the primary metrics for edit disfluency correction is to use the edit word detection method defined in RT'04F (Chen et al. 2002), which is similar to the metric for IP detection shown in Eq. (38).

Due to the lack of structural information, unigram does not obtain any improvement. Bigram provides more significant improvement combined with POS class-based alignment than semantic class-based alignment. Using 3-gram and semantic class-based alignment outperforms other combinations. The reason is that 3-gram with more strict constraints can reduce the false alarm rate for edit word detection. In fact, we also tried using 4-gram to gain more improvement than 3-gram, but the excess computation makes the light improvement not conspicuous as we expected. Besides, the statistics of 4-gram is too spare compared to 3-gram model. The best combination in edit disfluency correction module is 3 gram and semantic class.

According to the analysis of the results shown in Table 2, we can find the values of the probabilities of the *n*-gram model are much smaller than that of the alignment model. Since the alignment can be taken as the penalty for edit words, we should balance the effects between the 3-gram and the alignment with semantic class using a log linear combination weight . For optimizing the performance, we estimate empirically based on the minimization of the edit word errors.


1: word class based on the part of speech (POS) 2: word class based on the semantic class

**Table 2.** Results (%) of linguistic module with equal weight (1 ) 0.5 for edit word detection on REF and STT conditions

## **3.6. Conclusion and future work**

24 Modern Speech Recognition Approaches with Case Studies

similar to the metric for IP detection shown in Eq. (38).

gram and semantic class.

weight

minimization of the edit word errors.

defines the relation "IS-A" as the primary feature. There are 26 and 30 classes in POS class and semantic class respectively. By this, we can categorize the words according to their

The edit word detection (EWD) task is to detect the regions of the input speech containing the words in the deletable regions. One of the primary metrics for edit disfluency correction is to use the edit word detection method defined in RT'04F (Chen et al. 2002), which is

Due to the lack of structural information, unigram does not obtain any improvement. Bigram provides more significant improvement combined with POS class-based alignment than semantic class-based alignment. Using 3-gram and semantic class-based alignment outperforms other combinations. The reason is that 3-gram with more strict constraints can reduce the false alarm rate for edit word detection. In fact, we also tried using 4-gram to gain more improvement than 3-gram, but the excess computation makes the light improvement not conspicuous as we expected. Besides, the statistics of 4-gram is too spare compared to 3-gram model. The best combination in edit disfluency correction module is 3-

According to the analysis of the results shown in Table 2, we can find the values of the probabilities of the *n*-gram model are much smaller than that of the alignment model. Since the alignment can be taken as the penalty for edit words, we should balance the effects between the 3-gram and the alignment with semantic class using a log linear combination

*EWD*

*FA EWD*

*n n* 

*EWD Error <sup>M</sup> EWD*

 (1 ) 0.5 

*n n* empirically based on the

*EWD Error*

**Speech-to-text recognition output (STT)** 

*EWD*

for edit word detection

. For optimizing the performance, we estimate

*FA EWD*

1: word class based on the part of speech (POS) 2: word class based on the semantic class

**Table 2.** Results (%) of linguistic module with equal weight

on REF and STT conditions

*n n* 

*M EWD EWD*

*n n*

**Human generated transcription (REF)** 

*EWD*

**1-gram+alignment1** 0.15 0.17 0.32 0.58 0.65 1.23 **1-gram+alignment2** 0.23 0.12 0.35 0.62 0.42 1.04 **2-gram+alignment1** 0.09 0.15 0.24 0.46 0.43 0.87 **2-gram+alignment2** 0.10 0.11 0.21 0.38 0.36 0.74 **3-gram+alignment1** 0.12 0.04 0.16 0.39 0.23 0.62 **3-gram+alignment2 0.11 0.04 0.15** 0.36 0.24 0.60

hypernyms or concepts, and every word can map to its own semantic class.

This investigation has proposed an approach to edit disfluency detection and correction for rich transcription. The proposed theoretical approach, based on a two stage process, aims to model the behavior of edit disfluency and cleanup the disfluency. IP detection module using hypothesis testing from the acoustic features is employed to detect the potential IPs. Wordbased linguistic module consists of a cleanup language model and an alignment model is used for verifying the position of the IP and therefore correcting the edit disfluency. Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. In an investigation of the linguistic properties of edit disfluency, the linguistic module was explored for correcting disfluency based on the potential IPs. The experimental results indicate a significant improvement in performance was achieved. In the future, this framework will be extended to deal with the problem resulted from subword to improve the performance of the rich transcription system.

## **4. Speech recognition in multilingual environment**

This section presents an approach to generating phonetic units for mixed-language or multilingual speech recognition. Acoustic and contextual analysis is performed to characterize multilingual phonetic units for phone set creation. Acoustic likelihood is utilized for similarity estimation of phone models. The hyperspace analog to language (HAL) model is adopted for contextual modeling and contextual similarity estimation. A confusion matrix combining acoustic and contextual similarities between every two phonetic units is built for phonetic unit clustering. Multidimensional scaling (MDS) method is applied to the confusion matrix for reducing dimensionality.

## **4.1. Introduction**

In multilingual speech recognition, it is very important to determine a global phone inventory for different languages. When an authentic multilingual phone set is defined, the acoustic models and pronunciation lexicon can be constructed (Chen et al. 2002). The simplest approach to phone set definition is to combine the phone inventories of different languages together without sharing the units across the languages. The second one is to map language-dependent phones to the global inventory of the multilingual phonetic association based on phonetic knowledge to construct the multilingual phone inventory. Several global phone-based phonetic representations such as International Phonetic Alphabet (IPA) (Mathews 1979), Speech Assessment Methods Phonetic Alphabet (Wells 1989) and Worldbet (Hieronymus 1993) are generally used. The third one is to merge the language-dependent phone models using a hierarchical phone clustering algorithm to obtain a compact multilingual inventory. In this approach, the distance measure between acoustic models, such as Bhattacharyya distance (Mak et al. 1996) and Kullback-Leibler (KL) divergence (Goldberger et al. 2005), is employed to perform the bottom-up clustering. Finally, the multilingual phone models are generated with the use of a phonetic top-down clustering procedure (Young et al. 1994).

## **4.2. Multilingual phone set definition**

From the viewpoint of multilingual speech recognition, a phonetic representation is functionally defined by the mapping of the fundamental phonetic units of languages to describe the corresponding pronunciation. In this section, IPA-based multilingual phone definition is suitable and consistent for phonetic representation. Using phonetic representation of the IPA, the recognition units can be effectively reduced for multilingual speech recognition. Considering the co-articulated pronunciation, context-dependent triphones are adopted in the expansion of IPA-based phonetic units.

In multilingual speech recognition, misrecognition generally results from incorrect pronunciation or confusable phonetic set. For examples, in Mandarin speech, the "ei\_M" and "zh\_M" is usually pronounced as "en\_M" and "z\_M", respectively. In this section, statistical methods are proposed to deal with the problem of misrecognition caused by the confusing characteristics between phonetic units in multilingual speech recognition. Based on the analysis of confusing characteristics, confusing phones due in part to the confusable phonetic representation are redefined to alleviate the misrecognition problem.

## *4.2.1. Acoustic likelihood*

For the estimation of the confusion between two phone models, the posterior probabilities obtained from the phone-based hidden Markov model (HMM) are employed. Given two phone models, *k* and *<sup>l</sup>* , trained with the corresponding training data, , 1 *<sup>k</sup> <sup>i</sup> x iI* and , 1 *<sup>l</sup> <sup>j</sup> x jJ* , the symmetric acoustic likelihood (ACL) between two phone models, *<sup>k</sup>* and *<sup>l</sup>* , are estimated as follows.

$$a\_{k,l} = \frac{\sum\_{i=1}^{l} P(\mathbf{x}\_i^l \mid oo\_k) + \sum\_{j=1}^{l} P(\mathbf{x}\_j^k \mid oo\_l)}{I + J} \tag{39}$$

Robust Speech Recognition for Adverse Environments 27

(41)

*<sup>N</sup>* (40)

. Furthermore,

phones, which represents that the sense of a phone can be co-articulated through its context phones. Such notion is derived from the observation of articulation behavior. Based on the co-articulation behavior, if two phones share more common context, they are more similarly

The HAL model represents the multilingual triphones based on a vector representation. Each dimension of the vector is a weight representing the strength of association between the target phone and its context phone. The weights are computed by applying an observation window of length over the corpus. All phones within the window are considered as the co-articulated pronunciation with each other. For any two phones of distance *d* within the window, the weight between them is defined as 1 *d* . After moving the window by one phone increment over the sentence, the HAL space G( ) *kl N N* , *g* is constructed. The resultant HAL space is an *N N* matrix, where *N* is the

Table 3 presents the HAL space for the example of English and Mandarin mixed sentence " 查一下<look up> ( CH A @ I X I A ) Baghdad ( B AE G D AE D )." For each phone in Table 3, the corresponding row vector represents its left contextual information, i.e. the weights of the phones preceding it. The corresponding column vector represents its right

, , log *kl kl*

where *N* denotes the total number of phone vectors and *Nl* represents the number of vectors

is transformed into a probabilistic framework, and thus each weight can be redefined as

ˆ *k l k l N*

 

*k*

ˆ ˆ , 1 , <sup>2</sup> *kl lk*

The multidimensional scaling (MDS) method is used to project multilingual triphones to the orthogonal axes where the ranking distance relation between them can be estimated using Euclidean distance. MDS is generally a procedure which characterizes the data in terms of a matrix of pairwise distances using Euclidean distance estimation. One of the purposes of

,

, ,

*w w g kl N* 

*w*

*w w*

*l N*

with nonzero dimension. After each dimension is re-weighted, the HAL space

,

*w* 

*w*

, 1

*k l*

contextual information. *wk l*, indicates the *k*-th weight of the *l*-th triphone *<sup>l</sup>*

the weights in the vector are re-estimated as described as follows.

To generate a symmetric matrix, the weight is averaged as

,

*k l*

*4.2.3. Fusion of confusing matrices and dimensional reduction* 

articulated.

number of triphones.

of phone *<sup>l</sup>*

where *I* and *J* represent the number of training data for phone models, *k* and *l* , respectively. The acoustic confusing matrix A( ) *kl N N* , *a* is obtained from the pairwise similarities between every two phone models, and *N* denotes the number of phone models.

#### *4.2.2. Contextual analysis*

A co-articulation pattern can be considered as a semantically plausible combination of phones. This section presents a text mining framework to automatically induce coarticulation patterns from a mixed-language or a multilingual corpus. A crucial step to induce the co-articulation patterns is to represent speech intonation as well as combination of phones. To achieve this goal, the hyperspace analog to language (HAL) model constructs a high-dimensional contextual space for the mixed-language or multilingual corpus. Each context-dependent triphone in the HAL space is represented as a vector of its context phones, which represents that the sense of a phone can be co-articulated through its context phones. Such notion is derived from the observation of articulation behavior. Based on the co-articulation behavior, if two phones share more common context, they are more similarly articulated.

26 Modern Speech Recognition Approaches with Case Studies

**4.2. Multilingual phone set definition** 

*4.2.1. Acoustic likelihood* 

*k* and

*<sup>l</sup>* , are estimated as follows.

*4.2.2. Contextual analysis* 

,

*k l*

*a*

phone models,

, 1 *l*

From the viewpoint of multilingual speech recognition, a phonetic representation is functionally defined by the mapping of the fundamental phonetic units of languages to describe the corresponding pronunciation. In this section, IPA-based multilingual phone definition is suitable and consistent for phonetic representation. Using phonetic representation of the IPA, the recognition units can be effectively reduced for multilingual speech recognition. Considering the co-articulated pronunciation, context-dependent

In multilingual speech recognition, misrecognition generally results from incorrect pronunciation or confusable phonetic set. For examples, in Mandarin speech, the "ei\_M" and "zh\_M" is usually pronounced as "en\_M" and "z\_M", respectively. In this section, statistical methods are proposed to deal with the problem of misrecognition caused by the confusing characteristics between phonetic units in multilingual speech recognition. Based on the analysis of confusing characteristics, confusing phones due in part to the confusable

For the estimation of the confusion between two phone models, the posterior probabilities obtained from the phone-based hidden Markov model (HMM) are employed. Given two

*<sup>j</sup> x jJ* , the symmetric acoustic likelihood (ACL) between two phone models,

1 1

respectively. The acoustic confusing matrix A( ) *kl N N* , *a* is obtained from the pairwise similarities between every two phone models, and *N* denotes the number of phone models.

A co-articulation pattern can be considered as a semantically plausible combination of phones. This section presents a text mining framework to automatically induce coarticulation patterns from a mixed-language or a multilingual corpus. A crucial step to induce the co-articulation patterns is to represent speech intonation as well as combination of phones. To achieve this goal, the hyperspace analog to language (HAL) model constructs a high-dimensional contextual space for the mixed-language or multilingual corpus. Each context-dependent triphone in the HAL space is represented as a vector of its context

*i j*

where *I* and *J* represent the number of training data for phone models,

*I J*

*<sup>l</sup>* , trained with the corresponding training data, , 1 *<sup>k</sup>*

(| ) ( |)

*l k ik jl*

*I J*

*Px Px*

*<sup>i</sup> x iI* and

(39)

*k* and

*<sup>k</sup>* and

> *l* ,

triphones are adopted in the expansion of IPA-based phonetic units.

phonetic representation are redefined to alleviate the misrecognition problem.

The HAL model represents the multilingual triphones based on a vector representation. Each dimension of the vector is a weight representing the strength of association between the target phone and its context phone. The weights are computed by applying an observation window of length over the corpus. All phones within the window are considered as the co-articulated pronunciation with each other. For any two phones of distance *d* within the window, the weight between them is defined as 1 *d* . After moving the window by one phone increment over the sentence, the HAL space G( ) *kl N N* , *g* is constructed. The resultant HAL space is an *N N* matrix, where *N* is the number of triphones.

Table 3 presents the HAL space for the example of English and Mandarin mixed sentence " 查一下<look up> ( CH A @ I X I A ) Baghdad ( B AE G D AE D )." For each phone in Table 3, the corresponding row vector represents its left contextual information, i.e. the weights of the phones preceding it. The corresponding column vector represents its right contextual information. *wk l*, indicates the *k*-th weight of the *l*-th triphone *<sup>l</sup>* . Furthermore, the weights in the vector are re-estimated as described as follows.

$$
\varpi\_{k,l} = \varpi\_{k,l} \times \log \frac{N}{N\_l} \tag{40}
$$

where *N* denotes the total number of phone vectors and *Nl* represents the number of vectors of phone *<sup>l</sup>* with nonzero dimension. After each dimension is re-weighted, the HAL space is transformed into a probabilistic framework, and thus each weight can be redefined as

$$
\hat{\boldsymbol{w}}\_{k,l} = \frac{\overline{\boldsymbol{w}}\_{k,l}}{\sum\_{k=1}^{N} \overline{\boldsymbol{w}}\_{k,l}} \tag{41}
$$

To generate a symmetric matrix, the weight is averaged as

$$\mathcal{g}\_{k,l} = \frac{\hat{w}\_{k,l} + \hat{w}\_{l,k}}{2}, \quad 1 \le k, l \le N$$

#### *4.2.3. Fusion of confusing matrices and dimensional reduction*

The multidimensional scaling (MDS) method is used to project multilingual triphones to the orthogonal axes where the ranking distance relation between them can be estimated using Euclidean distance. MDS is generally a procedure which characterizes the data in terms of a matrix of pairwise distances using Euclidean distance estimation. One of the purposes of

MDS is to reduce the data dimensionality into a low-dimensional space. The IPA-based phone alphabet is 55 for English and Mandarin. This makes around 166,375 ( 55 55 55 ) triphone numbers. When the number of target languages is increased, the dimension of the confusing matrix becomes huge. Another purpose of multidimensional scaling is to project the elements in the matrix to the orthogonal axes where the ranking distance relation between elements in the confusion matrix can be estimated. Compared to the hierarchical clustering method (Mak et al. 1996), this section applies MDS to the global similarity measure of multilingual triphones.

Robust Speech Recognition for Adverse Environments 29

*kl kl k l bss ss* (44)

(45)

(46)

(47)

(48)

, is obtained. The corresponding ordered

 

1 , , 2 2 1 1 , ,

(49)

*z kl i ki li k l z z k l k i l i i i*

*y y y y* 

*<sup>n</sup>* is the centralized matrix. I indicates the diagonal matrix and 1 means

1

1

*s*

2 1 1

*kl*

*s*

*N kl*

*N* 

*N N*

*k l*

are the average similarity values over all rows and columns of the matrix B . The eigenvector analysis is applied to matrix B to obtain the axis of each triphone in a low dimension. The singular value decomposition (SVD) is applied to solve the eigenvalue and eigenvector problems. Afterwards, the first *z* nonzero eigenvalues for each phone in a

eigenvectors are denoted as u . Then, each triphone is represented by a projected vector as

Y [ u, u, , u] 11 22 *z z* 

This section presents how to cluster the triphones with similar acoustic and contextual properties into a multilingual triphone cluster. Cosine measure between triphones Y*k* and

where *k i*, *y* and *l i*, *y* are the element of the triphone vectors Y*k* and Y*<sup>l</sup>* . The modified *k*means (MKM) algorithm is applied to cluster all the triphones into a compact phonetic set.

*y y y y <sup>C</sup>*

The convergence of closeness measure is determined by a pre-set threshold.

*N* 

*s*

*N kl*

*N* 

*k l*

> *l k*

*s*

*s*

  *s*

the indicator vector. The elements in matrix B is computed as

denotes the average similarity values over the *th l* column, and

is the average similarity values over the *th k* row,

descending order, i.e. 1 2 0 *<sup>z</sup>*

*4.2.4. Phone clustering* 

Y*l* is adopted as follows.

(Y ,Y )

where <sup>1</sup> ' H I 11

where


**Table 3.** Example of multilingual sentence"查一下<look up> ( CH A @ I X I A ) Baghdad ( B AE G D AE D )"in HAL space

In this section, the multidimensional scaling method suitable to represent the high dimensionality relation is adopted to project the confusing characteristic of multilingual triphones onto a lower-dimensional space for similarity estimation. Multidimensional scaling approach is similar to the principal component analysis (PCA) method. The difference is that MDS focuses on the distance relation between any two variables and PCA focuses on the discriminative principal component in variables. MDS is applied for estimating the similarity of pairwise triphones. The similarity matrix V( ) *kl N N* , *v* contains pairwise similarities between every two multilingual triphones. The element of row *k* and column *l* in the similarity matrix is computed as

$$w\_{k,l} = -(a \times \log(a\_{k,l}) + (1 - a) \times \log(g\_{k,l})) \qquad 1 \le k, l \le N \tag{42}$$

where denotes the combination weight. The sum rule of data fusion is indicated to combine acoustic likelihood (ACL) and contextual analysis (HAL) confusing matrices as shown in Figure 5.

MDS is then adopted to project the triphones onto the orthogonal axes where the ranking distance relation between triphones can be estimated based on the similarity matrices of triphones. The first step of MDS is to obtain the following matrices

$$\mathbf{B} = \mathbf{H} \mathbf{S} \mathbf{H} \tag{43}$$

where <sup>1</sup> ' H I 11 *<sup>n</sup>* is the centralized matrix. I indicates the diagonal matrix and 1 means the indicator vector. The elements in matrix B is computed as

$$b\_{kl} = \mathbf{s}\_{kl} - \overline{\mathbf{s}}\_{k\bullet} - \overline{\mathbf{s}}\_{\bullet l} - \overline{\mathbf{s}}\_{\bullet \bullet} \tag{44}$$

where

28 Modern Speech Recognition Approaches with Case Studies

measure of multilingual triphones.

B AE G D AE D )"in HAL space

@ 2 3

A 3 4 1

I 1 2 4 3

B 3 2 1

X 1 2 3

column *l* in the similarity matrix is computed as

triphones. The first step of MDS is to obtain the following matrices

CH

where

shown in Figure 5.

MDS is to reduce the data dimensionality into a low-dimensional space. The IPA-based phone alphabet is 55 for English and Mandarin. This makes around 166,375 ( 55 55 55 ) triphone numbers. When the number of target languages is increased, the dimension of the confusing matrix becomes huge. Another purpose of multidimensional scaling is to project the elements in the matrix to the orthogonal axes where the ranking distance relation between elements in the confusion matrix can be estimated. Compared to the hierarchical clustering method (Mak et al. 1996), this section applies MDS to the global similarity

CH A @ I X B AE G D

AE 2 1 3 2 3

In this section, the multidimensional scaling method suitable to represent the high dimensionality relation is adopted to project the confusing characteristic of multilingual triphones onto a lower-dimensional space for similarity estimation. Multidimensional scaling approach is similar to the principal component analysis (PCA) method. The difference is that MDS focuses on the distance relation between any two variables and PCA focuses on the discriminative principal component in variables. MDS is applied for estimating the similarity of pairwise triphones. The similarity matrix V( ) *kl N N* , *v* contains pairwise similarities between every two multilingual triphones. The element of row *k* and

G 1 2 3

,, , ( log( ) (1 ) log( )) *k l k l k l va g*

 

combine acoustic likelihood (ACL) and contextual analysis (HAL) confusing matrices as

MDS is then adopted to project the triphones onto the orthogonal axes where the ranking distance relation between triphones can be estimated based on the similarity matrices of

B HSH (43)

denotes the combination weight. The sum rule of data fusion is indicated to

1 , *kl N* (42)

D 1 5 4 **Table 3.** Example of multilingual sentence"查一下<look up> ( CH A @ I X I A ) Baghdad (

$$\overline{s}\_{k\bullet} = \sum\_{l=1}^{N} \frac{s\_{kl}}{N} \tag{45}$$

is the average similarity values over the *th k* row,

$$\overline{\mathbf{s}\_{\bullet l}} = \sum\_{k=1}^{N} \frac{\mathbf{s}\_{kl}}{N} \tag{46}$$

denotes the average similarity values over the *th l* column, and

$$\overline{s}\_{\bullet \bullet} = \sum\_{k=1}^{N} \sum\_{l=1}^{N} \frac{s\_{kl}}{N^2} \tag{47}$$

are the average similarity values over all rows and columns of the matrix B . The eigenvector analysis is applied to matrix B to obtain the axis of each triphone in a low dimension. The singular value decomposition (SVD) is applied to solve the eigenvalue and eigenvector problems. Afterwards, the first *z* nonzero eigenvalues for each phone in a descending order, i.e. 1 2 0 *<sup>z</sup>* , is obtained. The corresponding ordered eigenvectors are denoted as u . Then, each triphone is represented by a projected vector as

$$\mathbf{Y} = \mathbf{I}\sqrt{\lambda\_1}\mathbf{u}\_{1'}\sqrt{\lambda\_2}\mathbf{u}\_{2'}..., \sqrt{\lambda\_z}\mathbf{u}\_z\text{[}\tag{48}$$

#### *4.2.4. Phone clustering*

This section presents how to cluster the triphones with similar acoustic and contextual properties into a multilingual triphone cluster. Cosine measure between triphones Y*k* and Y*l* is adopted as follows.

$$\mathbf{C}(\mathbf{Y}\_{k},\mathbf{Y}\_{l}) = \frac{\overline{y\_{k}} \bullet \overline{y\_{l}}}{\|\overline{y\_{k}}\| \cdot \left\|\overline{y\_{l}}\right\|} = \frac{\sum\_{i=1}^{z} y\_{k,i} \times y\_{l,i}}{\sqrt{\sum\_{i=1}^{z} y\_{k,i}^{2}} \times \sqrt{\sum\_{i=1}^{z} y\_{l,i}^{2}}} \tag{49}$$

where *k i*, *y* and *l i*, *y* are the element of the triphone vectors Y*k* and Y*<sup>l</sup>* . The modified *k*means (MKM) algorithm is applied to cluster all the triphones into a compact phonetic set. The convergence of closeness measure is determined by a pre-set threshold.

Robust Speech Recognition for Adverse Environments 31

*4.3.2. Evaluation of the phone set generation based on acoustic and contextual analysis* 

*4.3.3. Comparison of acoustic and language models for multilingual speech recognition* 

Phone models 78 55 1172 924 With language model 45.81% 66.05% 76.46% 78.18% Without language model 32.58% 51.98% 65.32% 67.01%

**Table 5.** Comparison of acoustic and language models for multilingual speech recognition

Table 5 shows the comparisons on different acoustic and language models for multilingual speech recognition. For the comparison of monophone and triphone-based recognition, different phone inventory definitions including direct combination of language-dependent phones (MIX), language-dependent IPA phone definition (IPA), tree-based clustering procedure (TRE) (Mak et al. 1996) and the proposed methods (FUN) were considered. The phonetic units of Mandarin can be represented as 37 fundamental phones and English can be represented as 39 fundamental phones. The phone set for the direct combination of English and Mandarin is 78 phones with two silence models. The phone set for IPA

**Monophone Triphone** 

MIX IPA TRE FUN

applied this setting in the following experiments.

definition of English and Mandarin contains 55 phones.

In this section, the phone recognition rate was adopted for the evaluation of acoustic modeling accuracy. Three classes of speech recognition errors, including insertion errors ( *Ins* ), deletion errors ( *Del* ) and substitution errors ( *Sub* ), were considered. This section applied the fusion of acoustic and contextual analysis approaches to generating the multilingual triphone set. Since the optimal clustering number of acoustic models was unknown, several sets of HMMs were produced by varying the MKM convergence threshold during multilingual triphone clustering. There are three different approaches including acoustic likelihood (ACL), contextual analysis (HAL) and fusion of acoustic and contextual analysis (FUN). It is evident that the proposed fusion method achieves a better result than individual ACL or HAL methods. The comparison of acoustic analysis and contextual analysis, HAL achieves a higher recognition rate than ACL. It denotes that contextual analysis is more significant than acoustic analysis for multilingual confusing phone clustering. The curves shows that phone accuracy will increase with the increase in state number, and finally decrease due to the confusing triphone definition and the requirement of a large size of multilingual training corpus. The proposed multilingual phone generation approach can get an improved performance than the ordinary multilingual triphone sets. In this section, the English and Mandarin triphone sets is defined based on the expansion of the IPA definition. The multilingual speech recognition system for English and Mandarin contains 924 context-dependent triphone models. The best phone recognition accuracy was 67.01% for the HAL window size = 3. Therefore, this section

**Figure 5.** An illustration of fusion of acoustic likelihood (ACL) and contextual analysis (HAL) confusing matrices for the MDS process

#### **4.3. Experimental evaluations**

For evaluation, an in-house multilingual speech recognizer was implemented and experiments were conducted to evaluate the performance of the proposed approach on an English-Mandarin multilingual corpus.

#### *4.3.1. Multilingual database*

In Taiwan, English and Mandarin are popular in conversation, culture, media, and everyday life. For bilingual corpus collection, the English across Taiwan (EAT) project (EAT [online] http://www.aclclp.org.tw/) sponsored by National Science Council, Taiwan prepared 600 recording sheets. Each sheet contains 80 reading sentences, including English long sentences, English short sentences, English words and mixed English and Mandarin sentences. Each sheet was used for speech recording individually for English-major students and non-Englishmajor students. Microphone corpus was recorded as sound files with 16 kHz sampling rate and 16 bit sample resolution. The summarized recording information of EAT corpus is shown in Table 4. In this section, we applied mixed English-Mandarin sentences in microphone application. The average sentence length is around 12.62 characters.


**Table 4.** EAT-MIC Multilingual Corpus Information

## *4.3.2. Evaluation of the phone set generation based on acoustic and contextual analysis*

30 Modern Speech Recognition Approaches with Case Studies

confusing matrices for the MDS process

**4.3. Experimental evaluations** 

*4.3.1. Multilingual database* 

English-Mandarin multilingual corpus.

**Figure 5.** An illustration of fusion of acoustic likelihood (ACL) and contextual analysis (HAL)

For evaluation, an in-house multilingual speech recognizer was implemented and experiments were conducted to evaluate the performance of the proposed approach on an

In Taiwan, English and Mandarin are popular in conversation, culture, media, and everyday life. For bilingual corpus collection, the English across Taiwan (EAT) project (EAT [online] http://www.aclclp.org.tw/) sponsored by National Science Council, Taiwan prepared 600 recording sheets. Each sheet contains 80 reading sentences, including English long sentences, English short sentences, English words and mixed English and Mandarin sentences. Each sheet was used for speech recording individually for English-major students and non-Englishmajor students. Microphone corpus was recorded as sound files with 16 kHz sampling rate and 16 bit sample resolution. The summarized recording information of EAT corpus is shown in Table 4. In this section, we applied mixed English-Mandarin sentences in microphone

No. of Sentences 11,977 30,094 25,432 15,540 No. of Speakers 166 406 368 224

English-Major Non-English-Major

male female male female

application. The average sentence length is around 12.62 characters.

**Table 4.** EAT-MIC Multilingual Corpus Information

In this section, the phone recognition rate was adopted for the evaluation of acoustic modeling accuracy. Three classes of speech recognition errors, including insertion errors ( *Ins* ), deletion errors ( *Del* ) and substitution errors ( *Sub* ), were considered. This section applied the fusion of acoustic and contextual analysis approaches to generating the multilingual triphone set. Since the optimal clustering number of acoustic models was unknown, several sets of HMMs were produced by varying the MKM convergence threshold during multilingual triphone clustering. There are three different approaches including acoustic likelihood (ACL), contextual analysis (HAL) and fusion of acoustic and contextual analysis (FUN). It is evident that the proposed fusion method achieves a better result than individual ACL or HAL methods. The comparison of acoustic analysis and contextual analysis, HAL achieves a higher recognition rate than ACL. It denotes that contextual analysis is more significant than acoustic analysis for multilingual confusing phone clustering. The curves shows that phone accuracy will increase with the increase in state number, and finally decrease due to the confusing triphone definition and the requirement of a large size of multilingual training corpus. The proposed multilingual phone generation approach can get an improved performance than the ordinary multilingual triphone sets. In this section, the English and Mandarin triphone sets is defined based on the expansion of the IPA definition. The multilingual speech recognition system for English and Mandarin contains 924 context-dependent triphone models. The best phone recognition accuracy was 67.01% for the HAL window size = 3. Therefore, this section applied this setting in the following experiments.

## *4.3.3. Comparison of acoustic and language models for multilingual speech recognition*

Table 5 shows the comparisons on different acoustic and language models for multilingual speech recognition. For the comparison of monophone and triphone-based recognition, different phone inventory definitions including direct combination of language-dependent phones (MIX), language-dependent IPA phone definition (IPA), tree-based clustering procedure (TRE) (Mak et al. 1996) and the proposed methods (FUN) were considered. The phonetic units of Mandarin can be represented as 37 fundamental phones and English can be represented as 39 fundamental phones. The phone set for the direct combination of English and Mandarin is 78 phones with two silence models. The phone set for IPA definition of English and Mandarin contains 55 phones.


**Table 5.** Comparison of acoustic and language models for multilingual speech recognition

In acoustic comparison, multilingual context-independent (MIX and IPA) and contextdependent (TRE and FUN) phone sets were investigated. With the language model of English and Mandarin, the approach based on MIX achieved 45.81% phone accuracy and the IPA method achieved 66.05% phone accuracy. The IPA performance is evidently better than MIX approach. TRE method achieved 76.46% phone accuracy and our proposed approach achieved 78.18%. It is obvious that triphone models achieved better performance than monophone models. There is around 2.25% relative improvement from 76.46% accuracy for the baseline system based on TRE to 78.18% accuracy for the approach using acoustic and contextual analysis.

Robust Speech Recognition for Adverse Environments 33

is constructed by the posterior probability of triphones. From the contextual analysis, the hyperspace analog to language (HAL) approach is employed. Using the multidimensional scaling and data fusion approaches, the combination matrix is built and each phone is represented as a vector. Furthermore, the modified *k*-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that

In this chapter speech recognition techniques in adverse environments are presented. For speech recognition in noisy environments, two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are described. Experimental results show that the proposed approach outperformed the SPLICE-based approach without stereo data on AURORA2 database. For speech recognition in disfluent environments, an approach to edit disfluency detection and correction for rich transcription is presented. The proposed theoretical approach, based on a two stage process, aims to model the behavior of edit disfluency and cleanup the disfluency. Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. For speech recognition in multilingual environments, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixedlanguage or multilingual speech recognition. The confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. The modified k-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that the proposed approach improves recognition accuracy in

*Department of Computer Science and Information Engineering, National Cheng Kung University,* 

This work was partially supported by NCKU Project of Promoting Academic Excellence &

Bear, J., J. Dowding and E. Shriberg (1992). *Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog*. *Proc. of ACL*. Newark, Deleware, USA,

the proposed approach gives encouraging results.

**5. Conclusions** 

multilingual environments.

and Chao-Hong Liu

Association for Computational Linguistics: 56-63.

**Author details** 

Chung-Hsien Wu\*

**6. References** 

Corresponding Author

 \*

*Tainan, Taiwan, R.O.C.* 

**Acknowledgement** 

Developing World Class Research Centers.

In order to evaluate the acoustic modeling performance, the experiments were conducted without using language model. Without the language model, the MIX approach achieved 32.58%, IPA method achieved 51.98%, TRE method achieved 65.32%, and the proposed approach achieved 67.01% phone accuracies. In conclusion, multilingual speech recognition can obtain the best performance using FUN approach for the context-dependent phone definition with language model.

## *4.3.4. Comparison of monolingual and multilingual speech recognition*

In this experiment, the utterances of English word and English sentence in the EAT corpus were collected for the evaluation of monolingual speech recognition. A comparison of monolingual and multilingual speech recognition using EAT corpus was shown in Table 6. Totally, 2496 English words, 3072 English sentences and 5884 mixed English and Mandarin utterances were separately used for training. Other 200 utterances were applied for evaluation. In the context-dependent without language model condition, the performance of monolingual English word achieved 76.25% which is higher than 67.42% for monolingual English sentences. The phone recognition accuracy of monolingual English sentences is 67.42% slightly better than 67.01% for mixed English and Mandarin sentences.


**Table 6.** Comparison of monolingual and multilingual speech recognition

## **4.4. Conclusions**

In this section, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixed-language or multilingual speech recognition. The contextdependent triphones are defined based on the IPA representation. Furthermore, the confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. From the acoustic analysis, the acoustic likelihood confusing matrix is constructed by the posterior probability of triphones. From the contextual analysis, the hyperspace analog to language (HAL) approach is employed. Using the multidimensional scaling and data fusion approaches, the combination matrix is built and each phone is represented as a vector. Furthermore, the modified *k*-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that the proposed approach gives encouraging results.

## **5. Conclusions**

32 Modern Speech Recognition Approaches with Case Studies

contextual analysis.

Phone recognition accuracy

**4.4. Conclusions** 

definition with language model.

In acoustic comparison, multilingual context-independent (MIX and IPA) and contextdependent (TRE and FUN) phone sets were investigated. With the language model of English and Mandarin, the approach based on MIX achieved 45.81% phone accuracy and the IPA method achieved 66.05% phone accuracy. The IPA performance is evidently better than MIX approach. TRE method achieved 76.46% phone accuracy and our proposed approach achieved 78.18%. It is obvious that triphone models achieved better performance than monophone models. There is around 2.25% relative improvement from 76.46% accuracy for the baseline system based on TRE to 78.18% accuracy for the approach using acoustic and

In order to evaluate the acoustic modeling performance, the experiments were conducted without using language model. Without the language model, the MIX approach achieved 32.58%, IPA method achieved 51.98%, TRE method achieved 65.32%, and the proposed approach achieved 67.01% phone accuracies. In conclusion, multilingual speech recognition can obtain the best performance using FUN approach for the context-dependent phone

In this experiment, the utterances of English word and English sentence in the EAT corpus were collected for the evaluation of monolingual speech recognition. A comparison of monolingual and multilingual speech recognition using EAT corpus was shown in Table 6. Totally, 2496 English words, 3072 English sentences and 5884 mixed English and Mandarin utterances were separately used for training. Other 200 utterances were applied for evaluation. In the context-dependent without language model condition, the performance of monolingual English word achieved 76.25% which is higher than 67.42% for monolingual English sentences. The phone recognition accuracy of monolingual English sentences is

**Monolingual Multilingual** 

English word English sent. English and Mandarin mixed sent.

76.25% 67.42% 67.01%

In this section, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixed-language or multilingual speech recognition. The contextdependent triphones are defined based on the IPA representation. Furthermore, the confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. From the acoustic analysis, the acoustic likelihood confusing matrix

*4.3.4. Comparison of monolingual and multilingual speech recognition* 

67.42% slightly better than 67.01% for mixed English and Mandarin sentences.

Training corpus 2496 3072 5884

**Table 6.** Comparison of monolingual and multilingual speech recognition

In this chapter speech recognition techniques in adverse environments are presented. For speech recognition in noisy environments, two approaches to cepstral feature enhancement for noisy speech recognition using noise-normalized stochastic vector mapping are described. Experimental results show that the proposed approach outperformed the SPLICE-based approach without stereo data on AURORA2 database. For speech recognition in disfluent environments, an approach to edit disfluency detection and correction for rich transcription is presented. The proposed theoretical approach, based on a two stage process, aims to model the behavior of edit disfluency and cleanup the disfluency. Experimental results indicate that the IP detection mechanism is able to recall IPs by adjusting the threshold in hypothesis testing. For speech recognition in multilingual environments, the fusion of acoustic and contextual analysis is proposed to generate phonetic units for mixedlanguage or multilingual speech recognition. The confusing characteristics of multilingual phone sets are analyzed using acoustic and contextual information. The modified k-means algorithm is used to cluster the multilingual triphones into a compact and robust phone set. Experimental results show that the proposed approach improves recognition accuracy in multilingual environments.

## **Author details**

Chung-Hsien Wu\* and Chao-Hong Liu *Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C.* 

## **Acknowledgement**

This work was partially supported by NCKU Project of Promoting Academic Excellence & Developing World Class Research Centers.

## **6. References**

Bear, J., J. Dowding and E. Shriberg (1992). *Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog*. *Proc. of ACL*. Newark, Deleware, USA, Association for Computational Linguistics: 56-63.

<sup>\*</sup> Corresponding Author

Benveniste, A., M. Métivier and P. Priouret (1990). *Adaptive Algorithms and Stochastic Approximations*. *Applications of Mathematics*. New York, Springer. 22.

Robust Speech Recognition for Adverse Environments 35

Kohler, J. (2001). Multilingual phone models for vocabulary-independent speech recognition

Liu, Y., E. Shriberg, A. Stolcke and M. Harper (2005). *Comparing HMM, maximum entropy, and conditional random fields for disfluency detection*. *Proc. of Eurospeech*: 3313-3316. Macho, D., L. Mauuary, B. Noé, Y. M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce and F. Saadoun (2002). Evaluation of a noise-robust DSR front-end on Aurora databases.

Mak, B. and E. Barnard (1996). *Phone clustering using the Bhattacharyya distance*. *Proc. ICSLP*,

Savova, G. and J. Bachenko (2003). *Prosodic features of four types of disfluencies*. *Proc. of DiSS*:

Shriberg, E., L. Ferrer, S. Kajarekar, A. Venkataraman and A. Stolcke (2005). Modeling prosodic feature sequences for speaker recognition. *Speech Communication*, Vol. 46. No.

Shriberg, E., A. Stolcke, D. Hakkani-Tur and G. Tur (2000). Prosody-based automatic segmentation of speech into sentences and topics. *Speech Communication*, Vol. 32. No. 1-

Snover, M., B. Dorr and R. Schwartz (2004). *A lexically-driven algorithm for disfluency detection*.

Soltau, H., B. Kingsbury, L. Mangu, D. Povey, G. Saon and G. Zweig (2005). The IBM 2004 conversational telephony system for rich transcription. *Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05)*, Philadelphia, USA. Waibel, A., H. Soltau, T. Schultz, T. Schaaf and F. Metze (2000). Multilingual Speech Recognition. *Verbmobil: foundations of speech-to-speech translation*, Springer-Verlag. Wells, J. C. (1989). Computer-coded phonemic notation of individual languages of the European Community. *Journal of the International Phonetic Association*, Vol. 19. No. 1. pp.

Wu, C. H., Y. H. Chiu, C. J. Shia and C. Y. Lin (2006). Automatic segmentation and identification of mixed-language speech using delta-BIC and LSA-based GMMs. *IEEE* 

Wu, J. and Q. Huo (2002). An environment compensated minimum classification error training approach and its evaluation on Aurora2 database. *Proc. ICSLP-2002*, Denver,

Wu, Z. and M. Palmer (1994). *Verbs semantics and lexical selection*. *Proc. 32nd ACL*, Association

*Transactions on Audio, Speech, and Language Processing*, Vol. 14. No. 1. pp. 266-276. Wu, C. H. and G. L. Yan (2004). Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition. *Journal of VLSI Signal Processing* 

*Proc. of HLT/NAACL*, Association for Computational Linguistics: 157-160.

Mathews, R. H. (1979). *Mathews' Chinese-English Dictionary*, Harvard university press.

tasks. *Speech Communication*, Vol. 35. No. 1-2. pp. 21-30.

*Proc. ICSLP-2002*, Denver, Colorado, USA.

IEEE. 4: 2005-2008.

3-4. pp. 455-472.

2. pp. 127-154.

91–94.

31-54.

Colorado, USA.

*Systems*, Vol. 36. No. 2. pp. 91-104.

for Computational Linguistics: 133-138.


Kohler, J. (2001). Multilingual phone models for vocabulary-independent speech recognition tasks. *Speech Communication*, Vol. 35. No. 1-2. pp. 21-30.

34 Modern Speech Recognition Approaches with Case Studies

No. 3-4. pp. 349-364.

47. No. 1-2. pp. 208-219.

Lisbon, Portugal: 1985-1988.

Vol. 25. No. 4. pp. 527-571.

Vol. 56. No. 9. pp. 1225-1233.

352-359.

Benveniste, A., M. Métivier and P. Priouret (1990). *Adaptive Algorithms and Stochastic* 

Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. *IEEE Transactions on Acoustics, Speech and Signal Processing*, Vol. 27. No. 2. pp. 113-120. Charniak, E. and M. Johnson (2001). *Edit detection and parsing for transcribed speech*. *Proc. of* 

Chen, Y. J., C. H. Wu, Y. H. Chiu and H. C. Liao (2002). Generation of robust phonetic set and decision tree for Mandarin using chi-square testing. *Speech Communication*, Vol. 38.

Deng, L., A. Acero, M. Plumpe and X. Huang (2000). Large-vocabulary speech recognition

Deng, L., J. Droppo and A. Acero (2003). Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. *Speech and Audio* 

Furui, S., M. Nakamura, T. Ichiba and K. Iwano (2005). Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese. *Speech Communication*, Vol.

Gales, M. J. F. and S. J. Young (1996). Robust continuous speech recognition using parallel model combination. *IEEE Transactions on Speech and Audio Processing*, Vol. 4. No. 5. pp.

Goldberger, J. and H. Aronowitz (2005). *A distance measure between gmms based on the unscented transform and its application to speaker recognition*. *Proc. of EUROSPEECH*.

Hain, T., P. C. Woodland, G. Evermann, M. J. F. Gales, X. Liu, G. L. Moore, D. Povey and L. Wang (2005). Automatic transcription of conversational telephone speech. *IEEE* 

Heeman, P. A. and J. F. Allen (1999). Speech repairs, intonational phrases, and discourse markers: modeling speakers' utterances in spoken dialogue. *Computational Linguistics*,

Hermansky, H. and N. Morgan (1994). RASTA processing of speech. *IEEE Transactions on* 

Hieronymus, J. L. (1993). ASCII phonetic symbols for the world's languages: Worldbet.

Hsieh, C. H. and C. H. Wu (2008). Stochastic vector mapping-based feature enhancement using prior-models and model adaptation for noisy speech recognition. *Speech* 

Huang, C. L. and C. H. Wu (2007). Generation of phonetic units for mixed-language speech recognition based on acoustic and contextual analysis. *IEEE Transactions on Computers*,

Johnson, M. and E. Charniak (2004). *A TAG-based noisy channel model of speech repairs*. *Proc. of* 

*Transactions on Speech and Audio Processing*, Vol. 13. No. 6. pp. 1173-1185.

*Speech and Audio Processing*, Vol. 2. No. 4. pp. 578-589.

*Journal of the International Phonetic Association*, Vol. 23.

*ACL*, Association for Computational Linguistics: 33-39.

*Communication*, Vol. 50. No. 6. pp. 467-475.

under adverse acoustic environments. *Proc. ICSLP-2000*, Beijing, China.

*Approximations*. *Applications of Mathematics*. New York, Springer. 22.

*NAACL*, Association for Computational Linguistics: 118-126.

*Processing, IEEE Transactions on*, Vol. 11. No. 6. pp. 568-580.


Yeh, J. F. and C. H. Wu (2006). Edit disfluency detection and correction using a cleanup language model and an alignment model. *IEEE Transactions on Audio, Speech, and Language Processing*, Vol. 14. No. 5. pp. 1574-1583.

**Chapter 2** 

© 2012 Thangarajan, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Thangarajan, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Speech Recognition for Agglutinative Languages** 

Speech technology is a broader area comprising many applications like speech recognition, Text to Speech (TTS) Synthesis, speaker identification and verification and language identification. Different applications of speech technology impose different constraints on the problem and these are tackled by different algorithms. In this chapter, the focus is on automatically transcribing speech utterances to text. This process is called Automatic Speech Recognition (ASR). ASR deals with transcribing speech utterances into text of a given language. Even after years of extensive research and development, ASR still remains a challenging field of research. But in the recent years, ASR technology has matured to a level where success rate is higher in certain domains. A well-known example is human-computer interaction where speech is used as an interface along with or without other pointing devices. ASR is fundamentally a statistical problem. Its objective is to find the most likely sequence of words, called hypothesis, for a given sequence of observations. The sequence of observations involves acoustic feature vectors representing the speech utterance. The performance of an ASR system can be measured by aligning the hypothesis with the reference text and by

counting errors like deletion, insertion and substitution of words in the hypothesis.

ASR is a subject involving signal processing and feature extraction, acoustics, information theory, linguistics and computer science. Speech signal processing helps in extracting relevant and discriminative information, which is called features, from speech signal in a robust manner. Robustness involves spectral analysis used to characterize time varying properties of speech signal and speech enhancement techniques for making features resilient to noise. Acoustics provides the necessary understanding of the relationship between speech utterances and the physiological processes in speech production and speech perception. Information theory provides the necessary procedures for estimating parameters of statistical models during training phase. Computer science plays a major role in ASR with its implementation of efficient algorithms in software or hardware for decoding speech in

R. Thangarajan

**1. Introduction** 

real-time.

http://dx.doi.org/10.5772/50140

Additional information is available at the end of the chapter

Young, S. J., J. Odell and P. Woodland (1994). *Tree-based state tying for high accuracy acoustic modelling*. *Proc. ARPA Human Language Technology Conference*. Plainsboro, USA, Association for Computational Linguistics: 307-312.
