**A Particle Filter Compensation Approach to Robust Speech Recognition**

Aleem Mushtaq

56 Modern Speech Recognition Approaches with Case Studies

No.5/6, pp. 703-715.

Brace, and Jovanovich

Understanding, pp. 393-396.

Conference on Communication, India.

Languages in India, Volume 4:6

International Conference on ICT and Development.

Tamil Verbal Complex', Language in India, Vol. 3:4

Asian Language Information Processing Vol. 6 No. 3, Article 9.

Transactions on Signal Processing, Issue 3, Vol.4, 2008, pp. 76-85.

London.

256.

No. 1, pp. 71-85.

PTR (ISBN 0-13-022616-5).

[11] Kumar M., Rajput N. and Verma A. (2004), 'A large-vocabulary continuous speech recognition system for Hindi', IBM Journal of Research and Development, Vol. 48,

[12] Ladefoged, Peter. (1993), 'A course in phonetics.' 3rd edition, Fort Worth, TX: Harcourt,

[13] Marthandan, C.R. (1983), 'Phonetics of casual Tamil', Ph.D. Thesis, University of

[14] Nakagawa S. and Hashimoto Y. (1988), 'A method for continuous speech segmentation using HMM', presented at IEEE International Conference on Pattern Recognition. [15] Nakagawa S., Hanai K., Yamamoto K. and Minematsu N. (1999), 'Comparison of syllable-based HMMs and triphone-based HMMs in Japanese speech recognition', Proceedings of International Workshop Automatic Speech Recognition and

[16] Nayeemulla Khan A. and Yegnanarayana B. (2001), 'Development of Speech Recognition System for Tamil for Small Restricted Task', Proceedings of National

[17] Plauche M., Udhyakummar N., Wooters C., Pal J. and Ramachadran D. (2006), 'Speech Recognition for Illiterate Access to Information and Technology', Proceedings of First

[18] Rajendran S, Viswanathan S and Ramesh Kumar (2003) 'Computational Morphology of

[19] Rajendran S (2004) 'Strategies in the Formation of Compound Nouns in Tamil',

[20] Rajendran S (2006) 'Parsing In Tamil: Present State of Art', Language in India, Vol. 6:8. [21] Saraswathi S. and Geetha T.V. (2004), 'Implementation of Tamil Speech Recognition System Using Neural Networks', Lecture Notes in Computer Science, Vol. 3285. [22] Saraswathi S. and Geetha T. V. (2007), 'Comparison of Performance of Enhanced Morpheme-based language Model with Different Word-based Language Models for Improving the Performance of Tamil Speech Recognition System', ACM Transaction on

[23] Soundaraj F. (2000) 'Accent in Tamil: Speech Research for Speech Technology', In: (Nagamma Reddy K. ed.), Speech Technology: Issues and implications in Indian languages International School of Dravidian Linguistics, Thiruvananthapuram, pp. 246-

[24] Thangarajan R., Natarajan A. M. and Selvam M. (2008a), 'Word and Triphone based Approaches in Continuous Speech Recognition for Tamil Language', WSEAS

[25] Thangarajan R. and Natarajan A. M. (2008b), 'Syllable Based Continuous Speech Recognition for Tamil Language', South Asian Language Review (SALR), Vol. XVIII,

[26] Xuedong Huang, Alex Acero and Hsiao-Wuen Hon (2001), 'Spoken Language Processing - A Guide to Theory, Algorithm and System Development', Prentice Hall Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51532

## **1. Introduction**

The speech production mechanism goes through various stages. First, a thought is generated in speakers mind. The thought is put into a sequence of words. These words are converted into a speech signal using various muscles including face muscles, chest muscles, tongue etc. This signal is distorted by environmental factors such as background noise, reverberations, channel distortions when sent through a microphone, telephone channel etc. The aim of Automatic Speech Recognition Systems (ASR) is to reconstruct the spoken words from the speech signal. From information theoretic [1] perspective, we can treat what is between the speaker and machine as a distortion channel as shown in figure 1.

**Figure 1.** Information theoretic view of Speech Recognition

Here, ܹ represent the spoken words and ܺ is the speech signal. The problem of extracting ܹ from ܺ can be viewed as finding the words sequence that most likely resulted in the observed signal ܺ as given in equation (1)

$$
\hat{W} = \underset{W}{\text{arg}\,\text{max}}\, p(X \mid W) \tag{1}
$$

Like any other Machine Learning/Pattern Recognition problem, the posterior ሺܺȁܹሻ plays a fundamental role in the decoding process. This distribution is parametric and its parameters are found from the available training data. Modern ASR systems do well when environment of speech signal being tested matches well with that of the training data. This is so because the parameter values correspond well to the speech signal being decoded. However, if the environments of training and testing data do not match well, the performance of the ASR systems degrade. Many schemes have been proposed to overcome this problem but humans still outperform these systems, especially in adverse conditions.

A Particle Filter Compensation Approach to Robust Speech Recognition 59

The features from an available training speech corpus are used to estimate the parameters of Acoustic Models. An acoustic model for a particular speech unit, say a phoneme or a word is the likelihood of observing that unit based on the features as given in equation 1.1. Most commonly used structure for the acoustic models in ASR systems is the Hidden Markov Models (HMM). These models capture the dynamics and variations of speech signal well.

The distortions in speech signal can be viewed in signal space, feature space and the model space [3] as shown in figure 3. Resilience to environmental distortions can be added in the feature extraction process, by modifying the distorted features or adapting the acoustic models to better match the environment from which test signal has emanated. ܵ and ܨrepresent speech signal and speech feature respectively. ܯ represent the acoustic models.

In stage 1, the feature extraction process is improved so that the features are robust to distortions. In stage 2, the features are modified to match them better with the training environment. The mismatch in this stage is usually modeled by nuisance parameters. These are estimated from the environment and test data and their effect is minimized based on some optimality criteria. In stage 3, the acoustic models are improved to match better with the testing environment. One way to achieve this is to use Multi-Condition training i.e. use data from diverse environments to train the models. Another way is find transform the

models where transformation matrix is obtained from the test environment.

The test speech signal is then decoded using Viterbi Decoder.

**Figure 3.** Stages where noise robustness can be added

**1.2. Distortions in speech** 

The approaches to overcome this problem falls under two categories. One way is to adapt the parameters of ሺܺȁܹሻ such that they match better with the testing environment and the other is to choose features ܺ such that they are more robust to environment variations. The features can also be transformed to make them more suited to the parameters of ሺܺȁܹሻ , obtained from training data.

## **1.1. Typical ASR system**

Typical ASR systems for small vocabulary are comprised of three main components as shown in figure 2. Speech data is available in waveform which is first converted into feature vectors. Mel Frequency Cepstrum Coefficients (MFCC) [2] features have been widely used in speech community for the task of speech recognition due to their superior discriminative capability.

**Figure 2.** Typical ASR System

The features from an available training speech corpus are used to estimate the parameters of Acoustic Models. An acoustic model for a particular speech unit, say a phoneme or a word is the likelihood of observing that unit based on the features as given in equation 1.1. Most commonly used structure for the acoustic models in ASR systems is the Hidden Markov Models (HMM). These models capture the dynamics and variations of speech signal well. The test speech signal is then decoded using Viterbi Decoder.

#### **1.2. Distortions in speech**

58 Modern Speech Recognition Approaches with Case Studies

obtained from training data.

**1.1. Typical ASR system** 

**Figure 2.** Typical ASR System

capability.

Like any other Machine Learning/Pattern Recognition problem, the posterior ሺܺȁܹሻ plays a fundamental role in the decoding process. This distribution is parametric and its parameters are found from the available training data. Modern ASR systems do well when environment of speech signal being tested matches well with that of the training data. This is so because the parameter values correspond well to the speech signal being decoded. However, if the environments of training and testing data do not match well, the performance of the ASR systems degrade. Many schemes have been proposed to overcome this problem but humans still outperform these systems, especially in adverse conditions.

The approaches to overcome this problem falls under two categories. One way is to adapt the parameters of ሺܺȁܹሻ such that they match better with the testing environment and the other is to choose features ܺ such that they are more robust to environment variations. The features can also be transformed to make them more suited to the parameters of ሺܺȁܹሻ ,

Typical ASR systems for small vocabulary are comprised of three main components as shown in figure 2. Speech data is available in waveform which is first converted into feature vectors. Mel Frequency Cepstrum Coefficients (MFCC) [2] features have been widely used in speech community for the task of speech recognition due to their superior discriminative The distortions in speech signal can be viewed in signal space, feature space and the model space [3] as shown in figure 3. Resilience to environmental distortions can be added in the feature extraction process, by modifying the distorted features or adapting the acoustic models to better match the environment from which test signal has emanated. ܵ and ܨrepresent speech signal and speech feature respectively. ܯ represent the acoustic models.

**Figure 3.** Stages where noise robustness can be added

In stage 1, the feature extraction process is improved so that the features are robust to distortions. In stage 2, the features are modified to match them better with the training environment. The mismatch in this stage is usually modeled by nuisance parameters. These are estimated from the environment and test data and their effect is minimized based on some optimality criteria. In stage 3, the acoustic models are improved to match better with the testing environment. One way to achieve this is to use Multi-Condition training i.e. use data from diverse environments to train the models. Another way is find transform the models where transformation matrix is obtained from the test environment.

## **1.3. Speech and noise tracking for noise compensation**

A sequential Monte Carlo feature compensation algorithm was initially proposed [4-5] in which the noise was treated as a state variable while speech was considered as the signal corrupting the observation noise and a VTS approximation was used to approximate the clean speech signal by applying a minimum mean square error (MMSE) procedure. In [5] extended Kalman filters were used to model a dynamical system representing the noise which was further improved by using Polyak averaging and feedback with a switching dynamical system [6]. These were initial attempts to incorporate particle filter for speech recognition in more indirect fashion as it was used for tracking of noise instead of the speech signal itself. Since the speech signal is treated as corrupting signal to the noise, limited or no information readily available from the HMMs or the recognition process can be utilized efficiently in the compensation process.

A Particle Filter Compensation Approach to Robust Speech Recognition 61

(2)

1 10 1 ( | , ,... ) ( | ) *t tt t t f x x x x fx x* (3)

11 0 11 ( | , ,... ) ( | ) *t tt t t fy x y y fy x* (4)

1 0 1 11 0 1 ( | ,..., ) ( | ) ( | ,..., ) *t t tt t t t <sup>f</sup> x y y f x x f x y y dx* (5)

the missing parameters are estimated in the operational situations we only observe a 13% error reduction in the current study. Moreover, by tracking the speech features, compensation can be done using only partial information about noise and consequently good recognition performance can be obtained despite potential distortion caused by non-

The remainder of the chapter is organized as follows. In section 2, a tracking scheme in general is described followed by the explanation of the well known Kalman filter tracking algorithm. Particle Filters, which form the backbone of PFC are also described in this section. In section 3, the steps involved in tracking and then extracting the clean speech signal from the noisy speech signal are laid out. We also discuss various methods to obtain information required to couple the particle filters and the HMMs in a joint framework. Finally, the experimental results and performance comparison for PFC is given before

Tracking is the problem of estimating the trajectory of an object in a space as it moves through that space. The space could be an image plane captured directly from a camera or it could be synthetically generated from a radar sweep. Generally, tracking schemes can be applied to any system that can be represented by a time dynamical system which consists of

> <sup>1</sup> ( ,) (,) *t tt t tt x fx w y hx n*

Where �� is the observation noise and �� is called the process noise and represents the model uncertainties in the state transition function ��� ). What is available is an observation �� which is function of ��.We are interested in finding a good estimate of current state given observations till current time � i.e. ������������� ����� ����). The state space model ��� ) represents the relation between states adjacent in time. The model in equation (2) assumes

Tracking is a two step process. The first step is to obtain density �� at time ���. This is called the prior density of ��. Once it is available, we can construct a posterior density upon availability of observation ��. The propagation step is given in equation (5). The update step

stationary noise within an utterance.

drawing the conclusions in section 4.

a state space model and an observation

that state sequence is one step Markov process

is obtained using Bayesian theory (equation (6)).

It is further assumed that observations are independent of one another

**2. Tracking algorithms** 

Particle filters are powerful numerical mechanisms for sequential signal modeling and is not constrained by the conventional linearity and Gaussianity [7] requirements. It is a generalization of the Kalman filter [8] and is more flexible than the extended Kalman filter [9] because the stage-by-stage linearization of the state space model in Kalman filter is no longer required [7]. One difficulty of using particle filters lies in obtaining a state space model for speech as consecutive speech features are usually highly correlated. Just like in the Kalman filter and HMM frameworks, state transition is an integral part of the particle filter algorithms.

In contrast to the previous particle filter attempts [4-6] we describe a method in this chapter where we treat the speech signal as the state variable and the noise as the corrupting signal and attempt to estimate clean speech from noisy speech. We incorporate statistical information available in the acoustic models of clean speech, e.g., the HMMs trained with clean speech, as an alternative state transition model[10-11]. The similarity between HMMs and particles filters can be seen from the fact that an observation probability density function corresponding to each state of an HMM describes, in statistical terms, the characteristics of the source generating a signal of interest if the source is in that particular state, whereas in particle filters we try to estimate the probability distribution of the state the system is in when it generates the observed signal of interest. Particle filters are suited for feature compensation because the probability density of the state can be updated dynamically on a sample-by-sample basis. On the other hand, state densities of the HMMs are assumed independent of each other. Although they are good for speech inference problems, HMMs do not adapt well in fast changing environments.

By establishing a close interaction of the particle filters and HMMs, the potentials of both models can be harnessed in a joint framework to perform feature compensation for robust speech recognition. We improve the recognition accuracy through compensation of noisy speech, and we enhance the compensation process by utilizing information in the HMM state transition and mixture component sequences obtained in the recognition process. When state sequence information is available we found we can attain a 67% digit error reduction from multi-condition training in the Aurora-2 connected digit recognition task. If the missing parameters are estimated in the operational situations we only observe a 13% error reduction in the current study. Moreover, by tracking the speech features, compensation can be done using only partial information about noise and consequently good recognition performance can be obtained despite potential distortion caused by nonstationary noise within an utterance.

The remainder of the chapter is organized as follows. In section 2, a tracking scheme in general is described followed by the explanation of the well known Kalman filter tracking algorithm. Particle Filters, which form the backbone of PFC are also described in this section. In section 3, the steps involved in tracking and then extracting the clean speech signal from the noisy speech signal are laid out. We also discuss various methods to obtain information required to couple the particle filters and the HMMs in a joint framework. Finally, the experimental results and performance comparison for PFC is given before drawing the conclusions in section 4.

## **2. Tracking algorithms**

60 Modern Speech Recognition Approaches with Case Studies

efficiently in the compensation process.

filter algorithms.

**1.3. Speech and noise tracking for noise compensation** 

A sequential Monte Carlo feature compensation algorithm was initially proposed [4-5] in which the noise was treated as a state variable while speech was considered as the signal corrupting the observation noise and a VTS approximation was used to approximate the clean speech signal by applying a minimum mean square error (MMSE) procedure. In [5] extended Kalman filters were used to model a dynamical system representing the noise which was further improved by using Polyak averaging and feedback with a switching dynamical system [6]. These were initial attempts to incorporate particle filter for speech recognition in more indirect fashion as it was used for tracking of noise instead of the speech signal itself. Since the speech signal is treated as corrupting signal to the noise, limited or no information readily available from the HMMs or the recognition process can be utilized

Particle filters are powerful numerical mechanisms for sequential signal modeling and is not constrained by the conventional linearity and Gaussianity [7] requirements. It is a generalization of the Kalman filter [8] and is more flexible than the extended Kalman filter [9] because the stage-by-stage linearization of the state space model in Kalman filter is no longer required [7]. One difficulty of using particle filters lies in obtaining a state space model for speech as consecutive speech features are usually highly correlated. Just like in the Kalman filter and HMM frameworks, state transition is an integral part of the particle

In contrast to the previous particle filter attempts [4-6] we describe a method in this chapter where we treat the speech signal as the state variable and the noise as the corrupting signal and attempt to estimate clean speech from noisy speech. We incorporate statistical information available in the acoustic models of clean speech, e.g., the HMMs trained with clean speech, as an alternative state transition model[10-11]. The similarity between HMMs and particles filters can be seen from the fact that an observation probability density function corresponding to each state of an HMM describes, in statistical terms, the characteristics of the source generating a signal of interest if the source is in that particular state, whereas in particle filters we try to estimate the probability distribution of the state the system is in when it generates the observed signal of interest. Particle filters are suited for feature compensation because the probability density of the state can be updated dynamically on a sample-by-sample basis. On the other hand, state densities of the HMMs are assumed independent of each other. Although they are good for speech inference

By establishing a close interaction of the particle filters and HMMs, the potentials of both models can be harnessed in a joint framework to perform feature compensation for robust speech recognition. We improve the recognition accuracy through compensation of noisy speech, and we enhance the compensation process by utilizing information in the HMM state transition and mixture component sequences obtained in the recognition process. When state sequence information is available we found we can attain a 67% digit error reduction from multi-condition training in the Aurora-2 connected digit recognition task. If

problems, HMMs do not adapt well in fast changing environments.

Tracking is the problem of estimating the trajectory of an object in a space as it moves through that space. The space could be an image plane captured directly from a camera or it could be synthetically generated from a radar sweep. Generally, tracking schemes can be applied to any system that can be represented by a time dynamical system which consists of a state space model and an observation

$$\begin{aligned} \mathbf{x}\_t &= f(\mathbf{x}\_{t-1}, \mathbf{w}\_t) \\ \mathbf{y}\_t &= h(\mathbf{x}\_t, \mathbf{w}\_t) \end{aligned} \tag{2}$$

Where �� is the observation noise and �� is called the process noise and represents the model uncertainties in the state transition function ��� ). What is available is an observation �� which is function of ��.We are interested in finding a good estimate of current state given observations till current time � i.e. ������������� ����� ����). The state space model ��� ) represents the relation between states adjacent in time. The model in equation (2) assumes that state sequence is one step Markov process

$$f(\mathbf{x}\_{t+1} \mid \mathbf{x}\_{t'} \mathbf{x}\_{t-1'} \dots \mathbf{x}\_0) = f(\mathbf{x}\_{t+1} \mid \mathbf{x}\_t) \tag{3}$$

It is further assumed that observations are independent of one another

$$f(y\_{t+1} \mid \mathbf{x}\_{t+1}, y\_{t'} \dots y\_0) = f(y\_{t+1} \mid \mathbf{x}\_{t+1}) \tag{4}$$

Tracking is a two step process. The first step is to obtain density �� at time ���. This is called the prior density of ��. Once it is available, we can construct a posterior density upon availability of observation ��. The propagation step is given in equation (5). The update step is obtained using Bayesian theory (equation (6)).

$$f(\mathbf{x}\_t \mid y\_{t-1}, \dots, y\_0) = \int f(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}) f(\mathbf{x}\_{t-1} \mid y\_{t-1}, \dots, y\_0) d\mathbf{x}\_{t-1} \tag{5}$$

$$f(\mathbf{x}\_t \mid y\_{t'}, y\_{t-1'}, \dots, y\_0) = \frac{f(y\_t \mid \mathbf{x}\_{t'} y\_{t-1'} \dots y\_0) f(\mathbf{x}\_t \mid y\_{t-1'} \dots y\_0)}{f(y\_t \mid y\_{t-1'} \dots y\_0)}\tag{6}$$

$$\begin{cases} \mathbf{x}\_{t+1} = \mathbf{A}\_t \mathbf{x}\_t + \mathbf{z} \mathbf{w}\_t \\ \mathbf{y}\_t = \mathbf{C}\_t \mathbf{x}\_t + \mathbf{n}\_t \end{cases} \tag{7}$$

$$\begin{aligned} mean(\mathbf{x}\_{t+1} \mid \mathbf{x}\_t) &= E(A\_t \mathbf{x}\_t + \mathbf{w}\_t) = A\_t \mathbf{x}\_t\\ covariance(\mathbf{x}\_{t+1} \mid \mathbf{x}\_t) &= E(\mathbf{w}\_t \mathbf{w}\_t^T) = \mathbf{Q}\_t \end{aligned} \tag{8}$$

$$p(\mathbf{x}\_{t+1} \mid \mathbf{x}\_t) \sim N(A\_t \mathbf{x}\_t, Q\_t) \tag{9}$$

$$p(\mathbf{x}\_t \mid y\_{t'}, y\_{t-1'}, \dots, y\_0) \sim \mathcal{N}(\hat{\mathbf{x}}\_{t|t'} P\_{t|t}) \tag{10}$$

$$p(\mathbf{x}\_{t+1} \mid y\_{t\prime}, y\_{t-1\prime}, \dots, y\_0) \sim N(A\_t \hat{\mathbf{x}}\_{t\prime\prime}, A\_t P\_{t\prime t} A\_t^T + \mathbf{Q}\_t) \tag{11}$$

$$p(\mathbf{x}\_{t+1} \mid y\_t, y\_{t-1}, \dots, y\_0) \sim N(\hat{\mathbf{x}}\_{t+1 \mid t}, P\_{t+1 \mid t}) \tag{12}$$

$$\hat{\mathbf{x}}\_{t+1} \mid \mathbf{x}\_{t+1} = \mathbb{E}[\mathbf{x}\_{t+1} \mid y\_{t+1}, y\_{t}, \dots, y\_{0}] = \hat{\mathbf{x}}\_{t+1} \mid \mathbf{x}\_{t} + \mathbb{R}\_{xy} \mathbb{R}\_{yy}^{-1} (y\_{t+1} - \mathbb{E}[y\_{t+1} \mid y\_{t}, \dots, y\_{0}]) \tag{13}$$

$$\begin{split} R\_{xy} &= \mathbb{E}[(\mathbf{x}\_{t+1} - \mathbf{E}[\mathbf{x}\_{t+1}])(y\_{t+1} - \mathbf{E}[y\_{t+1}])^T \mid y\_{t\prime} y\_{t-1\prime} \dots y\_0] \\ &= \mathbb{E}[(\mathbf{x}\_{t+1} - \hat{\mathbf{x}}\_{t+1|t})(\mathbf{C}\_{t+1}(\mathbf{x}\_{t+1} - \hat{\mathbf{x}}\_{t+1|t}) + \mathbf{n}\_{t+1})^T \mid y\_{t\prime} y\_{t-1\prime} \dots y\_0] \\ &= P\_{t+1|t} \mathbb{C}\_{t+1}^T \end{split} \tag{14}$$

$$R\_{yy} = \mathbf{C}\_{t+1} \mathbf{P}\_{t+1|t} \mathbf{C}\_{t+1}^T + R\_{t+1} \tag{15}$$

$$
\hat{\mathbf{x}}\_{t+1} \mid \mathbf{x}\_{t+1} = \hat{\mathbf{x}}\_{t+1|t} + \mathbf{K}\_{t+1} (\mathbf{y}\_{t+1} - \mathbf{C}\_{t+1} \hat{\mathbf{x}}\_{t+1|t}) \tag{16}
$$

$$K\_{t+1} = P\_{t+1|t} \mathbf{C}\_{t+1}^T \{\mathbf{C}\_{t+1} P\_{t+1|t} \mathbf{C}\_{t+1}^T + R\_{t+1}\}^{-1} \tag{17}$$

$$\text{cov}(X \mid Y) = R\_{xx} - R\_{xy} R\_{yy}^{-1} R\_{yx} \tag{18}$$

$$\begin{split} P\_{t+1|t+1} &= P\_{t+1|t} - P\_{t+1|t} \mathbf{C}\_{t+1}^T \left( \mathbf{C}\_{t+1} P\_{t+1|t} \mathbf{C}\_{t+1}^T + \mathbf{R}\_{t+1} \right)^{-1} \mathbf{C}\_{t+1} P\_{t+1|t} \\ &= (1 - K\_{t+1} \mathbf{C}\_{t+1}^T) P\_{t+1|t} \end{split} \tag{19}$$

#### **2.2. Grid based methods**

It is hard to obtain analytical solutions to most recursive estimation algorithms. If the state space for a problem is discrete, then we can use grid based methods and can still obtain the optimal solution. Considering that state � takes �� possible values, we can represent discrete density �(�|�) using �� samples[7].

$$p(\mathbf{x}\_k \mid y\_{k'}, y\_{k-1'}, \dots, y\_0) = \sum\_{i=1}^{N\_s} w\_{kk} \delta(\mathbf{x}\_k - \mathbf{x}\_k^i) \tag{20}$$

A Particle Filter Compensation Approach to Robust Speech Recognition 65

(22)

(23)

) and the observation probability �(��|��) are available, the grid based

method gives us the optimal solution for tracking the state of the system. If the state of the system is not discrete, then we can obtain an approximate solution using this method. We divide the continuous space into say � cells and for each cell we compute the prior and

1 1 ( | ) (| )

*p x x p x x dx*

*i j j k k k x x*

> *i k*

where �̅� is the center of �th cell at time ���. The weight update in equation (21)

Particle filtering is a way to model signals emanating from a dynamical system. If the underlying state transition is known and the relationship between the system state and the observed output is available, then the system state can be found using Monte Carlo

1 1


We are interested in obtaining p(x�|y�,…,y�) so that we have a filtered estimate of x� from the measurements available so far, y�,…,y�. If the state space model for the process is available, and both the state and the observation equations are linear, then Kalman filter described above can be used to determine the optimal estimate of x� given observations y�,…,y�. This is so under the condition that process and observation noises are white Gaussian noise with zero mean and mutually independent. In case the state and observation equations are nonlinear, the Extended Kalman Filter (EKF) [9], which is a modified form of the Kalman Filter can be used. Particle filter algorithm estimates the state's posterior

*X X x px x Y X x py x*

*i k*

 

( | ) ( |)

*p y x p y x dx*

*i kk k x x*

posterior in a way that takes into account the range of the whole cell:

simulations [13]. Consider the discrete time Markov process such that

density, p(x�|y�,…,y�) represented by a finite set of support points [7]:

for � � �, … , ��are the support points and ��

1 1

( | , ,..., ) ( ) *Ns*

 

have a discretized and weighted approximation of the posterior density without the need of an analytical solution. Note the similarities with Grid based method. In that, support points for discrete distribution were predefined and covered the whole space. In particle filter algorithm, the support points are determined based on the concept of importance sampling

*px y y y w x x*

*t tt ttt i*

1

*i i*

(24)

are the associated weights. We thus

�

1 1

*X x*

~()

If the prior �(��

� |�� �

subsequently remains unchanged.

**2.3. Particle filter method** 

where �� �

where the weights are computed as follows

$$\begin{aligned} \;^i w\_{kk}^i &\sim \frac{1}{C} w\_{kk-1}^i p(y\_k \mid \mathbf{x}\_k^i) \\ \;^i w\_{kk-1}^i &\sim \sum\_{j=1}^{N\_s} \;^i w\_{k-1|k-1}^i p(\mathbf{x}\_k^i \mid \mathbf{x}\_{k-1}^j) \end{aligned} \tag{21}$$

Here � is the normalizing constant to make total probability equal one. The assumption that state can be represented by finite number of points gives us the ability to sample the whole state space. The weight �� � represents the probability of being in state �� � when observation at time � is ��. In grid based method we construct the discrete density at every time instant in two steps. First we estimate the weights at � without the current observation ��|��� � and then update them when observation is available and obtain���|� � . In the propagation step we take into account probabilities (weights) for all possible state values at ��� to estimate the weights at time � as shown in figure 5.

**Figure 5.** Grid based method

If the prior �(�� � |�� � ) and the observation probability �(��|��) are available, the grid based method gives us the optimal solution for tracking the state of the system. If the state of the system is not discrete, then we can obtain an approximate solution using this method. We divide the continuous space into say � cells and for each cell we compute the prior and posterior in a way that takes into account the range of the whole cell:

$$\begin{aligned} p(\mathbf{x}\_k^i \mid \mathbf{x}\_{k-1}^j) &= \int\_{\mathbf{x} \in \mathbf{x}\_k^j} p(\mathbf{x} \mid \overline{\mathbf{x}}\_{k-1}^j) d\mathbf{x} \\ p(\mathbf{y}\_k \mid \mathbf{x}\_k^i) &= \int\_{\mathbf{x} \in \mathbf{x}\_k^j} p(\mathbf{y}\_k \mid \mathbf{x}) d\mathbf{x} \end{aligned} \tag{22}$$

where �̅� is the center of �th cell at time ���. The weight update in equation (21) subsequently remains unchanged.

#### **2.3. Particle filter method**

64 Modern Speech Recognition Approaches with Case Studies

It is hard to obtain analytical solutions to most recursive estimation algorithms. If the state space for a problem is discrete, then we can use grid based methods and can still obtain the optimal solution. Considering that state � takes �� possible values, we can represent discrete

> 10 | 1

 


Here � is the normalizing constant to make total probability equal one. The assumption that state can be represented by finite number of points gives us the ability to sample the whole

at time � is ��. In grid based method we construct the discrete density at every time instant in two steps. First we estimate the weights at � without the current observation ��|���

take into account probabilities (weights) for all possible state values at ��� to estimate the

~ (| ) *<sup>s</sup>*

� represents the probability of being in state ��

<sup>1</sup> ~ (|)

*i ii j k k kk k k*

*w w px x*

*ii i kk kk k k N*

*w w py x <sup>C</sup>*

*k kk kk k k i px y y y w x x*

*i*

(20)

(21)

� and

� when observation

� . In the propagation step we

( | , ,..., ) ( ) *Ns*


1

*j*

then update them when observation is available and obtain���|�

 

**2.2. Grid based methods** 

state space. The weight ��

**Figure 5.** Grid based method

weights at time � as shown in figure 5.

density �(�|�) using �� samples[7].

where the weights are computed as follows

Particle filtering is a way to model signals emanating from a dynamical system. If the underlying state transition is known and the relationship between the system state and the observed output is available, then the system state can be found using Monte Carlo simulations [13]. Consider the discrete time Markov process such that

$$\begin{aligned} \mathbf{X}\_{1} & \sim \mu(\mathbf{x}\_{1}) \\ \mathbf{X}\_{t} \mid \mathbf{X}\_{t-1} & = \mathbf{x}\_{t} \sim p(\mathbf{x}\_{t} \mid \mathbf{x}\_{t-1}) \\ \mathbf{Y}\_{t} \mid \mathbf{X}\_{t} & = \mathbf{x}\_{t} \sim p(\mathbf{y}\_{t} \mid \mathbf{x}\_{t}) \end{aligned} \tag{23}$$

We are interested in obtaining p(x�|y�,…,y�) so that we have a filtered estimate of x� from the measurements available so far, y�,…,y�. If the state space model for the process is available, and both the state and the observation equations are linear, then Kalman filter described above can be used to determine the optimal estimate of x� given observations y�,…,y�. This is so under the condition that process and observation noises are white Gaussian noise with zero mean and mutually independent. In case the state and observation equations are nonlinear, the Extended Kalman Filter (EKF) [9], which is a modified form of the Kalman Filter can be used. Particle filter algorithm estimates the state's posterior density, p(x�|y�,…,y�) represented by a finite set of support points [7]:

$$p(\mathbf{x}\_t \mid y\_{t'}, y\_{t-1'}, \dots, y\_1) = \sum\_{i=1}^{N\_s} w\_t^i \delta(\mathbf{x}\_t - \mathbf{x}\_t^i) \tag{24}$$

where �� � for � � �, … , ��are the support points and �� � are the associated weights. We thus have a discretized and weighted approximation of the posterior density without the need of an analytical solution. Note the similarities with Grid based method. In that, support points for discrete distribution were predefined and covered the whole space. In particle filter algorithm, the support points are determined based on the concept of importance sampling in which instead of drawing from �(. ), we draw points from another distribution q(.) and compute the weights using the following:

$$w^i = \frac{\pi(\mathbf{x}^i)}{q(\mathbf{x}^i)}\tag{25}$$

A Particle Filter Compensation Approach to Robust Speech Recognition 67

**Figure 6.** General feature compensation scheme

�

�� sample according to

, �,��

chosen the importance sampling density, ���������

where ����,��

The clean HMMs and the background noise information enable us to generate appropriate samples from ��� ) in equation (26). The parameters Φ in equation (30) in our particle filter compensation (PFC) implementation, correspond to the corresponding correct HMM state sequence and mixture component sequence. These sequences provide critical information for density approximation in PFC. As shown in Figure 6 this can be done in two stages. We first perform a front-end compensation of noisy speech. Then recognition is done in the second stage to generate the side information Φ so as to improve compensation. This process can be iterated similar to what's done in maximum likelihood stochastic matching [3]. During compensation, the observed speech � is mapped to clean speech features �. For this purpose clean speech alone cannot be represented by a finite set of points and therefore HMMs by themselves cannot be used directly for tracking of �. Now if an HMM �� is available that adequately represents the speech segment under consideration for compensation along with an estimated state sequence ��, ��,�,�� that correspond to � feature vectors to be considered in the segment, then we can generate the samples from the

> 1 , ,, 1 ( | )~ ( , ) *t tt K*

� is the ��� Gaussian mixture for the state �� in �� and ��,��

(30)

� , ��) in equation (26) to be p��������

is its

� ) in

*t t ks ks ks k px x c N*

corresponding weight for the mixture. The total number of particles is fixed and the contribution from each mixture, computed at run time, depends on its weight. We have

equation (31). This is known as the sampling importance resampling (SIR) filter [7]. It is one of the simplest implementation of particle filters and it enables the generation of samples independently from the observation. For the SIR filter, we only need to know the state and the observation equations and should be able to sample from the prior as in Eq. (3). Also, the

 

*i*

where �(. ) is the distribution of �(. )and �(. )is an importance density from which we can draw samples. For the sequential case, the weight update equation can be computed one by one,

$$\text{tr}\boldsymbol{w}\_t^i = \boldsymbol{w}\_{t-1}^i \frac{p(\boldsymbol{y}\_t \mid \boldsymbol{x}\_t^i) p(\boldsymbol{x}\_t^i \mid \boldsymbol{x}\_{t-1}^i)}{q(\boldsymbol{x}\_t^i \mid \boldsymbol{x}\_{t-1}^i, \boldsymbol{y}\_t)} \tag{26}$$

The density �(. ) propagates the samples to new positions at � given samples at time ��� and is derived from the state transition model of the system.

#### **3. Tracking algorithms for noise compensation**

State transition information is an integral part of the particle filter algorithm and is used to propagate the particle samples through time transitions of the signal being processed. Specifically, the state transition is important to be able to position the samples at the right locations. To solve this problem, statistics from HMMs can be used. Although we only have discrete states in HMMs, each state is characterized by a continuous density Gaussian mixture model (GMM) and therefore it enables us to capture part of the variation in speech features to generate particle samples for feature compensation. Using particle filter algorithms with side information about the statistics of clean speech available in the clean HMMs we can perform feature compensation. If the clean speech is corrupted by an additive noise, *n*, and a distortion channel, *h*, then we can represent the noise corrupted speech with an additive noise model [14], assuming known statistics of the noise parameters,

$$y = \mathbf{x} + h + \log(1 + \exp(n - \mathbf{x} - h))\tag{27}$$

where � � ����(������), � � ����(������) and � � ����(��������) and �(��) denotes the ��� mel spectrum.

$$\mathcal{S}\_y(m\_p) = \mathcal{S}\_x(m\_p) \|\, H(m\_p)\|^2 + \mathcal{S}\_N(m\_p) \tag{28}$$

The additional side information needed for feature compensation is a set of nuisance parameters, Φ similar to *stochastic matching* [3], we can iteratively find Φ followed by decoding as shown in Figure 6:

$$\Phi' = \underset{\Phi}{\text{arg}\,\text{max}}\,\mathbf{P}(Y' \mid \Phi, \Lambda) \tag{29}$$

where *Y'* is the noisy or compensated utterance.

**Figure 6.** General feature compensation scheme

compute the weights using the following:

one,

parameters,

mel spectrum.

decoding as shown in Figure 6:

arg max ( | , ) *P Y*

where *Y'* is the noisy or compensated utterance.

in which instead of drawing from �(. ), we draw points from another distribution q(.) and

*i*

*w*

1

and is derived from the state transition model of the system.

**3. Tracking algorithms for noise compensation** 

( ) ( ) *i*

*i x*

1

(26)

*y x h* log(1 exp( )) *nxh* (27)

<sup>2</sup> ( ) ( )| ( )| ( ) *yp xp p Np S m S m Hm S m* (28)

(29)

1

*tt t*

*qx x y*

*i ii*

( | )( | ) (| ,)

(25)

*q x* 

where �(. ) is the distribution of �(. )and �(. )is an importance density from which we can draw samples. For the sequential case, the weight update equation can be computed one by

> *i i tt tt t t i i*

The density �(. ) propagates the samples to new positions at � given samples at time ���

State transition information is an integral part of the particle filter algorithm and is used to propagate the particle samples through time transitions of the signal being processed. Specifically, the state transition is important to be able to position the samples at the right locations. To solve this problem, statistics from HMMs can be used. Although we only have discrete states in HMMs, each state is characterized by a continuous density Gaussian mixture model (GMM) and therefore it enables us to capture part of the variation in speech features to generate particle samples for feature compensation. Using particle filter algorithms with side information about the statistics of clean speech available in the clean HMMs we can perform feature compensation. If the clean speech is corrupted by an additive noise, *n*, and a distortion channel, *h*, then we can represent the noise corrupted speech with an additive noise model [14], assuming known statistics of the noise

where � � ����(������), � � ����(������) and � � ����(��������) and �(��) denotes the ���

The additional side information needed for feature compensation is a set of nuisance parameters, Φ similar to *stochastic matching* [3], we can iteratively find Φ followed by

*py x px x w w*

The clean HMMs and the background noise information enable us to generate appropriate samples from ��� ) in equation (26). The parameters Φ in equation (30) in our particle filter compensation (PFC) implementation, correspond to the corresponding correct HMM state sequence and mixture component sequence. These sequences provide critical information for density approximation in PFC. As shown in Figure 6 this can be done in two stages. We first perform a front-end compensation of noisy speech. Then recognition is done in the second stage to generate the side information Φ so as to improve compensation. This process can be iterated similar to what's done in maximum likelihood stochastic matching [3]. During compensation, the observed speech � is mapped to clean speech features �. For this purpose clean speech alone cannot be represented by a finite set of points and therefore HMMs by themselves cannot be used directly for tracking of �. Now if an HMM �� is available that adequately represents the speech segment under consideration for compensation along with an estimated state sequence ��, ��,�,�� that correspond to � feature vectors to be considered in the segment, then we can generate the samples from the � �� sample according to

$$p(\boldsymbol{x}\_t \mid \boldsymbol{x}\_{t-1}^i) \sim \sum\_{k=1}^K c\_{k,s\_t} N(\boldsymbol{\mu}\_{k,s\_t}, \boldsymbol{\Sigma}\_{k,s\_t}) \tag{30}$$

where ����,�� , �,�� � is the ��� Gaussian mixture for the state �� in �� and ��,�� is its corresponding weight for the mixture. The total number of particles is fixed and the contribution from each mixture, computed at run time, depends on its weight. We have chosen the importance sampling density, ��������� � , ��) in equation (26) to be p�������� � ) in equation (31). This is known as the sampling importance resampling (SIR) filter [7]. It is one of the simplest implementation of particle filters and it enables the generation of samples independently from the observation. For the SIR filter, we only need to know the state and the observation equations and should be able to sample from the prior as in Eq. (3). Also, the resampling step is applied at every stage and the weight assigned to the �-th support point of the distribution of the speech signal at time � is updated as:

$$w\_t^i \lhd p(y\_t \mid x\_t^i) \tag{31}$$

A Particle Filter Compensation Approach to Robust Speech Recognition 69

(37)

��

*l l*

(40)

 and ��� ��

are mean and

(39)

(38)

the utterance, we chose the �-best models ���� ��������� from HMMs trained using 'clean speech data'. The � models are combined together to obtain a single model �� as follows.

To obtain the observation model for each state � of model ��, we concatenate mixtures from

ˆ ( ) (,) *l ll L K <sup>m</sup> m mm j kj kj kj*

where � is the number of Gaussian mixtures in each original HMM and � is the number of

covariance from the �-th mixture in the �-th state of model ��. The mixture weights are normalized by scaling them according to the likelihood of the occurrence of the model, from

, , ( ) *l l*

The mixture weight is an important parameter because it determines the number of samples that will be generated from the corresponding mixture. The state transition coefficients for

ˆ [ , | ][ ]

*ij t t m m m m*

The recognition performance can be greatly improved if a good estimate of the HMM state sequence � is available. But obtaining this sequence in a noisy operational environment in ASR is very challenging. The simplest approach is to use the decoded state sequence obtained with multi-condition trained models in an ASR recognition process as shown in the bottom of Figure 6. However, these states could often correspond to incorrect models and deviate significantly from the optimal one. Alternatively, we can determine the states (to generate samples from) sequentially during compensation. For left-to-right HMMs,

*l l*

1

*a ps is j W pW*

ˆ [ | ][ ]

*ij ij m m m m*

*l l*

*kj kj m m c c pW*

1 1

*l k b o cN*

() ()

*m m*

, ,,

*l*

 

*3.1.1. Gaussian Mixtures Estimation* 

which they come from,

��are computed using the following:

*3.1.2. State sequence estimation* 

the corresponding states of all component models,

( ) () () ()

different words ��� ������� in the �-best hypothesis. ����

( ) () ()

*<sup>L</sup> <sup>m</sup> m m*

*l*

*a a W pW*

given the state ���� at time ���, we chose �� using equation (41) as follows:

selected from amongst the mixtures corresponding to the chosen state.

<sup>1</sup> , <sup>~</sup>

*t t t ss*

*s a*

arg max( )

*<sup>t</sup> ij <sup>i</sup>*

where � comes from the state transition matrix for ��. The mixture indices are subsequently

*s a* 

1 ( ) ( ) 1

*l <sup>L</sup> <sup>m</sup> <sup>m</sup>*

*l*

The procedure for obtaining HMMs and the state sequence will be described in detail later. To obtain p(����� � ), the distribution of the log spectra of noise for each channel is assumed Gaussian with mean �� and variance �� �. Assuming there is additive noise only with no channel effects

$$y = \text{x} + \log(\text{1} + e^{\text{n} - \text{x}}) \tag{32}$$

We are interested in evaluating �(���) where � represents clean speech and � is the noise with density �(��� ��). Then

$$p[Y < y \mid \mathbf{x}] = p[\mathbf{x} + \log(1 + e^{Y - \mathbf{x}}) < y \mid \mathbf{x}]$$

$$p(y \mid \mathbf{x}) = F'(\boldsymbol{\mu}) = p(\boldsymbol{\mu}) \frac{e^{y - \mathbf{x}}}{e^{y - \mathbf{x}} - 1} \tag{33}$$

Where �(�) is the Gaussian cumulative density function with mean �� and variance �� � and � � ���(���� − 1) � �. In the case of MFCC features, the nonlinear transformation is [14]

$$y = \mathbf{x} + D \log(1 + e^{D^{-1}(n-x)}) \tag{34}$$

Consequently,

$$p(y \mid \mathbf{x}) = p\_N(\mathbf{g}^{-1}(y)) I\_{\mathbf{g}^{-1}}(y) \tag{35}$$

where ��(. ) is a Gaussian pdf, ����(�) is the corresponding Jacobian and � is a discrete cosine transform matrix which is not square and thus not invertible. To overcome this problem, we zero-pad the � and � vectors and extend � to be a square matrix. The variance of the noise density is obtained from the available noise samples. Once the point density of the clean speech features is available, we estimate of the compensated features using discrete approximation of the expectation as

$$\mathbf{x}\_t = \sum\_{i=1}^{N\_s} w\_t^i \mathbf{x}\_t^i \tag{36}$$

where �� is the total number of particle samples at time �.

#### **3.1. Estimation of HMM side information**

As described above, it is important to obtain � � ���� �� where �� is an HMM that faithfully represents the speech segment being compensated and ����� ������� is the state sequence corresponding to the utterance of length �. To obtain �� for the ��� word �� in the utterance, we chose the �-best models ���� ��������� from HMMs trained using 'clean speech data'. The � models are combined together to obtain a single model �� as follows.

#### *3.1.1. Gaussian Mixtures Estimation*

68 Modern Speech Recognition Approaches with Case Studies

To obtain p(�����

channel effects

Consequently,

�

with density �(��� ��). Then

Gaussian with mean �� and variance ��

discrete approximation of the expectation as

**3.1. Estimation of HMM side information** 

where �� is the total number of particle samples at time �.

of the distribution of the speech signal at time � is updated as:

resampling step is applied at every stage and the weight assigned to the �-th support point

The procedure for obtaining HMMs and the state sequence will be described in detail later.

We are interested in evaluating �(���) where � represents clean speech and � is the noise

[ | ] [ log(1 ) | ]

*e*

*pY y x px e y x*

Where �(�) is the Gaussian cumulative density function with mean �� and variance ��

� � ���(���� − 1) � �. In the case of MFCC features, the nonlinear transformation is [14]

<sup>1</sup> ( | ) ( ( )) ( ) *<sup>N</sup> <sup>g</sup> py x p g y J y*

where ��(. ) is a Gaussian pdf, ����(�) is the corresponding Jacobian and � is a discrete cosine transform matrix which is not square and thus not invertible. To overcome this problem, we zero-pad the � and � vectors and extend � to be a square matrix. The variance of the noise density is obtained from the available noise samples. Once the point density of the clean speech features is available, we estimate of the compensated features using

1

As described above, it is important to obtain � � ���� �� where �� is an HMM that faithfully represents the speech segment being compensated and ����� ������� is the state sequence corresponding to the utterance of length �. To obtain �� for the ��� word �� in

*Ns i i t tt i x wx* 

( | ) '( ) ( )

*<sup>e</sup> py x F u pu*

), the distribution of the log spectra of noise for each channel is assumed

1

1

*y x y x*

 

*N x*

(|) *i i w py x t tt* (31)

�. Assuming there is additive noise only with no

(33)

� and

log(1 ) *n x yx e* (32)

<sup>1</sup> ( ) log(1 ) *D nx yxD e* (34)

(35)

(36)

To obtain the observation model for each state � of model ��, we concatenate mixtures from the corresponding states of all component models,

$$\hat{b}\_{j}^{\{m\}}(o) = \sum\_{l=1}^{L} \sum\_{k=1}^{K} c\_{k,j}^{\{m\_l\}} \mathcal{N}(\mu\_{k,j}^{\{m\_l\}}, \Sigma\_{k,j}^{\{m\_l\}}) \tag{37}$$

where � is the number of Gaussian mixtures in each original HMM and � is the number of different words ��� ������� in the �-best hypothesis. ���� �� and ��� �� are mean and covariance from the �-th mixture in the �-th state of model ��. The mixture weights are normalized by scaling them according to the likelihood of the occurrence of the model, from which they come from,

$$\mathbf{c}\_{k,j}^{(m\_l)} = \mathbf{c}\_{k,j}^{(m\_l)} \times p(\mathbf{W}\_m = \mathbb{A}\_{m\_l}) \tag{38}$$

The mixture weight is an important parameter because it determines the number of samples that will be generated from the corresponding mixture. The state transition coefficients for ��are computed using the following:

$$\begin{aligned} \hat{a}\_{ij}^{(m)} &= \sum\_{l=1}^{L} p \| s\_t^{(m\_l)} = i, s\_{t-1}^{(m\_l)} = j \mid \mathcal{W}\_m = \mathcal{\lambda}\_{m\_l} \| p \| \mathcal{W}\_m = \mathcal{\lambda}\_{m\_l} \| \\ \hat{a}\_{ij}^{(m)} &= \sum\_{l=1}^{L} \mathbf{l}\_{ij}^{(m\_l)} \mid \mathcal{W}\_m = \mathcal{\lambda}\_{m\_l} \| p \| \mathcal{W}\_m = \mathcal{\lambda}\_{m\_l} \| \end{aligned} \tag{39}$$

#### *3.1.2. State sequence estimation*

The recognition performance can be greatly improved if a good estimate of the HMM state sequence � is available. But obtaining this sequence in a noisy operational environment in ASR is very challenging. The simplest approach is to use the decoded state sequence obtained with multi-condition trained models in an ASR recognition process as shown in the bottom of Figure 6. However, these states could often correspond to incorrect models and deviate significantly from the optimal one. Alternatively, we can determine the states (to generate samples from) sequentially during compensation. For left-to-right HMMs, given the state ���� at time ���, we chose �� using equation (41) as follows:

$$\begin{aligned} s\_t &\sim a\_{s\_t, s\_{t-1}} \\ s\_t &= \arg\max\_i \{ a\_{ij} \} \end{aligned} \tag{40}$$

where � comes from the state transition matrix for ��. The mixture indices are subsequently selected from amongst the mixtures corresponding to the chosen state.

#### *3.1.3. Experiments*

To investigate the properties of the proposed approach, we first assume that a decent estimate of the state is available at each frame. Moreover, we assume that speech boundaries are marked and therefore the silence and speech sections of the utterance are known. To obtain this information, we use a set of digit HMMs (18 states, 3 Gaussian mixtures) that have been trained using clean speech represented by 23 channel mel-scale log spectral feature. The speech boundaries and state information for a particular noisy utterance is then captured through digit recognition performed on the corresponding clean speech utterance. The speech boundary information is critical because the noise statistics have to be estimated from the noisy section of the utterance. To get the HMM needed for particle filter compensation ܮ models ߣଵǡ ߣଶǡǤǤǤǡߣ are selected based on the ܰ-best hypothesis list. For our experiments, we set ܮൌ͵. We combine these models to get ߣԢ for the ݉-th word in the utterance. Best results are obtained if the correct word model is present in the pool of models that contribute to ߣԢ. Upon availability of this information, the compensation of the noisy log spectral features is done using the sequential importance sampling. To see the efficacy of the compensation process, we consider the noisy, clean and compensated filter banks (channel 8) for the whole utterances shown in Figure 7. The SNR for this particular case is 5 dB. It is clear that the compensated feature matches well with the clean feature. It should be noted however that such a good restoration of the clean speech signal from the noisy signal is achievable only when a good estimate of the side information about the state and mixture component sequences is available.

**Figure 7.** Fbank channel 8 corresponding underlying clean and compensated speech (SNR = 5 dB).

A Particle Filter Compensation Approach to Robust Speech Recognition 71

MC Training

Clean Training

Adapted Models III

Assuming all such information were given (the ideal oracle case) recognition can be performed on MFCCs (39 MFCCs with 13 MFCCs and their first and second time derivatives) extracted from these compensated log spectral features. The HMMs used for recognition are trained with noisy data that has been compensated in the same way as the testing data. The performance compared to multi-condition (MC) and clean condition training (Columns 5 and 6 in Table 1) is given in Column 2 of Table 1 (Adapted Model I). It is clearly noted that a very significant 67% digit error reduction was attained if the missing

clean 99.10 99.10 99.10 98.50 99.11 20dB 97.75 96.46 97.38 97.66 97.21 15dB 97.61 95.98 96.47 96.95 92.36 10dB 96.66 94.00 94.40 95.16 75.14 5dB 95.20 90.64 88.02 89.14 42.42 0dB 92.13 82.62 68.28 64.75 22.57 -5dB 89.28 72.13 32.92 27.47 NA 0-20dB 95.86 90.23 88.91 88.73 65.94

In the case of the actual operational scenarios, when no side information is available, models were chosen from the N-Best list while the states were computed using Viterbi decoding. Of course, the states would correspond to only one model which might not be correct, and there might be a significant mismatch between actual and computed states. Moreover the misalignment of words also exacerbated the problem. The results for this case (Adapted Model III as shown in Table 1 Column 4) were only marginally better than those obtained with the multi-condition trained models. To see the effects of the improvements for the case where the states are better aligned, we made use of whatever information we could get. The boundaries of words were extracted from the N-Best list using exhaustive search and the states for the words between these boundaries were assigned by splitting the digits into equal-sized segments and assigning one state to each segment. This limited the damage done by state misalignment, and it can be seen that a 13% digit error reduction from MC

Adapted Models II

information were made available to us.

Adapted Models I

**Table 1.** ASR accuracy comparisons for Aurora-2

training was observed (Adapted Model II in Table 1 Column 3).

built with the following distance measure [15]:

**3.2. A clustering approach to obtaining correct HMM information** 

HMM states are used to spread the particles at the right locations for subsequent estimation of the underlying clean speech density. If the state is incorrect, the location of particles will be wrong and the density estimate will be erroneous. One solution is to merge the states into clusters. Since the total number of clusters can be much less than the number of states, the problem of choosing the correct information block for sample generation is simplified. A tree structure to group the Gaussian mixtures from clean speech HMMs into clusters can be

Word Accuracy Assuming all such information were given (the ideal oracle case) recognition can be performed on MFCCs (39 MFCCs with 13 MFCCs and their first and second time derivatives) extracted from these compensated log spectral features. The HMMs used for recognition are trained with noisy data that has been compensated in the same way as the testing data. The performance compared to multi-condition (MC) and clean condition training (Columns 5 and 6 in Table 1) is given in Column 2 of Table 1 (Adapted Model I). It is clearly noted that a very significant 67% digit error reduction was attained if the missing information were made available to us.


**Table 1.** ASR accuracy comparisons for Aurora-2

70 Modern Speech Recognition Approaches with Case Studies

and mixture component sequences is available.

To investigate the properties of the proposed approach, we first assume that a decent estimate of the state is available at each frame. Moreover, we assume that speech boundaries are marked and therefore the silence and speech sections of the utterance are known. To obtain this information, we use a set of digit HMMs (18 states, 3 Gaussian mixtures) that have been trained using clean speech represented by 23 channel mel-scale log spectral feature. The speech boundaries and state information for a particular noisy utterance is then captured through digit recognition performed on the corresponding clean speech utterance. The speech boundary information is critical because the noise statistics have to be estimated from the noisy section of the utterance. To get the HMM needed for particle filter compensation ܮ models ߣଵǡ ߣଶǡǤǤǤǡߣ are selected based on the ܰ-best hypothesis list. For our experiments, we set ܮൌ͵. We combine these models to get ߣԢ for the ݉-th word in the utterance. Best results are obtained if the correct word model is present in the pool of models that contribute to ߣԢ. Upon availability of this information, the compensation of the noisy log spectral features is done using the sequential importance sampling. To see the efficacy of the compensation process, we consider the noisy, clean and compensated filter banks (channel 8) for the whole utterances shown in Figure 7. The SNR for this particular case is 5 dB. It is clear that the compensated feature matches well with the clean feature. It should be noted however that such a good restoration of the clean speech signal from the noisy signal is achievable only when a good estimate of the side information about the state

**Figure 7.** Fbank channel 8 corresponding underlying clean and compensated speech (SNR = 5 dB).

*3.1.3. Experiments* 

In the case of the actual operational scenarios, when no side information is available, models were chosen from the N-Best list while the states were computed using Viterbi decoding. Of course, the states would correspond to only one model which might not be correct, and there might be a significant mismatch between actual and computed states. Moreover the misalignment of words also exacerbated the problem. The results for this case (Adapted Model III as shown in Table 1 Column 4) were only marginally better than those obtained with the multi-condition trained models. To see the effects of the improvements for the case where the states are better aligned, we made use of whatever information we could get. The boundaries of words were extracted from the N-Best list using exhaustive search and the states for the words between these boundaries were assigned by splitting the digits into equal-sized segments and assigning one state to each segment. This limited the damage done by state misalignment, and it can be seen that a 13% digit error reduction from MC training was observed (Adapted Model II in Table 1 Column 3).

## **3.2. A clustering approach to obtaining correct HMM information**

HMM states are used to spread the particles at the right locations for subsequent estimation of the underlying clean speech density. If the state is incorrect, the location of particles will be wrong and the density estimate will be erroneous. One solution is to merge the states into clusters. Since the total number of clusters can be much less than the number of states, the problem of choosing the correct information block for sample generation is simplified. A tree structure to group the Gaussian mixtures from clean speech HMMs into clusters can be built with the following distance measure [15]:

$$d(m,n) = \int g\_m(\mathbf{x}) \log \frac{g\_m(\mathbf{x})}{g\_n(\mathbf{x})} d\mathbf{x} + \int g\_n(\mathbf{x}) \log \frac{g\_n(\mathbf{x})}{g\_m(\mathbf{x})} d\mathbf{x} \tag{41}$$

$$\begin{split} \dot{\lambda} &= \sum\_{i} \frac{\left(\sigma\_{m}^{2}(i) - \sigma\_{n}^{2}(i) + \left(\mu\_{n}(i) - \mu\_{m}(i)\right)^{2}}{\sigma\_{n}^{2}(i)}\right) \\ &+ \frac{\sigma\_{n}^{2}(i) - \sigma\_{m}^{2}(i) + \left(\mu\_{n}(i) - \mu\_{m}(i)\right)^{2}}{\sigma\_{m}^{2}(i)} \end{split} \tag{42}$$

A Particle Filter Compensation Approach to Robust Speech Recognition 73

*C g OC* (46)

maximizes the likelihood of the MFCC vector at time �, ��, belonging to that cluster as

It is important to emphasize here that ��� is derived from multi-condition speech models and has a different distribution from the one used to generate the samples. The relationship between clean clusters and multi-condition clusters is shown in figure 1. Clean clusters are obtained using methods described in section 3. The composition information of these clusters is then used to build a corresponding multi-condition cluster set from multicondition HMMs. A cluster C� in clean clusters represents statistical information of a particular section of clean speech. The multi-condition counterpart C� represents statistics of

Clean clusters are necessary to track clean speech because we need to generate samples from clean speech distributions. However, they are not the best choice for estimating equation (46) because the observation is noisy and has a different distribution. The best candidate for computing equation (46) is the multi-condition cluster set. It is constructed from multicondition HMMs that match more closely with noisy speech. A block diagram of the overall compensation and recognition process is shown in Figure 9. We make inference about the cluster to be used for observation vector ��using both the N-best transcripts and equation (46) combined together. Samples at frame � are then generated using the pdf of chosen cluster. The weights of the samples are computed using equation (46) and compensated features are obtained using equation (36). Once the compensated features are available for the whole utterance, recognition is performed again using retrained HMMs with

*k*

~ arg max ( | ) *mc t k*

the noisy version of the same speech section.

**Figure 8.** Clustering of multi-condition trained HMMs

compensated features.

follows:

where ��(�) is the �-th element of the mean vector �� and �� � (�) is the �-th diagonal element of the covariance matrix Σ�. The parameters of the single Gaussian representing the cluster, �� �(�) � �(����� �� �), is computed as follows:

$$\mu\_k(\mathbf{i}) = \frac{1}{M\_k} \sum\_{m=1}^{M\_k} E(\mathbf{x}\_m^{(k)}(\mathbf{i})) = \frac{1}{M\_k} \sum\_{m=1}^{M\_k} \mu\_m^{(k)}(\mathbf{i}) \tag{43}$$

$$\begin{split} \sigma\_k^2(\mathbf{i}) &= \frac{1}{M\_k} \sum\_{m=1}^{M\_k} E(\{\mathbf{x}\_m^{(k)}(\mathbf{i}) - \boldsymbol{\mu}\_k(\mathbf{i})\}^2 \\ &= \frac{1}{M\_k} \sum\_{m=1}^{M\_k} \sigma\_m^{2(k)}(\mathbf{i}) + \sum\_{m=1}^{M\_k} \mu\_m^{(k)2}(\mathbf{i}) - M\_k \mu\_k^2(\mathbf{i}) \end{split} \tag{44}$$

Alternatively, we can group the components at the state level using the following distance measure [16]:

$$d(m, m) = -\frac{1}{S} \sum\_{s=1}^{S} \frac{1}{P} \sum\_{p=1}^{P} \log[b\_{ms}(\mu\_{nsp})] + \log[b\_{ns}(\mu\_{msp})] \tag{45}$$

where S is the total number of states in the cluster, P is the number of mixtures per state and b(. ) is the observation probability. This method makes it easy to track the state level composition of each cluster. In both cases, the clustering algorithm proceeds as follows:


Once clustering is complete, it is important to pick the most suitable cluster for feature compensation at each frame. The particle samples are then generated from the representative density of the chosen cluster. Two methods can be explored. The first is to decide the cluster based on the �-best transcripts obtained from recognition using multi-condition trained models. Denote the states obtained from the �-best transcripts for noisy speech feature vectors at time ��as ���� ��������� . If state ��� is a member of cluster ��, we increment �(��) by one, where �(��) is a count of how many states from the �-best list belong to cluster ��. We choose the cluster based on argmax� �(��) and generate samples from it. If more than one cluster satisfies this criterion, we merge their probability density functions. In the second method, we chose the cluster that maximizes the likelihood of the MFCC vector at time �, ��, belonging to that cluster as follows:

$$\mathbb{C} \sim \operatorname\*{arg\,max}\_{k} \mathcal{g}\_{m}(\mathcal{O}\_{t} \mid \mathcal{C}\_{k}) \tag{46}$$

It is important to emphasize here that ��� is derived from multi-condition speech models and has a different distribution from the one used to generate the samples. The relationship between clean clusters and multi-condition clusters is shown in figure 1. Clean clusters are obtained using methods described in section 3. The composition information of these clusters is then used to build a corresponding multi-condition cluster set from multicondition HMMs. A cluster C� in clean clusters represents statistical information of a particular section of clean speech. The multi-condition counterpart C� represents statistics of the noisy version of the same speech section.

**Figure 8.** Clustering of multi-condition trained HMMs

72 Modern Speech Recognition Approaches with Case Studies

��

�(�) � �(����� ��

measure [16]:

( ) ( ) ( , ) ( )log ( )log ( ) ( )

*g x g x d m n g x dx g x dx*

2 2 2 2 2 2 2 2

 

*ii ii i*

( ) ( ) ( ( ) ( )) [ ( )

*mn nm*

( ) ( ) ( ( ) ( )) ] ( )

of the covariance matrix Σ�. The parameters of the single Gaussian representing the cluster,

*k mm k k m m i Ex i i M M*

1 1

*k k*

*M M*

*k m m*

*dnm b b*

2. While��� � ��, find � and � for which �(�� �) is minimum and merge them.

*k*

2 ( ) 2 1

*i Ex i i*

*k*

*k m k k m*

1 1

*s p*

*S P*

*S P*

*M*

*M*

*M*

<sup>1</sup> ( ) (( ( ) ( ))

*ii ii i*

 

*nm nm m*

*m n*

*i n*

 

where ��(�) is the �-th element of the mean vector �� and ��

�), is computed as follows:

1. Create one cluster for each mixture up to k clusters.

*m n*

*n m*

( ) ( )

2( ) ( )2 2

*i iM i*

*ms nsp ns msp*

 

(45)

*m m kk*

<sup>1</sup> () () ()

*k k*

Alternatively, we can group the components at the state level using the following distance

1 1 (, ) log[ ( )] log[ ( )]

where S is the total number of states in the cluster, P is the number of mixtures per state and b(. ) is the observation probability. This method makes it easy to track the state level composition of each cluster. In both cases, the clustering algorithm proceeds as follows:

Once clustering is complete, it is important to pick the most suitable cluster for feature compensation at each frame. The particle samples are then generated from the representative density of the chosen cluster. Two methods can be explored. The first is to decide the cluster based on the �-best transcripts obtained from recognition using multi-condition trained models. Denote the states obtained from the �-best transcripts for noisy speech feature vectors at time ��as ���� ��������� . If state ��� is a member of cluster ��, we increment �(��) by one, where �(��) is a count of how many states from the �-best list belong to cluster ��. We choose the cluster based on argmax� �(��) and generate samples from it. If more than one cluster satisfies this criterion, we merge their probability density functions. In the second method, we chose the cluster that

*k k*

 

(43)

1 1 1 1 ( ) ( ( )) ( ) *M M k k*

> 

*g x g x* (41)

(42)

� (�) is the �-th diagonal element

(44)

Clean clusters are necessary to track clean speech because we need to generate samples from clean speech distributions. However, they are not the best choice for estimating equation (46) because the observation is noisy and has a different distribution. The best candidate for computing equation (46) is the multi-condition cluster set. It is constructed from multicondition HMMs that match more closely with noisy speech. A block diagram of the overall compensation and recognition process is shown in Figure 9. We make inference about the cluster to be used for observation vector ��using both the N-best transcripts and equation (46) combined together. Samples at frame � are then generated using the pdf of chosen cluster. The weights of the samples are computed using equation (46) and compensated features are obtained using equation (36). Once the compensated features are available for the whole utterance, recognition is performed again using retrained HMMs with compensated features.

A Particle Filter Compensation Approach to Robust Speech Recognition 75

The results for a fixed number of particles (100) are shown in Table 1. The number of clusters was 20, 25 or 30. To set the specific number of clusters, HMM states were combined and clustering was stopped when the specified number was reached. HMM sets for all purposes were 18 states, with each state represented by 3 Gaussian mixtures. For the 11 digit vocabulary, we have a total of approximately 180 states. In case of, for example, 20 clusters, we have a 9 to 1 reduction of information blocks to choose from for plugging in the

It is interesting to note that best results were obtained for 25 clusters. Increasing the number of clusters beyond 25 did not improve the accuracy. The larger the number of clusters, the more specific speech statistics each cluster contains. If the number of clusters is large, then each cluster encompasses more specific section of the speech statistics. Having more specific information in each cluster is good for better compensation and recognition because the particles can be placed more accurately. However, due to the large number of clusters to choose from, it is difficult to pick the correct cluster for generation of particles. More errors were made in the cluster selection process resulting in degradation in the overall

This is further illustrated in Figure 10. If the correct cluster is known, having large number of clusters and consequently more specific information per cluster will only improve the performance. The results are for 20, 25 and 30 clusters. In the known cluster case, one cluster is obtained using equation (46) and the second cluster is the correct one. Correct cluster means the one that contains the state (obtained by doing recognition on the clean version of the noisy utterance using clean HMMs) to which the observation actually belongs to. For the unknown cluster case, the clusters are obtained using equation (46) and 1 − best. It can readily be observed from the known cluster case that if the choice of cluster is always correct, the recognition performance improves drastically. Error rate was reduced by 54%, 59% and 61.4% for 20, 25 and 30 clusters, respectively. Moreover, improvement faithfully follows the number of clusters used. This was also corroborated by the fact that if the cluster is specific down to the HMM state level, i.e., the exact HMM state sequence was assumed known and each state is a separate cluster (total of approximately 180 clusters), the error rate was reduced by as

For the results in Table 2, we fixed the number of clusters and varied the number of particles. As we increased the number of particles, the accuracy of the algorithm improves for set A and B combined i.e. for additive noise. The error reduction is 17% over MC trained models. Using a large number of particles implies more samples were utilized to construct the predicted densities of the underlying clean speech features, which is now denser and thus better approximated. Thus, a gradual improvement in the recognition results was observed as the particles increased. In case of Set C, however, the performance was worse when more particles were used. This is so because the underlying distribution is different

PF scheme.

performance.

much as 67% [10].

due to the distortions other than additive noise.

**Figure 9.** Complete recognition process

## *3.2.1. Experiments*

To evaluate the proposed framework we experimented on the Aurora 2 connected digit task. We extracted features (39 elements with 13 MFCCs and their first and second time derivatives) from test speech as well as 23 channel filter-bank features thereby forming two streams. One-best transcript was obtained from the MFCC stream using the multi-condition trained HMMs. PFC is then applied to the filter-bank stream (stream two). We chose two clusters, one based on 1-best and the other selected with equation (46). The multi-condition clusters used in equation (46) were from 23 channel fbank features so that the test features from stream two can be directly used to evaluate the likelihood of the observations. For results in these experiments, clusters were formed using method two, i.e., tracking the statewise composition of each cluster. The number of clusters and particles were varied to evaluate the performance of the algorithm under different settings. From the compensated filter-bank features of stream two, we extracted 39-element MFCC features. Final recognition on these models was done using the retrained HMMs, i.e., multi-condition training data compensated in a similar fashion as described above.


**Table 2.** Variable number of clusters (100 particles)

The results for a fixed number of particles (100) are shown in Table 1. The number of clusters was 20, 25 or 30. To set the specific number of clusters, HMM states were combined and clustering was stopped when the specified number was reached. HMM sets for all purposes were 18 states, with each state represented by 3 Gaussian mixtures. For the 11 digit vocabulary, we have a total of approximately 180 states. In case of, for example, 20 clusters, we have a 9 to 1 reduction of information blocks to choose from for plugging in the PF scheme.

74 Modern Speech Recognition Approaches with Case Studies

**Figure 9.** Complete recognition process

To evaluate the proposed framework we experimented on the Aurora 2 connected digit task. We extracted features (39 elements with 13 MFCCs and their first and second time derivatives) from test speech as well as 23 channel filter-bank features thereby forming two streams. One-best transcript was obtained from the MFCC stream using the multi-condition trained HMMs. PFC is then applied to the filter-bank stream (stream two). We chose two clusters, one based on 1-best and the other selected with equation (46). The multi-condition clusters used in equation (46) were from 23 channel fbank features so that the test features from stream two can be directly used to evaluate the likelihood of the observations. For results in these experiments, clusters were formed using method two, i.e., tracking the statewise composition of each cluster. The number of clusters and particles were varied to evaluate the performance of the algorithm under different settings. From the compensated filter-bank features of stream two, we extracted 39-element MFCC features. Final recognition on these models was done using the retrained HMMs, i.e., multi-condition

Word Accy 20 Clust. 25 Clust. 30 Clust. MC Trained Clean Trained clean **99.11 99.11 99.11 98.50 99.11**  20dB **97.76 98.00 97.93 97.66 97.21**  15dB **97.00 97.14 96.69 96.80 92.36**  10dB **95.21 95.41 93.88 95.32 75.14**  5dB **89.48 89.59 87.08 89.14 42.42**  0dB **70.16 70.38 68.84 64.75 22.57**  -5dB **36.30 36.63 36.94 27.47 NA**  0-20dB **89.92 90.10 88.88 88.73 65.94** 

training data compensated in a similar fashion as described above.

**Table 2.** Variable number of clusters (100 particles)

*3.2.1. Experiments* 

It is interesting to note that best results were obtained for 25 clusters. Increasing the number of clusters beyond 25 did not improve the accuracy. The larger the number of clusters, the more specific speech statistics each cluster contains. If the number of clusters is large, then each cluster encompasses more specific section of the speech statistics. Having more specific information in each cluster is good for better compensation and recognition because the particles can be placed more accurately. However, due to the large number of clusters to choose from, it is difficult to pick the correct cluster for generation of particles. More errors were made in the cluster selection process resulting in degradation in the overall performance.

This is further illustrated in Figure 10. If the correct cluster is known, having large number of clusters and consequently more specific information per cluster will only improve the performance. The results are for 20, 25 and 30 clusters. In the known cluster case, one cluster is obtained using equation (46) and the second cluster is the correct one. Correct cluster means the one that contains the state (obtained by doing recognition on the clean version of the noisy utterance using clean HMMs) to which the observation actually belongs to. For the unknown cluster case, the clusters are obtained using equation (46) and 1 − best. It can readily be observed from the known cluster case that if the choice of cluster is always correct, the recognition performance improves drastically. Error rate was reduced by 54%, 59% and 61.4% for 20, 25 and 30 clusters, respectively. Moreover, improvement faithfully follows the number of clusters used. This was also corroborated by the fact that if the cluster is specific down to the HMM state level, i.e., the exact HMM state sequence was assumed known and each state is a separate cluster (total of approximately 180 clusters), the error rate was reduced by as much as 67% [10].

For the results in Table 2, we fixed the number of clusters and varied the number of particles. As we increased the number of particles, the accuracy of the algorithm improves for set A and B combined i.e. for additive noise. The error reduction is 17% over MC trained models. Using a large number of particles implies more samples were utilized to construct the predicted densities of the underlying clean speech features, which is now denser and thus better approximated. Thus, a gradual improvement in the recognition results was observed as the particles increased. In case of Set C, however, the performance was worse when more particles were used. This is so because the underlying distribution is different due to the distortions other than additive noise.


A Particle Filter Compensation Approach to Robust Speech Recognition 77

**Author details** 

*School of ECE, Georgia Institute of Technology, Atlanta, USA* 

robust speech recognition," Proc. ICASSP, 2005.

filtering with switching dynamical system," Proc. ICASSP, 2006.

[9] Simon Haykin. 2009. Adaptive Filter Theory, 4th edition, Prentice Hall.

http://www.cs.ubc.ca/~arnaud/doucet\_johansen\_tutorialPF.pdf

series for noisy speech recognition," Proc. ICSLP, pp. 869-872, 2002.

and Applied Kalman Filtering, 3rd edition, Prentice Hall.

Robust Speech Recognition." Proc. Interspeech, 2009.

1980ol. 28, no.4, pp. 357-366, 1980.

models." Proc. ICASSP, 2004.

[1] C.-H. Lee and Q. Huo, "On adaptive decision rules and decision parameter adaptation

[2] S.Davis and P. Mermelstein, "Comparison of parametric representations for monosyllable word recognition in continuously spoken sentences," Proc. ICASSP

[3] A. Sankar and C.-H. Lee, "A maximum-likelihood approach to stochastic matching for robust speech recognition," IEEE Trans. Speech Audio Processing, vol. 4, pp.190-202,

[4] B. Raj, R. Singh, and R. Stern, "On tracking noise with linear dynamical system

[5] M. Fujimoto and S. Nakamura, "Particle Filter based non-stationary noise tracking for

[6] M. Fujimoto and S. Nakamura, "Sequential non-stationary noise tracking using particle

[7] M .S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, "A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking," IEEE Trans. Signal

[8] Robert Grover Brown and Patrick Y. C. Hwang. 1996. Introduction to Random Signals

[10] A. Mushtaq, Y. Tsao and C.-H. Lee, "A Particle Filter Compensation Approach to

[11] A. Mushtaq and C.-H. Lee, "An integrated approach to feature compensation combining particle filters and Hidden Markov Model for robust speech recognition."

[12] Todd K. Moon and Wynn C. Stirling. 2007. Mathematical Methods and Algorithms for

[13] N Arnaud, Doucet, and Johansen, "A tutorial on particle filtering and smoothing:

[14] A. Acero, L. Deng, T. Kristjansson, and J. Zhang, "HMM adaptation using vector Taylor

[15] T. Watanbe, K. Shinoda, K. Takagi, and E. Yamada, "Speech recognition using treestructured probability sensity function," in Proc. Int. Conf. Speech Language Processing

for automatic speech recognition", *Proc. IEEE,* vol. 88, pp. 1241-1269, 2000.

Aleem Mushtaq

**5. References** 

May.1996.

Proc., 2002.

Proc. ICASSP, 2012.

'94, 1994, pp. 223-226.

Signal Processing, Pearson Education.

Fifteen years later," Tech. Rep., 2008. [Online].

**Table 3.** Variable number of particles (25 clusters)

**Figure 10.** Accuracy when correct cluster known vs. unknown

## **4. Conclusions**

In this chapter, we proposed a particle filter compensation approach to robust speech recognition, and show that a tight coupling and sharing of information between HMMs and particle filters has a strong potential to improve recognition performance in adverse environments. It is noted that we need an accurate alignment of the state and mixture sequences used for compensation with particle filters and the actual HMM state sequences that describes the underlying clean speech features. Although we have observed an improved performance in the current particle filter compensation implementation there is still a considerable performance gap between the oracle setup with correct side information and what's achievable in this study with the missing side information estimated from noisy speech. We further developed a scheme to merge statistically similar information in HMM states to enable us to find the right section of HMMs to dynamically plug in the particle filter algorithm. Results show that if we use information from HMMs that match specifically well with section of speech being compensated, significant error reduction is possible compared to multi-condition HMMs.

## **Author details**

76 Modern Speech Recognition Approaches with Case Studies

**Table 3.** Variable number of particles (25 clusters)

**Figure 10.** Accuracy when correct cluster known vs. unknown

**4. Conclusions** 

compared to multi-condition HMMs.

Set A Set B Set C Average

100 particles **90.02 91.03 89.26 90.1**  500 particles **90.03 91.10 89.07 90.07**  1000 particles **90.02 91.13 89.07 90.07**  MC Trained **88.41 88.82 88.97 88.73**  Clean Trained **64.00 67.46 65.39 65.73** 

In this chapter, we proposed a particle filter compensation approach to robust speech recognition, and show that a tight coupling and sharing of information between HMMs and particle filters has a strong potential to improve recognition performance in adverse environments. It is noted that we need an accurate alignment of the state and mixture sequences used for compensation with particle filters and the actual HMM state sequences that describes the underlying clean speech features. Although we have observed an improved performance in the current particle filter compensation implementation there is still a considerable performance gap between the oracle setup with correct side information and what's achievable in this study with the missing side information estimated from noisy speech. We further developed a scheme to merge statistically similar information in HMM states to enable us to find the right section of HMMs to dynamically plug in the particle filter algorithm. Results show that if we use information from HMMs that match specifically well with section of speech being compensated, significant error reduction is possible Aleem Mushtaq *School of ECE, Georgia Institute of Technology, Atlanta, USA* 

## **5. References**

	- [16] S. J. Young, J. J. Odell, and P. C. Woodland, "Tree-based state tying for high accuracy acoustic modeling, " Proc. ARPA Human Language Technology Workshop, pp. 307– 312, 1994.

**Chapter 4** 

© 2012 Flynn and Jones, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Flynn and Jones, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Robust Distributed Speech Recognition** 

The use of the Internet for accessing information has expanded dramatically over the past few years, while the availability and use of mobile hand-held devices for communication and Internet access has greatly increased in parallel. Industry has reacted to this trend for information access by developing services and applications that can be accessed by users on the move. These trends have highlighted a need for alternatives to the traditional methods of user data input, such as keypad entry, which is difficult on small form-factor mobile devices. One alternative is to make use of automatic speech recognition (ASR) systems that act on speech input from the user. An ASR system has two main elements. The first element is a front-end processor that extracts parameters, or features, that represent the speech signal. These features are processed by a back-end classifier, which makes the decision as to

In a fully embedded ASR system [1], the feature extraction and the speech classification are carried out on the mobile device. However, due to the computational complexity of highperformance speech recognition systems, such an embedded architecture can be impractical on mobile hand-held terminals due to limitations in processing and memory resources. On the other hand, fully centralised (server-based) ASR systems have fewer computational constraints, can be used to share the computational burden between mobile users, and can also allow for the easy upgrade of speech recognition technologies and services that are provided. However, in a centralised ASR system the recognition accuracy can be compromised as a result of the speech signal being distorted by low bit-rate encoding at the

A distributed speech recognition (DSR) system is designed to overcome some of the difficulties described above. In DSR, the terminal (the mobile device) includes a local frontend processor that extracts, directly from the speech, the features to be sent to the remote

**Using Auditory Modelling** 

Additional information is available at the end of the chapter

codec and a poor quality transmission channel [2, 3].

Ronan Flynn and Edward Jones

http://dx.doi.org/10.5772/49954

**1. Introduction** 

what has been spoken.
