**Meet the editor**

Dr. S. Ramakrishnan is a Professor and the Head of the Department of Information Technology, Dr. Mahalingam College of Engineering and Technology, Pollachi, India. He is a reviewer for 14 International Journals such as IEEE Transactions on Image Processing, IET Journals, ACM, Elsevier Science, etc., and is on the editorial board of five international journals. He is also

a Guest-Editor-in-Chief for special issues of three international journals, including the Telecommunication Systems Journal. Dr. Ramakrishnan has published 77 papers, along with a book for LAP, Germany. He was the convener of the IT board at Anna University of Technology - Coimbatore Board of Studies (BoS) and his biography has been included in Marquis Whos's who in the World 2011 edition.

Contents

**Preface IX** 

**Section 1 Speech Recognition 1** 

R. Thangarajan

Aleem Mushtaq

Chapter 5 **Improvement Techniques** 

Chapter 1 **Robust Speech Recognition for Adverse Environments 3** 

Chapter 2 **Speech Recognition for Agglutinative Languages 37** 

**for Automatic Speech Recognition 105** 

**Continuous Speech Recognition 131** 

Chapter 7 **Dereverberation Based on Spectral Subtraction** 

Longbiao Wang, Kyohei Odani, Atsuhiko Kai, Norihide Kitaoka and Seiichi Nakagawa

Chapter 8 **Improvement on Sound Quality of the Body Conducted** 

**Speech from Optical Fiber Bragg Grating Microphone 177**  Masashi Nakayama, Shunsuke Ishimitsu and Seiji Nakagawa

Chapter 6 **Linear Feature Transformations in Slovak Phoneme-Based** 

**by Multi-Channel LMS Algorithm for Hands-Free** 

Chung-Hsien Wu and Chao-Hong Liu

Chapter 3 **A Particle Filter Compensation Approach to Robust Speech Recognition 57** 

Chapter 4 **Robust Distributed Speech Recognition Using Auditory Modelling 79**  Ronan Flynn and Edward Jones

Santiago Omar Caballero Morales

Jozef Juhár and Peter Viszlay

**Speech Recognition 155** 

**Section 2 Speech Enhancement 175** 

## Contents

#### **Preface XI**


Chapter 8 **Improvement on Sound Quality of the Body Conducted Speech from Optical Fiber Bragg Grating Microphone 177**  Masashi Nakayama, Shunsuke Ishimitsu and Seiji Nakagawa


#### **Section 3 Speech Modelling 255**


## Preface

Speech processing has become a vital area of research in engineering due to the development of human-computer interaction (HCI). Among the various modes available in HCI, speech is the convenient and natural mode through which users can interact with machines, including computers. Hence, many scholars have been carrying out research on automatic speech recognition (ASR) to suit modern HCI techniques, which turns out to be a challenging task. Natural spoken words can't be easily recognized without pre-processing because of the disturbances such as environmental noise, noise by instruments, inter and intra person speech variations, etc.

This book focuses on speech recognition and its associated tasks namely speech enhancement and modelling. This book comprises thirteen chapters and is divided in to three sections, each one for speech processing, enhancement and modelling. Section 1 on speech recognition consists of seven chapters, and sections 2 and 3 on speech enhancement and speech modelling have three chapters each respectively to supplement section 1.

The first chapter, by Chung-Hsien Wu and Chao-Hong Liu provides techniques for speech recognition in adverse environments such as noisy, diffluent and multilingual environments using Gaussian Mixture Model (GMM) and Support Vector Machine (SVM). Authors have made extensive experiments using English across Taiwan (EAT) project database.

In the second chapter, R. Thangarajan presents two automatic speech recognition approaches for Tamil, an agglutinative language. In the first approach, he used a bigram/tri-gram morpheme based language model to reduce the vocabulary size and to predict the strings. The second approach leverages the syllabic structure of the word and builds a syllable based acoustic model. In addition to that, he has presented basics of speech units and Tamil as an agglutinative language.

Aleem Mushtaq presents an interesting approach to robust speech recognition using particle filter compensation approach in chapter 3. The author has proved both mathematically and experimentally that a tight coupling and sharing of information between Hidden Markov Model (HMM) and particle filters has a strong potential to improve recognition performance in adverse environments.

#### XII Preface

In chapter 4, Ronan Flynn and Edward Jones address speech recognition in a distributed framework considering background noise and packet loss using an auditory model as an alternative to the commonly-used Mel-frequency Cepstral coefficients. They have presented several speech enhancement techniques as well as conducted extensive experiments and presented the results in support of their claim.

Preface XI

**S. Ramakrishnan**  Professor and Head

India

Department of Information Technology

Dr.Mahalingam College of Engineering and Technology

Chapter 11 by Ján Staš, Daniel Hládek and Jozef Juhár focuses on the development of the language model using grammatical features for Slovak language. The authors have made a detailed presentation on class based language models with sparse training corpus, utilizing grammatical features. In addition to that authors have done extensive

In chapter 12, Dia AbuZeina, Husni Al-Muhtaseb and Moustafa Elshafei have addressed pronunciation variation modeling using part of speech tagging for Arabic speech recognition, employing language models. The proposed method was investigated on a speaker-independent modern standard Arabic speech recognition system using Carnegie Mellon University Sphinx speech recognition engine. The authors have concluded that the proposed knowledge-based approach to model cross-

The final chapter by Nelson Neto, Pedro Batista and Aldebaro Klautau highlights the use of Internet as a collaborative network to develop speech science and technology for a language such as Brazilian Portuguese (BP). The authors have presented the required background on speech recognition and synthesis. The authors named the proposed framework as VOICECONET which is a comprehensive web-based platform

I would like to express my sincere thanks to all authors for their contribution and effort to bring in this wonderful book. My gratitude and appreciation to InTech, in particular Ms. Ana Nikolic and Mr Dimitri Jelovcan who have drawn together the authors to publish this book. I would like to express my heartfelt thanks to The

word pronunciation variations problem got improvement.

based on the open source concept such as HTK, Julius, MARY, etc.

Management, Secretary, Director and Principal of my Institute.

experimental studies.

Chapter 5 by Santiago Omar Caballero Morales focuses on various techniques for improving speech recognition accuracy. Techniques such as meta models, genetic algorithm based method and non – negative matrix factorization methods have been suggested by the author. All these techniques are experimentally tested and validated.

In chapter 6, Jozef Juhár and Peter Viszlay introduced linear feature transformations for speech recognition in Slovak. Authors have used three popular dimensionality reduction techniques namely Linear Discriminant Analysis (LDA), Two-dimensional LDA (2DLDA) and Principal Component Analysis (PCA). This chapter is very much balanced in terms of mathematical, theoretical and experimental treatment. The authors concluded by clearly stating the linear transformation which suits best for the particular type of feature.

Chapter 7 by Longbiao Wang, Kyohei Odani, Atsuhiko Kai, Norihide Kitaoka and Seiichi Nakagawa elaborates on speech recognition in the distant-talking environment. This chapter provides the outline of blind dereverberation and the authors have used the LMS algorithm and its variants to estimate the power spectrum of the impulse response. The authors conducted experiments for hands-free speech recognition in both simulated and real reverberant environments and presented the results.

Chapter 8 by Masashi Nakayama, Shinseki Ishimitsu and Seiji Nakagawa focuses on enhancement of body-conducted speech (BCS). The authors used a BCS microphone with an optical fiber bragg grating (OFBG microphone) to improve the speech quality. Experimental results for the body-conducted speeches with the proposed method are presented in this chapter, employing an accelerometer and an OFBG microphone under noisy environment.

In chapter 9, Alfredo Victor Mantilla Caeiros and Hector Manuel Pérez Meana presented techniques for esophageal speech enhancement. People who suffer from throat cancer require rehabilitation in order to reintegrate their voice. To accomplish this, the authors have suggested esophageal speech enhancement technique using wavelet transformation and artificial neural networks.

Komal Arora discussed on impacts of psychophysical abilities in speech perception performance caused by different stimulation rates for cochlear implant in chapter 10. The author has presented details about cochlear implants and advanced combination encoder (ACE) and a number of interesting studies using multi-channel stimuli to measure modulation detection.

Chapter 11 by Ján Staš, Daniel Hládek and Jozef Juhár focuses on the development of the language model using grammatical features for Slovak language. The authors have made a detailed presentation on class based language models with sparse training corpus, utilizing grammatical features. In addition to that authors have done extensive experimental studies.

X Preface

particular type of feature.

under noisy environment.

measure modulation detection.

wavelet transformation and artificial neural networks.

In chapter 4, Ronan Flynn and Edward Jones address speech recognition in a distributed framework considering background noise and packet loss using an auditory model as an alternative to the commonly-used Mel-frequency Cepstral coefficients. They have presented several speech enhancement techniques as well as conducted extensive experiments and presented the results in support of their claim.

Chapter 5 by Santiago Omar Caballero Morales focuses on various techniques for improving speech recognition accuracy. Techniques such as meta models, genetic algorithm based method and non – negative matrix factorization methods have been suggested by the author. All these techniques are experimentally tested and validated.

In chapter 6, Jozef Juhár and Peter Viszlay introduced linear feature transformations for speech recognition in Slovak. Authors have used three popular dimensionality reduction techniques namely Linear Discriminant Analysis (LDA), Two-dimensional LDA (2DLDA) and Principal Component Analysis (PCA). This chapter is very much balanced in terms of mathematical, theoretical and experimental treatment. The authors concluded by clearly stating the linear transformation which suits best for the

Chapter 7 by Longbiao Wang, Kyohei Odani, Atsuhiko Kai, Norihide Kitaoka and Seiichi Nakagawa elaborates on speech recognition in the distant-talking environment. This chapter provides the outline of blind dereverberation and the authors have used the LMS algorithm and its variants to estimate the power spectrum of the impulse response. The authors conducted experiments for hands-free speech recognition in

Chapter 8 by Masashi Nakayama, Shinseki Ishimitsu and Seiji Nakagawa focuses on enhancement of body-conducted speech (BCS). The authors used a BCS microphone with an optical fiber bragg grating (OFBG microphone) to improve the speech quality. Experimental results for the body-conducted speeches with the proposed method are presented in this chapter, employing an accelerometer and an OFBG microphone

In chapter 9, Alfredo Victor Mantilla Caeiros and Hector Manuel Pérez Meana presented techniques for esophageal speech enhancement. People who suffer from throat cancer require rehabilitation in order to reintegrate their voice. To accomplish this, the authors have suggested esophageal speech enhancement technique using

Komal Arora discussed on impacts of psychophysical abilities in speech perception performance caused by different stimulation rates for cochlear implant in chapter 10. The author has presented details about cochlear implants and advanced combination encoder (ACE) and a number of interesting studies using multi-channel stimuli to

both simulated and real reverberant environments and presented the results.

In chapter 12, Dia AbuZeina, Husni Al-Muhtaseb and Moustafa Elshafei have addressed pronunciation variation modeling using part of speech tagging for Arabic speech recognition, employing language models. The proposed method was investigated on a speaker-independent modern standard Arabic speech recognition system using Carnegie Mellon University Sphinx speech recognition engine. The authors have concluded that the proposed knowledge-based approach to model crossword pronunciation variations problem got improvement.

The final chapter by Nelson Neto, Pedro Batista and Aldebaro Klautau highlights the use of Internet as a collaborative network to develop speech science and technology for a language such as Brazilian Portuguese (BP). The authors have presented the required background on speech recognition and synthesis. The authors named the proposed framework as VOICECONET which is a comprehensive web-based platform based on the open source concept such as HTK, Julius, MARY, etc.

I would like to express my sincere thanks to all authors for their contribution and effort to bring in this wonderful book. My gratitude and appreciation to InTech, in particular Ms. Ana Nikolic and Mr Dimitri Jelovcan who have drawn together the authors to publish this book. I would like to express my heartfelt thanks to The Management, Secretary, Director and Principal of my Institute.

> **S. Ramakrishnan** Professor and Head Department of Information Technology Dr.Mahalingam College of Engineering and Technology India

**Section 1** 

**Speech Recognition** 

**Speech Recognition** 

**Chapter 1** 

© 2012 Wu and Liu, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 Wu and Liu, licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Robust Speech Recognition for** 

**Adverse Environments** 

Chung-Hsien Wu and Chao-Hong Liu

http://dx.doi.org/10.5772/47843

**1. Introduction** 

preprocessing.

Additional information is available at the end of the chapter

As the state-of-the-art speech recognizers can achieve a very high recognition rate for clean speech, the recognition performance generally degrades drastically under noisy environments. Noise-robust speech recognition has become an important task for speech recognition in adverse environments. Recent research on noise-robust speech recognition mostly focused on two directions: (1) removing the noise from the corrupted noisy signal in signal space or feature space - such as noise filtering: spectral subtraction (Boll 1979), Wiener filtering (Macho et al. 2002) and RASTA filtering (Hermansky et al. 1994), and speech or feature enhancement using model-based approach: SPLICE (Deng et al. 2003) and stochastic vector mapping (Wu et al. 2002); (2) compensating the noise effect into acoustic models in model space so that the training environment can match the test environment - such as PMC (Wu et al. 2004) or multi-condition/multi-style training (Deng et al. 2000). The noise filtering approaches require some assumption of prior information, such as the spectral characteristic of the noise. The performance will degrade when the noisy environment vary drastically or under unknown noise environment. Furthermore, (Deng et al. 2000; Deng et al. 2003) have shown that the use of denoising or preprocessing are superior to retraining the recognizers under the matched noise conditions with no

Stochastic vector mapping (SVM) (Deng et al. 2003; Wu et al. 2002) and sequential noise estimation (Benveniste et al. 1990; Deng et al. 2003; Gales et al. 1996) for noise normalization have been proposed and achieved significant improvement in noisy speech recognition. However, there still exist some drawbacks and limitations. First, the performance of sequential noise estimation will decrease when the noisy environment vary drastically. Second, the environment mismatch between training data and test data still exists and results in performance degradation. Third, the maximum-likelihood-based stochastic vector

**Chapter 1** 
