**Meet the editor**

Dr S. Ramakrishnan has 11 years of teaching experience and one year's industry experience. He is a Professor and the Head of the Department of Information Technology at the Dr Mahalingam College of Engineering and Technology, Pollachi, India. Dr Ramakrishnan is a reviewer of 14 international journals such as IEEE Transactions on Image Processing, IET Journals (Formally

IEE), ACM Computing Reviews, and Elsevier Science. He is on the editorial board of four international journals and is a guest editor of special issues in three international journals. He has published 54 papers in international, national journals and conference proceedings. Dr.S. Ramakrishnan has published a book for LAP, Germany. His areas of research include digital image processing, soft computing, human-computer interaction and speech processing.

Contents

**Preface VII** 

Chapter 1 **A Real-Time Speech Enhancement Front-End** 

Stefano Squartini and Francesco Piazza

Chapter 3 **Mathematical Modeling of Speech Production** 

**of Vowels in Tunisian Context 51** 

Jani Nurminen, Hanna Silén, Victor Popa, Elina Helander and Moncef Gabbouj

Chapter 6 **Automatic Visual Speech Recognition 95**  Alin Chiţu and Léon J.M. Rothkrantz

Chapter 7 **Recognition of Emotion from Speech:** 

**A Review 121**  S. Ramakrishnan

Rudy Rotili, Emanuele Principi,

Chapter 4 **Multi-Resolution Spectral Analysis** 

Chapter 5 **Voice Conversion 69** 

**for Multi-Talker Reverberated Scenarios 1** 

Chapter 2 **Real-Time Dual-Microphone Speech Enhancement 19** 

**and Its Application to Noise Cancellation 35**  N.R. Raajan, T.R. Sivaramakrishnan and Y. Venkatramani

Nefissa Annabi-Elkadri, Atef Hamouda and Khaled Bsaies

Trabelsi Abdelaziz, Boyer François-Raymond and Savaria Yvon

### Contents

#### **Preface XI**


S. Ramakrishnan

Preface

enhancement.

Speech processing is the process by which speech signals are interpreted, understood, and acted upon. Interpretation and production of coherent speech are both important in the processing of speech. It is done by automated systems such as voice recognition software or voice-to-text programs. Speech processing includes speech recognition, speaker recognition, speech coding, voice analysis, speech synthesis and speech

Speech recognition is one of the most important aspects of speech processing because the overall aim of processing speech is to comprehend the speech and act on its linguistic part. One commonly used application of speech recognition is simple speech-to-text conversion, which is used in many word processing programs. Speaker recognition, another element of speech recognition, is also a highly important aspect of speech processing. While speech recognition refers specifically to understanding what is said, speaker recognition is only concerned with who does the speaking. It validates a user's claimed identity using characteristics extracted from their voices. Validating the identity of the speaker can be an important security feature to prevent unauthorized access to or use of a computer system. Another component of speech processing is voice recognition, which is essentially a combination of speech and speaker recognition. Voice recognition occurs when speech recognition programs process the speech of a known speaker; such programs can generally interpret the speech of a known speaker with much greater accuracy than that of a random speaker. Another topic of study in the area of speech processing is voice analysis. Voice analysis differs from other topics in speech processing because it is not really concerned with the linguistic content of speech. It is primarily concerned with speech patterns and sounds. Voice analysis could be used to diagnose problems with the vocal cords or other organs related to speech by noting sounds that are indicative of disease or damage. Sound and stress patterns could also be used to determine if an individual is telling the truth, though this use of voice analysis is highly controversial.

This book comprises seven chapters written by leading scientists from around the

In Chapter 1 the authors Rudy Rotili, Emanuele Principi, Stefano Squartini and Francesco Piazza present about real-time speech enhancement front-end for multi-

globe. It be useful to researchers, graduate students and practicing engineers.

### Preface

Speech processing is the process by which speech signals are interpreted, understood, and acted upon. Interpretation and production of coherent speech are both important in the processing of speech. It is done by automated systems such as voice recognition software or voice-to-text programs. Speech processing includes speech recognition, speaker recognition, speech coding, voice analysis, speech synthesis and speech enhancement.

Speech recognition is one of the most important aspects of speech processing because the overall aim of processing speech is to comprehend the speech and act on its linguistic part. One commonly used application of speech recognition is simple speech-to-text conversion, which is used in many word processing programs. Speaker recognition, another element of speech recognition, is also a highly important aspect of speech processing. While speech recognition refers specifically to understanding what is said, speaker recognition is only concerned with who does the speaking. It validates a user's claimed identity using characteristics extracted from their voices. Validating the identity of the speaker can be an important security feature to prevent unauthorized access to or use of a computer system. Another component of speech processing is voice recognition, which is essentially a combination of speech and speaker recognition. Voice recognition occurs when speech recognition programs process the speech of a known speaker; such programs can generally interpret the speech of a known speaker with much greater accuracy than that of a random speaker. Another topic of study in the area of speech processing is voice analysis. Voice analysis differs from other topics in speech processing because it is not really concerned with the linguistic content of speech. It is primarily concerned with speech patterns and sounds. Voice analysis could be used to diagnose problems with the vocal cords or other organs related to speech by noting sounds that are indicative of disease or damage. Sound and stress patterns could also be used to determine if an individual is telling the truth, though this use of voice analysis is highly controversial.

This book comprises seven chapters written by leading scientists from around the globe. It be useful to researchers, graduate students and practicing engineers.

In Chapter 1 the authors Rudy Rotili, Emanuele Principi, Stefano Squartini and Francesco Piazza present about real-time speech enhancement front-end for multitalker reverberated scenarios. The focus of this chapter is on the speech enhancement stage of the speech processing unit and in particular on the set of algorithms constituting the front-end of the automatic speech recognition (ASR). Users' voices acquired are more or less susceptible to the presence of noise. Several solutions are available to alleviate the problems. There are two popular techniques among them, namely blind source separation (BSS) and speech dereverberation. A two-stage approach leading to sequential source separation and speech dereverberation based on blind channel identification (BCI) is proposed by the authors. This is accomplished by converting the multiple-input multiple-output (MIMO) system into several singleinput multiple-output (SIMO) systems free of any interference from the other sources. The major drawback of such implementation is that the BCI stage needs to know "who speaks when" in order to estimate the impulse response related to the right speaker. To overcome the problem, in this chapter a solution which exploits a speaker diarization system is proposed. Speaker diarization steers the BCI and the ASR, thus allowing the identification task to be accomplished directly on the microphone mixture. The ASR system was successfully enhanced by an advanced multi-channel front-end to recognize the speech content coming from multiple speakers in reverberated acoustic conditions. The overall architecture is able to blindly identify the impulse responses, to separate the existing multiple overlapping sources, to dereverberate them and to recognize the information contained within the original speeches.

Preface IX

noise is an important aspect of speech production. In order to reduce the noise level, active noise cancellation technique is proposed by the authors. A mathematical model of vocal fold is introduced by the authors as part of a new approach for noise cancellation. The mathematical modeling of vocal fold will only recognize the voice and will not create a signal opposite to the noise. It will feed only the vocal output and not the noise, since it uses shape and characteristic of speech. In this chapter, the representation of shape and characteristic of speech using an acoustics tube model is

Chapter 4 by Nefissa Annabi-Elkadri, Atef Hamouda and Khaled Bsaies deals with the concept of multi-resolution for the spectral analysis (MRS) of vowels in Tunisian words and in French words under the Tunisian context. The suggested method is composed of two parts. The first part is applied MRS method to the signal. MRS is calculated by combining several FFT of different lengths. The second part is the formant detection by applied multi-resolution linear predictive coding (LPC). The authors use a linear prediction method for analysis. Linear prediction models the signal as if it were generated by a signal of minimum energy being passed through a purely-recursive IIR filter. Multi resolution LPC (MR LPC) is calculated by the LPC of the average of the convolution of several windows to the signal. The authors observe that the Tunisian speakers pronounce vowels in the same way for both the French language and Tunisian dialects. The results obtained by the authors show that, due to the influence of the French language on the Tunisian dialect, the vowels are, in some

In Chapter 5 the authors Jani Nurminen, Hanna Silén, Victor Popa, Elina Helander and Moncef Gabbouj, focus on voice conversion (VC). This is an area of speech processing in which the speech signal uttered by a speaker is modified to a sound as if it is spoken by the target speaker. According to the authors, it is essential to determine the factors in a speech signal that the speaker's identity relies upon. In this chapter a training phase is employed to convert the source features to target features. A conversion function is estimated between the source and target features. Voice conversion is of two types depending upon the data used for training data. Data used for training can be either parallel or non-parallel. The extreme case of speaker independent voice conversion is cross-lingual conversion in which the source and target speakers speak different languages. Numerous VC approaches are proposed and surveyed in this chapter. The VC techniques are characterized into two methods used for stand-alone voice conversion and the adaptation techniques used in HMM-based speech synthesis. In stand-alone voice conversion, there are two approaches according to authors: the Gaussian mixture model-based conversion and codebook-based methods. A number of algorithms used in codebook-based methods to change the characteristics of the voice signal appropriately are surveyed. Speaker adaptation techniques help us to change the voice characteristics of the signal accordingly for the targeted speech signal. More realistic mimicking of the human speech production has been briefed in

also presented.

contexts, similarly pronounced.

this chapter using various approaches.

Chapter 2 on real-time dual microphone speech enhancement was written by Trabelsi Abdelaziz, Boyer Francois-Raymond and Savaria Yvon. Single microphone speech enhancement approaches often fail to yield satisfactory performance, in particular when the interfering noise statistics are time-varying. In contrast, multiple microphone systems provide superior performance over the single microphone schemes at the expense of a substantial increase in implementation complexity and computational cost. This chapter addresses the problem of enhancing a speech signal corrupted with additive noise when observations from two microphones are available. The greater advantage of using the dual microphone is spatial discrimination of an array to separate speech from noise. The spatial information was broken in the development of dual-microphone beam forming algorithm, which considers spatially uncorrelated noise field. A cross-power spectral density (CPSD) noise reduction-based approach was used initially. In this chapter the authors propose the modified CPSD approach (MCPSD). This is based on minimum statistics, the noise power spectrum estimator seeks to provide a good tradeoff between the amount of noise reduction and the speech distortion, while attenuating the high energy correlated noise components especially in the low frequency ranges. The best noise reduction was obtained in the case of multitasked babble noise.

In Chapter 3 the authors, R. Raajan, T.R.Sivaramakrishnan and Y.Venkatramani, introduce the mathematical modeling of speech production to remove noise from speech signal. Speech is produced by the human vocal apparatus. Cancellation of noise is an important aspect of speech production. In order to reduce the noise level, active noise cancellation technique is proposed by the authors. A mathematical model of vocal fold is introduced by the authors as part of a new approach for noise cancellation. The mathematical modeling of vocal fold will only recognize the voice and will not create a signal opposite to the noise. It will feed only the vocal output and not the noise, since it uses shape and characteristic of speech. In this chapter, the representation of shape and characteristic of speech using an acoustics tube model is also presented.

VIII Preface

speeches.

case of multitasked babble noise.

talker reverberated scenarios. The focus of this chapter is on the speech enhancement stage of the speech processing unit and in particular on the set of algorithms constituting the front-end of the automatic speech recognition (ASR). Users' voices acquired are more or less susceptible to the presence of noise. Several solutions are available to alleviate the problems. There are two popular techniques among them, namely blind source separation (BSS) and speech dereverberation. A two-stage approach leading to sequential source separation and speech dereverberation based on blind channel identification (BCI) is proposed by the authors. This is accomplished by converting the multiple-input multiple-output (MIMO) system into several singleinput multiple-output (SIMO) systems free of any interference from the other sources. The major drawback of such implementation is that the BCI stage needs to know "who speaks when" in order to estimate the impulse response related to the right speaker. To overcome the problem, in this chapter a solution which exploits a speaker diarization system is proposed. Speaker diarization steers the BCI and the ASR, thus allowing the identification task to be accomplished directly on the microphone mixture. The ASR system was successfully enhanced by an advanced multi-channel front-end to recognize the speech content coming from multiple speakers in reverberated acoustic conditions. The overall architecture is able to blindly identify the impulse responses, to separate the existing multiple overlapping sources, to dereverberate them and to recognize the information contained within the original

Chapter 2 on real-time dual microphone speech enhancement was written by Trabelsi Abdelaziz, Boyer Francois-Raymond and Savaria Yvon. Single microphone speech enhancement approaches often fail to yield satisfactory performance, in particular when the interfering noise statistics are time-varying. In contrast, multiple microphone systems provide superior performance over the single microphone schemes at the expense of a substantial increase in implementation complexity and computational cost. This chapter addresses the problem of enhancing a speech signal corrupted with additive noise when observations from two microphones are available. The greater advantage of using the dual microphone is spatial discrimination of an array to separate speech from noise. The spatial information was broken in the development of dual-microphone beam forming algorithm, which considers spatially uncorrelated noise field. A cross-power spectral density (CPSD) noise reduction-based approach was used initially. In this chapter the authors propose the modified CPSD approach (MCPSD). This is based on minimum statistics, the noise power spectrum estimator seeks to provide a good tradeoff between the amount of noise reduction and the speech distortion, while attenuating the high energy correlated noise components especially in the low frequency ranges. The best noise reduction was obtained in the

In Chapter 3 the authors, R. Raajan, T.R.Sivaramakrishnan and Y.Venkatramani, introduce the mathematical modeling of speech production to remove noise from speech signal. Speech is produced by the human vocal apparatus. Cancellation of Chapter 4 by Nefissa Annabi-Elkadri, Atef Hamouda and Khaled Bsaies deals with the concept of multi-resolution for the spectral analysis (MRS) of vowels in Tunisian words and in French words under the Tunisian context. The suggested method is composed of two parts. The first part is applied MRS method to the signal. MRS is calculated by combining several FFT of different lengths. The second part is the formant detection by applied multi-resolution linear predictive coding (LPC). The authors use a linear prediction method for analysis. Linear prediction models the signal as if it were generated by a signal of minimum energy being passed through a purely-recursive IIR filter. Multi resolution LPC (MR LPC) is calculated by the LPC of the average of the convolution of several windows to the signal. The authors observe that the Tunisian speakers pronounce vowels in the same way for both the French language and Tunisian dialects. The results obtained by the authors show that, due to the influence of the French language on the Tunisian dialect, the vowels are, in some contexts, similarly pronounced.

In Chapter 5 the authors Jani Nurminen, Hanna Silén, Victor Popa, Elina Helander and Moncef Gabbouj, focus on voice conversion (VC). This is an area of speech processing in which the speech signal uttered by a speaker is modified to a sound as if it is spoken by the target speaker. According to the authors, it is essential to determine the factors in a speech signal that the speaker's identity relies upon. In this chapter a training phase is employed to convert the source features to target features. A conversion function is estimated between the source and target features. Voice conversion is of two types depending upon the data used for training data. Data used for training can be either parallel or non-parallel. The extreme case of speaker independent voice conversion is cross-lingual conversion in which the source and target speakers speak different languages. Numerous VC approaches are proposed and surveyed in this chapter. The VC techniques are characterized into two methods used for stand-alone voice conversion and the adaptation techniques used in HMM-based speech synthesis. In stand-alone voice conversion, there are two approaches according to authors: the Gaussian mixture model-based conversion and codebook-based methods. A number of algorithms used in codebook-based methods to change the characteristics of the voice signal appropriately are surveyed. Speaker adaptation techniques help us to change the voice characteristics of the signal accordingly for the targeted speech signal. More realistic mimicking of the human speech production has been briefed in this chapter using various approaches.

#### XIV Preface

Chapter 6 by Alin Chiţu, Léon J.M. Rothkrantz deals with visual speech recognition. Extensive lip reading research was primarily done in order to improve the teaching methodology for hearing impaired people to increase their chances for integration in the society. Lip reading is part of our multi-sensory speech perception process and it is named as visual speech recognition. Lip reading is an artificial form of communication and neural mechanism, the one that enables humans to achieve high literacy skills with relative ease. In this chapter authors employed active appearance models (AAM) which combine the active shape models with texture-based information to accurately detect the shape of the mouth or the face. According to the authors, teeth, tongue and cavity were of great importance to lip reading by humans. The speaker's areas of attention during communication were found by the authors using four major areas: the mouth, the eyes and the centre of the face depending on the task and the noise level.

The last chapter on speech emotion recognition (SER) by S. Ramakrishnan provides a comprehensive review. Speech emotions constitute an important constituent of human computer interaction. Several recent surveys are devoted to the analysis and synthesis of speech emotions from the point of view of pattern recognition and machine learning as well as psychology. The main problem in speech emotion recognition is how reliable the correct classification rate achieved by a classifier is. In this chapter the author focuses on (1) framework and databases used for SER; (2) acoustic characteristics of typical emotions; (3) various acoustic features and classifiers employed for recognition of emotions from speech; and (4) applications of emotion recognition.

I would like to express my sincere thanks to all contributing authors, for their effort in bringing their insights on current open questions in speech processing research. I offer my deepest appreciation and gratitude to the Intech Publishers who gathered the authors and published this book. I would like to express my deepest gratitude to The Management, Secretary, Director and Principal of my Institute.

> **S. Ramakrishnan** Professor and Head Department of Electronics and Communication Engineering Dr Mahalingam College of Engineering and Technology India

## **A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios**

Rudy Rotili, Emanuele Principi, Stefano Squartini and Francesco Piazza *Università Politecnica delle Marche Italy*

#### **1. Introduction**

In the direct human interaction, the verbal and nonverbal communication modes play a fundamental role by jointly cooperating in assigning semantic and pragmatic contents to the conveyed message and by manipulating and interpreting the participants' cognitive and emotional states from the interactional contextual instance. In order to understand, model, analyse, and automatize such behaviours, converging competences from social and cognitive psychology, linguistic, philosophy, and computer science are needed.

The exchange of information (more or less conscious) that take place during interactions build up a new knowledge that often needs to be recalled, in order to be re-used, but sometime it also needs to be appropriately supported as it occurs. Currently, the international scientific research is strongly committed towards the realization of intelligent instruments able to recognize, process and store relevant interactional signals: The goal is not only to allow efficient use of the data retrospectively but also to assist and dynamically optimize the experience of interaction itself while it is being held. To this end, both verbal and nonverbal (gestures, facial expressions, gaze, etc.) communication modes can be exploited. Nevertheless, voice is still a popular choice due to informative content it carries: Words, emotions, dominance can all be detected by means of different kinds of speech processing techniques. Examples of projects exploiting this idea are CHIL (Waibel et al. (2004)), AMI-AMIDA (Renals (2005)) and CALO (Tur et al. (2010)).

The applicative scenario taken here as reference is a professional meeting, where the system can readily assists the participants and where the participants themselves do not have particular expectations on the forms of supports provided by the system. In this scenario, it is assumed that people are sitting around a table, and the system supports and enrich the conversation experience by projecting graphical information and keywords on a screen.

A complete architecture of such a system has been proposed and validated in (Principi et al. (2009); Rocchi et al. (2009)). It consists of three logical layers: Perception, Interpretation and Presentation. The Perception layer aims to achieve situational awareness in the workplace and is composed of two essential elements: Presence Detector and Speech Processing Unit. The first determines the operating states of the system: Presence (the system checks if there are people around the table); conversation (the system senses that a conversation is ongoing). The Speech Processing Unit processes the captured audio signals and identifies the keywords that are exploited by the system in order to decide which stimuli to project. It consists of

The proposed framework, is developed on the NU-Tech platform (Squartini et al. (2005)), a freeware software which allows the efficient management of the audio stream by means of the ASIO interface. NU-Tech provides a useful plug-in architecture which has been exploited for the C++ implementation. Experiments performed over synthetic conditions at 16 kHz sampling rate confirm the real-time capabilities of the implemented architecture and its effectiveness as multi-channel front-end for the subsequent speech recognition engine. The chapter outline is the following. In Sec. 2 the speech enhancement front-end, aimed at separating and dereverberating the speech sources is described, whereas Sec. 3 details the ASR engine and its parametrization. Sec. 4 is targeted to discuss the simulations setup and

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 3

Let *M* be the number of independent speech sources and *N* the number of microphones. The relationship between them is described by an *M* × *N* MIMO FIR (finite impulse response) system. According to such a model, the *n*-th microphone signal at *k*-th sample time is:

is the *Lh*-taps RIR between the *n*-th microphone and the *m*-th source. Applying the *z*

*Lh*−1 ∑ *l*=0

The objective is recovering the original clean speech sources *sm* by means of a speech dereverberation approach: Indeed, it is necessary to automatically identify who is speaking, accordingly estimating the unknown RIRs and then apply a seperation and dereverberation

The reference framework proposed in (Huang et al. (2005); Rotili et al. (2010)) consists of three main stages: source separation, speech dereverberation and BCI. Firstly source separation is accomplished by transforming the original MIMO system in a certain number of SIMO systems and secondly the separated sources (but still reverberated) pass through the dereverberation process yielding the final cleaned-up speech signals. In order to make the two procedures properly working, it is necessary to estimate the MIMO RIRs of the audio

*nm***s***m*(*k*, *Lh*), *k* = 1, 2, ..., *K*, *n* = 1, 2, ..., *N* (1)

*<sup>T</sup>*, *n* = 1, 2, ..., *N*, *m* = 1, 2, ..., *M* (3)

*hnm*,*lz*<sup>−</sup>1. (5)

*Hnm*(*z*)*Sm*(*z*), *n* = 1, 2, ..., *N* (4)

**<sup>s</sup>***m*(*k*, *Lh*)=[*sm*(*k*) *sm*(*<sup>k</sup>* <sup>−</sup> <sup>1</sup>) ··· *sm*(*<sup>k</sup>* <sup>−</sup> *Lh* <sup>+</sup> <sup>1</sup>)]*T*. (2)

performed experiments. Conclusions are drawn in Sec. 5.

*M* ∑ *m*=1 **h***T*

**<sup>h</sup>***nm* = [*hnm*,0 *hnm*,1 ··· *hnm*,*Lh*−1]

*Xn*(*z*) =

process to restore the original speech quality.

*M* ∑ *m*=1

*Hnm*(*z*) =

**2. Speech enhancement front-end**

*xn*(*k*) =

is the *m*-th source. The term

where

transform, Eq. 1 can be rewritten as:

where (·)*<sup>T</sup>* denotes the transpose operator and

two main components: The multi-channel front-end (speech enhancement) and the automatic speech recognizer (ASR).

The Interpretation module is responsible of the recognition of the ongoing conversation. At this level, semantic representation techniques are adopted in order to structure both the content of the conversation and how the discussion is linked to the speakers present around the table. Closely related to this module is the Presentation one that, based on conversational analysis just made, dynamically decides which stimuli have to be proposed and sent. The stimuli are classified in terms of conversation topics and on the basis of their recognition, they are selected and projected on the table.

The focus of this chapter is on the speech enhancement stage of the Speech Processing Unit and in particular on the set of algorithms constituting the front-end of the ASR. In a typical meeting scenario, participants' voices can be acquired through different type of microphones. Depending on the choice made, the microphone signals are more or less susceptible to the presence of noise, the interference from other co-existing sources and reverberation produced by multiple acoustic paths. The usage of close-talking microphones can mitigate the aforementioned problems but they are invasive and the meeting participants can feel uncomfortable in such situation. A less invasive and more flexible solution is the choice of far-field microphone arrays. In this situation, the extraction of a desired speech signal can be a difficult task since noise, interference and reverberation are more relevant.

In the literature, several solutions have been proposed in order to alleviate the problems (Naylor & Gaubitch (2010); Woelfel & McDonough (2009)): Here, the attention is on two popular techniques among them, namely blind source separation (BSS) and speech dereverberation. In (Huang et al. (2005)), a two stage approach leading to sequential source separation and speech dereverberation based on blind channel identification (BCI) is proposed. This can be accomplished by converting the multiple-input multiple-output (MIMO) system into several single-input multiple-output (SIMO) systems free of any interference from the other sources. Since each SIMO system is blindly identified at different time, the BSS algorithm does not suffer of the annoying permutation ambiguity problem. Finally, if the obtained SIMO systems room impulse responses (RIRs) do not share common zeros, dereverberation can be performed by using the Multiple-Input/Output Inverse Theorem (MINT) (Miyoshi & Kaneda (1988)).

A real-time implementation of this approach has been presented in (Rotili et al. (2010)), where the optimum inverse filtering approach is substituted by an iterative technique, which is computationally more efficient and allows the inversion of long RIRs in real-time applications (Rotili et al. (2008)). Iterative inversion is based on the well known steepest-descent algorithm, where a regularization parameter taking into account the presence of disturbances, makes the dereverberation more robust to RIRs fluctuations or estimation errors due to the BCI algorithm (Hikichi et al. (2007)).

The major drawback of such implementation is that the BCI stage need to know "who speaks when" in order to estimate the RIRs related to the right speaker. To overcome the problem, in this chapter a solution which exploits a speaker diarization system is proposed. Speaker diarization steers the BCI and the ASR, thus allowing the identification task to be accomplished directly on the microphone mixture.

The proposed framework, is developed on the NU-Tech platform (Squartini et al. (2005)), a freeware software which allows the efficient management of the audio stream by means of the ASIO interface. NU-Tech provides a useful plug-in architecture which has been exploited for the C++ implementation. Experiments performed over synthetic conditions at 16 kHz sampling rate confirm the real-time capabilities of the implemented architecture and its effectiveness as multi-channel front-end for the subsequent speech recognition engine. The chapter outline is the following. In Sec. 2 the speech enhancement front-end, aimed at separating and dereverberating the speech sources is described, whereas Sec. 3 details the ASR engine and its parametrization. Sec. 4 is targeted to discuss the simulations setup and performed experiments. Conclusions are drawn in Sec. 5.

#### **2. Speech enhancement front-end**

Let *M* be the number of independent speech sources and *N* the number of microphones. The relationship between them is described by an *M* × *N* MIMO FIR (finite impulse response) system. According to such a model, the *n*-th microphone signal at *k*-th sample time is:

$$\mathbf{x}\_{\boldsymbol{n}}(k) = \sum\_{m=1}^{M} \mathbf{h}\_{nm}^{T} \mathbf{s}\_{m}(k, L\_{\boldsymbol{h}}), \qquad k = 1, 2, \dots, K, \quad n = 1, 2, \dots, N \tag{1}$$

where (·)*<sup>T</sup>* denotes the transpose operator and

$$\mathbf{s}\_{m}(k \,\!\!\/)\mathbf{s}\_{h}(k) = \left[\mathbf{s}\_{m}(k)\,\!\/\,\mathbf{s}\_{m}(k-1)\,\!\/\,\cdots\,\,\mathbf{s}\_{m}(k-L\_{h}+1)\right]^{T}.\tag{2}$$

is the *m*-th source. The term

$$\mathbf{h}\_{nm} = [h\_{nm,0} \, h\_{nm,1} \, \cdots \, h\_{nm,L\_0 - 1}]^T, \quad n = 1,2,...,N, \quad m = 1,2,...,M \tag{3}$$

is the *Lh*-taps RIR between the *n*-th microphone and the *m*-th source. Applying the *z* transform, Eq. 1 can be rewritten as:

$$X\_{\mathfrak{n}}(z) = \sum\_{m=1}^{M} H\_{\mathfrak{nm}}(z) S\_{\mathfrak{m}}(z), \qquad n = 1, 2, \dots, N \tag{4}$$

where

2 Speech Processing

two main components: The multi-channel front-end (speech enhancement) and the automatic

The Interpretation module is responsible of the recognition of the ongoing conversation. At this level, semantic representation techniques are adopted in order to structure both the content of the conversation and how the discussion is linked to the speakers present around the table. Closely related to this module is the Presentation one that, based on conversational analysis just made, dynamically decides which stimuli have to be proposed and sent. The stimuli are classified in terms of conversation topics and on the basis of their recognition, they

The focus of this chapter is on the speech enhancement stage of the Speech Processing Unit and in particular on the set of algorithms constituting the front-end of the ASR. In a typical meeting scenario, participants' voices can be acquired through different type of microphones. Depending on the choice made, the microphone signals are more or less susceptible to the presence of noise, the interference from other co-existing sources and reverberation produced by multiple acoustic paths. The usage of close-talking microphones can mitigate the aforementioned problems but they are invasive and the meeting participants can feel uncomfortable in such situation. A less invasive and more flexible solution is the choice of far-field microphone arrays. In this situation, the extraction of a desired speech signal can be

In the literature, several solutions have been proposed in order to alleviate the problems (Naylor & Gaubitch (2010); Woelfel & McDonough (2009)): Here, the attention is on two popular techniques among them, namely blind source separation (BSS) and speech dereverberation. In (Huang et al. (2005)), a two stage approach leading to sequential source separation and speech dereverberation based on blind channel identification (BCI) is proposed. This can be accomplished by converting the multiple-input multiple-output (MIMO) system into several single-input multiple-output (SIMO) systems free of any interference from the other sources. Since each SIMO system is blindly identified at different time, the BSS algorithm does not suffer of the annoying permutation ambiguity problem. Finally, if the obtained SIMO systems room impulse responses (RIRs) do not share common zeros, dereverberation can be performed by using the Multiple-Input/Output

A real-time implementation of this approach has been presented in (Rotili et al. (2010)), where the optimum inverse filtering approach is substituted by an iterative technique, which is computationally more efficient and allows the inversion of long RIRs in real-time applications (Rotili et al. (2008)). Iterative inversion is based on the well known steepest-descent algorithm, where a regularization parameter taking into account the presence of disturbances, makes the dereverberation more robust to RIRs fluctuations or estimation errors due to the BCI algorithm

The major drawback of such implementation is that the BCI stage need to know "who speaks when" in order to estimate the RIRs related to the right speaker. To overcome the problem, in this chapter a solution which exploits a speaker diarization system is proposed. Speaker diarization steers the BCI and the ASR, thus allowing the identification task to be

a difficult task since noise, interference and reverberation are more relevant.

Inverse Theorem (MINT) (Miyoshi & Kaneda (1988)).

accomplished directly on the microphone mixture.

(Hikichi et al. (2007)).

speech recognizer (ASR).

are selected and projected on the table.

$$H\_{nm}(z) = \sum\_{l=0}^{L\_h - 1} h\_{nm,l} z^{-1}.\tag{5}$$

The objective is recovering the original clean speech sources *sm* by means of a speech dereverberation approach: Indeed, it is necessary to automatically identify who is speaking, accordingly estimating the unknown RIRs and then apply a seperation and dereverberation process to restore the original speech quality.

The reference framework proposed in (Huang et al. (2005); Rotili et al. (2010)) consists of three main stages: source separation, speech dereverberation and BCI. Firstly source separation is accomplished by transforming the original MIMO system in a certain number of SIMO systems and secondly the separated sources (but still reverberated) pass through the dereverberation process yielding the final cleaned-up speech signals. In order to make the two procedures properly working, it is necessary to estimate the MIMO RIRs of the audio

of UNMCFLMS is based on cross relation criteria (Xu et al. (1995)) using the overlap-save

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 5

*N* ∑ *i*=*i*+1

where **e***ni*(*q*) is the frequency-domain block error signal between the *n*-th and *i*-th channels and (·)*<sup>H</sup>* denotes the Hermitian transpose operator. The update equation of the UNMCFLMS

**h***nm*<sup>∗</sup> (*q* + 1) = **h***nm*<sup>∗</sup> (*q*) − *ρ*[**P***nm*<sup>∗</sup> (*q*) + *δ***I**2*Lh*×*Lh* ]

**D***<sup>H</sup>*

 **0**1×*Lh*

**D***<sup>H</sup>*

while **F** denotes the discrete Fourier transform (DFT) matrix. The frequency-domain error

is the DFT of the *q*-th frame input signal block for the *n*-th channel. From a computational point of view, the UNMCFLMS algorithm ensures an efficient execution of the circular convolution by means of the fast Fourier transform (FFT). In addition, it can be easily implemented in a real-time application since the normalization matrix **P***nm*<sup>∗</sup> (*q*) + *δ***I**2*Lh*×*Lh* is

Though UNMCFLMS allows the estimation of long RIRs, it requires a high input signal-to-noise ratio. In this paper, the presence of noise has not been taken into account and therefore the UNMCFLMS still remain an appropriate choice. Different solutions have been proposed in literature in order to alleviate the misconvergence problem of the UNMCFLMS in presence of noise. Among them, the algorithms presented in (Haque et al. (2007); Haque & Hasan (2008); Yu & Er (2004)) guarantee a significant robustness against noise and they could

**<sup>h</sup>***nm*<sup>∗</sup> (*q*) **<sup>0</sup>**1×*Lh*

 **F**−<sup>1</sup> *Lh*×*Lh*

*T* ,

**e***ni*(*q*)

**e***ni*(*q*) = **D***xn* (*q*)**h***nm*<sup>∗</sup> (*q*) − **D***xi*(*q*)**h***im*<sup>∗</sup> (*q*) (9)

[*xn*(*qLh* <sup>−</sup> *Lh*) *xn*(*qLh* <sup>−</sup> *Lh* <sup>+</sup> <sup>1</sup>)··· *xn*(*qLh* <sup>+</sup> *Lh* <sup>−</sup> <sup>1</sup>)]*<sup>T</sup>*

**e***H*

*ni*(*q*)**e***ni*(*q*) (6)

−1

*xn* (*q*)**e***ni*(*q*), *<sup>i</sup>* <sup>=</sup> 1, . . . , *<sup>N</sup>* (7)

*T<sup>T</sup>* ,

*xn* (*q*)**D***xn* (*q*) (8)

(10)

The frequency-domain cost function for the *q*-th frame is defined as

*Jf* =

× *N* ∑ *n*=1

where 0 *< ρ <* 2 is the step-size, *δ* is a small positive number and

**h***nm*<sup>∗</sup> (*q*) = **F**2*Lh*×2*Lh*

**e***ni*(*q*) = **F**2*Lh*×2*Lh*

*N* ∑*n*=1,*n*�=*i*

**P***nm*<sup>∗</sup> (*q*) =

**F** 

diagonal, and it is straightforward to compute its inverse.

*N*−1 ∑ *n*=1

technique (Oppenheim et al. (1999)).

is expressed as

function **e***ni*(*q*) is given by

where the diagonal matrix

**<sup>D</sup>***xn* (*q*) = diag

be used to improve our front-end.

channels between the speech sources and the microphones by the usage of the BCI stage. As mentioned in the introductory section, this approach suffers from the BCI stage inability of estimating the RIRs without the knowledge of the speakers' activities. To overcome this disadvantage a speaker diarization system can be introduced to steer the BCI stage. The block diagram of the proposed framework is shown in Fig. 1 where *N* = 3 and *M* = 2 have been considered. Speaker Diarization takes as input the central microphone mixture and for each

Fig. 1. Block diagram of the proposed framework.

frame, the output P*<sup>m</sup>* is "1" if the *m*-th source is the only active, and "0" otherwise. In such a way, the front-end is able to detect when to perform or not to perform the required operation. Using the information carried out by the Speaker Diarization stage, the BCI will estimate the RIRs and the speech recognition engine will perform recognition if the corresponding source is the only active.

#### **2.1 Blind channel identification**

Considering a SIMO system for a specific source *sm*<sup>∗</sup> , a BCI algorithm aims to find the RIRs vector **h***nm*<sup>∗</sup> = [**h***<sup>T</sup>* <sup>1</sup>*m*<sup>∗</sup> **<sup>h</sup>***<sup>T</sup>* <sup>2</sup>*m*<sup>∗</sup> ··· **<sup>h</sup>***<sup>T</sup> Nm*<sup>∗</sup> ] *<sup>T</sup>* by using only the microphone signals *xn*(*k*). In order to ensure this, two identifiability condition are assumed satisfied (Xu et al. (1995)):


This stage performs the BCI through the unconstrained normalized multi-channel frequency-domain least mean square (UNMCFLMS) algorithm (Huang & Benesty (2003)). It is an adaptive technique well suited to satisfy the real-time constraints imposed by the case study since it offers a good compromise among fast convergence, adaptivity, and low computational complexity.

Here, we briefly review the UNMCFLMS in order to understand the motivation of its choice in the proposed front-end. Refer to (Huang & Benesty (2003)) for details. The derivation of UNMCFLMS is based on cross relation criteria (Xu et al. (1995)) using the overlap-save technique (Oppenheim et al. (1999)).

The frequency-domain cost function for the *q*-th frame is defined as

$$J\_f = \sum\_{n=1}^{N-1} \sum\_{i=i+1}^{N} \mathbf{e}\_{ni}^H(q) \mathbf{e}\_{ni}(q) \tag{6}$$

where **e***ni*(*q*) is the frequency-domain block error signal between the *n*-th and *i*-th channels and (·)*<sup>H</sup>* denotes the Hermitian transpose operator. The update equation of the UNMCFLMS is expressed as

$$\begin{split} \widehat{\mathbf{h}}\_{nm^\*} (q+1) &= \widehat{\mathbf{h}}\_{nm^\*} (q) - \rho [\mathbf{P}\_{nm^\*} (q) + \delta \mathbf{I}\_{2L\_h \times L\_h}]^{-1} \\ &\times \sum\_{n=1}^N \mathbf{D}\_{\mathbf{x}\_n}^H (q) \mathbf{e}\_{ni} (q)\_\prime \quad i = 1, \ldots, N \end{split} \tag{7}$$

where 0 *< ρ <* 2 is the step-size, *δ* is a small positive number and

$$
\widehat{\mathbf{h}}\_{nm^\*} (q) = \mathbf{F}\_{2L\_h \times 2L\_h} \left[ \widehat{\mathbf{h}}\_{nm^\*} (q) \, \mathbf{0}\_{1 \times L\_h} \right]^T,
$$

$$
\mathbf{e}\_{n i} (q) = \mathbf{F}\_{2L\_h \times 2L\_h} \left[ \mathbf{0}\_{1 \times L\_h} \left\{ \mathbf{F}\_{L\_h \times L\_h}^{-1} \mathbf{e}\_{n i} (q) \right\}^T \right]^T,
$$

$$
\mathbf{P}\_{nm^\*} (q) = \sum\_{n=1, n \neq i}^N \mathbf{D}\_{\mathbf{x}\_n}^H (q) \mathbf{D}\_{\mathbf{x}\_n} (q) \tag{8}
$$

while **F** denotes the discrete Fourier transform (DFT) matrix. The frequency-domain error function **e***ni*(*q*) is given by

$$\mathbf{e}\_{\rm ni}(q) = \mathbf{D}\_{\mathbf{x}\_{\rm n}}(q)\hat{\mathbf{h}}\_{nm^\*}(q) - \mathbf{D}\_{\mathbf{x}\_l}(q)\hat{\mathbf{h}}\_{im^\*}(q) \tag{9}$$

where the diagonal matrix

4 Speech Processing

channels between the speech sources and the microphones by the usage of the BCI stage. As mentioned in the introductory section, this approach suffers from the BCI stage inability of estimating the RIRs without the knowledge of the speakers' activities. To overcome this disadvantage a speaker diarization system can be introduced to steer the BCI stage. The block diagram of the proposed framework is shown in Fig. 1 where *N* = 3 and *M* = 2 have been considered. Speaker Diarization takes as input the central microphone mixture and for each

Separation Dereverberation

**h**ˆ

)(2, <sup>1</sup> *kys*

)( , *ky psm*

BCI

P<sup>1</sup> P<sup>2</sup>

Multi-channel Front-end

frame, the output P*<sup>m</sup>* is "1" if the *m*-th source is the only active, and "0" otherwise. In such a way, the front-end is able to detect when to perform or not to perform the required operation. Using the information carried out by the Speaker Diarization stage, the BCI will estimate the RIRs and the speech recognition engine will perform recognition if the corresponding source

Considering a SIMO system for a specific source *sm*<sup>∗</sup> , a BCI algorithm aims to find the RIRs

1. The polynomial formed from **h***nm*<sup>∗</sup> are co-prime, i.e. the room transfer functions (RTFs)

Here, we briefly review the UNMCFLMS in order to understand the motivation of its choice in the proposed front-end. Refer to (Huang & Benesty (2003)) for details. The derivation

2. C{*s*(*k*)} ≥ 2*Lh* + 1, where C{*s*(*k*)} denotes the linear complexity of the sequence *s*(*k*). This stage performs the BCI through the unconstrained normalized multi-channel frequency-domain least mean square (UNMCFLMS) algorithm (Huang & Benesty (2003)). It is an adaptive technique well suited to satisfy the real-time constraints imposed by the case study since it offers a good compromise among fast convergence, adaptivity, and low

to ensure this, two identifiability condition are assumed satisfied (Xu et al. (1995)):

Speaker Diarization

)(3 *kx* )(3, <sup>2</sup> *kys*

Fig. 1. Block diagram of the proposed framework.

)(2 *kx* )(1 *kx*

is the only active.

vector **h***nm*<sup>∗</sup> = [**h***<sup>T</sup>*

**2.1 Blind channel identification**

computational complexity.

<sup>1</sup>*m*<sup>∗</sup> **<sup>h</sup>***<sup>T</sup>*

<sup>2</sup>*m*<sup>∗</sup> ··· **<sup>h</sup>***<sup>T</sup>*

*Nm*<sup>∗</sup> ]

*Hnm*<sup>∗</sup> (*z*) do not share any common zeros (channel diversity);

ˆ )(2 *ks*

*<sup>T</sup>* by using only the microphone signals *xn*(*k*). In order

ˆ )(1 *ks*

ASR

words

words

ASR

$$\mathbf{D}\_{\mathbf{x}\_{\hbar}}(q) = \text{diag}\left(\mathbf{F}\left\{ \left[ \mathbf{x}\_{\hbar}(qL\_{\hbar} - L\_{\hbar}) \ge \mathbf{x}\_{\hbar}(qL\_{\hbar} - L\_{\hbar} + 1) \cdot \cdots \ge \mathbf{x}\_{\hbar}(qL\_{\hbar} + L\_{\hbar} - 1) \right]^{T} \right\} \right) \tag{10}$$

is the DFT of the *q*-th frame input signal block for the *n*-th channel. From a computational point of view, the UNMCFLMS algorithm ensures an efficient execution of the circular convolution by means of the fast Fourier transform (FFT). In addition, it can be easily implemented in a real-time application since the normalization matrix **P***nm*<sup>∗</sup> (*q*) + *δ***I**2*Lh*×*Lh* is diagonal, and it is straightforward to compute its inverse.

Though UNMCFLMS allows the estimation of long RIRs, it requires a high input signal-to-noise ratio. In this paper, the presence of noise has not been taken into account and therefore the UNMCFLMS still remain an appropriate choice. Different solutions have been proposed in literature in order to alleviate the misconvergence problem of the UNMCFLMS in presence of noise. Among them, the algorithms presented in (Haque et al. (2007); Haque & Hasan (2008); Yu & Er (2004)) guarantee a significant robustness against noise and they could be used to improve our front-end.

**2.3 Speech dereverberation**

**F***sm*<sup>∗</sup> ,1 **F***sm*<sup>∗</sup> ,2 ··· **F***sm*<sup>∗</sup> ,*<sup>P</sup>*

**f***sm*<sup>∗</sup> ,*<sup>p</sup>* =

(2005)).

where �·� denote the *l*2-norm operator and

Given the equivalent SIMO system *Fsm*<sup>∗</sup> ,*p*(*z*) related to the specific source *sm*<sup>∗</sup> , a set of inverse

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 7

assuming that the polynomials *Fsm*<sup>∗</sup> ,*p*(*z*) have no common zeros. In the time-domain, the inverse filter vector denoted as **g***sm*<sup>∗</sup> , is calculated by minimizing the following cost function:

*sm*<sup>∗</sup> ,2 ··· **<sup>g</sup>***<sup>T</sup>*

*gsm*<sup>∗</sup> ,*p*(1) *gsm*<sup>∗</sup> ,*p*(2) ··· *gsm*<sup>∗</sup> ,*P*(*Lg*)

with *p* = 1, 2, ··· , *P*. The vector **v** is the target vector, i.e. the Kronecker delta shifted by an appropriate modeling delay (<sup>0</sup> <sup>≤</sup> *<sup>d</sup>* <sup>≤</sup> *PLg*) while **<sup>F</sup>***sm*<sup>∗</sup> <sup>=</sup>

, 1, ··· , 0]

*sm*<sup>∗</sup> ,*P T*

where **<sup>F</sup>***sm*<sup>∗</sup> ,*<sup>p</sup>* is the convolution matrix of the equivalent FIR filter

*Fsm*<sup>∗</sup> ,*p*(*z*)*Gsm*<sup>∗</sup> ,*p*(*z*) = 1, (13)

*<sup>C</sup>* <sup>=</sup> �**F***sm*<sup>∗</sup> **<sup>g</sup>***sm*<sup>∗</sup> <sup>−</sup> **<sup>v</sup>**�<sup>2</sup> , (14)

, (15)

*<sup>T</sup>*, (17)

of length *Lf* . When the matrix **F***sm*<sup>∗</sup> is obtained as

*sm*<sup>∗</sup> **v** (18)

*<sup>P</sup>* <sup>−</sup> <sup>1</sup> . (19)

*sm*<sup>∗</sup> **F***sm*<sup>∗</sup> � = *γ***I**. In this case a general

*<sup>C</sup>* <sup>=</sup> �**F***sm*<sup>∗</sup> **<sup>g</sup>***sm*<sup>∗</sup> <sup>−</sup> **<sup>v</sup>**�<sup>2</sup> <sup>+</sup> *<sup>γ</sup>* �**g***sm*<sup>∗</sup> �<sup>2</sup> , (20)

*<sup>T</sup>* , (16)

filters *Gsm*<sup>∗</sup> ,*p*(*z*) can be found by using the MINT theorem such that

**g***sm*<sup>∗</sup> =

**<sup>g</sup>***sm*<sup>∗</sup> ,*<sup>p</sup>* <sup>=</sup>

*fsm*<sup>∗</sup> ,*p*(1) *fsm*<sup>∗</sup> ,*p*(1) ··· *fsm*<sup>∗</sup> ,*p*(*Lf*)

must be chosen in such a way that **F***sm*<sup>∗</sup> is square i.e.

function Eq. 14 is modified as follows (Hikichi et al. (2007)):

and the fluctuation from the mean RTF (**F***sm*<sup>∗</sup> ) and let *<sup>E</sup>*�**F***<sup>T</sup>*

 **g***T sm*<sup>∗</sup> ,1 **<sup>g</sup>***<sup>T</sup>*

**<sup>v</sup>** = [0, ··· , 0 *d*

**<sup>g</sup>***sm*<sup>∗</sup> <sup>=</sup> **<sup>F</sup>**†

where (·)† denotes the Moore-Penrose pseudoinverse. In order to have a unique solution *Lg*

*Lg* <sup>=</sup> *Lf* <sup>−</sup> <sup>1</sup>

Considering the presence of disturbances, i.e. additive noise or RTFs fluctuations, the cost

where the parameter *γ*(≥ 0), called regularization parameter, is a scalar coefficient representing the weight assigned to the disturbance term. It should be noticed that Eq. 20 has the same form to that of Tikhonov regularization for ill-posed problems (Egger & Engl

Let the RTF for the fluctuation case be given by the sum of two terms, the mean RTF (**F***sm*<sup>∗</sup> )

shown in the previous section, the inverse filter set can be calculated as

*P* ∑ *p*=1

#### **2.2 Source separation**

Here we briefly review the procedure already described in (Huang et al. (2005)) according to which it is possible to transform an *M* × *N* MIMO system (with *M < N*) in M 1 × *N* SIMO systems free of interferences, as described by the following relation:

$$\mathcal{Y}\_{\mathbf{s}\_{m}p}(z) = \mathcal{F}\_{\mathbf{s}\_{m}p}(z)\mathcal{S}\_{\mathbf{m}}(z) + \mathcal{B}\_{\mathbf{s}\_{m},p}(z), \quad m = 1,2,\ldots,M, \quad p = 1,2,\ldots,P \tag{11}$$

where *P* = *C<sup>M</sup> <sup>N</sup>* is the number of combinations. It must be noted that the SIMO systems outputs are reverberated, likely more than the microphone signals due to the long impulse response of equivalent channels *Fsm*,*p*(*z*). Related formula and the detailed description of the algorithm can be found in (Huang et al. (2005)). Different choices can be made in order

Fig. 2. Conversion of a 2 × 3 MIMO system in two 1 × 3 SIMO systems.

to calculate the equivalent SIMO system. In the block scheme of Fig. 2, representing the MIMO-SIMO conversion, is depicted a possible solution when *M* = 2 and *N* = 3. With this choice the first SIMO systems corresponding to the source *s*<sup>1</sup> is

$$\begin{aligned} F\_{\mathbf{s}\_1,1}(z) &= H\_{32}(z)H\_{21}(z) - H\_{22}(z)H\_{31}(z), \\ F\_{\mathbf{s}\_1,2}(z) &= H\_{32}(z)H\_{11}(z) - H\_{12}(z)H\_{31}(z), \\ F\_{\mathbf{s}\_1,3}(z) &= H\_{22}(z)H\_{11}(z) - H\_{12}(z)H\_{21}(z). \end{aligned} \tag{12}$$

The second SIMO system corresponding to the source *s*<sup>2</sup> can be found in a similar way, thus results, *Fs*1,*p*(*z*) = *Fs*2,*p*(*z*) with *p* = 1, 2, 3. As stated in the previous section the presence of additive noise is not taken into account in this contribution and than all the terms *Bsm*,*p*(*z*) of Eq. 11 are equal to zero. Finally it is important to highlight that in using this separation algorithm a lower computation complexity w.r.t. traditional independent component analysis technique is achieved and since the MIMO system is decomposed into a number of SIMO systems which are be blindly identified at different time the permutation ambiguity problem is avoided.

#### **2.3 Speech dereverberation**

6 Speech Processing

Here we briefly review the procedure already described in (Huang et al. (2005)) according to which it is possible to transform an *M* × *N* MIMO system (with *M < N*) in M 1 × *N* SIMO

outputs are reverberated, likely more than the microphone signals due to the long impulse response of equivalent channels *Fsm*,*p*(*z*). Related formula and the detailed description of the algorithm can be found in (Huang et al. (2005)). Different choices can be made in order

( ) ,3 <sup>1</sup> *y k <sup>s</sup>*

( ) ,2 <sup>1</sup> *y k <sup>s</sup>*

( ) <sup>2</sup> *s k*

( ) ,1 <sup>1</sup> *y k <sup>s</sup>*

to calculate the equivalent SIMO system. In the block scheme of Fig. 2, representing the MIMO-SIMO conversion, is depicted a possible solution when *M* = 2 and *N* = 3. With

> *Fs*1,1(*z*) = *H*32(*z*)*H*21(*z*) − *H*22(*z*)*H*31(*z*), *Fs*1,2(*z*) = *H*32(*z*)*H*11(*z*) − *H*12(*z*)*H*31(*z*),

The second SIMO system corresponding to the source *s*<sup>2</sup> can be found in a similar way, thus results, *Fs*1,*p*(*z*) = *Fs*2,*p*(*z*) with *p* = 1, 2, 3. As stated in the previous section the presence of additive noise is not taken into account in this contribution and than all the terms *Bsm*,*p*(*z*) of Eq. 11 are equal to zero. Finally it is important to highlight that in using this separation algorithm a lower computation complexity w.r.t. traditional independent component analysis technique is achieved and since the MIMO system is decomposed into a number of SIMO systems which are be blindly identified at different time the permutation ambiguity problem

∑

∑

*Ysm*,*p*(*z*) = *Fsm*,*p*(*z*)*Sm*(*z*) + *Bsm*,*p*(*z*), *m* = 1, 2, . . . , *M*, *p* = 1, 2, . . . , *P* (11)

*<sup>N</sup>* is the number of combinations. It must be noted that the SIMO systems

( ) <sup>11</sup> *H z* <sup>∑</sup>

( ) <sup>12</sup> *H z* ( ) <sup>22</sup> *H z* ( ) <sup>32</sup> *H z*

( ) <sup>2</sup> *s k*

( ) <sup>21</sup> *H z* ( ) <sup>31</sup> *H z*

∑

( ) <sup>2</sup> *b k*

( ) <sup>2</sup> *x k*

∑

( ) ,3 <sup>2</sup> *y k <sup>s</sup>*

( ) ,2 <sup>2</sup> *y k <sup>s</sup>*

( ) ,1 <sup>2</sup> *y k <sup>s</sup>*

( ) <sup>21</sup> *H z*

( ) <sup>31</sup> *H z*

( ) <sup>11</sup> *H z*

( ) <sup>31</sup> *H z*

( ) <sup>21</sup> *H z*

( ) ,3 <sup>2</sup> *y k <sup>s</sup>*

( ) ,2 <sup>2</sup> *y k <sup>s</sup>*

( ) ,1 <sup>2</sup> *y k <sup>s</sup>*

( ) <sup>11</sup> *H z*

∑

( ) <sup>3</sup> *x k*

∑

( ) ,3 <sup>2</sup> *b k <sup>s</sup>*

( ) ,2 <sup>2</sup> *b k <sup>s</sup>*

( ) ,1 <sup>2</sup> *b k <sup>s</sup>*

∑

∑

( ) <sup>1</sup> *b k* ( ) <sup>1</sup> *x k*

( ) <sup>1</sup>*<sup>s</sup> <sup>k</sup>* <sup>∑</sup>

∑

( ) ,3 <sup>2</sup> *F k <sup>s</sup>*

( ) ,2 <sup>2</sup> *F k <sup>s</sup>*

( ) ,1 <sup>2</sup> *F k <sup>s</sup>*

*Fs*1,3(*z*) = *H*22(*z*)*H*11(*z*) − *H*12(*z*)*H*21(*z*). (12)

( ) <sup>3</sup> *b k*

systems free of interferences, as described by the following relation:

( ) <sup>22</sup> *H z* ( ) <sup>12</sup> *H z* ( ) <sup>32</sup> *H z* ( ) <sup>12</sup> *H z* ( ) <sup>32</sup> *H z* ( ) <sup>22</sup> *H z*

**2.2 Source separation**

where *P* = *C<sup>M</sup>*

( ) <sup>11</sup> *H z* <sup>∑</sup>

( ) <sup>12</sup> *H z* ( ) <sup>22</sup> *H z* ( ) <sup>32</sup> *H z*

( ) <sup>1</sup>*s k*

is avoided.

( ) <sup>2</sup> *s k*

( ) <sup>21</sup> *H z* ( ) <sup>31</sup> *H z*

∑

( ) <sup>2</sup> *b k*

( ) <sup>2</sup> *x k*

( ) <sup>3</sup> *x k*

∑

( ) ,3 <sup>1</sup> *b k <sup>s</sup>*

( ) ,2 <sup>1</sup> *b k <sup>s</sup>*

( ) ,1 <sup>1</sup> *b k <sup>s</sup>*

( ) ,3 <sup>1</sup> *y k <sup>s</sup>*

( ) ,2 <sup>1</sup> *y k <sup>s</sup>*

( ) ,1 <sup>1</sup> *y k <sup>s</sup>*

this choice the first SIMO systems corresponding to the source *s*<sup>1</sup> is

Fig. 2. Conversion of a 2 × 3 MIMO system in two 1 × 3 SIMO systems.

∑

∑

( ) <sup>1</sup> *b k* ( ) <sup>1</sup> *x k*

( ) <sup>1</sup>*<sup>s</sup> <sup>k</sup>* <sup>∑</sup>

∑

( ) ,3 <sup>1</sup> *F k <sup>s</sup>*

( ) ,2 <sup>1</sup> *F k <sup>s</sup>*

( ) ,1 <sup>1</sup> *F k <sup>s</sup>*

( ) <sup>3</sup> *b k*

Given the equivalent SIMO system *Fsm*<sup>∗</sup> ,*p*(*z*) related to the specific source *sm*<sup>∗</sup> , a set of inverse filters *Gsm*<sup>∗</sup> ,*p*(*z*) can be found by using the MINT theorem such that

$$\sum\_{p=1}^{P} F\_{\mathbf{s}\_{m^\*},p}(z) G\_{\mathbf{s}\_{m^\*},p}(z) = 1,\tag{13}$$

assuming that the polynomials *Fsm*<sup>∗</sup> ,*p*(*z*) have no common zeros. In the time-domain, the inverse filter vector denoted as **g***sm*<sup>∗</sup> , is calculated by minimizing the following cost function:

$$\mathbf{C} = \left\| \mathbf{F}\_{\mathbf{s}\_{m^\*}} \mathbf{g}\_{\mathbf{s}\_{m^\*}} - \mathbf{v} \right\|^2,\tag{14}$$

where �·� denote the *l*2-norm operator and

$$\mathbf{g}\_{\mathbf{s}\_{m^\*}} = \begin{bmatrix} \mathbf{g}\_{\mathbf{s}\_{m^\*},1}^T \mathbf{g}\_{\mathbf{s}\_{m^\*},2}^T \cdots \mathbf{g}\_{\mathbf{s}\_{m^\*},P}^T \end{bmatrix}^T \tag{15}$$

$$\mathbf{g}\_{\mathbf{s}\_m \ast, p} = \begin{bmatrix} \mathbf{g}\_{\mathbf{s}\_m \ast, p}(\mathbf{1}) \ \mathbf{g}\_{\mathbf{s}\_m \ast, p}(\mathbf{2}) \ \cdots \ \mathbf{g}\_{\mathbf{s}\_m \ast, P}(L\_{\mathcal{J}}) \end{bmatrix}^T,\tag{16}$$

$$\mathbf{v} = [\underbrace{0, \dots, 0}\_{d}, 1, \dots, 0]^T,\tag{17}$$

with *p* = 1, 2, ··· , *P*. The vector **v** is the target vector, i.e. the Kronecker delta shifted by an appropriate modeling delay (<sup>0</sup> <sup>≤</sup> *<sup>d</sup>* <sup>≤</sup> *PLg*) while **<sup>F</sup>***sm*<sup>∗</sup> <sup>=</sup> **F***sm*<sup>∗</sup> ,1 **F***sm*<sup>∗</sup> ,2 ··· **F***sm*<sup>∗</sup> ,*<sup>P</sup>* where **<sup>F</sup>***sm*<sup>∗</sup> ,*<sup>p</sup>* is the convolution matrix of the equivalent FIR filter **f***sm*<sup>∗</sup> ,*<sup>p</sup>* = *fsm*<sup>∗</sup> ,*p*(1) *fsm*<sup>∗</sup> ,*p*(1) ··· *fsm*<sup>∗</sup> ,*p*(*Lf*) of length *Lf* . When the matrix **F***sm*<sup>∗</sup> is obtained as shown in the previous section, the inverse filter set can be calculated as

$$\mathbf{g}\_{\mathbf{s}\_{m^\*}} = \mathbf{F}\_{\mathbf{s}\_{m^\*}}^\dagger \mathbf{v} \tag{18}$$

where (·)† denotes the Moore-Penrose pseudoinverse. In order to have a unique solution *Lg* must be chosen in such a way that **F***sm*<sup>∗</sup> is square i.e.

$$L\_{\mathcal{S}} = \frac{L\_f - 1}{P - 1}.\tag{19}$$

Considering the presence of disturbances, i.e. additive noise or RTFs fluctuations, the cost function Eq. 14 is modified as follows (Hikichi et al. (2007)):

$$\mathbf{C} = \left\| \mathbf{F}\_{\mathbf{s}\_{m^\*}} \mathbf{g}\_{\mathbf{s}\_{m^\*}} - \mathbf{v} \right\|^2 + \gamma \left\| \mathbf{g}\_{\mathbf{s}\_{m^\*}} \right\|^2,\tag{20}$$

where the parameter *γ*(≥ 0), called regularization parameter, is a scalar coefficient representing the weight assigned to the disturbance term. It should be noticed that Eq. 20 has the same form to that of Tikhonov regularization for ill-posed problems (Egger & Engl (2005)).

Let the RTF for the fluctuation case be given by the sum of two terms, the mean RTF (**F***sm*<sup>∗</sup> ) and the fluctuation from the mean RTF (**F***sm*<sup>∗</sup> ) and let *<sup>E</sup>*�**F***<sup>T</sup> sm*<sup>∗</sup> **F***sm*<sup>∗</sup> � = *γ***I**. In this case a general cost function, embedding noise and fluctuation case, can be derived:

$$\mathbf{C} = \mathbf{g}\_{\mathbf{s}\_{\mathbb{R}^\*}}^T \mathcal{F}^T \mathcal{F} \mathbf{g}\_{\mathbf{s}\_{\mathbb{R}^\*}} - \mathbf{g}\_{\mathbf{s}\_{\mathbb{R}^\*}}^T \mathcal{F}^T \mathbf{v} - \mathbf{v}^T \mathcal{F} \mathbf{g}\_{\mathbf{s}\_{\mathbb{R}^\*}} + \mathbf{v}^T \mathbf{v} + \gamma \mathbf{g}\_{\mathbf{s}\_{\mathbb{R}^\*}}^T \mathbf{g}\_{\mathbf{s}\_{\mathbb{R}^\*}} \tag{21}$$

where

$$\mathcal{F} = \begin{cases} \mathbf{F}\_{\mathbf{s}\_{m^\*}} & \text{(noise case)}\\ \mathbf{F}\_{\mathbf{s}\_{m^\*}} & \text{(fuctuation case)}. \end{cases} \tag{22}$$

**2.4 Speaker diarization**

framework.

The speaker diarization stage drives the BCI and the ASRs so that they can operate into speaker-homogeneous regions. Current state-of-the-art speaker diarization systems are based on clustering approaches, usually combining hidden Markov models (HMMs) and the bayesian information criterion metric (Fredouille et al. (2009); Wooters & Huijbregts (2008)). Despite their state-of-art performance, such systems have the drawback of operating on the entire signals, making them unsuitable to work online as required by the proposed

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 9

The approach taken here as reference has been proposed in (Vinyals & Friedland (2008)), and its block scheme for *M* = 2 and *N* = 3, is shown in Fig. 3. The algorithm operation is divided in two phases, training and recognition. In the first, the acquired signals, after a manual removal of silence periods, are transformed in feature vectors composed of 19 mel-frequency cepstral coefficients (MFCC) plus their first and second derivatives. Cepstral mean normalization is applied to deal with stationary channel effects. Speaker models are represented by mixture of Gaussians trained by means of the expectation maximization algorithm. The number of Gaussians and the end accuracy at convergence have been empirically determined, and set to 100 and 10−<sup>4</sup> respectively. In this phase the voice activity detector (VAD) is also trained. The adopted VAD is based on bi-gaussian model of the log-energy frame. During the training a two gaussian model is estimated using the input sequence: The gaussian with the smallest mean will model the silence frames whereas the

other gaussian corresponds to frames of speech activity.

)(2 *kx*

Feature

Feature Extraction

and sets it to "1" if the speaker is the only active, and "0" otherwise.

speech, and an oracle overlap detector is used to overcome this lack.

VAD

labels assigned to each chunk.

model, using a maximum likelihood criterion.

)(2 *kx*

Extraction GMM Training

Identification (Majority Vote)

Fig. 3. The speaker diarization block scheme: "SPK1" and "SPK2" are the speaker identities

In the recognition phase, the first operation consists in a voice activity detection in order to remove the silence periods: frames are tagged as silence or not based on the bi-gaussian

After the voice activity detection, the signals are divided into non overlapping chunks, and the same feature extraction pipeline of the training phase extracts feature vectors. The decision is then taken using majority vote on the likelihoods: every feature vector in the current segment is assigned to one of the known speaker's model based on the maximum likelihood criterion. The model which has the majority of vectors assigned determines the speaker identity on the current segment. The Demultiplexer block associates each speaker label to a distinct output

It is worth pointing out that the speaker diarization algorithm is not able to detect overlapped

SPK1 SPK2 ... SPK2

Demultiplexer

P1 P2

Models

The filter that minimizes the cost function in Eq. 21 is obtained by taking derivatives with respect to **g***sm*<sup>∗</sup> and setting them equal to zero. The required solution is

$$\mathbf{g}\_{\mathbf{s}\_{m^\*}} = \left(\boldsymbol{\mathcal{F}}^T \boldsymbol{\mathcal{F}} + \gamma \mathbf{I}\right)^{-1} \boldsymbol{\mathcal{F}}^T \mathbf{v}.\tag{23}$$

The usage of Eq. 23 to calculate the inverse filters requires a matrix inversion that, in the case of long RIRs, can result in a high computational burden. Instead, an adaptive algorithm (Rotili et al. (2008)) has been here adopted to satisfy the real-time constraint. It is based on the steepest-descent technique, whose recursive estimator has the form

$$\mathbf{g}\_{\mathbf{s}\_{m^\*}}(q+1) = \mathbf{g}\_{\mathbf{s}\_{m^\*}}(q) - \frac{\mu(q)}{2} \nabla \mathbb{C}.\tag{24}$$

Moving from Eq. 21 through simple algebraic calculations, the following expression is obtained:

$$\nabla \mathbb{C} = -2[\mathcal{F}^T(\mathbf{v} - \mathcal{F}\mathbf{g}\_{\mathbf{s}\_{\mathbb{N}^\*}}(q)) - \gamma \mathbf{g}\_{\mathbf{s}\_{\mathbb{N}^\*}}(q)].\tag{25}$$

Substituting Eq. 25 into Eq. 24 is

$$\mathbf{g}\_{\mathbf{s}\_{\mathbf{u}^\*}}(q+1) = \mathbf{g}\_{\mathbf{s}\_{\mathbf{u}^\*}}(q) + \mu(q)[\mathcal{F}^T(\mathbf{v} - \mathcal{F}\mathbf{g}\_{\mathbf{s}\_{\mathbf{u}^\*}}(q)) - \gamma \mathbf{g}\_{\mathbf{s}\_{\mathbf{u}^\*}}(q)],\tag{26}$$

where *μ*(*q*) is the step-size. The convergence of the algorithm to the optimal solution is guaranteed if the usual conditions for the step-size in terms of autocorrelation matrix <sup>F</sup>*T*<sup>F</sup> eigenvalues hold. However, the achievement of the optimum can be slow if a fixed step-size value is chosen. The algorithm convergence speed can be increased following the approach in (Guillaume et al. (2005)), where the step-size is chosen in order to minimize the cost function at the next iteration. The analytical expression obtained for the step-size is the following:

$$\mu(q) = \frac{\mathbf{e}^T(q)\mathbf{e}(q)}{\mathbf{e}^T(q)\left(\mathcal{F}^T\mathcal{F} + \gamma I\right)\mathbf{e}(q)}\tag{27}$$

where

$$\mathbf{e}(q) = \mathcal{F}^T \left[ \mathbf{v} - \mathcal{F} \mathbf{g}\_{\mathbf{s}\_{m^\*}}(q) \right] - \gamma \mathbf{g}\_{\mathbf{s}\_{m^\*}}(q).$$

In using the previously illustrated algorithm, different advantages are obtained: The regularization parameter which takes into account the presence of disturbances, makes the dereverberation process more robust to estimation errors due to the BCI algorithm (Hikichi et al. (2007)); the real-time constraint can be met also in the case of long RIRs since no matrix inversion is required. Finally, the complexity of the algorithm has been decreased computing the required operation in the frequency-domain by using FFTs.

#### **2.4 Speaker diarization**

8 Speech Processing

**<sup>F</sup>***sm*<sup>∗</sup> (noise case)

The filter that minimizes the cost function in Eq. 21 is obtained by taking derivatives with

<sup>F</sup>*T*<sup>F</sup> <sup>+</sup> *<sup>γ</sup>***<sup>I</sup>**

The usage of Eq. 23 to calculate the inverse filters requires a matrix inversion that, in the case of long RIRs, can result in a high computational burden. Instead, an adaptive algorithm (Rotili et al. (2008)) has been here adopted to satisfy the real-time constraint. It is based on the

**<sup>g</sup>***sm*<sup>∗</sup> (*<sup>q</sup>* <sup>+</sup> <sup>1</sup>) = **<sup>g</sup>***sm*<sup>∗</sup> (*q*) <sup>−</sup> *<sup>μ</sup>*(*q*)

Moving from Eq. 21 through simple algebraic calculations, the following expression is

where *μ*(*q*) is the step-size. The convergence of the algorithm to the optimal solution is guaranteed if the usual conditions for the step-size in terms of autocorrelation matrix <sup>F</sup>*T*<sup>F</sup> eigenvalues hold. However, the achievement of the optimum can be slow if a fixed step-size value is chosen. The algorithm convergence speed can be increased following the approach in (Guillaume et al. (2005)), where the step-size is chosen in order to minimize the cost function at the next iteration. The analytical expression obtained for the step-size is the following:

*<sup>μ</sup>*(*q*) = **<sup>e</sup>***T*(*q*)**e**(*q*)

the required operation in the frequency-domain by using FFTs.

**<sup>e</sup>**(*q*) = <sup>F</sup>*<sup>T</sup>* [**<sup>v</sup>** − F**g***sm*<sup>∗</sup> (*q*)] <sup>−</sup> *<sup>γ</sup>***g***sm*<sup>∗</sup> (*q*).

In using the previously illustrated algorithm, different advantages are obtained: The regularization parameter which takes into account the presence of disturbances, makes the dereverberation process more robust to estimation errors due to the BCI algorithm (Hikichi et al. (2007)); the real-time constraint can be met also in the case of long RIRs since no matrix inversion is required. Finally, the complexity of the algorithm has been decreased computing

*sm*<sup>∗</sup> <sup>F</sup>*T***<sup>v</sup>** <sup>−</sup> **<sup>v</sup>***T*F**g***sm*<sup>∗</sup> <sup>+</sup> **<sup>v</sup>***T***<sup>v</sup>** <sup>+</sup> *<sup>γ</sup>***g***<sup>T</sup>*

−<sup>1</sup>

<sup>∇</sup>*<sup>C</sup>* <sup>=</sup> <sup>−</sup>2[F*T*(**<sup>v</sup>** − F**g***sm*<sup>∗</sup> (*q*)) <sup>−</sup> *<sup>γ</sup>***g***sm*<sup>∗</sup> (*q*)]. (25)

**<sup>e</sup>***T*(*q*)(F*T*<sup>F</sup> <sup>+</sup> *<sup>γ</sup>I*) **<sup>e</sup>**(*q*) (27)

**<sup>g</sup>***sm*<sup>∗</sup> (*<sup>q</sup>* <sup>+</sup> <sup>1</sup>) = **<sup>g</sup>***sm*<sup>∗</sup> (*q*) + *<sup>μ</sup>*(*q*)[F*T*(**<sup>v</sup>** − F**g***sm*<sup>∗</sup> (*q*)) <sup>−</sup> *<sup>γ</sup>***g***sm*<sup>∗</sup> (*q*)], (26)

**<sup>F</sup>***sm*<sup>∗</sup> (fluctuation case). (22)

<sup>F</sup>*T***v**. (23)

<sup>2</sup> <sup>∇</sup>*C*. (24)

*sm*<sup>∗</sup> **g***sm*<sup>∗</sup> (21)

cost function, embedding noise and fluctuation case, can be derived:

F =

respect to **g***sm*<sup>∗</sup> and setting them equal to zero. The required solution is

**g***sm*<sup>∗</sup> =

steepest-descent technique, whose recursive estimator has the form

*sm*<sup>∗</sup> <sup>F</sup>*T*F**g***sm*<sup>∗</sup> <sup>−</sup> **<sup>g</sup>***<sup>T</sup>*

*C* = **g***<sup>T</sup>*

where

obtained:

where

Substituting Eq. 25 into Eq. 24 is

The speaker diarization stage drives the BCI and the ASRs so that they can operate into speaker-homogeneous regions. Current state-of-the-art speaker diarization systems are based on clustering approaches, usually combining hidden Markov models (HMMs) and the bayesian information criterion metric (Fredouille et al. (2009); Wooters & Huijbregts (2008)). Despite their state-of-art performance, such systems have the drawback of operating on the entire signals, making them unsuitable to work online as required by the proposed framework.

The approach taken here as reference has been proposed in (Vinyals & Friedland (2008)), and its block scheme for *M* = 2 and *N* = 3, is shown in Fig. 3. The algorithm operation is divided in two phases, training and recognition. In the first, the acquired signals, after a manual removal of silence periods, are transformed in feature vectors composed of 19 mel-frequency cepstral coefficients (MFCC) plus their first and second derivatives. Cepstral mean normalization is applied to deal with stationary channel effects. Speaker models are represented by mixture of Gaussians trained by means of the expectation maximization algorithm. The number of Gaussians and the end accuracy at convergence have been empirically determined, and set to 100 and 10−<sup>4</sup> respectively. In this phase the voice activity detector (VAD) is also trained. The adopted VAD is based on bi-gaussian model of the log-energy frame. During the training a two gaussian model is estimated using the input sequence: The gaussian with the smallest mean will model the silence frames whereas the other gaussian corresponds to frames of speech activity.

Fig. 3. The speaker diarization block scheme: "SPK1" and "SPK2" are the speaker identities labels assigned to each chunk.

In the recognition phase, the first operation consists in a voice activity detection in order to remove the silence periods: frames are tagged as silence or not based on the bi-gaussian model, using a maximum likelihood criterion.

After the voice activity detection, the signals are divided into non overlapping chunks, and the same feature extraction pipeline of the training phase extracts feature vectors. The decision is then taken using majority vote on the likelihoods: every feature vector in the current segment is assigned to one of the known speaker's model based on the maximum likelihood criterion. The model which has the majority of vectors assigned determines the speaker identity on the current segment. The Demultiplexer block associates each speaker label to a distinct output and sets it to "1" if the speaker is the only active, and "0" otherwise.

It is worth pointing out that the speaker diarization algorithm is not able to detect overlapped speech, and an oracle overlap detector is used to overcome this lack.

real-time ability was required. The computational cost strongly depends on the number of Gaussians per state, and in (Vertanen (2006)) it has been shown that real-time execution can be obtained using 16 Gaussians per state. The main parameters of the selected acoustic model

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 11

Training data WSJ0 & WSJ1 Initialization strategy TIMIT bootstrap Triphone model cross-word # of tied states (approx.) 8000 # of Gaussians per state 16 # of silence Gaussians 32

The language model consists of the 5k words bi-gram model included in the Wall Street Journal (WSJ) corpus. Recognizer parameters are the same as in (Vertanen (2006)): using such values, the word accuracy obtained on the November '92 test set is 94.30% with a real-time factor of 0.33 on the same hardware platform mentioned above. It is worth pointing out that

The acoustic scenario under study is made of an array of three microphones and two speech sources located in a small office. The room arrangement is depicted in Fig. 4. The data set

> S1 (0.70 m, 1.25 m, 1.40 m) S2 (3.30 m, 1.25 m, 1.40 m) M1 (1.65 m, 2.00 m, 1.40 m) M2 (2.35 m, 2.00 m, 1.40 m) M3 (2.00 m, 1.65 m, 1.40 m)

M1 M2

M3

S1 S2

4.00 m

used for the speech recognition experiments has been constructed from the WSJ November '92 speech recognition evaluation set. It consists of 330 sentences (about 40 minutes of speech), uttered by eight different speakers, both male and female. The data set is recorded at 16 kHz

A suitable database representing the described scenario has been artificially created using the following procedure: The 330 clean sentences are firstly reduced to 320 in order to have the

are summarized in Table 1.

**4. Experiments**

**4.1 Corpus description**

Fig. 4. Room setup.

3.00 m

and does not contain any additive noise or reverberation.

Table 1. Characteristics of the selected acoustic model.

the ASR engine and the front-end can jointly operate in real-time.

#### **2.5 Speech enhancement front-end operation**

The proposed front-end requires an initial training phase where each speaker is asked to talk for 60 s. During this period, the speaker diarization stage trains the both the VAD and speakers' models.

In the testing phase, the input signal is divided into non overlapping chunks of 2 s, the speaker diarization stage provides as output the speakers' activity P*m*. This information is employed both in the BCI stage and ASR engines: only when the *m*-th source is the only active the related RIRs are updated and the dereverberated speech recognized. In all the other situations the BCI stage provide as output the RIRs estimated at the previous step while the ASRs are idle.

The Separation stage takes as input the microphone signals and outputs the interference free signals that are subsequently processed by Dereverberation stage. Both stages perform theirs operations using the RIRs vector provided by the BCI stage.

The front-end performances are strictly related to the speaker diarization errors. In particular, the BCI stage is sensitive to false alarms (speaker in hypothesis but not in reference) and speaker errors (mapped reference is not the same as hypothesis speaker). If one of these occurs, the BCI performs the adaptation of the RIRs using an inappropriate input frame providing as output an incorrect estimation. An additional error which produces the previously highlighted behaviour is the miss speaker overlap detection.

The sensitivity to false alarms and speaker errors could be reduced imposing a constraint in the estimation procedure and updating the RIR only when a decrease in the cost function occurs. A solution to miss overlap error would be to add an overlap detector and not to perform the estimation if more than one speaker is simultaneously active. On the other hand, missed speaker errors (speaker in reference but not in hypothesis) does not negatively affect the RIRs estimation procedure, since the BCI stage does not perform the adaptation in such frames. Only a reduced convergence rate can be noticed in this case.

The real-time capabilities of the proposed front-end have been evaluated calculating the real-time factor on a Intel® Core™i7 machine running at 3 GHz with 4 GB of RAM. The obtained value for the speaker diarization stage is 0.03, meaning that a new result is output every 2.06 s. The real-time factor for the others stage is 0.04 resulting in a total value of 0.07 for the entire front-end.

#### **3. ASR engine**

Automatic speech recognition has been performed by means of the Hidden Markov Model Toolkit (HTK) (Young et al. (2006)) using HDecode, which has been specifically designed for large vocabulary speech recognition tasks. Features have been extracted through the HCopy tool, and are composed of 13 MFCC, deltas and double deltas, resulting in a 39 dimensional feature vector. Cepstral mean normalization is included in the feature extraction pipeline. Recognition has been performed based on the acoustic models available in (Vertanen (2006)).

The models differ with respect to the amount of training data, the use of word-internal or cross-word triphones, the number of tied states, the number of Gaussians per state, and the initialization strategy. The main focus of this work is to achieve real-time execution of the complete framework, thus an acoustic model able to obtain adequate accuracies and real-time ability was required. The computational cost strongly depends on the number of Gaussians per state, and in (Vertanen (2006)) it has been shown that real-time execution can be obtained using 16 Gaussians per state. The main parameters of the selected acoustic model are summarized in Table 1.


Table 1. Characteristics of the selected acoustic model.

The language model consists of the 5k words bi-gram model included in the Wall Street Journal (WSJ) corpus. Recognizer parameters are the same as in (Vertanen (2006)): using such values, the word accuracy obtained on the November '92 test set is 94.30% with a real-time factor of 0.33 on the same hardware platform mentioned above. It is worth pointing out that the ASR engine and the front-end can jointly operate in real-time.

#### **4. Experiments**

10 Speech Processing

The proposed front-end requires an initial training phase where each speaker is asked to talk for 60 s. During this period, the speaker diarization stage trains the both the VAD and

In the testing phase, the input signal is divided into non overlapping chunks of 2 s, the speaker diarization stage provides as output the speakers' activity P*m*. This information is employed both in the BCI stage and ASR engines: only when the *m*-th source is the only active the related RIRs are updated and the dereverberated speech recognized. In all the other situations the BCI stage provide as output the RIRs estimated at the previous step while the ASRs are idle.

The Separation stage takes as input the microphone signals and outputs the interference free signals that are subsequently processed by Dereverberation stage. Both stages perform theirs

The front-end performances are strictly related to the speaker diarization errors. In particular, the BCI stage is sensitive to false alarms (speaker in hypothesis but not in reference) and speaker errors (mapped reference is not the same as hypothesis speaker). If one of these occurs, the BCI performs the adaptation of the RIRs using an inappropriate input frame providing as output an incorrect estimation. An additional error which produces the

The sensitivity to false alarms and speaker errors could be reduced imposing a constraint in the estimation procedure and updating the RIR only when a decrease in the cost function occurs. A solution to miss overlap error would be to add an overlap detector and not to perform the estimation if more than one speaker is simultaneously active. On the other hand, missed speaker errors (speaker in reference but not in hypothesis) does not negatively affect the RIRs estimation procedure, since the BCI stage does not perform the adaptation in such

The real-time capabilities of the proposed front-end have been evaluated calculating the real-time factor on a Intel® Core™i7 machine running at 3 GHz with 4 GB of RAM. The obtained value for the speaker diarization stage is 0.03, meaning that a new result is output every 2.06 s. The real-time factor for the others stage is 0.04 resulting in a total value of 0.07

Automatic speech recognition has been performed by means of the Hidden Markov Model Toolkit (HTK) (Young et al. (2006)) using HDecode, which has been specifically designed for large vocabulary speech recognition tasks. Features have been extracted through the HCopy tool, and are composed of 13 MFCC, deltas and double deltas, resulting in a 39 dimensional feature vector. Cepstral mean normalization is included in the feature extraction pipeline. Recognition has been performed based on the acoustic models available in (Vertanen (2006)). The models differ with respect to the amount of training data, the use of word-internal or cross-word triphones, the number of tied states, the number of Gaussians per state, and the initialization strategy. The main focus of this work is to achieve real-time execution of the complete framework, thus an acoustic model able to obtain adequate accuracies and

**2.5 Speech enhancement front-end operation**

operations using the RIRs vector provided by the BCI stage.

previously highlighted behaviour is the miss speaker overlap detection.

frames. Only a reduced convergence rate can be noticed in this case.

speakers' models.

for the entire front-end.

**3. ASR engine**

#### **4.1 Corpus description**

The acoustic scenario under study is made of an array of three microphones and two speech sources located in a small office. The room arrangement is depicted in Fig. 4. The data set

#### Fig. 4. Room setup.

used for the speech recognition experiments has been constructed from the WSJ November '92 speech recognition evaluation set. It consists of 330 sentences (about 40 minutes of speech), uttered by eight different speakers, both male and female. The data set is recorded at 16 kHz and does not contain any additive noise or reverberation.

A suitable database representing the described scenario has been artificially created using the following procedure: The 330 clean sentences are firstly reduced to 320 in order to have the

• REAL SD w/ REAL-VAD: The system described in Sec. 2.4.

Table 2. VAD error rate (%).

where

Table 3. Speaker diarization error rate (%).

one at the *q*-th iteration, i.e. the frame index.

−3

−2.5

−2

NPM (dB)

−1.5

−1

−0.5

0

testing conditions, and are consistent with (Vinyals & Friedland (2008)).

The performance across the three scenarios are similar due to the matching of the training and

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 13

REAL-VAD 1.85 1.96 1.68

REAL-SD w/ ORACLE-VAD 13.57 13.30 13.24 REAL-SD w/ REAL-VAD 15.20 15.20 14.73

The BCI stage performance are evaluated by means of a channel-based measure called

NPM (*q*) <sup>=</sup> 20 log10 �(*q*)�

(*q*) = **<sup>h</sup>** <sup>−</sup> **<sup>h</sup>***T***<sup>h</sup>**(*q*)

**h***<sup>T</sup>*(*q*)**h**(*q*)

is the projection misalignment vector, **h** is the real RIR vector whereas **h**(*q*) is the estimated

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>25</sup> <sup>30</sup> <sup>35</sup> <sup>40</sup> −3.5

Fig. 5 shows the NPM curve for the identification of the RIRs relative to source *s*<sup>1</sup> at *T*<sup>60</sup> = 240 ms for an input signal of 40 s. In order to understand how the performance of

Fig. 5. NPM curves for the "Real" and "Oracle" speaker diarization system.

Time (s)

Normalized Projection Misalignment (NPM) (Morgan et al. (1998)) defined as

Clean *T*<sup>60</sup> = 120 ms *T*<sup>60</sup> = 240 ms

Clean *T*<sup>60</sup> = 120 ms *T*<sup>60</sup> = 240 ms

�**h**�

, (29)

**h**(*q*) (30)

REAL-SD ORACLE-SD

same number of sentences for each speaker. These are then convolved with RIRs generated using the RIR Generator tool (Habets (2008)). No background noise has been added. Two different reverberation conditions have been taken into account: the low and the and high reverberant ones, corresponding to *T*<sup>60</sup> = 120 ms and *T*<sup>60</sup> = 240 ms respectively (with RIRs 1024 taps long).

For each channel, the final overlapped and reverberated sentences have been obtained by coupling the sentences of two speakers. Following the WSJ November '92 notation, speaker 440 has been paired with 441, 442 with 443, etc. This choice makes possible to cover all the combinations of male and female speakers, resulting in 40 sentences per couple of speakers. The mean value of overlap has been fixed to 15% of the speech frames for the overall dataset. For each sentence the amount of overlap is obtained as a random value drown from the uniform distribution on the interval [12, 18]. This assumption allows the artificial database to reflect the frequency of overlapped speech in real-life scenarios such as two-party telephone conversation or meeting (Shriberg et al. (2000)).

#### **4.2 Front-end evaluation**

As stated in Sec. 2 the proposed speech enhancement front-end consists in four different stages. Here we focus the attention on the evaluation of the Speaker Diarization and BCI stages which represent the most crucial parts of the entire system. An extensive evaluation of the Separation and Dereverberation stages can be found in (Huang et al. (2005)) and (Rotili et al. (2008)) respectively.

The performance of the speaker diarization algorithms are measured by the diarization error rate<sup>1</sup> (DER). DER is defined by the following expression:

$$\text{DER} = \frac{\sum\_{s=1}^{S} \text{dur}(s) \left( \max(N\_{\text{ref}}(s), N\_{\text{hyp}}(s)) - N\_{\text{correct}}(s) \right)}{\sum\_{s=1}^{S} \text{dur}(s) N\_{\text{ref}}(s)} \tag{28}$$

where dur is the duration of the segment, *S* is the total number of segments in which no speaker change occurs, *N*ref(*s*) and *N*hyp(*s*) indicate respectively the number of speakers in the reference and in the hypothesis, and *N*correct(*s*) indicates the number of speakers that speak in the segment *s* and have been correctly matched between the reference and the hypothesis. As recommended by the National Institute for Standards and Technology (NIST), evaluation has been performed by means of the "md-eval" tool with a collar of 0.25 s around each segment to take into account timing errors in the reference. The same metric and tool are used to evaluate the VAD performance2.

Performance for the sole VAD are reported in table Table 2. Table 3 shows the results obtained testing the speaker diarization algorithm on the clean signals, as well as on the two reverberated scenarios in the previous illustrated configurations. For the seek of comparison two different configurations have been considered:

• REAL SD w/ ORACAL-VAD: The speaker diarization system uses an "Oracle" VAD;

<sup>1</sup> http://www.itl.nist.gov/iad/mig/tests/rt/2004-fall/

<sup>2</sup> Details can be found in "*Spring 2005 (RT-05S) Rich Transcription Meeting Recognition Evaluation Plan*". The "md-eval" tool is available at http://www.itl.nist.gov/iad/mig//tools/

• REAL SD w/ REAL-VAD: The system described in Sec. 2.4.

The performance across the three scenarios are similar due to the matching of the training and testing conditions, and are consistent with (Vinyals & Friedland (2008)).


Table 2. VAD error rate (%).


Table 3. Speaker diarization error rate (%).

The BCI stage performance are evaluated by means of a channel-based measure called Normalized Projection Misalignment (NPM) (Morgan et al. (1998)) defined as

$$\text{NPM}\left(q\right) = 20\log\_{10}\left(\frac{||\varepsilon(q)||}{||\mathbf{h}||}\right),\tag{29}$$

where

12 Speech Processing

same number of sentences for each speaker. These are then convolved with RIRs generated using the RIR Generator tool (Habets (2008)). No background noise has been added. Two different reverberation conditions have been taken into account: the low and the and high reverberant ones, corresponding to *T*<sup>60</sup> = 120 ms and *T*<sup>60</sup> = 240 ms respectively (with RIRs

For each channel, the final overlapped and reverberated sentences have been obtained by coupling the sentences of two speakers. Following the WSJ November '92 notation, speaker 440 has been paired with 441, 442 with 443, etc. This choice makes possible to cover all the combinations of male and female speakers, resulting in 40 sentences per couple of speakers. The mean value of overlap has been fixed to 15% of the speech frames for the overall dataset. For each sentence the amount of overlap is obtained as a random value drown from the uniform distribution on the interval [12, 18]. This assumption allows the artificial database to reflect the frequency of overlapped speech in real-life scenarios such as two-party telephone

As stated in Sec. 2 the proposed speech enhancement front-end consists in four different stages. Here we focus the attention on the evaluation of the Speaker Diarization and BCI stages which represent the most crucial parts of the entire system. An extensive evaluation of the Separation and Dereverberation stages can be found in (Huang et al. (2005)) and (Rotili

The performance of the speaker diarization algorithms are measured by the diarization error

where dur is the duration of the segment, *S* is the total number of segments in which no speaker change occurs, *N*ref(*s*) and *N*hyp(*s*) indicate respectively the number of speakers in the reference and in the hypothesis, and *N*correct(*s*) indicates the number of speakers that speak in the segment *s* and have been correctly matched between the reference and the hypothesis. As recommended by the National Institute for Standards and Technology (NIST), evaluation has been performed by means of the "md-eval" tool with a collar of 0.25 s around each segment to take into account timing errors in the reference. The same metric and tool are used to evaluate

Performance for the sole VAD are reported in table Table 2. Table 3 shows the results obtained testing the speaker diarization algorithm on the clean signals, as well as on the two reverberated scenarios in the previous illustrated configurations. For the seek of comparison

• REAL SD w/ ORACAL-VAD: The speaker diarization system uses an "Oracle" VAD;

The "md-eval" tool is available at http://www.itl.nist.gov/iad/mig//tools/

<sup>2</sup> Details can be found in "*Spring 2005 (RT-05S) Rich Transcription Meeting Recognition Evaluation Plan*".

∑*S*

*<sup>s</sup>*=<sup>1</sup> dur(*s*)(max(*N*ref(*s*), *N*hyp(*s*)) − *N*correct(*s*))

*<sup>s</sup>*=<sup>1</sup> dur(*s*)*N*ref(*s*) (28)

1024 taps long).

**4.2 Front-end evaluation**

et al. (2008)) respectively.

the VAD performance2.

conversation or meeting (Shriberg et al. (2000)).

rate<sup>1</sup> (DER). DER is defined by the following expression:

DER <sup>=</sup> <sup>∑</sup>*<sup>S</sup>*

two different configurations have been considered:

<sup>1</sup> http://www.itl.nist.gov/iad/mig/tests/rt/2004-fall/

$$\boldsymbol{\epsilon}(q) = \mathbf{h} - \frac{\mathbf{h}^T \widehat{\mathbf{h}}(q)}{\widehat{\mathbf{h}}^T(q)\widehat{\mathbf{h}}(q)} \widehat{\mathbf{h}}(q) \tag{30}$$

is the projection misalignment vector, **h** is the real RIR vector whereas **h**(*q*) is the estimated one at the *q*-th iteration, i.e. the frame index.

Fig. 5. NPM curves for the "Real" and "Oracle" speaker diarization system.

Fig. 5 shows the NPM curve for the identification of the RIRs relative to source *s*<sup>1</sup> at *T*<sup>60</sup> = 240 ms for an input signal of 40 s. In order to understand how the performance of

53.70

58.29

36.57

93.60


Fig. 6. Word accuracy for the *Overall* case.

93.60


Fig. 7. Word accuracy for the *Convergence* case.

proposed algorithmic framework as multichannel front-end.

40

20


40

20


0

60

ccuracy (%)

Word A

**5. Conclusion**

80

100

0

60

ccuracy (%)

Word A

80

100

29.51

86.63

Unprocessd ASR w/o SD ASR w/ ORACLE-SD ASR w/ REAL-SD

Unprocessd ASR w/o SD ASR w/ ORACLE-SD ASR w/ REAL-SD

speaker diarization lead to decrease of only 8%. As expected, the reverberation effect has a negative impact on the recognition performances especially in presence of high reverberation, i.e. *T*<sup>60</sup> = 240 ms. However, it must be observed that the convergence margin is even more significant w.r.t. the low reverberant scenario, further highlighting the effectiveness of the

In this paper, an ASR system was successfully enhanced by an advanced multi-channel front-end to recognize the speech content coming from multiple speakers in reverberated acoustic conditions. The overall architecture is able to blindly identify the impulse responses,

T60=120 ms T60=240 ms Reference

T60=120 ms T60=240 ms Reference

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 15

61.00

86.49 78.11

67.27

71.81

51.00

55.36

the Speaker Diarization stage affect the RIRs identification we compare the curves obtained for ORACLE-SD where the speaker diariazion operates in an "Oracle" fashion, i.e. it operates at 100% of its possibilities, and REAL-SD case. As expected the REAL-SD NPM is always above the ORACLE-SD NPM. Parts where the curves are flat indicate speech segment in which source *s*<sup>1</sup> is the not only active source i.e. it is overlapped to *s*<sup>2</sup> or we have silence.

#### **4.3 Full system evaluation**

In this section the objective is to evaluate the recognition capabilities of the ASR engine fed by speech signals coming from the multichannel DSP front-end, therefore the performance metric employed is the word recognition accuracy.

The word recognition accuracy obtained assuming ideal source separation and dereverberation is 93.60%. This situation will be denoted as "Reference" in the remainder of the section.

Four different setups have been addressed:


Fig. 6 reports the word accuracy for both the low and high reverberant conditions when the complete test file is processed by the multi-channel DSP front-end and recognition is performed on the separated and dereverberated streams (*Overall*) for all the three setup. Fig. 7 shows the word accuracy values attained where the recognition is performed starting from the first silence frame after the BCI and Dereverberation stages converge3 (*Convergence*).

Observing the results of Fig. 6, it can be immediately stated that feeding the ASR engine with unprocessed audio files leads to very poor performances. The missing source separation and the related wrong matching between the speaker and the corresponding word transcriptions result in a significant amount of insertions which justify the occurrence of negative word accuracy values.

Conversely, when the audio streams are processed, the ASRs are able to recognize most of the spoken words, specially once the front-end algorithms have reached the convergence. The usage of speaker diarization information to drive the ASRs activity significantly increases the performance. As expected the usage of the "Real" speaker diarization instead of an "Oracle" one lead to a decrease in performance of about 15% for the low reverberant condition and of a 10% for the high reverberant condition. Despite this, the word accuracy is still higher then the one obtained without speaker diarization, providing an average increase of about 20% for both the reverberation time.

In the *Convergence* evaluation case study, when *T*<sup>60</sup> = 120 ms and the "Oracle" speaker diarization is employed, a word accuracy of 86.49% is obtained, which is about 7% less than the result attainable in the "Reference" conditions. In this case, the usage of the "Real"

<sup>3</sup> Additional experiments have demonstrated that this is reached after 20 <sup>−</sup> 25 s of speech activity.

Fig. 6. Word accuracy for the *Overall* case.

14 Speech Processing

the Speaker Diarization stage affect the RIRs identification we compare the curves obtained for ORACLE-SD where the speaker diariazion operates in an "Oracle" fashion, i.e. it operates at 100% of its possibilities, and REAL-SD case. As expected the REAL-SD NPM is always above the ORACLE-SD NPM. Parts where the curves are flat indicate speech segment in which

In this section the objective is to evaluate the recognition capabilities of the ASR engine fed by speech signals coming from the multichannel DSP front-end, therefore the performance

The word recognition accuracy obtained assuming ideal source separation and dereverberation is 93.60%. This situation will be denoted as "Reference" in the remainder of

• Unprocessed: The recognition is performed on the reverberant speech mixture acquired

Fig. 6 reports the word accuracy for both the low and high reverberant conditions when the complete test file is processed by the multi-channel DSP front-end and recognition is performed on the separated and dereverberated streams (*Overall*) for all the three setup. Fig. 7 shows the word accuracy values attained where the recognition is performed starting from the first silence frame after the BCI and Dereverberation stages converge3 (*Convergence*).

Observing the results of Fig. 6, it can be immediately stated that feeding the ASR engine with unprocessed audio files leads to very poor performances. The missing source separation and the related wrong matching between the speaker and the corresponding word transcriptions result in a significant amount of insertions which justify the occurrence of negative word

Conversely, when the audio streams are processed, the ASRs are able to recognize most of the spoken words, specially once the front-end algorithms have reached the convergence. The usage of speaker diarization information to drive the ASRs activity significantly increases the performance. As expected the usage of the "Real" speaker diarization instead of an "Oracle" one lead to a decrease in performance of about 15% for the low reverberant condition and of a 10% for the high reverberant condition. Despite this, the word accuracy is still higher then the one obtained without speaker diarization, providing an average increase of about 20% for

In the *Convergence* evaluation case study, when *T*<sup>60</sup> = 120 ms and the "Oracle" speaker diarization is employed, a word accuracy of 86.49% is obtained, which is about 7% less than the result attainable in the "Reference" conditions. In this case, the usage of the "Real"

<sup>3</sup> Additional experiments have demonstrated that this is reached after 20 <sup>−</sup> 25 s of speech activity.

• ASR w/ ORACLE-SD: The ASRs exploit the "Oracle" speaker diarization output; • ASR w/ REAL-SD: The ASRs exploit the "Real" speaker diarization output.

• ASR w/o SD: The ASRs do not exploit the speaker diarization output;

source *s*<sup>1</sup> is the not only active source i.e. it is overlapped to *s*<sup>2</sup> or we have silence.

**4.3 Full system evaluation**

from Mic2 (see Fig. 4);

the section.

accuracy values.

both the reverberation time.

metric employed is the word recognition accuracy.

Four different setups have been addressed:

Fig. 7. Word accuracy for the *Convergence* case.

speaker diarization lead to decrease of only 8%. As expected, the reverberation effect has a negative impact on the recognition performances especially in presence of high reverberation, i.e. *T*<sup>60</sup> = 240 ms. However, it must be observed that the convergence margin is even more significant w.r.t. the low reverberant scenario, further highlighting the effectiveness of the proposed algorithmic framework as multichannel front-end.

#### **5. Conclusion**

In this paper, an ASR system was successfully enhanced by an advanced multi-channel front-end to recognize the speech content coming from multiple speakers in reverberated acoustic conditions. The overall architecture is able to blindly identify the impulse responses,

Naylor, P. & Gaubitch, N. (2010). *Speech Dereverberation*, Signals and Communication

A Real-Time Speech Enhancement Front-End for Multi-Talker Reverberated Scenarios 17

Oppenheim, A. V., Schafer, R. W. & Buck, J. R. (1999). *Discrete-Time Signal Processing*, 2 edn,

Principi, E., Cifani, S., Rocchi, C., Squartini, S. & Piazza, F. (2009). Keyword spotting based

Renals, S. (2005). AMI: Augmented Multiparty Interaction, *Proc. NIST Meeting Transcription*

Rocchi, C., Principi, E., Cifani, S., Rotili, R., Squartini, S. & Piazza, F. (2009). A real-time

Rotili, R., Cifani, S., Principi, E., Squartini, S. & Piazza, F. (2008). A robust iterative inverse

Rotili, R., De Simone, C., Perelli, A., Cifani, A. & Squartini, S. (2010). Joint multichannel blind

*Proceedings of 6th International Conference on Intelligent Computing*, pp. 85–93. Schuller, B., Batliner, A., Steidl, S. & Seppi, D. (2011). Recognising realistic emotions and

Shriberg, E., Stolcke, A. & Baron, D. (2000). Observations on Overlap : Findings and

Squartini, S., Ciavattini, E., Lattanzi, A., Zallocco, D., Bettarelli, F. & Piazza, F. (2005). NU-Tech:

Vertanen, K. (2006). Baseline WSJ acoustic models for HTK and Sphinx: Training recipes

Vinyals, O. & Friedland, G. (2008). Towards semantic analysis of conversations: A system

Waibel, A., Steusloff, H., Stiefelhagen, R. & the CHIL Project Consortium (2004). CHIL:

Woelfel, M. & McDonough, J. (2009). *Distant Speech Recognition*, 1st edn, Wiley, New York. Wöllmer, M., Marchi, E., Squartini, S. & Schuller, B. (2011). Multi-stream lstm-hmm

*of IEEE Asia Pacific Conference on Circuits and Systems*, pp. 434–437.

*The International Linguistic Association* pp. 1–4.

*Speech, and Lang. Process.,* 18(6): 1601 –1611.

URL: *http://www.keithv.com/software/htk/us/*

*for Multimedia Interactive Services*.

*Neurodynamics* 5: 253–264.

*Conference on Semantic Computing*, pp. 426 –431.

*of 2nd Conference on Human System Interactions*, pp. 216–219.

system for conversation fostering in tabletop scenarios: Preliminary evaluation, *Proc.*

speech-interfaced system for group conversation modeling, *19th Italian Workshop on*

filtering approach for speech dereverberation in presence of disturbances, *Proceedings*

speech separation and dereverberation: A real-time algorithmic implementation,

affect in speech: state of the art and lessons learnt from the first challenge, *Speech*

Implications for Automatic Processing of Multi-Party Conversation, *Word Journal Of*

implementing DSP algorithms in a plug-in based software platform for real time audio applications, *Proceedings of 118th Convention of the Audio Engineering Society*. Tur, G., Stolcke, A., Voss, L., Peters, S., Hakkani-Tur, D., Dowding, J., Favre, B., Fernandez, R.,

Frampton, M., Frandsen, M., Frederickson, C., Graciarena, M., Kintzing, D., Leveque, K., Mason, S., Niekrasz, J., Purver, M., Riedhammer, K., Shriberg, E., Tien, J., Vergyri, D. & Yang, F. (2010). The CALO meeting assistant system, *IEEE Trans. on Audio,*

and recognition experiments, *Technical report*, Cavendish Laboratory, University of

for the live identification of speakers in meetings, *Proceedings of IEEE International*

Computers in the Human Interaction Loop, *International Workshop on Image Analysis*

decoding and histogram equalization for noise robust keyword spotting, *Cognitive*

Technology, Springer.

*Neural Networks*, pp. 70–80.

*Workshop*.

*Communication* .

Cambridge.

Prentice Hall, Upper Saddle River, NJ.

to separate the existing multiple overlapping sources, to dereverberate them and to recognize the information contained within the original utterances. A speaker diarization system able to steer the BCI stage and the ASRs has been also included in the overall framework. All the algorithms work in real-time and a PC-based implementation of them has been discussed in this contribution. Performed simulations, based on a existing large vocabulary database (WSJ) and suitably addressing the acoustic scenario under test, have shown the effectiveness of the developed system, making it appealing in real-life human-machine interaction scenarios. As future works, an overlap detector will be integrated in the speaker diarization system and its impact in terms of final recognition accuracy will be evaluated. In addition other applications different form ASR such as emotion recognition (Schuller et al. (2011)), dominance detection (Hung et al. (2011)) or keyword spotting (Wöllmer et al. (2011)) will be considered in order to assess the effectiveness of the front-end in other recognition tasks.

#### **6. References**


16 Speech Processing

to separate the existing multiple overlapping sources, to dereverberate them and to recognize the information contained within the original utterances. A speaker diarization system able to steer the BCI stage and the ASRs has been also included in the overall framework. All the algorithms work in real-time and a PC-based implementation of them has been discussed in this contribution. Performed simulations, based on a existing large vocabulary database (WSJ) and suitably addressing the acoustic scenario under test, have shown the effectiveness of the developed system, making it appealing in real-life human-machine interaction scenarios. As future works, an overlap detector will be integrated in the speaker diarization system and its impact in terms of final recognition accuracy will be evaluated. In addition other applications different form ASR such as emotion recognition (Schuller et al. (2011)), dominance detection (Hung et al. (2011)) or keyword spotting (Wöllmer et al. (2011)) will be considered in order to

Egger, H. & Engl, H. (2005). Tikhonov regularization applied to the inverse problem of option pricing: convergence analysis and rates, *Inverse Problems* 21(3): 1027–1045. Fredouille, C., Bozonnet, S. & Evans, N. (2009). The LIA-EURECOM RT'09 Speaker

Guillaume, M., Grenier, Y. & Richard, G. (2005). Iterative algorithms for multichannel

Haque, M., Bashar, M. S., Naylor, P., Hirose, K. & Hasan, M. K. (2007). Energy constrained

Haque, M. & Hasan, M. K. (2008). Noise robust multichannel frequency-domain LMS

Hikichi, T., Delcroix, M. & Miyoshi, M. (2007). Inverse filtering for speech dereverberation

Huang, Y. & Benesty, J. (2003). A class of frequency-domain adaptive approaches to

Huang, Y., Benesty, J. & Chen, J. (2005). A Blind Channel Identification-Based Two-Stage

Miyoshi, M. & Kaneda, Y. (1988). Inverse filtering of room acoustics, *IEEE Transactions on*

Morgan, D., Benesty, J. & Sondhi, M. (1998). On the evaluation of estimated impulse responses,

Diarization System, *RT'09, NIST Rich Transcription Workshop*, Melbourne, Florida,

equalization in sound reproduction systems, *Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. 3, pp. iii/269–iii/272.

frequency-domain normalized LMS algorithm for blind channel identification,

algorithms for blind channel identification, *IEEE Signal Processing Letters* 15: 305–308.

less sensitive to noise and room transfer function fluctuations, *EURASIP Journal on*

blind multichannel identification, *IEEE Transactions on Speech and Audio Processing*

Approach to Separation and Dereverberation of Speech Signals in a Reverberant Environment, *IEEE Transactions on Speech and Audio Processing* 13(5): 882–895. Hung, H., Huang, Y., Friedland, G. & Gatica-Perez, D. (2011). Estimating dominance in

multi-party meetings using speaker diarization, *IEEE Transactions on Audio, Speech,*

assess the effectiveness of the front-end in other recognition tasks.

Habets, E. (2008). Room impulse response (RIR) generator. URL: *http://home.tiscali.nl/ehabets/rirgenerator.html*

*Advances in Signal Processing* 2007(1).

*and Language Processing* 19(4): 847–860.

*IEEE Signal Processing Letters* 5(7): 174–176.

*Signal Processing* 36(2): 145–152.

*Signal, Image and Video Processing* 1(3): 203–213.

**6. References**

USA.

51(1): 11–24.

	- URL: *http://www.keithv.com/software/htk/us/*

**2** 

*Canada* 

**Real-Time Dual-Microphone** 

Trabelsi Abdelaziz, Boyer François-Raymond and Savaria Yvon

In various applications such as mobile communications and digital hearing aids, the presence of interfering noise may cause serious deterioration in the perceived quality of speech signals. Thus, there exists considerable interest in developing speech enhancement algorithms that solve the problem of noise reduction in order to make the compensated speech more pleasant to a human listener. The noise reduction problem in single and multiple microphone environments was extensively studied (Benesty et al., 2005; Ephraim. & Malah, 1984). Single microphone speech enhancement approaches often fail to yield satisfactory performance, in particular when the interfering noise statistics are time-varying. In contrast, multiple microphone systems provide superior performance over the single microphone schemes at the expense of a substantial increase of implementation complexity and computational cost.

This chapter addresses the problem of enhancing a speech signal corrupted with additive noise when observations from two microphones are available. It is organized as follows. The next section presents different well-known and state of the art noise reduction methods for speech enhancement. Section 3 surveys the spatial cross-power spectral density (CPSD) based noise reduction approach in the case of a dual-microphone arrangement. Also included in this section, the well known problems associated with the use of the CPSD-based approach. Section 4 describes the single channel noise spectrum estimation algorithm used to cope with the CPSD-based approach shortcomings, and uses this algorithm in conjunction with a softdecision scheme to come up with the proposed method. We call the proposed method the modified CPSD (MCPSD) based approach. Based on minimum statistics, the noise power spectrum estimator seeks to provide a good tradeoff between the amount of noise reduction and the speech distortion, while attenuating the high energy correlated noise components (i.e., coherent direct path noise), especially in the low frequency range. Section 5 provides objective measures, speech spectrograms and subjective listening test results from experiments comparing the performance of the MCPSD-based method with the cross-spectral subtraction (CSS) based approach, which is a dual-microphone method previously reported in the

There have been several approaches proposed in the literature to deal with the noise reduction problem in speech processing, with varying degrees of success. These approaches

literature. Finally, Section 6 concludes the chapter.

**2. State of the art** 

**1. Introduction** 

**Speech Enhancement** 

*École Polytechnique de Montréal* 


## **Real-Time Dual-Microphone Speech Enhancement**

Trabelsi Abdelaziz, Boyer François-Raymond and Savaria Yvon *École Polytechnique de Montréal Canada* 

#### **1. Introduction**

18 Speech Processing

18 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Wooters, C. & Huijbregts, M. (2008). The ICSI RT07s Speaker Diarization System, *in*

Xu, G., Liu, H., Tong, L. & Kailath, T. (1995). A Least-Squares Approach to Blind Channel Identification, *IEEE Transactions On Signal Processing* 43(12): 2982–2993. Young, S., Everman, G., Kershaw, D., Moore, G. & Odell, J. (2006). *The HTK Book*, Cambridge

Yu, Z. & Er, M. (2004). A robust adaptive blind multichannel identification algorithm for

pp. 509–519.

University Engineering.

*and Signal Processing*, Vol. 2, pp. ii/25–ii/28.

R. Stiefelhagen, R. Bowers & J. Fiscus (eds), *Multimodal Technologies for Perception of Humans, Lecture Notes in Computer Science*, Springer-Verlag, Berlin, Heidelberg,

acoustic applications, *Proceedings of IEEE International Conference on Acoustics, Speech,*

In various applications such as mobile communications and digital hearing aids, the presence of interfering noise may cause serious deterioration in the perceived quality of speech signals. Thus, there exists considerable interest in developing speech enhancement algorithms that solve the problem of noise reduction in order to make the compensated speech more pleasant to a human listener. The noise reduction problem in single and multiple microphone environments was extensively studied (Benesty et al., 2005; Ephraim. & Malah, 1984). Single microphone speech enhancement approaches often fail to yield satisfactory performance, in particular when the interfering noise statistics are time-varying. In contrast, multiple microphone systems provide superior performance over the single microphone schemes at the expense of a substantial increase of implementation complexity and computational cost.

This chapter addresses the problem of enhancing a speech signal corrupted with additive noise when observations from two microphones are available. It is organized as follows. The next section presents different well-known and state of the art noise reduction methods for speech enhancement. Section 3 surveys the spatial cross-power spectral density (CPSD) based noise reduction approach in the case of a dual-microphone arrangement. Also included in this section, the well known problems associated with the use of the CPSD-based approach. Section 4 describes the single channel noise spectrum estimation algorithm used to cope with the CPSD-based approach shortcomings, and uses this algorithm in conjunction with a softdecision scheme to come up with the proposed method. We call the proposed method the modified CPSD (MCPSD) based approach. Based on minimum statistics, the noise power spectrum estimator seeks to provide a good tradeoff between the amount of noise reduction and the speech distortion, while attenuating the high energy correlated noise components (i.e., coherent direct path noise), especially in the low frequency range. Section 5 provides objective measures, speech spectrograms and subjective listening test results from experiments comparing the performance of the MCPSD-based method with the cross-spectral subtraction (CSS) based approach, which is a dual-microphone method previously reported in the literature. Finally, Section 6 concludes the chapter.

#### **2. State of the art**

There have been several approaches proposed in the literature to deal with the noise reduction problem in speech processing, with varying degrees of success. These approaches

Real-Time Dual-Microphone Speech Enhancement 21

input signals was averaged during speech pauses and subtracted from the estimated CPSD in the presence of speech. In (Guerin et al., 2003), the authors have suggested an adaptive smoothing parameter estimator to determine the noise CPSD that should be used in the coherence-magnitude based filter. By evaluating the required overestimation for the noise CPSD, the authors showed that the musical noise (resulting from large fluctuations of the smoothing parameter between speech and non-speech periods) could be carefully controlled, especially during speech activity. A simple soft-decision scheme based on minimum statistics to estimate accurately the noise CPSD was proposed in

Considering ease of implementation and lower computational cost when compared with approaches requiring microphone arrays with more than two microphones, dualmicrophone solutions are yet a promising class of speech enhancement systems due to their simpler array processing, which is expected to lead to lower power consumption, while still maintaining sufficiently good performance, in particular for compact portable applications (i.e., digital hearing aids, and hands-free telephones). The CPSD-based approach (Zelinski, 1988, 1990), the adaptive noise canceller (ANC) approach (Maj et al., 2006), (Berghe & Wooters, 1998), and the CSS-based approach (Guerin et al., 2003; Le Bouquin-Jannès et al., 1997; Zhang & Jia, 2005) are well-known examples. The former lacks robustness in a number of practical noise fields (i.e., coherent noise). The standard ANC method provides high speech distortion in the presence of crosstalk interferences between the two microphones. Formerly reported in the literature, the CSS-based approach provides interesting performance in a variety of noise fields. However, it lacks efficiency in dealing with highly nonstationary noises such as the multitalker babble. This issue will be further discussed later

This section introduces the signal model and gives a brief review of the CPSD-based approach in the case of a dual-microphone arrangement. Let *s*(*t*) be a speech signal of interest, and let the signal vector 1 2 ( ) [ ( ) ( )]*<sup>T</sup> nt n t n t* denote two-channel noise signals at the output of two spatially separated microphones. The sampled noisy signal *x i <sup>m</sup>*( )

where *i* is the sampling time index. The observed noisy signals are segmented into overlapping time frames by applying a window function and they are transformed into the frequency domain using the short-time Fourier transform (STFT). Thus, we have for a given

*x i si n i m m* ( ) ( ) ( ), *m* 1, 2 (1)

*Xkl Skl Nkl* ( ,) ( ,) ( ,) (2a)

1 2 ( , ) [ ( , ) ( , )]*<sup>T</sup> Xkl X kl X kl* (2b)

(Zhang & Jia, 2005).

in this chapter.

time frame:

**3. CPSD-based noise reduction approach** 

observed at the *m*th microphone can then be modeled as

where *k* is the frequency bin index, and *l* is the time index, and where

can generally be divided into two main categories. The first category uses a single microphone system and exploits information about the speech and noise signal statistics for enhancement. The most often used single microphone noise reduction approaches are the spectral subtraction method and its variants (O'Shaughnessy, 2000).

The second category of signal processing methods applicable to that situation involves using a microphone array system. These methods take advantage of the spatial discrimination of an array to separate speech from noise. The spatial information was exploited in (Kaneda & Tohyama, 1984) to develop a dual-microphone beamforming algorithm, which considers spatially uncorrelated noise field. This method was extended to an arbitrary number of microphones and combined with an adaptive Wiener filtering in (Zelinski, 1988, 1990) to further improve the output of the beamformer. The authors in (McCowan & Bourlard, 2003) have replaced the spatially uncorrelated noise field assumption by a more accurate model based on an assumed knowledge of the noise field coherence function, and extended the CPSD-based approach to develop a more appropriate postfiltering scheme. However, both methods overestimate the noise power spectral density at the beamformer's output and, thus, they are suboptimal in the Wiener sense (Simmer & Wasiljeff, 1992). In (Lefkimmiatis & Maragos, 2007), the authors have obtained a more accurate estimation of the noise power spectral density at the output of the beamformer proposed in (Simmer & Wasiljeff, 1992) by taking into account the noise reduction performed by the minimum variance distortionless response (MVDR) beamformer.

The generalized sidelobe canceller (GSC) method, initially introduced in (Griffiths & Jim, 1982), was considered for the implementation of adaptive beamformers in various applications. It was found that this method performs well in enhancing the signal-to-noise ratio (SNR) at the beamformer's output without introducing further distortion to the desired signal components (Guerin et al., 2003). However, the achievable noise reduction performance is limited by the amount of incoherent noise. To cope with the spatially incoherent noise components, a GSC based method that incorporates an adaptive Wiener filter in the look direction was proposed in (Fischer & Simmer, 1996). The authors in (Bitzer et al., 1999) have investigated the theoretical noise reduction limits of the GSC. They have shown that this structure performs well in anechoic rooms, but it does not work well in diffuse noise fields. By using a broadband array beamformer partitioned into several harmonically nested linear subarrays, the authors in (Fischer & Kammeyer, 1997) have shown that the resulting noise reduction system performance is nearly independent of the correlation properties of the noise field (i.e., the system is suitable for diffuse as well as for coherent noise field). The GSC array structure was further investigated in (Marro et al., 1998). In (Cohen, 2004), the author proposed to incorporate into the GSC beamformer a multichannel postfilter which is appropriate to work in nonstationary noise environments. To discriminate desired speech transients from interfering transients, he used both the GSC beamformer primary output and the reference noise signals. To get a real-time implementation of the method, the author suggested in an earlier paper (Cohen, 2003a), feeding back to the beamformer the discrimination decisions made by the postfilter.

In the dual-microphone noise reduction context, the authors in (Le Bouquin-Jannès et al., 1997) have proposed to modify both the Wiener and the coherence-magnitude based filters by including a cross-power spectrum estimation to take some correlated noise components into account. In this method, the cross-power spectral density of the two 20 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

can generally be divided into two main categories. The first category uses a single microphone system and exploits information about the speech and noise signal statistics for enhancement. The most often used single microphone noise reduction approaches are the

The second category of signal processing methods applicable to that situation involves using a microphone array system. These methods take advantage of the spatial discrimination of an array to separate speech from noise. The spatial information was exploited in (Kaneda & Tohyama, 1984) to develop a dual-microphone beamforming algorithm, which considers spatially uncorrelated noise field. This method was extended to an arbitrary number of microphones and combined with an adaptive Wiener filtering in (Zelinski, 1988, 1990) to further improve the output of the beamformer. The authors in (McCowan & Bourlard, 2003) have replaced the spatially uncorrelated noise field assumption by a more accurate model based on an assumed knowledge of the noise field coherence function, and extended the CPSD-based approach to develop a more appropriate postfiltering scheme. However, both methods overestimate the noise power spectral density at the beamformer's output and, thus, they are suboptimal in the Wiener sense (Simmer & Wasiljeff, 1992). In (Lefkimmiatis & Maragos, 2007), the authors have obtained a more accurate estimation of the noise power spectral density at the output of the beamformer proposed in (Simmer & Wasiljeff, 1992) by taking into account the noise reduction performed by the minimum variance distortionless

The generalized sidelobe canceller (GSC) method, initially introduced in (Griffiths & Jim, 1982), was considered for the implementation of adaptive beamformers in various applications. It was found that this method performs well in enhancing the signal-to-noise ratio (SNR) at the beamformer's output without introducing further distortion to the desired signal components (Guerin et al., 2003). However, the achievable noise reduction performance is limited by the amount of incoherent noise. To cope with the spatially incoherent noise components, a GSC based method that incorporates an adaptive Wiener filter in the look direction was proposed in (Fischer & Simmer, 1996). The authors in (Bitzer et al., 1999) have investigated the theoretical noise reduction limits of the GSC. They have shown that this structure performs well in anechoic rooms, but it does not work well in diffuse noise fields. By using a broadband array beamformer partitioned into several harmonically nested linear subarrays, the authors in (Fischer & Kammeyer, 1997) have shown that the resulting noise reduction system performance is nearly independent of the correlation properties of the noise field (i.e., the system is suitable for diffuse as well as for coherent noise field). The GSC array structure was further investigated in (Marro et al., 1998). In (Cohen, 2004), the author proposed to incorporate into the GSC beamformer a multichannel postfilter which is appropriate to work in nonstationary noise environments. To discriminate desired speech transients from interfering transients, he used both the GSC beamformer primary output and the reference noise signals. To get a real-time implementation of the method, the author suggested in an earlier paper (Cohen, 2003a),

feeding back to the beamformer the discrimination decisions made by the postfilter.

In the dual-microphone noise reduction context, the authors in (Le Bouquin-Jannès et al., 1997) have proposed to modify both the Wiener and the coherence-magnitude based filters by including a cross-power spectrum estimation to take some correlated noise components into account. In this method, the cross-power spectral density of the two

spectral subtraction method and its variants (O'Shaughnessy, 2000).

response (MVDR) beamformer.

input signals was averaged during speech pauses and subtracted from the estimated CPSD in the presence of speech. In (Guerin et al., 2003), the authors have suggested an adaptive smoothing parameter estimator to determine the noise CPSD that should be used in the coherence-magnitude based filter. By evaluating the required overestimation for the noise CPSD, the authors showed that the musical noise (resulting from large fluctuations of the smoothing parameter between speech and non-speech periods) could be carefully controlled, especially during speech activity. A simple soft-decision scheme based on minimum statistics to estimate accurately the noise CPSD was proposed in (Zhang & Jia, 2005).

Considering ease of implementation and lower computational cost when compared with approaches requiring microphone arrays with more than two microphones, dualmicrophone solutions are yet a promising class of speech enhancement systems due to their simpler array processing, which is expected to lead to lower power consumption, while still maintaining sufficiently good performance, in particular for compact portable applications (i.e., digital hearing aids, and hands-free telephones). The CPSD-based approach (Zelinski, 1988, 1990), the adaptive noise canceller (ANC) approach (Maj et al., 2006), (Berghe & Wooters, 1998), and the CSS-based approach (Guerin et al., 2003; Le Bouquin-Jannès et al., 1997; Zhang & Jia, 2005) are well-known examples. The former lacks robustness in a number of practical noise fields (i.e., coherent noise). The standard ANC method provides high speech distortion in the presence of crosstalk interferences between the two microphones. Formerly reported in the literature, the CSS-based approach provides interesting performance in a variety of noise fields. However, it lacks efficiency in dealing with highly nonstationary noises such as the multitalker babble. This issue will be further discussed later in this chapter.

#### **3. CPSD-based noise reduction approach**

This section introduces the signal model and gives a brief review of the CPSD-based approach in the case of a dual-microphone arrangement. Let *s*(*t*) be a speech signal of interest, and let the signal vector 1 2 ( ) [ ( ) ( )]*<sup>T</sup> nt n t n t* denote two-channel noise signals at the output of two spatially separated microphones. The sampled noisy signal *x i <sup>m</sup>*( ) observed at the *m*th microphone can then be modeled as

$$\mathbf{x}\_m(i) = \mathbf{s}(i) + n\_m(i), \quad m = \mathbf{1}, \ \mathbf{2} \tag{1}$$

where *i* is the sampling time index. The observed noisy signals are segmented into overlapping time frames by applying a window function and they are transformed into the frequency domain using the short-time Fourier transform (STFT). Thus, we have for a given time frame:

$$X(k,l) = S(k,l) + N(k,l) \tag{2a}$$

where *k* is the frequency bin index, and *l* is the time index, and where

$$X(k,l) = \begin{bmatrix} X\_1(k,l) & X\_2(k,l) \end{bmatrix}^T \tag{2b}$$

$$N(k,l) = \begin{bmatrix} N\_1(k,l) & N\_2(k,l) \end{bmatrix}^T \tag{2c}$$

Real-Time Dual-Microphone Speech Enhancement 23

In this section, we review the basic concepts of the noise power spectrum estimator algorithm on which the MCPSD method presented later, is based. Then, we use a variation of this algorithm in conjunction with a soft-decision scheme to cope with the CPSD-based

For highly nonstationary environments, such as the multitalker babble, the noise spectrum needs to be estimated and updated continuously to allow an effective noise reduction. A variety of methods were recently reported that continuously update the noise spectrum estimate while avoiding the need for explicit speech pause detection. In (Martin, 2001), a method known as the minimum statistics (MS) was proposed for estimating the noise spectrum by tracking the minimum of the noisy speech over a finite window. The author in (Cohen & Berdugo, 2002) suggested a minima controlled recursive algorithm (MCRA) which updates the noise spectrum estimate by tracking the noise-only periods of the noisy speech. These periods were found by comparing the ratio of the noisy speech to the local minimum against a fixed threshold. In the improved MCRA approach (Cohen, 2003b), a different approach was used to track the noise-only periods of the noisy signal based on the estimated speech-presence probability. Because of its ease of use that facilitates affordable (hardware, power and energy wise) real-time implementation, the MS method was

The MS algorithm tracks the minima of a short term power estimate of the noisy signal

magnitude of the noisy signal *Xkl* ( ,) , estimated at frequency *k* and frame *l* according to the

ˆ ˆ <sup>2</sup> *Pkl kl Pkl* ( , ) ( , ) ( , 1) (1 ( , )) | ( , )|

spectral minimum at each time and frequency index is obtained by tracking the minimum of

Because the minimum value of a set of random variables is smaller than their average, the noise spectrum estimate is usually biased. Let min *B kl* ( ,) denote the factor by which the minimum is smaller than the mean. This bias compensation factor is determined as a function of the minimum search window length *D* and the inverse normalized variance

 

ˆ *k l* is a time and frequency dependent smoothing parameter. The

*Pkl* ( ,), regardless of whether speech is present or not, and is given

min min ˆ ˆˆ *P kl P kl Pkl* ( , ) min( ( , 1), ( , )) (6)

*Pkl* ( ,). The resulting unbiased estimator of the

*Pkl* ( ,) denote the smoothed spectrum of the squared

ˆ ˆ *kl Xkl* (5)

**4. Dual-microphone speech enhancement system** 

approach shortcomings.

**4.1 Noise power spectrum estimation** 

considered for estimating the noise power spectrum.

within a time window of about 1 s. Let ˆ

following first-order recursive averaging:

( ,) *Q kl eq* of the smoothed spectrum estimate <sup>ˆ</sup>

*k l* is then given by:

ˆ( ,) *k l* (0 ( , ) 1) 

*D* successive estimates of ˆ

by the following equation:

noise spectrum <sup>2</sup> <sup>ˆ</sup> ( ,) *<sup>n</sup>*

where 

The CPSD-based noise reduction approach is derived from Wiener's theory, which solves the problem of optimal signal estimation in the mean-square error sense. The Wiener filter weights the spectral components of the noisy signal according to the signal-to-noise power spectral density ratio at individual frequencies given by:

$$\mathcal{W}(k,l) = \frac{\boldsymbol{\Phi}\_{SS}(k,l)}{\boldsymbol{\Phi}\_{X\_m X\_m}(k,l)} \tag{3}$$

where ( ,) *SS k l* and ( ,) *X Xm m k l* are respectively the power spectral densities (PSDs) of the desired signal and the input signal to the *m*th microphone.

For the formulation of the CPSD-based noise reduction approach, the following assumptions are made:


Under those assumptions, the unknown PSD ( ,) *SS k l* in (3) can be obtained from the estimated spatial CPSD 1 2 *X X* ( ,) *k l* between microphone noisy signals. To improve the estimation, the estimated PSDs are averaged over the microphone pair, leading to the following transfer function:

$$\hat{\mathcal{W}}(k,l) = \frac{\mathfrak{R}\{\hat{\varPhi}\_{X\_1X\_2}(k,l)\}}{(\hat{\varPhi}\_{X\_1X\_1}(k,l) + \hat{\varPhi}\_{X\_2X\_2}(k,l))\Big/\!\!/2} \tag{4}$$

where {·} is the real operator, and " " ˆ denotes the estimated value. It should be noted that only the real part of the estimated CPSD in the numerator of equation (4) is used, based on the fact that both the auto-power spectral density of the speech signal and the spatial cross-power spectral density of a diffuse noise field are real functions.

There are three well known drawbacks associated with the use of the CPSD-based approach. First, the noise signals on different microphones often hold correlated components, especially in the low frequency range, as is the case in a diffuse noise field (Simmer et al., 1994). Second, such approach usually gives rise to an audible residual noise that has a cosine shaped power spectrum that is not pleasant to a human listener (Le Bouquin-Jannès et al., 1997). Third, applying the derived transfer function to the output signal of a conventional beamformer yields an effective reduction of the remaining noise components but at the expense of an increased noise bias, especially when the number of microphones is too large (Simmer & Wasiljeff, 1992). In the next section, we will focus our attention on estimating and discarding the residual and coherent noise components resulting from the use of the CPSDbased approach in the case of a dual-microphone arrangement. For such system, the overestimation of the noise power spectral density should not be a problem.

22 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

The CPSD-based noise reduction approach is derived from Wiener's theory, which solves the problem of optimal signal estimation in the mean-square error sense. The Wiener filter weights the spectral components of the noisy signal according to the signal-to-noise power

> ( ,) ( ,) ( ,) *m m SS X X*

For the formulation of the CPSD-based noise reduction approach, the following

2. The desired signal *Skl* ( ,) and the noise signal *N kl <sup>m</sup>*( ,) are statistically independent

estimation, the estimated PSDs are averaged over the microphone pair, leading to the

where {·} is the real operator, and " " ˆ denotes the estimated value. It should be noted that only the real part of the estimated CPSD in the numerator of equation (4) is used, based on the fact that both the auto-power spectral density of the speech signal and the spatial

There are three well known drawbacks associated with the use of the CPSD-based approach. First, the noise signals on different microphones often hold correlated components, especially in the low frequency range, as is the case in a diffuse noise field (Simmer et al., 1994). Second, such approach usually gives rise to an audible residual noise that has a cosine shaped power spectrum that is not pleasant to a human listener (Le Bouquin-Jannès et al., 1997). Third, applying the derived transfer function to the output signal of a conventional beamformer yields an effective reduction of the remaining noise components but at the expense of an increased noise bias, especially when the number of microphones is too large (Simmer & Wasiljeff, 1992). In the next section, we will focus our attention on estimating and discarding the residual and coherent noise components resulting from the use of the CPSDbased approach in the case of a dual-microphone arrangement. For such system, the

1 2 1 1 2 2 <sup>ˆ</sup> { ( ,) <sup>ˆ</sup> ( ,) ˆ ˆ ( ( , ) ( , )) 2 *X X XX XX*

 

*kl kl*

*k l*

*k l*

*k l* are respectively the power spectral densities (PSDs) of the

1 2 *EN kl N kl* { ( , ) ( , ) 0 ;

*X X* ( ,) *k l* between microphone noisy signals. To improve the

*k l Wkl*

spectral density ratio at individual frequencies given by:

desired signal and the input signal to the *m*th microphone.

1. The noise signals are spatially uncorrelated, \*

random processes, \* *ES kl N kl m* { ( , ) ( , ) 0, 1, 2 *<sup>m</sup>* ; 3. The noise PSDs are the same on the two microphones.

*Wkl*

cross-power spectral density of a diffuse noise field are real functions.

overestimation of the noise power spectral density should not be a problem.

Under those assumptions, the unknown PSD ( ,) *SS*

 *k l* and ( ,) *X Xm m* 

where ( ,) *SS* 

assumptions are made:

estimated spatial CPSD 1 2

following transfer function:

1 2 ( , ) [ ( , ) ( , )]*<sup>T</sup> Nkl N kl N kl* (2c)

(3)

*k l* in (3) can be obtained from the

(4)

#### **4. Dual-microphone speech enhancement system**

In this section, we review the basic concepts of the noise power spectrum estimator algorithm on which the MCPSD method presented later, is based. Then, we use a variation of this algorithm in conjunction with a soft-decision scheme to cope with the CPSD-based approach shortcomings.

#### **4.1 Noise power spectrum estimation**

For highly nonstationary environments, such as the multitalker babble, the noise spectrum needs to be estimated and updated continuously to allow an effective noise reduction. A variety of methods were recently reported that continuously update the noise spectrum estimate while avoiding the need for explicit speech pause detection. In (Martin, 2001), a method known as the minimum statistics (MS) was proposed for estimating the noise spectrum by tracking the minimum of the noisy speech over a finite window. The author in (Cohen & Berdugo, 2002) suggested a minima controlled recursive algorithm (MCRA) which updates the noise spectrum estimate by tracking the noise-only periods of the noisy speech. These periods were found by comparing the ratio of the noisy speech to the local minimum against a fixed threshold. In the improved MCRA approach (Cohen, 2003b), a different approach was used to track the noise-only periods of the noisy signal based on the estimated speech-presence probability. Because of its ease of use that facilitates affordable (hardware, power and energy wise) real-time implementation, the MS method was considered for estimating the noise power spectrum.

The MS algorithm tracks the minima of a short term power estimate of the noisy signal within a time window of about 1 s. Let ˆ *Pkl* ( ,) denote the smoothed spectrum of the squared magnitude of the noisy signal *Xkl* ( ,) , estimated at frequency *k* and frame *l* according to the following first-order recursive averaging:

$$
\hat{P}(k,l) = \hat{\alpha}(k,l) \cdot \hat{P}(k,l-1) + (1 - \hat{\alpha}(k,l)) \cdot |X(k,l)|^2 \tag{5}
$$

where ˆ( ,) *k l* (0 ( , ) 1) ˆ *k l* is a time and frequency dependent smoothing parameter. The spectral minimum at each time and frequency index is obtained by tracking the minimum of *D* successive estimates of ˆ *Pkl* ( ,), regardless of whether speech is present or not, and is given by the following equation:

$$
\hat{P}\_{\min}(k,l) = \min(\hat{P}\_{\min}(k,l-1), \hat{P}(k,l))\tag{6}
$$

Because the minimum value of a set of random variables is smaller than their average, the noise spectrum estimate is usually biased. Let min *B kl* ( ,) denote the factor by which the minimum is smaller than the mean. This bias compensation factor is determined as a function of the minimum search window length *D* and the inverse normalized variance ( ,) *Q kl eq* of the smoothed spectrum estimate <sup>ˆ</sup> *Pkl* ( ,). The resulting unbiased estimator of the noise spectrum <sup>2</sup> <sup>ˆ</sup> ( ,) *<sup>n</sup> k l* is then given by:

$$
\hat{\sigma}\_n^2(k,l) = B\_{\text{min}}(k,l) \cdot \hat{P}\_{\text{min}}(k,l) \tag{7}
$$

Real-Time Dual-Microphone Speech Enhancement 25

Fig. 1. The proposed dual-microphone noise reduction system for speech enhancement,

To track the residual and coherent noise components that are often present in the estimated spectrum in (8), a variation of the MS algorithm was implemented as follows. In performing the running spectral minima search, the *D* subsequent noise PSD estimates were divided into two sliding data subwindows of *D*/2 samples. Whenever *D*/2 samples were processed, the minimum of the current subwindow was stored for later use. The sub-band noise power

estimate and the latest *D*/2 PSD values. The sub-band noise power was updated at each time step. As a result, a fast update of the minimum estimate was achieved in response to a falling noise power. In case of a rising noise power, the update of the minimum estimate was delayed by *D* samples. For accurate power estimates, the bias correction factor introduced in (Martin, 2001) was scaled by a constant decided empirically. This constant was obtained by performing the MS algorithm on a white noise signal so that the estimated

To discard the estimated residual and coherent noise components, a soft-decision scheme was implemented. For each frequency bin *k* and frame index *l*, the signal to noise ratio was estimated. The signal power was estimated from equation (8) and the noise power was the latest estimated value from equation (7). This ratio, called difference in level (*DL*), was calculated as follows:

> 10 2 <sup>ˆ</sup> | ( , )| 10 log <sup>ˆ</sup> ( ,) *<sup>n</sup> Skl DL*


*Skl DL*


 

The estimated *DL* value was then compared to a fixed threshold *Ths* decided empirically. Based on that comparison, a running decision was taken by preserving the sound frequency

2

*DL Skl Skl DL Th*

*k l*

2

(10)

*s*

(11a)

output power had to match exactly that of the driving noise in the mean sense.

bins of interest and reducing the noise bins to a minimum spectral floor. That is,

ˆ | ( , )| | ( , )| (1 ) , if

*Th*

*Skl*

*s*

*k l* was obtained by picking the minimum value of the current signal PSD

*s*ˆ(*i*)

2 ˆ*n*

( ) <sup>1</sup> *n i*

( ) <sup>1</sup> *x i*

( ) <sup>2</sup> *x i*

where "| |" denotes the magnitude spectrum.

ˆ

*s*(*i*)

estimate <sup>2</sup> <sup>ˆ</sup> ( ,) *<sup>n</sup>* 

( ) <sup>2</sup> *n i*

To make the adaptation of the minimum estimate faster, the search window of *D* samples is subdivided into *U* subwindows of *V* samples (*D* = *U*·*V*) and the noise PSD estimate is updated every *V* subsequent PSD estimates ˆ *Pkl* ( ,). In case of a sudden increase in the noise floor, the noise PSD estimate is updated when a local minimum with amplitude in the vicinity of the overall minimum is detected. The minimum estimate, however, lags behind by at most *D* + *V* samples when the noise power increases abruptly. It should be noted that the noise power estimator in (Martin, 2001) tends to underestimate the noise power, in particular when frame-wise processing with considerable frame overlap is performed. This underestimation problem is known and further investigation on the adjustment of the bias of the spectral minimum can be found in (Martin, 2006) and (Mauler & Martin, 2006).

#### **4.2 Dual-microphone noise reduction system**

Although the CPSD-based method has shown its effectiveness in various practical noise fields, its performance could be increased if the residual and coherent noise components were estimated and discarded from the output spectrum. In the MCPSD-based method, this is done by adding a noise power estimator in conjunction with a soft-decision scheme to achieve a good tradeoff between noise reduction and speech distortion, while still guaranteeing its real-time behavior. Fig. 1 shows an overview of the MCPSD-based system, which is described in details in this section.

We consider the case in which the average of the STFT magnitude spectra of the noisy observations received by the two microphones, | ( , )| (| ( , )| | ( , )|) 2 *Ykl X kl X kl* 1 2 , is multiplied by a spectral gain function *G*(*k*,*l*) for approximating the magnitude spectrum of the sound signal of interest, that is

$$<\langle \hat{S}(k,l) \rangle = G(k,l) \cdot | \, | \, Y(k,l) \rangle \tag{8}$$

The gain function *G*(*k*,*l*) is obtained by using equation (4), and can be expressed in the following extended form as

$$\mathbf{G}(k,l) = \frac{\left( |\, \mathbf{X}\_1(k,l) \, | \, \mathbf{Y}\_2(k,l) \, | \right) \cdot \cos(\Delta \phi(k,l))}{\left( |\, \mathbf{X}\_1(k,l) \, | \, ^2 + |\, \mathbf{X}\_2(k,l) \, | \, ^2 \right) \Big| 2} \tag{9a}$$

where

$$
\Delta\varPhi(k,l) = \varvarphi\_{X\_1}(k,l) - \varvarphi\_{X\_2}(k,l) \tag{9b}
$$

and where 1 *<sup>X</sup>* ( ,) *k l* and 2 *<sup>X</sup>* ( ,) *k l* denote the phase spectra of the STFTs of 1 *X kl* ( ,) and <sup>2</sup> *X kl* ( ,) respectively that satisfy the relationship 1 2 | ( , ) ( , )| 2 *X X kl kl* . In the implementation of the MCPSD-based approach, any negative values of the gain function *G*(*k*,*l*) are reset to a minimum spectral floor, on the assumption that such frequencies cannot be recovered. Moreover, good results can be obtained when the gain function *G*(*k*,*l*) is squared, which improves the signals selectivity (i.e., those coming from the direct path).

24 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

min min <sup>ˆ</sup> <sup>ˆ</sup> ( ,) ( ,) ( ,) *<sup>n</sup>*

To make the adaptation of the minimum estimate faster, the search window of *D* samples is subdivided into *U* subwindows of *V* samples (*D* = *U*·*V*) and the noise PSD estimate is

floor, the noise PSD estimate is updated when a local minimum with amplitude in the vicinity of the overall minimum is detected. The minimum estimate, however, lags behind by at most *D* + *V* samples when the noise power increases abruptly. It should be noted that the noise power estimator in (Martin, 2001) tends to underestimate the noise power, in particular when frame-wise processing with considerable frame overlap is performed. This underestimation problem is known and further investigation on the adjustment of the bias

Although the CPSD-based method has shown its effectiveness in various practical noise fields, its performance could be increased if the residual and coherent noise components were estimated and discarded from the output spectrum. In the MCPSD-based method, this is done by adding a noise power estimator in conjunction with a soft-decision scheme to achieve a good tradeoff between noise reduction and speech distortion, while still guaranteeing its real-time behavior. Fig. 1 shows an overview of the MCPSD-based system,

We consider the case in which the average of the STFT magnitude spectra of the noisy observations received by the two microphones, | ( , )| (| ( , )| | ( , )|) 2 *Ykl X kl X kl* 1 2 , is multiplied by a spectral gain function *G*(*k*,*l*) for approximating the magnitude spectrum of

The gain function *G*(*k*,*l*) is obtained by using equation (4), and can be expressed in the

1 2 (| ( , )| | ( , )|) cos( ( , )) ( ,) (| ( , )| | ( , )| ) 2 *X kl X kl kl Gkl X kl X kl* 

1 2

 

implementation of the MCPSD-based approach, any negative values of the gain function *G*(*k*,*l*) are reset to a minimum spectral floor, on the assumption that such frequencies cannot be recovered. Moreover, good results can be obtained when the gain function *G*(*k*,*l*) is squared, which improves the signals selectivity (i.e., those coming from the direct path).

2 2

1 2

<sup>2</sup> *X kl* ( ,) respectively that satisfy the relationship 1 2 | ( , ) ( , )| 2

of the spectral minimum can be found in (Martin, 2006) and (Mauler & Martin, 2006).

*kl B kl P kl* (7)

ˆ | ( , )| ( , ) | ( , )| *Skl Gkl Ykl* (8)

*<sup>X</sup>* ( ,) *k l* denote the phase spectra of the STFTs of 1 *X kl* ( ,) and

( ,) ( ,) ( ,) *kl kl kl X X* (9b)

 *X X kl kl* 

(9a)

 

. In the

*Pkl* ( ,). In case of a sudden increase in the noise

2

updated every *V* subsequent PSD estimates ˆ

**4.2 Dual-microphone noise reduction system** 

which is described in details in this section.

the sound signal of interest, that is

following extended form as

where

and where 1

*<sup>X</sup>* ( ,) *k l* and 2

Fig. 1. The proposed dual-microphone noise reduction system for speech enhancement, where "| |" denotes the magnitude spectrum.

To track the residual and coherent noise components that are often present in the estimated spectrum in (8), a variation of the MS algorithm was implemented as follows. In performing the running spectral minima search, the *D* subsequent noise PSD estimates were divided into two sliding data subwindows of *D*/2 samples. Whenever *D*/2 samples were processed, the minimum of the current subwindow was stored for later use. The sub-band noise power

estimate <sup>2</sup> <sup>ˆ</sup> ( ,) *<sup>n</sup> k l* was obtained by picking the minimum value of the current signal PSD estimate and the latest *D*/2 PSD values. The sub-band noise power was updated at each time step. As a result, a fast update of the minimum estimate was achieved in response to a falling noise power. In case of a rising noise power, the update of the minimum estimate was delayed by *D* samples. For accurate power estimates, the bias correction factor introduced in (Martin, 2001) was scaled by a constant decided empirically. This constant was obtained by performing the MS algorithm on a white noise signal so that the estimated output power had to match exactly that of the driving noise in the mean sense.

To discard the estimated residual and coherent noise components, a soft-decision scheme was implemented. For each frequency bin *k* and frame index *l*, the signal to noise ratio was estimated. The signal power was estimated from equation (8) and the noise power was the latest estimated value from equation (7). This ratio, called difference in level (*DL*), was calculated as follows:

$$DL = 10 \cdot \log\_{10} \left( \frac{|\hat{S}(k, l)|^2}{\hat{\sigma}\_n^2(k, l)} \right) \tag{10}$$

The estimated *DL* value was then compared to a fixed threshold *Ths* decided empirically. Based on that comparison, a running decision was taken by preserving the sound frequency bins of interest and reducing the noise bins to a minimum spectral floor. That is,

$$|\hat{S}(k,l)| = \begin{cases} |\tilde{S}(k,l)| \cdot \lambda & \text{if } DL < 0\\ |\tilde{S}(k,l)| \cdot \left( \left(\frac{DL}{Th\_s}\right)^2 \cdot (1-\lambda) + \lambda \right) & \text{if } DL < Th\_s\\ |\tilde{S}(k,l)| \cdot & \text{otherwise.} \end{cases} \tag{11a}$$

where

$$|\tilde{S}(k,l)| = \sqrt{|\hat{S}(k,l)|^2 - \hat{\sigma}\_n^2(k,l)}\tag{11b}$$

Real-Time Dual-Microphone Speech Enhancement 27

Four different noise types, namely white Gaussian noise, helicopter rotor noise, impulsive noise and multitalker babble noise, were recorded at the same sampling rate and used throughout the experiments. The noise was scaled in power level and added acoustically to the above sentences with a varying SNR. A global SNR estimation of the input data was used. It was computed by averaging power over the whole length of the two observed

10 2

SNR 10 log

MCPSD-based method over the CSS-based approach.

input SNR was varied from 8 dB to 8 dB in 4 dB steps.

2

*m i I*

1 1

*m i*

where *I* is the number of data samples of the signal observed at the *m*th microphone. Throughout the experiments, the average of the two clean signals 1 2 *si s i s i* ( ) ( ( ) ( )) 2 was used as the clean speech signal. Objective measures, speech spectrograms and subjective listening tests were used to demonstrate the performance improvement achieved with the

The Itakura-Saito (IS) distance (Itakura, 1975) and the log spectral distortion (LSD) (Mittal & Phamdo, 2000) were chosen to measure the differences between the clean and the test spectra. The IS distance has a correlation of 0.59 with subjective quality measures (Quakenbush et al., 1988). A typical range for the IS distance is 010, where lower values indicate better speech quality. The LSD provides reasonable degree of correlation with subjective results. A range of 015 dB was considered for the selected LSD, where the minimum value of LSD corresponds to the best speech quality. In addition to the IS and LSD measures, a frame-based segmental SNR was used which takes into consideration both speech distortion and noise reduction. In order to compute these measures, an utterance of the sentence 1 was processed through the two methods (i.e., the MCPSD and CSS). The

Values of the IS distance measure for various noise types and different input SNRs are presented in Tables 1 and 2 for signals processed by the different methods. Results in this table were obtained by averaging the IS distance values over the length of sentence 1. The results in this table indicate that the CSS-based approach yielded more speech distortion than that produced with the MCPSD-based method, particularly in helicopter and impulsive noise environments. Fig. 3 illustrates the comparative results in terms of LSD measures between both methods for various noise types and different input SNRs. From these figures, it can be observed that, whereas the two methods showed comparable improvement in the case of impulsive noise, the estimated LSD values provided by the MCPSD-based method

1 1

*m*

*I*

2

*s i*

( ( ) ( ))

 

*x i si*

( )

2

(12)

1. Sentence 1 (male talker): "Flowers grow in the garden". 2. Sentence 2 (female talker): "She looked in her mirror". 3. Sentence 3 (male talker): "The shop closes for lunch". 4. Sentence 4 (female talker): "The police helped the driver". 5. Sentence 5 (male talker): "A boy ran down the path".

signals with:

**5.1 Objective measures** 

and where was chosen such that 20 log 40 dB <sup>10</sup> . The argument of the square-root function in equation (11b) was restricted to positive values in order to guarantee real-valued results. When the estimated *DL* value is lower than the statistical threshold, the quadratic function "(*DL*/*Ths*)² ·(1−) + " allows the estimated spectrum to be smoothed during noise reduction. It should be noted that the so called *DL* has to take positive values during speech activity and negative values during speech pause periods.

Finally, the estimated magnitude spectrum in (11) was combined with the average of the phase spectra of the two received signals prior to estimate the time signal of interest. In addition to the 6 dB reduction in phase noise, the time waveform resulting from such combination provided a better match of the sound signal of interest coming from the direct path. After an inverse DFT of the enhanced spectrum, the resultant time waveform was halfoverlapped and added to adjacent processed segments to produce an approximation of the sound signal of interest (see Fig. 1).

#### **5. Performance evaluation and results**

This section presents the performance evaluation of the MCPSD-based method, as well as the results of experiments comparing this method with the CSS-based approach. In all the experiments, the analysis frame length was set to 1024 data samples (23 ms at 44.1 kHz sampling rate) with 50% overlap. The analysis and synthesis windows thus had a perfect reconstruction property (i.e., Hann-window). The sliding window length of *D* subsequent PSD estimates was set to 100 samples. The threshold *Ths* was fixed to 5 dB. The recordings were made using a Presonus Firepod recording interface and two Shure KSM137 cardioid microphones placed approximately 20 cm apart. The experimental environment of the MCPSD is depicted in Fig. 2. The room with dimensions of 5.5 x 3.5 x 3 m enclosed a speech source situated at a distance of 0.5 m directly in front (0 degrees azimuth) of the input microphones, and a masking source of noise located at a distance of 0.5 m from the speech source.

Fig. 2. Overhead view of the experimental environment.

Designed to be equally intelligible in noise, five sentences taken from the Hearing in Noise Test (HINT) database (Nilsson et al., 1994) were recorded at a sampling frequency of 44.1 kHz. They are

26 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

ˆ 2 2 | ( , )| | ( , )| ( , ) ˆ *Skl Skl kl*

function in equation (11b) was restricted to positive values in order to guarantee real-valued results. When the estimated *DL* value is lower than the statistical threshold, the quadratic

reduction. It should be noted that the so called *DL* has to take positive values during speech

Finally, the estimated magnitude spectrum in (11) was combined with the average of the phase spectra of the two received signals prior to estimate the time signal of interest. In addition to the 6 dB reduction in phase noise, the time waveform resulting from such combination provided a better match of the sound signal of interest coming from the direct path. After an inverse DFT of the enhanced spectrum, the resultant time waveform was halfoverlapped and added to adjacent processed segments to produce an approximation of the

This section presents the performance evaluation of the MCPSD-based method, as well as the results of experiments comparing this method with the CSS-based approach. In all the experiments, the analysis frame length was set to 1024 data samples (23 ms at 44.1 kHz sampling rate) with 50% overlap. The analysis and synthesis windows thus had a perfect reconstruction property (i.e., Hann-window). The sliding window length of *D* subsequent PSD estimates was set to 100 samples. The threshold *Ths* was fixed to 5 dB. The recordings were made using a Presonus Firepod recording interface and two Shure KSM137 cardioid microphones placed approximately 20 cm apart. The experimental environment of the MCPSD is depicted in Fig. 2. The room with dimensions of 5.5 x 3.5 x 3 m enclosed a speech source situated at a distance of 0.5 m directly in front (0 degrees azimuth) of the input microphones,

and a masking source of noise located at a distance of 0.5 m from the speech source.

0.2 m

Microphones

Fig. 2. Overhead view of the experimental environment.

kHz. They are

0.5 m

0.5 m

Speech Noise

0.2 m

Designed to be equally intelligible in noise, five sentences taken from the Hearing in Noise Test (HINT) database (Nilsson et al., 1994) were recorded at a sampling frequency of 44.1

was chosen such that 20 log 40 dB <sup>10</sup>

) + 

activity and negative values during speech pause periods.

*<sup>n</sup>* (11b)

" allows the estimated spectrum to be smoothed during noise

. The argument of the square-root

where

and where

function "(*DL*/*Ths*)² ·(1−

sound signal of interest (see Fig. 1).

**5. Performance evaluation and results** 


Four different noise types, namely white Gaussian noise, helicopter rotor noise, impulsive noise and multitalker babble noise, were recorded at the same sampling rate and used throughout the experiments. The noise was scaled in power level and added acoustically to the above sentences with a varying SNR. A global SNR estimation of the input data was used. It was computed by averaging power over the whole length of the two observed signals with:

$$\text{SNR} = 10 \cdot \log\_{10} \left( \frac{\sum\_{m=1}^{2} \sum\_{i=1}^{I} s^2(i)}{\sum\_{m=1}^{2} \sum\_{i=1}^{I} (\mathbf{x}\_m(i) - \mathbf{s}(i))^2} \right) \tag{12}$$

where *I* is the number of data samples of the signal observed at the *m*th microphone. Throughout the experiments, the average of the two clean signals 1 2 *si s i s i* ( ) ( ( ) ( )) 2 was used as the clean speech signal. Objective measures, speech spectrograms and subjective listening tests were used to demonstrate the performance improvement achieved with the MCPSD-based method over the CSS-based approach.

#### **5.1 Objective measures**

The Itakura-Saito (IS) distance (Itakura, 1975) and the log spectral distortion (LSD) (Mittal & Phamdo, 2000) were chosen to measure the differences between the clean and the test spectra. The IS distance has a correlation of 0.59 with subjective quality measures (Quakenbush et al., 1988). A typical range for the IS distance is 010, where lower values indicate better speech quality. The LSD provides reasonable degree of correlation with subjective results. A range of 015 dB was considered for the selected LSD, where the minimum value of LSD corresponds to the best speech quality. In addition to the IS and LSD measures, a frame-based segmental SNR was used which takes into consideration both speech distortion and noise reduction. In order to compute these measures, an utterance of the sentence 1 was processed through the two methods (i.e., the MCPSD and CSS). The input SNR was varied from 8 dB to 8 dB in 4 dB steps.

Values of the IS distance measure for various noise types and different input SNRs are presented in Tables 1 and 2 for signals processed by the different methods. Results in this table were obtained by averaging the IS distance values over the length of sentence 1. The results in this table indicate that the CSS-based approach yielded more speech distortion than that produced with the MCPSD-based method, particularly in helicopter and impulsive noise environments. Fig. 3 illustrates the comparative results in terms of LSD measures between both methods for various noise types and different input SNRs. From these figures, it can be observed that, whereas the two methods showed comparable improvement in the case of impulsive noise, the estimated LSD values provided by the MCPSD-based method

Real-Time Dual-Microphone Speech Enhancement 29

Fig. 4. Segmental SNR improvement for various noise types and levels, obtained using (○)

Objective measures alone do not provide an adequate evaluation of system performance. Speech spectrograms constitute a well-suited tool for analyzing the time-frequency behavior of any speech enhancement system. All the speech spectrograms presented in this section

In the case of white Gaussian noise (Fig. 5), whereas the MCPSD-based method and the CSSbased approach provided sufficient amount of noise reduction, the spectrum of the former preserved better the desired speech components. In the case of helicopter rotor noise (Fig. 6), large residual noise components were observed in the spectrograms of the signals processed by the CSS-based approach. Unlike this method, the spectrogram of the signal processed by the MCPSD-based method indicated that the noise between the speech periods was noticeably reduced, while the shape of the speech periods was nearly unchanged. In the case of impulsive noise (Fig. 7), it can be observed that the CSS-based approach was less effective for this type of noise. In contrast, the spectrogram of the signal processed by the MCPSD-based method shows that the impulsive noise was moderately reduced in both the speech and noise periods. In the case of multitalker babble noise (Fig. 8), it can be seen that the CSS-based approach provided limited noise reduction, particularly in the noise only periods. By contrast, a good

(Figs. 58) use sentence 1 corrupted with different background noises at SNR = 0 dB.

noise reduction was achieved by the MCPSD-based method on the entire spectrum.

We can conclude that, while the CSS-based approach afforded limited noise reduction, especially for highly nonstationary noise such as multitalker babble, the MCPSD-based method can deal efficiently with both stationary and transient noises with less spectral

CSS approach and (□) the MCPSD-based method.

distortion even in severe noisy environments.

**5.2 Speech spectrograms** 

were the lowest in all noise conditions. In terms of segmental SNR, the MCPSD-based method provided a performance improvement of about 2 dB on average, over the CSSbased approach. The largest improvement was achieved in the case of multitalker babble noise, while for impulsive noise this improvement was decreased. This is shown in Fig. 4.


Table 1. Comparative performance in terms of mean Itakura-Saito distance measure for white and helicopter noises and different input SNRs.


Table 2. Comparative performance in terms of mean Itakura-Saito distance measure for impulsive and babble noises and different input SNRs.

Fig. 3. Log spectral distortion measure for various noise types and levels, obtained using (○) CSS approach, and (□) the MCPSD-based method.

28 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

were the lowest in all noise conditions. In terms of segmental SNR, the MCPSD-based method provided a performance improvement of about 2 dB on average, over the CSSbased approach. The largest improvement was achieved in the case of multitalker babble noise, while for impulsive noise this improvement was decreased. This is shown in Fig. 4.

8 1.88 0.62 3.29 2.81 1.92 3.28 4 1.4 0.43 2.82 2.18 1.29 2.62 0 0.78 0.3 2.23 1.72 0.95 2.18 4 0.51 0.24 1.64 1.28 0.71 1.7 8 0.34 0.25 1.18 0.87 0.47 1.24 Table 1. Comparative performance in terms of mean Itakura-Saito distance measure for

White Noise Helicopter Noise CSS MCPSD Noisy CSS MCPSD Noisy

Impulsive Noise Babble Noise CSS MCPSD Noisy CSS MCPSD Noisy

8 2.71 2.03 3.23 2.38 1.26 3.1 4 2.21 1.67 2.65 1.7 0.85 2.62 0 1.7 1.21 2.06 1.28 0.59 2.12 4 1.34 0.93 1.56 0.92 0.46 1.73 8 0.99 0.69 1.09 0.67 0.32 1.27 Table 2. Comparative performance in terms of mean Itakura-Saito distance measure for

Fig. 3. Log spectral distortion measure for various noise types and levels, obtained using (○)

SNR (dB)

SNR (dB)

white and helicopter noises and different input SNRs.

impulsive and babble noises and different input SNRs.

CSS approach, and (□) the MCPSD-based method.

Fig. 4. Segmental SNR improvement for various noise types and levels, obtained using (○) CSS approach and (□) the MCPSD-based method.

#### **5.2 Speech spectrograms**

Objective measures alone do not provide an adequate evaluation of system performance. Speech spectrograms constitute a well-suited tool for analyzing the time-frequency behavior of any speech enhancement system. All the speech spectrograms presented in this section (Figs. 58) use sentence 1 corrupted with different background noises at SNR = 0 dB.

In the case of white Gaussian noise (Fig. 5), whereas the MCPSD-based method and the CSSbased approach provided sufficient amount of noise reduction, the spectrum of the former preserved better the desired speech components. In the case of helicopter rotor noise (Fig. 6), large residual noise components were observed in the spectrograms of the signals processed by the CSS-based approach. Unlike this method, the spectrogram of the signal processed by the MCPSD-based method indicated that the noise between the speech periods was noticeably reduced, while the shape of the speech periods was nearly unchanged. In the case of impulsive noise (Fig. 7), it can be observed that the CSS-based approach was less effective for this type of noise. In contrast, the spectrogram of the signal processed by the MCPSD-based method shows that the impulsive noise was moderately reduced in both the speech and noise periods. In the case of multitalker babble noise (Fig. 8), it can be seen that the CSS-based approach provided limited noise reduction, particularly in the noise only periods. By contrast, a good noise reduction was achieved by the MCPSD-based method on the entire spectrum.

We can conclude that, while the CSS-based approach afforded limited noise reduction, especially for highly nonstationary noise such as multitalker babble, the MCPSD-based method can deal efficiently with both stationary and transient noises with less spectral distortion even in severe noisy environments.

Real-Time Dual-Microphone Speech Enhancement 31

Fig. 7. Speech spectrograms obtained with impulsive noise added at SNR=0 dB. (a) Clean

Fig. 8. Speech spectrograms obtained with multitalker babble noise added at SNR=0 dB. (a)

Clean speech (b) Noisy signal (c) CSS output (d) MCPSD output.

speech (b) Noisy signal (c) CSS output (d) MCPSD output.

Fig. 5. Speech spectrograms obtained with white Gaussian noise added at SNR=0 dB. (a) Clean speech (b) Noisy signal (c) CSS output (d) MCPSD output.

Fig. 6. Speech spectrograms obtained with helicopter rotor noise added at SNR=0 dB. (a) Clean speech (b) Noisy signal (c) CSS output (d) MCPSD output.

30 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Fig. 5. Speech spectrograms obtained with white Gaussian noise added at SNR=0 dB. (a)

Fig. 6. Speech spectrograms obtained with helicopter rotor noise added at SNR=0 dB. (a)

Clean speech (b) Noisy signal (c) CSS output (d) MCPSD output.

Clean speech (b) Noisy signal (c) CSS output (d) MCPSD output.

Fig. 7. Speech spectrograms obtained with impulsive noise added at SNR=0 dB. (a) Clean speech (b) Noisy signal (c) CSS output (d) MCPSD output.

Fig. 8. Speech spectrograms obtained with multitalker babble noise added at SNR=0 dB. (a) Clean speech (b) Noisy signal (c) CSS output (d) MCPSD output.

Real-Time Dual-Microphone Speech Enhancement 33

added noise power estimator seeks to provide a good tradeoff between the amount of noise reduction and the speech distortion, while attenuating the high energy correlated noise components, especially in the low frequency ranges. The performance evaluation of the modified CPSD-based method, formerly named MCPSD in this chapter, was carried out over the CSS-based approach, a dual-microphone method previously reported in the literature. Objective evaluation results show that a performance improvement in terms of segmental SNR of about 2 dB on average can be achieved by the MCPSD-based method over the CSS-based approach. The best noise reduction was obtained in the case of multitalker babble noise, while the improvement was lower for impulsive noise. Subjective listening tests performed on a limited data set revealed that CCRs ranging from 0.33 to 1.27 can be achieved over the CSSbased approach. The maximum improvement of CCR was obtained in the case of helicopter and multitalker babble noises, while the worst score was achieved when white noise was added. A fruitful direction of further research would therefore be to extend the MCPSD-based method to multiple microphones as well as to investigate the benefits of such extension on the

Benesty, J. et al. (2005). Speech Enhancement, Springer, ISBN 978-3540240396, New York, USA. Berghe, J.V. & Wooters, J. (1998). An adaptive noise canceller for hearing aids using two

Bitzer, J. et al. (1999). Theoretical noise reduction limits of the generalized sidelobe canceller

*Speech & Signal Processing*, vol. 5, pp. 29652968, Phoenix, USA, March 1999. Cohen, I. & Berdugo, B. (2002). Noise estimation by minima controlled recursive averaging

Cohen, I. et al. (2003a). An integrated real-time beamforming and postfiltering system for

Cohen, I. (2003b). Noise spectrum estimation in adverse environments: improved minima

Cohen, I. (2004). Multichannel post-filtering in nonstationary noise environments. *IEEE* 

Ephraim, Y. & Malah, D. (1984). Speech enhancement using a minimum mean-square error

Fischer, S. & Simmer, K.U. (1996). Beamforming microphone arrays for speech acquisition in noisy environments. *Speech Communication*, vol. 20, no. 34, pp. 215227. Fischer, S. & Kammeyer, K.D. (1997). Broadband beamforming with adaptive postfiltering for

Griffiths, L.J. & Jim, C.W. (1982). An alternative approach to linearly constrained adaptive beamforming. *IEEE Transaction on Antennas & Propagation*, vol. 30, no. 1, pp. 2734.

Guerin, A. et al. (2003). A two-sensor noise reduction system: applications for hands-free car

kit. *EURASIP Journal on Applied Signal Processing*, pp. 11251134.

*Transaction on Signal Processing*, vol. 52, no. 5, pp. 11491160.

nearby microphones. *Journal of the Acoustical Society of America*, vol. 103, no. 6, pp.

(GSC) for speech enhancement. *24th IEEE International Conference on Acoustics,* 

for robust speech enhancement. *IEEE Transaction on Signal & Audio Processing*, vol.

nonstationary noise environments. *EURASIP Journal on Applied Signal Processing*,

controlled recursive averaging. *IEEE Transaction on Speech & Audio Processing*, vol.

short-time spectral amplitude estimator. *IEEE Transaction on Audio, Speech & Signal* 

speech acquisition in noisy environments. *22th IEEE International Conference on Acoustics, Speech & Signal Processing*, vol. 1, pp. 359362, Munich, Germany, April 1997.

overall system performance.

36213626.

9, no. 1, pp. 1215.

11, no. 5, pp. 466475.

*Processing*, vol. 32, no. 6, pp. 1109-1121.

pp. 10641073.

**7. References** 

#### **5.3 Subjective listening tests**

In order to validate the objective performance evaluation, subjective listening tests were conducted with the MCPSD and the CSS based approaches. The different noise types considered in this study were added to utterances of the five sentences listed before with SNRs of 5, 0, and 5 dB. The test signals were recorded on a portable computer, and headphones were used during the experiments. The seven-grade comparison category rating (CCR) was used (ITU-T, Recommendation P.800, 1996). The two methods were scored by a panel of twelve subjects asked to rate every sequence of two test signals between 3 and 3. A negative score was given whenever the former test signal sounded more pleasant and natural to the listener than the latter. Zero was selected if there was no difference between the two test signals. For each subject, the following procedure was applied: 1) each sequence of two test signals was played with brief pauses in between tracks and repeated twice in a random order; 2) the listener was then asked if he wished to hear the current sequence once more or skip to the next. This led to 60 scores for each test session which took about 25 minutes per subject. The results, averaged over the 12 listeners' scores and the 5 test sentences, are shown in Fig. 9. For the considered background noises, CCRs ranging from 0.33 to 1.27 were achieved over the alternative approach. The maximum improvement of CCR was obtained in the case of helicopter noise (1.1) and multitalker babble noise (1.27), while the worst score was achieved for additive white noise (0.33). The reason behind the roughly similar performance of the two methods in the case of white noise can be understood by recognizing that the minimum statistics noise PSD estimator performs better in the presence of stationary noise as opposed to nonstationary noise.

Fig. 9. CCR improvement against CSS for various noise types and different SNRs.

#### **6. Conclusion**

Given two received signals corrupted by additive noise, adding a noise power spectrum estimator after the CPSD-based noise reduction system, can substantially reduce the residual and coherent noise components that would otherwise be present at the output spectrum. The added noise power estimator seeks to provide a good tradeoff between the amount of noise reduction and the speech distortion, while attenuating the high energy correlated noise components, especially in the low frequency ranges. The performance evaluation of the modified CPSD-based method, formerly named MCPSD in this chapter, was carried out over the CSS-based approach, a dual-microphone method previously reported in the literature. Objective evaluation results show that a performance improvement in terms of segmental SNR of about 2 dB on average can be achieved by the MCPSD-based method over the CSS-based approach. The best noise reduction was obtained in the case of multitalker babble noise, while the improvement was lower for impulsive noise. Subjective listening tests performed on a limited data set revealed that CCRs ranging from 0.33 to 1.27 can be achieved over the CSSbased approach. The maximum improvement of CCR was obtained in the case of helicopter and multitalker babble noises, while the worst score was achieved when white noise was added. A fruitful direction of further research would therefore be to extend the MCPSD-based method to multiple microphones as well as to investigate the benefits of such extension on the overall system performance.

#### **7. References**

32 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

In order to validate the objective performance evaluation, subjective listening tests were conducted with the MCPSD and the CSS based approaches. The different noise types considered in this study were added to utterances of the five sentences listed before with SNRs of 5, 0, and 5 dB. The test signals were recorded on a portable computer, and headphones were used during the experiments. The seven-grade comparison category rating (CCR) was used (ITU-T, Recommendation P.800, 1996). The two methods were scored by a panel of twelve subjects asked to rate every sequence of two test signals between 3 and 3. A negative score was given whenever the former test signal sounded more pleasant and natural to the listener than the latter. Zero was selected if there was no difference between the two test signals. For each subject, the following procedure was applied: 1) each sequence of two test signals was played with brief pauses in between tracks and repeated twice in a random order; 2) the listener was then asked if he wished to hear the current sequence once more or skip to the next. This led to 60 scores for each test session which took about 25 minutes per subject. The results, averaged over the 12 listeners' scores and the 5 test sentences, are shown in Fig. 9. For the considered background noises, CCRs ranging from 0.33 to 1.27 were achieved over the alternative approach. The maximum improvement of CCR was obtained in the case of helicopter noise (1.1) and multitalker babble noise (1.27), while the worst score was achieved for additive white noise (0.33). The reason behind the roughly similar performance of the two methods in the case of white noise can be understood by recognizing that the minimum statistics noise PSD estimator performs better

> -5 dB 0 dB 5 dB Input SNR

Given two received signals corrupted by additive noise, adding a noise power spectrum estimator after the CPSD-based noise reduction system, can substantially reduce the residual and coherent noise components that would otherwise be present at the output spectrum. The

Fig. 9. CCR improvement against CSS for various noise types and different SNRs.

White noise Helicopter noise Impulsive noise Multitalker noise

in the presence of stationary noise as opposed to nonstationary noise.

**5.3 Subjective listening tests** 

0 0.2 0.4 0.6 0.8 1 1.2 1.4

CC

**6. Conclusion** 

R

Benesty, J. et al. (2005). Speech Enhancement, Springer, ISBN 978-3540240396, New York, USA.


**1. Introduction**

**2)VarNaatmaka sabdas**.

dhvani visheSasahakrta kanThataalva | bhighaata janyashca varNaatmaka ||

representations during the process of speech production.

shabdaartha Ratnaakara ||

etc.

Sound emanates by three processes, they are twisting of nerves, wires beating of membranes or blowing of air through holes. But human voice mechanism is different as it comes out in different languages and feelings by a control mechanism, the brain. As per Indian thought The soul (Atma) associates with (budhi) brain , and later inturn orders the (manas) heart. Thus the (manas) heart under the influence of (bhudhi) brain stimulates the (jathagani) simulator. The Jathagani stimulates Udanda vata and finally the (intuition) udana vata produces speech. The voice with which we speak has two components namely **1)Dhwanyaatmaka sabdas**

**Mathematical Modeling of Speech Production** 

**and Its Application to Noise Cancellation** 

*1School of Electrical and Elcetronics Engineering, SASTRA University, Thanjore* 

N. R. Raajan1, T. R. Sivaramakrishnan1 and Y. Venkatramani2

*2Saranathan College of Engineering, Trichy* 

**3**

*India* 

Dhwanyaatmaka sabdas (fricative sound)are produced as sounds without modification. These sounds are modified after they come out of the vocal cords into pharynx and mouth. Here by different types of movements in pharynx, palate, tounge checks and lips, various syllables and words are produced. the production of speech will be effected by the action of the areas of cerebral cortex viz, 1) Audio sensory, 2) Audio Psychic and 3) Audio-motor. Simply, **Dhvanyaatmaka** (fricative sound), for example, is the sound produced by the beat of a drum or the ringing of a bell, etc. **VarNaatmaka** (Plosive sound) is the sound produced by the vocal organs, namely, the throat, palate etc. For example, the sound of the letter, ka, kha,

Block diagram shows the complete process of producing and perceiving speech from the formulation of a message in the brain of a talker, to the creation of the speech signal, and finally to the understanding of the message by a listener. In their classic introduction to speech science. The process starts in the upper left as a message represented somehow in the brain of the speaker. The message information can be thought of as having a number of different

For example the message could be represented initially as English text. In order to *speak* the message, the talker implicitly converts the text into a symbolic representation of the sequence


## **Mathematical Modeling of Speech Production and Its Application to Noise Cancellation**

N. R. Raajan1, T. R. Sivaramakrishnan1 and Y. Venkatramani2 *1School of Electrical and Elcetronics Engineering, SASTRA University, Thanjore 2Saranathan College of Engineering, Trichy India* 

#### **1. Introduction**

34 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Itakura, F. (1975). Minimum prediction residual principle applied to speech recognition. *IEEE Transaction on Audio Speech & Signal Processing*, vol. 23, pp. 67-72. ITU-T, Recommendation P.800 (1996). Methods for subjective determination of transmission quality. *International Telecommunication Union Radiocommunication Assembly*. Kaneda, Y. & Tohyama, M. (1984). Noise suppression signal processing using 2-point received signal. *Electronics and Communications in Japan*, vol. 67A, pp. 1928. Le Bouquin-Jannès, R. et al. (1997). Enhancement of speech degraded by coherent and

Lefkimmiatis, S. & Maragos, P. (2007). A generalized estimation approach for linear and

Maj, J.B. et al. (2006). Comparison of adaptive noise reduction algorithms in dual microphone hearing aids. *Speech Communication*, vol. 48, no. 8, pp. 957970. Marro, C. et al. (1998). Analysis of noise reduction and dereverberation techniques based on

Martin, R. (2001). Noise power spectral estimation based on optimal smoothing and minimum statistics. *IEEE Transaction on Signal & Audio Processing*, vol. 9, pp. 504512. Martin, R. (2006). Bias compensation methods for minimum statistics noise power spectral

Mauler, D. & Martin, R. (2006). Noise power spectral density estimation on highly correlated

McCowan, I.A. & Bourlard, H. (2003). Microphone array post-filter based on noise field

Mittal, U. & Phamdo, N. (2000). Signal/Noise KLT based approach for enhancing speech

Nilsson, M. et al. (1994). Development of the hearing in noise test for the measurement of

O'Shaughnessy, D. (2000). *Speech Communications, Human and Machine*, IEEE Press, ISBN 0-

Quakenbush, S. et al. (1988). Objective Measures of Speech Quality. Englewood Cliffs,

Simmer, K.U. & Wasiljeff, A. (1992). Adaptive microphone arrays for noise suppression in

Simmer, K.U. et al. (1994). Suppression of coherent and incoherent noise using a microphone

Zelinski, R. (1988). A microphone array with adaptive post-filtering for noise reduction in

Zelinski, R. (1990). Noise reduction based on microphone array with LMS adaptive post-

Zhang, X. & Jia, Y. (2005). A soft decision based noise cross power spectral density

Prentice-Hall, ISBN/ISSN 0136290566, 9780136290568.

*Communications*, pp. 185194, Bordeaux, France, October 1992.

array. *Annales des Télécommunications*, vol. 49, pp. 439446.

*Processing*, vol. 5, pp. 25782581, NY, USA, April 1988.

filtering. *Electronic Letters*, vol. 26, no. 24, pp. 2036–2581.

density estimation. *Signal Processing*, vol. 86, no. 6, pp. 12151229.

*Audio Processing*, vol. 5, pp. 484487.

*America*, vol. 95, no. 2, pp. 1085-1099.

7803-3449-3, New York, USA.

Philadelphia, USA, March 2005.

vol. 6, no. 3, pp. 240259.

September 2006.

2, pp. 159-167.

incoherent noise using a cross-spectral estimator. *IEEE Transaction on Speech &* 

nonlinear microphone array post-filters. *Speech Communication*, vol. 49, pp. 657666.

microphone arrays with postfiltering. *IEEE Transaction on Speech & Audio Processing*,

data. *10th International Workshop on Acoustic, Echo & Noise Control*, Paris, France,

coherence. *IEEE Transaction on Speech & Audio Processing*, vol. 11, no. 6, pp. 709716.

degraded by colored noise. *IEEE Transaction on Speech & Audio Processing*, vol. 8, no.

speech reception thresholds in quiet and in noise. *Journal of the Acoustical Society of* 

the frequency domain. *Second COST229 Workshop on Adaptive Algorithms in* 

reverberant rooms. *13th IEEE International Conference on Acoustics, Speech & Signal* 

estimation for two-microphone speech enhancement systems. *IEEE International Conference on Acoustics, Speech & Signal Processing*, vol. 1, pp. I/81316, Sound emanates by three processes, they are twisting of nerves, wires beating of membranes or blowing of air through holes. But human voice mechanism is different as it comes out in different languages and feelings by a control mechanism, the brain. As per Indian thought The soul (Atma) associates with (budhi) brain , and later inturn orders the (manas) heart. Thus the (manas) heart under the influence of (bhudhi) brain stimulates the (jathagani) simulator. The Jathagani stimulates Udanda vata and finally the (intuition) udana vata produces speech. The voice with which we speak has two components namely **1)Dhwanyaatmaka sabdas 2)VarNaatmaka sabdas**.

Dhwanyaatmaka sabdas (fricative sound)are produced as sounds without modification. These sounds are modified after they come out of the vocal cords into pharynx and mouth. Here by different types of movements in pharynx, palate, tounge checks and lips, various syllables and words are produced. the production of speech will be effected by the action of the areas of cerebral cortex viz, 1) Audio sensory, 2) Audio Psychic and 3) Audio-motor. Simply, **Dhvanyaatmaka** (fricative sound), for example, is the sound produced by the beat of a drum or the ringing of a bell, etc. **VarNaatmaka** (Plosive sound) is the sound produced by the vocal organs, namely, the throat, palate etc. For example, the sound of the letter, ka, kha, etc.

dhvani visheSasahakrta kanThataalva | bhighaata janyashca varNaatmaka || shabdaartha Ratnaakara ||

Block diagram shows the complete process of producing and perceiving speech from the formulation of a message in the brain of a talker, to the creation of the speech signal, and finally to the understanding of the message by a listener. In their classic introduction to speech science. The process starts in the upper left as a message represented somehow in the brain of the speaker. The message information can be thought of as having a number of different representations during the process of speech production.

For example the message could be represented initially as English text. In order to *speak* the message, the talker implicitly converts the text into a symbolic representation of the sequence

The information representations for the first two stages in the speech signal are discrete so we can readily estimate the rate of information flow with some simple assumptions. For the next stage in the speech production part of the speech chain, the representation becomes continuous (in the form of control signals for articulatory motion). If they could be measured, we could estimate the spectral bandwidth of these control signals and appropriately sample and quantize these signals to obtain equivalent digital signals for which the data rate could be estimated. The articulators move relatively slowly compared to the time variation of the resulting acoustic waveform. Estimates of bandwidth and required accuracy suggest that the total data rate of the sampled articulatory control signals is about 2000 bps. Thus, the original text message is represented by a set of continuously varying signals whose digital representation requires a much higher data rate than the information rate that we estimated

Mathematical Modeling of Speech Production and Its Application to Noise Cancellation 37

Finally, as we will see later, the data rate of the digitized speech waveform at the end of the speech production part of the speech chain can be anywhere from 64,000 to more than 700,000 bps. We arrive at such numbers by examining the sampling rate and quantization required to represent the speech signal with a desired perceptual fidelity. For example, *telephonequality* requires that a bandwidth of 0 to 4 kHz be preserved, implying a sampling rate of 8000 samples/sec. Each sample can be quantized with 8 bits on a log scale, resulting in a bit rate of 64,000 bps. This representation is highly intelligible (i.e., humans can readily extract the message from it) but to most listeners, it will sound different from the original speech signal uttered by the talker. On the other hand, the speech waveform can be represented with *CDquality* using a sampling rate of 44,100 samples/s with 16 bit samples, or a data rate of 705,600 bps. In this case, the reproduced acoustic signal will be virtually indistinguishable from the original speech signal. As we move from text to speech waveform through the speech chain, the result is an encoding of the message that can be effectively transmitted by acoustic wave propagation and robustly decoded by the hearing mechanism of a listener. The above analysis of data rates shows that as we move from text to sampled speech waveform, the data rate can increase by a factor of 10,000. Part of this extra information represents characteristics of the talker such as emotional state, speech mannerisms, accent, etc., but much of it is due to the inefficiency of simply sampling and finely quantizing analog signals. Thus, motivated by an awareness of the low intrinsic information rate of speech, a central theme of much of digital speech processing is to obtain a digital representation with lower data rate than that of

One of the features which has bothered researchers in the area of speech synthesis in the past has been voicing. We discuss this here because it is a good example of how failure to understand the differences between abstract and physical modeling can lead to disproportionate problems (Keating 1984). The difficulty has arisen because of the nonlinearity of the correlation between the cognitive phonological voicing and how the feature is rendered phonetically. Phonological voicing is a distinctive feature in that, it is a parameter of phonological segments the presence or absence of which is able to change one underlying segment into another. For example, the English alveolar stop /d/ is [+voice] (has voicing) and differs on this feature from the alveolar stop / t / which is [-voice] (does not have voicing). Like all phonological different features, the representation is binary, meaning in this

case that [voice] is either present or absent in any one segment.

for transmission of the message as a speech signal.

the sampled waveform.

Fig. 1. Block Diagram

of sounds corresponding to the spoken version of the text. This step, called the language code generator (it is done under bhudhi (brain) converts text to speech)in Block digaram, converts text symbols to phonetic symbols (along with stress and durational information) that describe the basic sounds of a spoken version of the message and the manner (i.e., the speed and emphasis) in which the sounds are intended to be produced. As there are labeled with phonetic symbols using a computer-keyboard-friendly code called ARPAbet. Thus, the text *shouldwechase* is represented phonetically (in ARPAbet symbols) as [SH UH D W IY CH EY S]. The third step in the speech production process is the conversion to *neuromuscularcontrols*, i.e., the set of control signals that direct the neuromuscular system to move the speech articulators, namely the tongue, lips, teeth,jaw and velum, in a manner that is consistent with the sounds of the desired spoken message and with the desired degree of emphasis. The end result of the neuromuscular controls step is a set of articulatory motions (continuous control) that cause the vocal tract articulators to move in a prescribed manner in order to create the desired sounds. Finally the last step in the Speech Production process is the *vocaltractsystem* that physically creates the necessary sound sources and the appropriate vocal tract shapes over time so as to create an acoustic waveform, that encodes the information in the desired message into the speech signal. To determine the rate of information flow during speech production, assume that there are about 32 symbols (letters) in the language(in English there are 26 letters, but if we include simple punctuation. we get a count closer to 32 = 25symbols). Furthermore, the rate of speaking for most people is about 10 symbols per second (somewhat on the high side, but still acceptable for a rough information rate estimate). Hence, assuming independent letters as a simple approximation, we estimate the base information rate of the text message as about 50 bps (5 bits per symbol times 10 symbols per second). At the second stage of the process, where the text representation is converted into phonemes and prosody (e.g., pitch and stress) markers, the information rate is estimated to increase by a factor of 4 to about 200 bps. For example, the ARBAbet phonetic symbol set used to label the speech sounds contains approximately 64 = 26 symbols, or about 6 bits/phoneme (again a rough approximation assuming independence of phonemes). There are 8 phonemes in approximately 600ms. This leads to an estimate of 8 Œ 6/0.6 = 80 bps. Additional information required to describe prosodic features of the signal (e.g., duration, pitch, loudness) could easily add 100 bps to the total information rate for a message encoded as a speech signal.

2 Speech processing

of sounds corresponding to the spoken version of the text. This step, called the language code generator (it is done under bhudhi (brain) converts text to speech)in Block digaram, converts text symbols to phonetic symbols (along with stress and durational information) that describe the basic sounds of a spoken version of the message and the manner (i.e., the speed and emphasis) in which the sounds are intended to be produced. As there are labeled with phonetic symbols using a computer-keyboard-friendly code called ARPAbet. Thus, the text *shouldwechase* is represented phonetically (in ARPAbet symbols) as [SH UH D W IY CH EY S]. The third step in the speech production process is the conversion to *neuromuscularcontrols*, i.e., the set of control signals that direct the neuromuscular system to move the speech articulators, namely the tongue, lips, teeth,jaw and velum, in a manner that is consistent with the sounds of the desired spoken message and with the desired degree of emphasis. The end result of the neuromuscular controls step is a set of articulatory motions (continuous control) that cause the vocal tract articulators to move in a prescribed manner in order to create the desired sounds. Finally the last step in the Speech Production process is the *vocaltractsystem* that physically creates the necessary sound sources and the appropriate vocal tract shapes over time so as to create an acoustic waveform, that encodes the information in the desired message into the speech signal. To determine the rate of information flow during speech production, assume that there are about 32 symbols (letters) in the language(in English there are 26 letters, but if we include simple punctuation. we get a count closer to 32 = 25symbols). Furthermore, the rate of speaking for most people is about 10 symbols per second (somewhat on the high side, but still acceptable for a rough information rate estimate). Hence, assuming independent letters as a simple approximation, we estimate the base information rate of the text message as about 50 bps (5 bits per symbol times 10 symbols per second). At the second stage of the process, where the text representation is converted into phonemes and prosody (e.g., pitch and stress) markers, the information rate is estimated to increase by a factor of 4 to about 200 bps. For example, the ARBAbet phonetic symbol set used to label the speech sounds contains approximately 64 = 26 symbols, or about 6 bits/phoneme (again a rough approximation assuming independence of phonemes). There are 8 phonemes in approximately 600ms. This leads to an estimate of 8 Œ 6/0.6 = 80 bps. Additional information required to describe prosodic features of the signal (e.g., duration, pitch, loudness) could easily add 100 bps to the total information rate for a message encoded as a speech signal.

Fig. 1. Block Diagram

The information representations for the first two stages in the speech signal are discrete so we can readily estimate the rate of information flow with some simple assumptions. For the next stage in the speech production part of the speech chain, the representation becomes continuous (in the form of control signals for articulatory motion). If they could be measured, we could estimate the spectral bandwidth of these control signals and appropriately sample and quantize these signals to obtain equivalent digital signals for which the data rate could be estimated. The articulators move relatively slowly compared to the time variation of the resulting acoustic waveform. Estimates of bandwidth and required accuracy suggest that the total data rate of the sampled articulatory control signals is about 2000 bps. Thus, the original text message is represented by a set of continuously varying signals whose digital representation requires a much higher data rate than the information rate that we estimated for transmission of the message as a speech signal.

Finally, as we will see later, the data rate of the digitized speech waveform at the end of the speech production part of the speech chain can be anywhere from 64,000 to more than 700,000 bps. We arrive at such numbers by examining the sampling rate and quantization required to represent the speech signal with a desired perceptual fidelity. For example, *telephonequality* requires that a bandwidth of 0 to 4 kHz be preserved, implying a sampling rate of 8000 samples/sec. Each sample can be quantized with 8 bits on a log scale, resulting in a bit rate of 64,000 bps. This representation is highly intelligible (i.e., humans can readily extract the message from it) but to most listeners, it will sound different from the original speech signal uttered by the talker. On the other hand, the speech waveform can be represented with *CDquality* using a sampling rate of 44,100 samples/s with 16 bit samples, or a data rate of 705,600 bps. In this case, the reproduced acoustic signal will be virtually indistinguishable from the original speech signal. As we move from text to speech waveform through the speech chain, the result is an encoding of the message that can be effectively transmitted by acoustic wave propagation and robustly decoded by the hearing mechanism of a listener. The above analysis of data rates shows that as we move from text to sampled speech waveform, the data rate can increase by a factor of 10,000. Part of this extra information represents characteristics of the talker such as emotional state, speech mannerisms, accent, etc., but much of it is due to the inefficiency of simply sampling and finely quantizing analog signals. Thus, motivated by an awareness of the low intrinsic information rate of speech, a central theme of much of digital speech processing is to obtain a digital representation with lower data rate than that of the sampled waveform.

One of the features which has bothered researchers in the area of speech synthesis in the past has been voicing. We discuss this here because it is a good example of how failure to understand the differences between abstract and physical modeling can lead to disproportionate problems (Keating 1984). The difficulty has arisen because of the nonlinearity of the correlation between the cognitive phonological voicing and how the feature is rendered phonetically. Phonological voicing is a distinctive feature in that, it is a parameter of phonological segments the presence or absence of which is able to change one underlying segment into another. For example, the English alveolar stop /d/ is [+voice] (has voicing) and differs on this feature from the alveolar stop / t / which is [-voice] (does not have voicing). Like all phonological different features, the representation is binary, meaning in this case that [voice] is either present or absent in any one segment.

Fig. 3. Articulatory model

**2.1 Development of speech**

like tongue, lips, etc.

**2.2 The human vocal apparatus**

length. The articulatory vector a is (*a*1, *a*2, ......, *a*6).

into (sound) speech communication.

pressure field. The physics of this process is well understood, giving us important insights

Mathematical Modeling of Speech Production and Its Application to Noise Cancellation 39

The rudiments of speech generation are given in next two sections. Thorough treatments of

An young child for the first few months of his life goes on hearing the words being spoken by the persons around him, suppose he has heard the word 'AMMA' serval times spoken by his parents etc., Then he goes on thinking about the production of that word with his audio psychic area ties to reproduce with different movements of his lips, tongue etc., This will be effected by his audio-motor area thus after a few trials the child will be able to reproduce that word. the speech is nothing but a modified expiratory act produced while the expiratory air vibrates the vocal cords of the larynx, and altered by the movements of different structures

Fig:2 shows a representation of the mid sagittal section of the human vocal tract [Coker]. In this model, the cross-sectional area of the oral cavity A(x), from the glottis, x = 0, to the lips, x = L, is determined by five parameters: *a*<sup>1</sup> , tongue body height; *a*<sup>2</sup> , anterior/posterior position of the tongue body; *a*<sup>3</sup> , tongue tip height; *a*<sup>4</sup> , mouth opening; and *a*<sup>5</sup> , pharyngeal opening. In addition, a sixth parameter, *a*<sup>6</sup> , is used to additively alter the nominal 17-cm vocal tract

The vocal tract model has three components: an oral cavity, a glottal source, and an acoustic impedance at the lips. We shall consider them singly first and then in combination. As is commonly done, we assume that the behavior of the oral cavity is that of a lossless acoustic

this important subject can be found in [Flanagan] and [Rabiner and Schafer].

Fig. 2. Anatomy of vocal fold

The most frequent phonetic parameter to correlate with phonological voicing is vocal cord vibration. The vocal cords usually vibrate when the underlying plan is to produce a [+voice] sound, but usually do not when the underlying plan is to produce a [-voice] sound.

Many synthesis models assume constant voicing vocal-cord vibration, but it is quite clear that the binary distinction of vocal-cord vibration vs. no vocal-cord vibration is not accurate. Vocal-cord vibration can begin abruptly (as when there is a glottal stop onset to make this possible singers regularly do this), gradually (the usual case), or at some point during the phone, although it may be phonologically voiced. Similarly for phonologically voiceless segments, it is certainly not the case that on every occasion there is no vocal-cord vibration present at some point during the phone. We know of no model which sets out the conditions under which these variants occur.

Phonological characterizations of segments should not be considered as though they were phonetic, and sets of acoustical features should not be given one-to-one correlation with phonological features. More often than not the correlation is not linear nor, apparently, consistent- though it may yet turn out to be consistent in some respects. Phonology and phonetics cannot be linked simply by using phonological terms within the phonetic domain such as the common transfer of the term voicing between the two levels. Abstract voicing is very different from physical voicing, which is why we consistently use different terms for the two. The basis of the terminology is different for the two levels; and it is bad science to equate the two so directly.

Major problem in speech processing is to represent the shape and characteristics of the vocal tracts. This task is normally done by using an acoustics tube model, based on the calculation of the area function. A Mathematical model of Vocal fold has been obtained as part of new approach for Noise cancellation.

#### **2. The physics of sound production**

Speech is the unique signal generated by the human vocal apparatus. Air from the lungs is forced through the vocal tract, generating acoustic waves that are radiated from the lips as a

#### Fig. 3. Articulatory model

4 Speech processing

The most frequent phonetic parameter to correlate with phonological voicing is vocal cord vibration. The vocal cords usually vibrate when the underlying plan is to produce a [+voice]

Many synthesis models assume constant voicing vocal-cord vibration, but it is quite clear that the binary distinction of vocal-cord vibration vs. no vocal-cord vibration is not accurate. Vocal-cord vibration can begin abruptly (as when there is a glottal stop onset to make this possible singers regularly do this), gradually (the usual case), or at some point during the phone, although it may be phonologically voiced. Similarly for phonologically voiceless segments, it is certainly not the case that on every occasion there is no vocal-cord vibration present at some point during the phone. We know of no model which sets out the conditions

Phonological characterizations of segments should not be considered as though they were phonetic, and sets of acoustical features should not be given one-to-one correlation with phonological features. More often than not the correlation is not linear nor, apparently, consistent- though it may yet turn out to be consistent in some respects. Phonology and phonetics cannot be linked simply by using phonological terms within the phonetic domain such as the common transfer of the term voicing between the two levels. Abstract voicing is very different from physical voicing, which is why we consistently use different terms for the two. The basis of the terminology is different for the two levels; and it is bad science to equate

Major problem in speech processing is to represent the shape and characteristics of the vocal tracts. This task is normally done by using an acoustics tube model, based on the calculation of the area function. A Mathematical model of Vocal fold has been obtained as part of new

Speech is the unique signal generated by the human vocal apparatus. Air from the lungs is forced through the vocal tract, generating acoustic waves that are radiated from the lips as a

sound, but usually do not when the underlying plan is to produce a [-voice] sound.

Fig. 2. Anatomy of vocal fold

under which these variants occur.

approach for Noise cancellation.

**2. The physics of sound production**

the two so directly.

pressure field. The physics of this process is well understood, giving us important insights into (sound) speech communication.

The rudiments of speech generation are given in next two sections. Thorough treatments of this important subject can be found in [Flanagan] and [Rabiner and Schafer].

#### **2.1 Development of speech**

An young child for the first few months of his life goes on hearing the words being spoken by the persons around him, suppose he has heard the word 'AMMA' serval times spoken by his parents etc., Then he goes on thinking about the production of that word with his audio psychic area ties to reproduce with different movements of his lips, tongue etc., This will be effected by his audio-motor area thus after a few trials the child will be able to reproduce that word. the speech is nothing but a modified expiratory act produced while the expiratory air vibrates the vocal cords of the larynx, and altered by the movements of different structures like tongue, lips, etc.

#### **2.2 The human vocal apparatus**

Fig:2 shows a representation of the mid sagittal section of the human vocal tract [Coker]. In this model, the cross-sectional area of the oral cavity A(x), from the glottis, x = 0, to the lips, x = L, is determined by five parameters: *a*<sup>1</sup> , tongue body height; *a*<sup>2</sup> , anterior/posterior position of the tongue body; *a*<sup>3</sup> , tongue tip height; *a*<sup>4</sup> , mouth opening; and *a*<sup>5</sup> , pharyngeal opening. In addition, a sixth parameter, *a*<sup>6</sup> , is used to additively alter the nominal 17-cm vocal tract length. The articulatory vector a is (*a*1, *a*2, ......, *a*6).

The vocal tract model has three components: an oral cavity, a glottal source, and an acoustic impedance at the lips. We shall consider them singly first and then in combination. As is commonly done, we assume that the behavior of the oral cavity is that of a lossless acoustic

Fig. 4. The acoustic tube model of the vocal tract and its area function

tube of slowly varying in time and space cross-sectional area, A(x), in which plane waves propagate in one dimension (see Fig:3). [Sondhi] and [Portnoff] have shown that under these assumptions, the pressure, p(x, t ), and volume velocity, u(x, t ), satisfy

The vocal tract model has three components: an oral cavity, a glottal source, and an acoustic impedance at the lips. We shall consider them singly first and then in combination. As is commonly done, we assume that the behavior of the oral cavity is that of a lossless acoustic tube of slowly varying (in time and space) cross-sectional area, A(x), in which plane waves propagate in one dimension. [Sondhi] and [Portnoff] have shown that under these assumptions, the pressure, p(x, t), and volume velocity, u(x, t), satisfy

$$-\frac{\partial p}{\partial \mathbf{x}} = \frac{\rho}{A(\mathbf{x}, t)} \frac{\partial \mathbf{u}}{\partial t} \tag{1}$$

Fig. 5. The discretized acoustic tube model of the vocal tract

*Y*(*x*, *ω*)

*<sup>N</sup>* and let 0 <sup>≤</sup> <sup>k</sup> <sup>≤</sup> N . We shall define *Ai*,*Y<sup>k</sup>*

3 + (Δ*x*)2*Z<sup>k</sup>*

system for which the relationship between pressure and displacement is

*<sup>p</sup>*(*x*, *<sup>y</sup>*) = *<sup>M</sup> <sup>∂</sup>*2*<sup>ξ</sup>*

<sup>0</sup> and *<sup>U</sup><sup>k</sup>*

*dU dx dY*

(eq:5) with respect to x and substituting for <sup>−</sup>*dp*

*d*2*U dx*<sup>2</sup> <sup>=</sup> <sup>1</sup>

frequencies. Let us write *U<sup>k</sup>*

*Uk*

Given suitable values for *U<sup>k</sup>*

*<sup>i</sup>*+<sup>1</sup> <sup>=</sup> *<sup>U</sup><sup>k</sup> i* 

obtain

Δ*x* = *<sup>L</sup>*

choose Δ*ω* = <sup>Ω</sup>

respectively. In order to account for the losses we define *Z*(*x*, *ω*) and *Y*(*x*, *ω*) to be the generalized acoustic impedance and admittance per unit length, respectively. Differentiating

Mathematical Modeling of Speech Production and Its Application to Noise Cancellation 41

This is recognized as the "*lossy*" Webster equation for volume velocity. The sinusoidal steady-state transfer function of the vocal tract can be computed by discretizing (eq:6) in space and obtaining approximate solutions to the resulting difference equation for a sequence of

Approximating second derivatives by second central differences and first derivatives by first

functions from (eq:7). We must find appropriate expressions for Y and Z to account for the losses. Losses arise from thermal effects and viscosity and primarily due to wall vibrations. A detailed treatment of the wall losses is found in [Portnoff] and is summarized by [Rabiner and Schafer]. Portnoff assumes that the walls are displaced *ξ*(*x*, *t*) in a direction normal to the flow due to the pressure at x only. The vocal tract walls are modeled by a damped spring-mass

*<sup>∂</sup>t*<sup>2</sup> <sup>+</sup> *<sup>b</sup>*

where M , b, and k(x) are the unit length wall mass, damping coefficient, and spring constant, respectively. The displacement of the walls is assumed to perturb the area function about a

*∂ξ*

backward differences, the finite difference representation of (eq:6) is given by (eq:7)

*i Yk <sup>i</sup>* <sup>−</sup> *<sup>Y</sup><sup>k</sup> i*−1 *Yk i*

*<sup>n</sup>* with i = 0 at the glottis and i = n at the lips, as is shown in Fig:4. Similarly, we

 + *U<sup>k</sup> i*−1

*u*(*x*, *t*) = *U*(*x*, *ω*) (5)

*dx* and P from (eq:4) and (eq:5), respectively, we

*dx* <sup>−</sup> *<sup>Y</sup>*(*x*, *<sup>ω</sup>*)*Z*(*x*, *<sup>ω</sup>*)*U*(*x*, *<sup>ω</sup>*) (6)

*<sup>i</sup>* in an analogous manner.

(7)

*<sup>i</sup>* to signify *U*(*i*Δ*x*, *k*Δ*ω*) where the spatial discretization assumes

*<sup>i</sup>* , and *<sup>Z</sup><sup>k</sup>*

*Y<sup>k</sup> i*−1 *Yk i*

<sup>1</sup> for 0 ≤ *k* ≤ *N*, we can obtain the desired transfer

− 2 

*<sup>∂</sup><sup>t</sup>* <sup>+</sup> *<sup>k</sup>*(*x*)*ξ*(*x*, *<sup>t</sup>*) (8)

and

$$-\frac{\partial \mu}{\partial \mathbf{x}} = \frac{A(\mathbf{x}, t)}{\rho c^2} \frac{\partial p}{\partial t} \tag{2}$$

which express Newton's law and conservation of mass, respectively. In above stated equation *ρ* is the equilibrium density of the air in the tube and c is the corresponding velocity of sound. Differentiating (eq:1) and (eq:2) with respect to time and space, respectively, and then eliminating the mixed partials, we get the well-known Webster equation [Webster] for pressure,

$$\frac{\partial^2 p}{\partial \mathbf{x}^2} + \frac{1}{A(\mathbf{x}, t)} \frac{\partial p \partial A}{\partial \mathbf{x} \partial \mathbf{x}} = \frac{1 \partial^2 p}{c^2 \partial t^2} \tag{3}$$

The eigenvalues of (eq:3) are taken as formant frequencies. It is preferable to use the Webster equation (in volume velocity) to compute a sinusoidal steady-state transfer function for the acoustic tube including the effects of thermal, viscous, and wall losses.

To do so we let *p*(*x*, *t*) = *P*(*x*, *ω*) and *u*(*x*, *t*) = *U*(*x*, *ω*), where *ω* is angular frequency. When p and u have this form, (eq:1) and (eq:2) become (cf. [Rabiner, L.R.and Schafer, R.W.]) and

$$p(\mathbf{x}, t) = P(\mathbf{x}, \omega) \tag{4}$$

Fig. 5. The discretized acoustic tube model of the vocal tract

6 Speech processing

tube of slowly varying in time and space cross-sectional area, A(x), in which plane waves propagate in one dimension (see Fig:3). [Sondhi] and [Portnoff] have shown that under these

The vocal tract model has three components: an oral cavity, a glottal source, and an acoustic impedance at the lips. We shall consider them singly first and then in combination. As is commonly done, we assume that the behavior of the oral cavity is that of a lossless acoustic tube of slowly varying (in time and space) cross-sectional area, A(x), in which plane waves propagate in one dimension. [Sondhi] and [Portnoff] have shown that under these

Fig. 4. The acoustic tube model of the vocal tract and its area function

assumptions, the pressure, p(x, t ), and volume velocity, u(x, t ), satisfy

assumptions, the pressure, p(x, t), and volume velocity, u(x, t), satisfy

*∂*<sup>2</sup> *p <sup>∂</sup>x*<sup>2</sup> <sup>+</sup>

acoustic tube including the effects of thermal, viscous, and wall losses.

and

pressure,

<sup>−</sup> *<sup>∂</sup><sup>p</sup>*

<sup>−</sup> *<sup>∂</sup><sup>u</sup>*

1 *A*(*x*, *t*)

*<sup>∂</sup><sup>x</sup>* <sup>=</sup> *<sup>ρ</sup>*

*<sup>∂</sup><sup>x</sup>* <sup>=</sup> *<sup>A</sup>*(*x*, *<sup>t</sup>*) *ρc*<sup>2</sup>

which express Newton's law and conservation of mass, respectively. In above stated equation *ρ* is the equilibrium density of the air in the tube and c is the corresponding velocity of sound. Differentiating (eq:1) and (eq:2) with respect to time and space, respectively, and then eliminating the mixed partials, we get the well-known Webster equation [Webster] for

> *∂p∂A <sup>∂</sup>x∂<sup>x</sup>* <sup>=</sup> <sup>1</sup>*∂*<sup>2</sup> *<sup>p</sup>*

The eigenvalues of (eq:3) are taken as formant frequencies. It is preferable to use the Webster equation (in volume velocity) to compute a sinusoidal steady-state transfer function for the

To do so we let *p*(*x*, *t*) = *P*(*x*, *ω*) and *u*(*x*, *t*) = *U*(*x*, *ω*), where *ω* is angular frequency. When p and u have this form, (eq:1) and (eq:2) become (cf. [Rabiner, L.R.and Schafer, R.W.]) and

*A*(*x*, *t*)

*∂u*

*∂p*

*<sup>∂</sup><sup>t</sup>* (1)

*<sup>∂</sup><sup>t</sup>* (2)

*<sup>c</sup>*2*∂t*<sup>2</sup> (3)

*p*(*x*, *t*) = *P*(*x*, *ω*) (4)

$$u(\mathbf{x},t) = \mathcal{U}(\mathbf{x}, \omega) \tag{5}$$

respectively. In order to account for the losses we define *Z*(*x*, *ω*) and *Y*(*x*, *ω*) to be the generalized acoustic impedance and admittance per unit length, respectively. Differentiating (eq:5) with respect to x and substituting for <sup>−</sup>*dp dx* and P from (eq:4) and (eq:5), respectively, we obtain

$$\frac{d^2\mathcal{U}}{dx^2} = \frac{1}{\mathcal{Y}(\mathbf{x},\omega)}\frac{d\mathcal{U}}{d\mathbf{x}}\frac{d\mathcal{Y}}{d\mathbf{x}} - \mathcal{Y}(\mathbf{x},\omega)\mathcal{Z}(\mathbf{x},\omega)\mathcal{U}(\mathbf{x},\omega) \tag{6}$$

This is recognized as the "*lossy*" Webster equation for volume velocity. The sinusoidal steady-state transfer function of the vocal tract can be computed by discretizing (eq:6) in space and obtaining approximate solutions to the resulting difference equation for a sequence of frequencies. Let us write *U<sup>k</sup> <sup>i</sup>* to signify *U*(*i*Δ*x*, *k*Δ*ω*) where the spatial discretization assumes Δ*x* = *<sup>L</sup> <sup>n</sup>* with i = 0 at the glottis and i = n at the lips, as is shown in Fig:4. Similarly, we choose Δ*ω* = <sup>Ω</sup> *<sup>N</sup>* and let 0 <sup>≤</sup> <sup>k</sup> <sup>≤</sup> N . We shall define *Ai*,*Y<sup>k</sup> <sup>i</sup>* , and *<sup>Z</sup><sup>k</sup> <sup>i</sup>* in an analogous manner. Approximating second derivatives by second central differences and first derivatives by first backward differences, the finite difference representation of (eq:6) is given by (eq:7)

$$\mathcal{U}\_{i+1}^{k} = \mathcal{U}\_{i}^{k} \left( \mathbf{3} + (\Delta x)^{2} Z\_{i}^{k} Y\_{i}^{k} - \frac{Y\_{i-1}^{k}}{Y\_{i}^{k}} \right) + \mathcal{U}\_{i-1}^{k} \left( \frac{Y\_{i-1}^{k}}{Y\_{i}^{k}} - 2 \right) \tag{7}$$

Given suitable values for *U<sup>k</sup>* <sup>0</sup> and *<sup>U</sup><sup>k</sup>* <sup>1</sup> for 0 ≤ *k* ≤ *N*, we can obtain the desired transfer functions from (eq:7). We must find appropriate expressions for Y and Z to account for the losses. Losses arise from thermal effects and viscosity and primarily due to wall vibrations. A detailed treatment of the wall losses is found in [Portnoff] and is summarized by [Rabiner and Schafer]. Portnoff assumes that the walls are displaced *ξ*(*x*, *t*) in a direction normal to the flow due to the pressure at x only. The vocal tract walls are modeled by a damped spring-mass system for which the relationship between pressure and displacement is

$$p(\mathbf{x}, y) = M \frac{\partial^2 \tilde{\xi}}{\partial t^2} + b \frac{\partial \tilde{\xi}}{\partial t} + k(\mathbf{x}) \tilde{\xi}(\mathbf{x}, t) \tag{8}$$

where M , b, and k(x) are the unit length wall mass, damping coefficient, and spring constant, respectively. The displacement of the walls is assumed to perturb the area function about a neutral position according to

$$A(\mathbf{x},t) = A(\mathbf{x}) + S(\mathbf{x})\xi(\mathbf{x},t) \tag{9}$$

k(x)≡0 and the ratio of circumference to area is constant. In fact, Portnoff used k(x) = 0 and this assumption is reasonable. The third term in the right-hand side of (eq:16) is of the same form as (eq:11) and (eq:12) (under the assumption that the ratio of S to A is constant) by noting

Mathematical Modeling of Speech Production and Its Application to Noise Cancellation 43

<sup>2</sup> = (1 + *j*)

With a description of the vocal tract in hand, we can turn our attention to the boundary conditions. Following [Flanagan], the glottal excitation has been assumed to be a constant volume source with an asymmetric triangular waveform of amplitude V . [Dunn et al.] have analyzed such a source in detail. What is relevant is that the spectral envelope decreases as

*Ug*(*ω*) = *<sup>V</sup>*

For the boundary condition at the mouth, the well-known [Portnoff] and [Rabiner and Schafer] relationship between sinusoidal steady-state pressure and volume velocity, is used.

Here the radiation impedance *Zr* is taken as that of a piston in an infinite plane baffle, the

*Zr*(*ω*) = *<sup>j</sup>ωLr* <sup>1</sup>+*jωLr*

Values of the constants which are appropriate for the vocal tract model are given by [Flanagan]

*<sup>R</sup>* <sup>=</sup> <sup>128</sup>

It is convenient to solve (eq:6) with its boundary conditions (eq:19) and (eq:20) by solving a

*Lr* = 8 [*A*(*L*)/*π*]

*dx* <sup>|</sup>*x*=*<sup>L</sup>* <sup>=</sup> *<sup>A</sup>*(*L*)

*<sup>n</sup>* <sup>−</sup> *<sup>U</sup><sup>k</sup> n*−1 <sup>Δ</sup>*<sup>x</sup>* <sup>=</sup> *jk*Δ*<sup>ω</sup>*

*R*

1

*<sup>ω</sup>*<sup>2</sup> (18)

(20)

<sup>9</sup>*π*<sup>2</sup> (21)

*<sup>ρ</sup>c*<sup>2</sup> (*jω*)*P*(*L*, *<sup>ω</sup>*) (24)

*<sup>n</sup>* (25)

<sup>2</sup> /3*πc* (22)

*P*(*L*, *ω*) = *Zr*(*ω*)*U*(*L*, *ω*) (19)

*H*(*ω*) = *U*(*L*, *ω*)/*U*(0, *ω*) (23)

*An <sup>ρ</sup>c*<sup>2</sup> *<sup>P</sup><sup>k</sup>*

 ( *ω* 2 1 2

(17)

(*jω*) 1

the square of frequency. We have therefore taken the glottal source *Ug*(*ω*) to be

that

as

and

**2.3 Boundary conditions**

behavior of which is well approximated by

related initial-value problem for the transfer function

From which the frequency domain difference equation is

<sup>−</sup> *dU*

− *Uk*

where A(x) and S(x) are the neutral area and circumference, respectively. By substituting (eq:1) into (eq:2), ignoring higher-order terms, transforming into the frequency domain, Portnoff goes on to observe that the effect of vibrating walls is to add a term to the acoustic admittance in (eq:5), where

$$Y\_{\rm up}(\mathbf{x},t) = j\omega \mathbf{S}(\mathbf{x},\omega) \left( \frac{[k(\mathbf{x}) - \omega^2 M] - j\omega b}{[k(\mathbf{x}) - \omega^2 M]^2 + \omega^2 b^2} \right) \tag{10}$$

The other losses that we wish to consider are those arising from viscous friction and thermal conduction. The former can be accounted for by adding a real quantity *Zv* to the acoustic impedance in (eq:4),

$$Z\_{\upsilon}(\mathbf{x},\omega) = \frac{S(\mathbf{x})}{A^2(\mathbf{x})} \left(\frac{\omega\rho\mu}{2}\right)^{\frac{1}{2}} \tag{11}$$

Here *μ* is the viscosity of air. The thermal losses have an effect which is described by adding a real quantity *YT* to the acoustic admittance in (eq:5), where

$$Y\_T(\mathbf{x}, \omega) = \frac{S(\mathbf{x})(\eta - 1)}{\rho c^2} \left(\frac{\lambda \omega}{2\mathcal{C}\_p \rho}\right)^{\frac{1}{2}} \tag{12}$$

Here A is the coefficient of heat conduction, *η* is the adiabatic constant, and *Cp* is the heat capacity. All the constants are, of course, for air at the conditions of temperature, pressure, and humidity found in the vocal tract. In view of (eq:1), (eq:2), (eq:10), (eq:11) and (eq:12) it is possible to set

$$Z(\mathbf{x},\omega) = \frac{j\omega\rho}{A(\mathbf{x}) + Z\_{\upsilon}(\mathbf{x},\omega)}\tag{13}$$

and

$$Y(\mathbf{x},\omega) = \frac{j\omega A(\mathbf{x})}{\rho c^2} + Y\_w(\mathbf{x},\omega) + Y\_T(\mathbf{x},\omega) \tag{14}$$

There are two disadvantages to this approach. First, (eq:13) and (eq:14) are computationally expensive to evaluate. Second, (eq:10) requires values for some physical constants (saliva, phlegm, tonsils, etc.,) of the tissue forming the vocal tract walls. Estimates of these constants are available in [Rabiner, L.R.and Schafer, R.W.] and [Webster]. A computationally simpler empirical model of the losses which agrees with the measurements has been proposed by [Sondhi] in which

$$Z(\mathbf{x},\omega) = \frac{j\omega\rho}{A(\mathbf{x})} \tag{15}$$

and

$$Y(\mathbf{x},\omega) = \frac{A(\mathbf{x})}{\rho c^2} \left(j\omega + \frac{\omega\_0^2}{a+j\omega} + (\beta j\omega)\right)^{\frac{1}{2}} \tag{16}$$

Sondhi [10] has chosen values for the constants, *w*<sup>0</sup> = 406*π*, *α* = 130*π*, *β* = 4, which he then shows give good agreement with measured formant bandwidths. Moreover, the form of the model agrees with the results of Portnoff, which becomes clear when we observe that *Yw*(*x*, *ω*) in (eq:10) will have the same form as the second term on the right-hand side of (eq:16) if k(x)≡0 and the ratio of circumference to area is constant. In fact, Portnoff used k(x) = 0 and this assumption is reasonable. The third term in the right-hand side of (eq:16) is of the same form as (eq:11) and (eq:12) (under the assumption that the ratio of S to A is constant) by noting that

$$(j\omega)^{\frac{1}{2}} = (1+j)\left(\frac{\omega}{2}\right)^{\frac{1}{2}}\tag{17}$$

#### **2.3 Boundary conditions**

8 Speech processing

where A(x) and S(x) are the neutral area and circumference, respectively. By substituting (eq:1) into (eq:2), ignoring higher-order terms, transforming into the frequency domain, Portnoff goes on to observe that the effect of vibrating walls is to add a term to the acoustic admittance

The other losses that we wish to consider are those arising from viscous friction and thermal conduction. The former can be accounted for by adding a real quantity *Zv* to the acoustic

*A*2(*x*)

Here *μ* is the viscosity of air. The thermal losses have an effect which is described by adding a

*ρc*<sup>2</sup>

Here A is the coefficient of heat conduction, *η* is the adiabatic constant, and *Cp* is the heat capacity. All the constants are, of course, for air at the conditions of temperature, pressure, and humidity found in the vocal tract. In view of (eq:1), (eq:2), (eq:10), (eq:11) and (eq:12) it is

There are two disadvantages to this approach. First, (eq:13) and (eq:14) are computationally expensive to evaluate. Second, (eq:10) requires values for some physical constants (saliva, phlegm, tonsils, etc.,) of the tissue forming the vocal tract walls. Estimates of these constants are available in [Rabiner, L.R.and Schafer, R.W.] and [Webster]. A computationally simpler empirical model of the losses which agrees with the measurements has been proposed by

*<sup>Z</sup>*(*x*, *<sup>ω</sup>*) = *<sup>j</sup>ωρ*

Sondhi [10] has chosen values for the constants, *w*<sup>0</sup> = 406*π*, *α* = 130*π*, *β* = 4, which he then shows give good agreement with measured formant bandwidths. Moreover, the form of the model agrees with the results of Portnoff, which becomes clear when we observe that *Yw*(*x*, *ω*) in (eq:10) will have the same form as the second term on the right-hand side of (eq:16) if

*ω*2 0 *α* + *jω*

+ (*βjω*)

1 2

*<sup>Z</sup>*(*x*, *<sup>ω</sup>*) = *<sup>j</sup>ωρ*

*<sup>Y</sup>*(*x*, *<sup>ω</sup>*) = *<sup>j</sup>ωA*(*x*)

*<sup>Y</sup>*(*x*, *<sup>ω</sup>*) = *<sup>A</sup>*(*x*)

*ρc*<sup>2</sup>

 *jω* +

*Zv*(*x*, *<sup>ω</sup>*) = *<sup>S</sup>*(*x*)

*YT*(*x*, *<sup>ω</sup>*) = *<sup>S</sup>*(*x*)(*<sup>η</sup>* <sup>−</sup> <sup>1</sup>

*Yw*(*x*, *t*) = *j�S*(*x*, *ω*)

real quantity *YT* to the acoustic admittance in (eq:5), where

*A*(*x*, *t*) = *A*(*x*) + *S*(*x*)*ξ*(*x*, *t*) (9)

(10)

(11)

(12)

(16)

 [*k*(*x*) <sup>−</sup> *<sup>ω</sup>*2*M*] <sup>−</sup> *<sup>j</sup>ω<sup>b</sup>* [*k*(*x*) − *<sup>ω</sup>*2*M*]<sup>2</sup> + *<sup>ω</sup>*2*b*<sup>2</sup>

> *ωρμ* 2

> > *λω* 2*Cpρ*

1 2

*<sup>A</sup>*(*x*) + *Zv*(*x*, *<sup>ω</sup>*) (13)

*<sup>A</sup>*(*x*) (15)

*<sup>ρ</sup>c*<sup>2</sup> <sup>+</sup> *Yw*(*x*, *<sup>ω</sup>*) + *YT*(*x*, *<sup>ω</sup>*) (14)

1 2

neutral position according to

in (eq:5), where

impedance in (eq:4),

possible to set

[Sondhi] in which

and

and

With a description of the vocal tract in hand, we can turn our attention to the boundary conditions. Following [Flanagan], the glottal excitation has been assumed to be a constant volume source with an asymmetric triangular waveform of amplitude V . [Dunn et al.] have analyzed such a source in detail. What is relevant is that the spectral envelope decreases as the square of frequency. We have therefore taken the glottal source *Ug*(*ω*) to be

$$\mathcal{U}\_{\mathcal{S}}(\omega) = \frac{V}{\omega^2} \tag{18}$$

For the boundary condition at the mouth, the well-known [Portnoff] and [Rabiner and Schafer] relationship between sinusoidal steady-state pressure and volume velocity, is used.

$$P(L\_\prime \omega) = Z\_r(\omega) \mathcal{U}(L\_\prime \omega) \tag{19}$$

Here the radiation impedance *Zr* is taken as that of a piston in an infinite plane baffle, the behavior of which is well approximated by

$$Z\_r(\omega) = \frac{j\omega L\_r}{\left(\frac{1 + j\omega L\_r}{R}\right)}\tag{20}$$

Values of the constants which are appropriate for the vocal tract model are given by [Flanagan] as

$$R = \frac{128}{9\pi^2} \tag{21}$$

and

$$L\_r = 8\left[A(L)/\pi\right]^{\frac{1}{2}}/3\pi c \tag{22}$$

It is convenient to solve (eq:6) with its boundary conditions (eq:19) and (eq:20) by solving a related initial-value problem for the transfer function

$$H(\omega) = \mathcal{U}(L, \omega) / \mathcal{U}(0, \omega) \tag{23}$$

$$-\frac{d\mathcal{U}}{d\boldsymbol{\omega}}\vert\_{\mathbf{x}=\boldsymbol{L}} = \frac{A(\boldsymbol{L})}{\rho c^2} (\boldsymbol{j}\boldsymbol{\omega})P(\boldsymbol{L},\boldsymbol{\omega})\tag{24}$$

From which the frequency domain difference equation is

$$-\frac{\mathcal{U}\_n^k - \mathcal{U}\_{n-1}^k}{\Delta x} = jk\Delta \omega \frac{A\_n}{\rho c^2} P\_n^k \tag{25}$$

In (eq:30) and (eq:31) the first term on the right-hand side is recognized as Newton's law expressed in (eq:1) and (eq:2). The second term represents the convective flow. The third term accounts for viscous shear and drag at Reynolds number, NR, and the last term represents

Mathematical Modeling of Speech Production and Its Application to Noise Cancellation 45

Equations (eq:29),(eq:30) and (eq:31) are known as the normalized, two-dimensional, Reynolds averaged, Navier - Stokes equations for slightly compressible flow. These equations can be solved numerically for p(t). The solutions are slightly different from those obtained from ( eq:3) due to the formation of vortices and transfer of energy between the convective and wave propagation components of the fluid flow. Typical solutions for the articulatory configuration of Fig: 2 are shown in eqns. ( eq:5) and (eq:6). There is reason to believe that (eq:29),(eq:30) and (eq:31) provide a more faithful model of the acoustics of the vocal

The conclusion to be drawn from the previous two sections is that information is encoded in the speech signal in its short-duration amplitude spectrum [Rabiner, L.R. and Schafer, R.W.]. This implies that by estimating the power spectrum of the speech signal as a function of time, we can identify the corresponding sequence of sounds. Because the speech signal x(t) is non-stationary it has a time-varying spectrum that can be obtained from the time-varying Fourier transform, *Xn*(*ω*). Note that x(t) is the voltage analog of the sound pressure wave,

turbulence.

**5. Noise cancellation**

p(t), obtained by solving (eq:3).

6. Get the value of f→ function 7. Read lower and upper limits 8. Read n→ numbr of iterations

**5.1 Algorithm**

3. *<sup>z</sup>* <sup>=</sup> <sup>√</sup>*<sup>d</sup>*

4. *u* = −*b* + *z* 5. *v* = *b* + *z*

9. h =(*upper*−*lower*) 2

11. for i = 1: 2: (n-1) (odd)

10. *S* = *F*(*a*)

12. x = a+*h*. ∗ *i* 13. S = S+4 ∗ *f*(*x*)

15. for i=2:2:(n-2) 16. x = a + *h*. ∗ *i*

14. end

1. Read b, t, M, k(x) 2. *<sup>d</sup>* <sup>=</sup> *<sup>b</sup>*<sup>2</sup> <sup>−</sup> <sup>4</sup> <sup>∗</sup> *<sup>k</sup>*(*x*) <sup>∗</sup> *<sup>M</sup>*

apparatus than the Webster equation does [11].

been derived. Let it be noted from (eq:21), finally, the vocal track output is obtained by (eq:26)

$$\begin{split} \xi(\mathbf{x},t) &= \mathbf{e}^{1/2} \frac{\left(\frac{-b + \sqrt{b^2 - 4k(\mathbf{x})M}}{M}\right)t}{} + \mathbf{e}^{-1/2} \frac{\left(\frac{b + \sqrt{b^2 - 4k(\mathbf{x})M}}{M}\right)t}{M} - \frac{1}{\sqrt{b^2 - 4k\left(\mathbf{x}\right)M}} \\ & - \left(\int p\left(\mathbf{x},t\right) \mathbf{e}^{1/2} \frac{\left(\frac{b + \sqrt{b^2 - 4k(\mathbf{x})M}}{M}\right)t}{M} dt\right) \mathbf{e}^{-1/2} \frac{\left(\frac{b + \sqrt{b^2 - 4k(\mathbf{x})M}}{M}\right)t}{M} \\ & + \left(\int p\left(\mathbf{x},t\right) \mathbf{e}^{-1/2} \frac{\left(\frac{-b + \sqrt{b^2 - 4k(\mathbf{x})M}}{M}\right)t}{M}\right) \mathbf{e}^{1/2} \frac{\left(\frac{-b + \sqrt{b^2 - 4k(\mathbf{x})M}}{M}\right)t}{M} \left(\mathbf{e}^{-\frac{bt}{M}}\right) \end{split} \tag{26}$$

Here, *p* represents pressure, *k*(*x*) - Damping coefficient, *M* - Mass of speech, *ξ*(*x*, *t*) - resultant Value. [NRR]

#### **3. Non- stationary speech signal**

The speech signal is the solution to equation(eq:3) Since the function A(x,t) is continuously varyinging time,the solution, p(t),is a non-stationary random change in time. Fortunately, A(x,t) is slowly time-varying with respect to p(t).That is,

$$\overline{P}\_k = \overline{P}(k\Delta\omega) = H(k\Delta\omega)\mathcal{U}\_\mathcal{\mathcal{S}}(k\Delta\omega)\mathcal{Z}\_r(k\Delta\omega) \tag{27}$$

$$
\left|\frac{\partial A}{\partial t}\right| << < \left|\frac{\partial p}{\partial t}\right|\tag{28}
$$

Equation (eq:28) may be taken to mean that p(t) is quasi-stationary or piecewise stationary. As such, p(t) can be considered to be a sequence of intervals within each one of which p(t) is stationary. It is true that there are rapid articulatory gestures that violate (eq:28), but in general the quasi-stationary assumption is useful.

#### **4. Fluid dynamical effects**

Equation (eq:3) predicts the formation of planar acoustic waves as a result of air flowing into the vocal tract according to the boundary condition of (xyz). However, the Webster equation ignores any effects that the convective air flow may have on the function p(t).

If, instead of (eq:1) and (eq:2), we consider two-dimensional wave propagation, conservation of mass can be written as

$$\frac{\partial u}{\partial x} = \frac{\partial u}{\partial y} = -M^2 \frac{\partial p}{\partial t} \tag{29}$$

where M is the Mach number.

We can also include the viscous and convective effects by observing

$$\frac{\partial u\_{\mathbf{x}}}{\partial t} = -\frac{\partial p}{\partial \mathbf{x}} - \frac{\partial (u\_{\mathbf{x}} u\_{\mathbf{y}})}{\partial \mathbf{x}} + \frac{\partial}{\partial \mathbf{x}} \left[ \frac{1}{N\_{\mathbf{R}}} (\frac{\partial u\_{\mathbf{x}}}{\partial \mathbf{x}} + \frac{\partial u\_{\mathbf{y}}}{\partial \mathbf{x}}) - \overline{\mu\_{\mathbf{x}} \mu\_{\mathbf{y}}} \right] \tag{30}$$

$$\frac{\partial u\_y}{\partial t} = -\frac{\partial p}{\partial y} - \frac{\partial (u\_x u\_y)}{\partial y} + \frac{\partial}{\partial y} \left[ \frac{1}{N\_R} (\frac{\partial u\_x}{\partial y} + \frac{\partial u\_y}{\partial y}) - \overline{\mu\_x \mu\_y} \right] \tag{31}$$

In (eq:30) and (eq:31) the first term on the right-hand side is recognized as Newton's law expressed in (eq:1) and (eq:2). The second term represents the convective flow. The third term accounts for viscous shear and drag at Reynolds number, NR, and the last term represents turbulence.

Equations (eq:29),(eq:30) and (eq:31) are known as the normalized, two-dimensional, Reynolds averaged, Navier - Stokes equations for slightly compressible flow. These equations can be solved numerically for p(t). The solutions are slightly different from those obtained from ( eq:3) due to the formation of vortices and transfer of energy between the convective and wave propagation components of the fluid flow. Typical solutions for the articulatory configuration of Fig: 2 are shown in eqns. ( eq:5) and (eq:6). There is reason to believe that (eq:29),(eq:30) and (eq:31) provide a more faithful model of the acoustics of the vocal apparatus than the Webster equation does [11].

#### **5. Noise cancellation**

10 Speech processing

been derived. Let it be noted from (eq:21), finally, the vocal track output is obtained by (eq:26)

<sup>√</sup>*b*2−<sup>4</sup> *<sup>k</sup>*(*x*)*M*)*<sup>t</sup> <sup>M</sup> dt*

> <sup>√</sup>*b*2−<sup>4</sup> *<sup>k</sup>*(*x*)*M*)*<sup>t</sup> <sup>M</sup> dt*

Here, *p* represents pressure, *k*(*x*) - Damping coefficient, *M* - Mass of speech, *ξ*(*x*, *t*) - resultant

The speech signal is the solution to equation(eq:3) Since the function A(x,t) is continuously varyinging time,the solution, p(t),is a non-stationary random change in time. Fortunately,

Equation (eq:28) may be taken to mean that p(t) is quasi-stationary or piecewise stationary. As such, p(t) can be considered to be a sequence of intervals within each one of which p(t) is stationary. It is true that there are rapid articulatory gestures that violate (eq:28), but in general

Equation (eq:3) predicts the formation of planar acoustic waves as a result of air flowing into the vocal tract according to the boundary condition of (xyz). However, the Webster equation

If, instead of (eq:1) and (eq:2), we consider two-dimensional wave propagation, conservation

+ *∂ ∂x*

+ *∂ ∂y*

*<sup>∂</sup><sup>y</sup>* <sup>=</sup> <sup>−</sup>*M*<sup>2</sup> *<sup>∂</sup><sup>p</sup>*

 1 *NR* ( *∂ux ∂x* + *∂uy*

 1 *NR* ( *∂ux ∂y* + *∂uy*

 *∂A ∂t << ∂p ∂t* 

ignores any effects that the convective air flow may have on the function p(t).

*∂u <sup>∂</sup><sup>x</sup>* <sup>=</sup> *<sup>∂</sup><sup>u</sup>*

We can also include the viscous and convective effects by observing

*<sup>∂</sup><sup>x</sup>* <sup>−</sup> *<sup>∂</sup>*(*uxuy*) *∂x*

*<sup>∂</sup><sup>y</sup>* <sup>−</sup> *<sup>∂</sup>*(*uxuy*) *∂y*

<sup>√</sup>*b*2−<sup>4</sup> *<sup>k</sup>*(*x*)*M*)*<sup>t</sup>*

*<sup>M</sup>* <sup>−</sup> <sup>1</sup>

<sup>e</sup>1/2 (−*b*<sup>+</sup>

<sup>e</sup>−1/2 (*b*<sup>+</sup>

*Pk* = *P*(*k*Δ*ω*) = *H*(*k*Δ*ω*)*Ug*(*k*Δ*ω*)*Zr*(*k*Δ*ω*) (27)

*<sup>b</sup>*<sup>2</sup> <sup>−</sup> <sup>4</sup> *<sup>k</sup>* (*x*) *<sup>M</sup>*

<sup>√</sup>*b*2−<sup>4</sup> *<sup>k</sup>*(*x*)*M*)*<sup>t</sup> M*

 e<sup>−</sup> *b t M* 

*<sup>∂</sup><sup>t</sup>* (29)

*<sup>∂</sup><sup>x</sup>* ) <sup>−</sup> *<sup>μ</sup>xμ<sup>y</sup>*

*<sup>∂</sup><sup>y</sup>* ) <sup>−</sup> *<sup>μ</sup>xμ<sup>y</sup>*

(26)

(28)

(30)

(31)

<sup>√</sup>*b*2−<sup>4</sup> *<sup>k</sup>*(*x*)*M*)*<sup>t</sup> M*

*<sup>ξ</sup>*(*x*, *<sup>t</sup>*) = <sup>e</sup>1/2 (−*b*<sup>+</sup>

− 

+ 

**3. Non- stationary speech signal**

the quasi-stationary assumption is useful.

**4. Fluid dynamical effects**

of mass can be written as

where M is the Mach number.

*∂ux <sup>∂</sup><sup>t</sup>* <sup>=</sup> <sup>−</sup>*∂<sup>p</sup>*

*∂uy*

*<sup>∂</sup><sup>t</sup>* <sup>=</sup> <sup>−</sup>*∂<sup>p</sup>*

Value. [NRR]

<sup>√</sup>*b*2−<sup>4</sup> *<sup>k</sup>*(*x*)*M*)*<sup>t</sup>*

*<sup>p</sup>* (*x*, *<sup>t</sup>*) <sup>e</sup>1/2 (*b*<sup>+</sup>

*<sup>p</sup>* (*x*, *<sup>t</sup>*) <sup>e</sup>−1/2 (−*b*<sup>+</sup>

A(x,t) is slowly time-varying with respect to p(t).That is,

*<sup>M</sup>* <sup>+</sup> <sup>e</sup>−1/2 (*b*<sup>+</sup>

The conclusion to be drawn from the previous two sections is that information is encoded in the speech signal in its short-duration amplitude spectrum [Rabiner, L.R. and Schafer, R.W.]. This implies that by estimating the power spectrum of the speech signal as a function of time, we can identify the corresponding sequence of sounds. Because the speech signal x(t) is non-stationary it has a time-varying spectrum that can be obtained from the time-varying Fourier transform, *Xn*(*ω*). Note that x(t) is the voltage analog of the sound pressure wave, p(t), obtained by solving (eq:3).

#### **5.1 Algorithm**


$$4. \ u = -b + z$$

$$5. \ v = b + z$$


$$9. \text{ \hspace{0.} } \text{ \hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.} } \text{\hspace{0.}$$

10. *S* = *F*(*a*)


$$13. \ S = S + 4 \ast f(x)$$


The result is obtained by implementation of the equation (eq:26) in MATLAB. The various values for the pressure, Damping co-efficient and mass is considered for the implementation of the noise cancellation. The graphs are plotted with basic, mid and high damping efficiency (Fig: 7, Fig: 8 and Fig: 9) respectively. Fig: 10 shows the original signal with noise and without noise through the equation (eq:26). From this it can be concluded that the equation

Mathematical Modeling of Speech Production and Its Application to Noise Cancellation 47

Fig. 7. The acoustic tube model of the vocal tract with basic damping efficiency

Fig. 8. The acoustic tube model of the vocal tract with Mid damping efficiency

can be implemented for active noise cancellation.


Fig. 6. Speech signals and their spectrum obtained by solving the *NavierStokes* equations

12 Speech processing

23. Obtained values are substuited *<sup>ξ</sup>*(*x*, *<sup>t</sup>*) = *exp*(0.5 <sup>∗</sup> *<sup>u</sup>* <sup>∗</sup> *<sup>t</sup>*/*M*) + *exp*(−0.5 <sup>∗</sup> *<sup>v</sup>* <sup>∗</sup> *<sup>t</sup>*/*M*) + <sup>1</sup>

Fig. 6. Speech signals and their spectrum obtained by solving the *NavierStokes* equations

*z* ∗

17. S = S + 2 ∗ *f*(*x*)

3 21. Repeat the step 6 to 20

3

[*A* ∗ *exp*(*t* ∗ *z*/*M*) − *B* ∗ *exp*(−0.5 ∗ *v* ∗ *t*/*M*)]

24. Plot the sequence in the polar graph

19. S = S + f(b) 20. A = h\**<sup>s</sup>*

22. B = h\**<sup>s</sup>*

18. end

The result is obtained by implementation of the equation (eq:26) in MATLAB. The various values for the pressure, Damping co-efficient and mass is considered for the implementation of the noise cancellation. The graphs are plotted with basic, mid and high damping efficiency (Fig: 7, Fig: 8 and Fig: 9) respectively. Fig: 10 shows the original signal with noise and without noise through the equation (eq:26). From this it can be concluded that the equation can be implemented for active noise cancellation.

Fig. 7. The acoustic tube model of the vocal tract with basic damping efficiency

Fig. 8. The acoustic tube model of the vocal tract with Mid damping efficiency

Female Male

Fundamental frequency *F*<sup>0</sup> (Hz) 207 119 Glottal peak flow 0.14 0.23 Closed quotient 0.26 0.39

Mathematical Modeling of Speech Production and Its Application to Noise Cancellation 49

Configuration Thickness lip Length lip

Background noise Parametric background quality

Noise MMVF LMS RLS AFA NLMS acoustic 28.341 23.988 18.729 22.669 20.146 Short 23.105 19.769 20.161 18.083 20.019 White 30.142 20.581 25.105 28.565 26.206

Echo and fading 26.231 21.499 19.281 20.042 26.165

speech signal is used as a benchmark for speech processing. Various noises were generated and added to the original speech signal. The SNR of the signal corrupted with the noise was 8 dB. A linear combination of the generated noise and the original signal is used as the primary

input. The outputs SNR of the denoised speech signal are calculated.

MMVF - Mathematical modeling of Vocal fold

AFA - Adaptive filter algorithm (Adaptive RLS)

High Hissing - Fizzing Mid Rushing - Roaring Low Rumbling -Rolling Buzz Humming - Buzzing Flutter Bubbling - Percolating Static Crackling - Staticky

Basic 0.25 7 Long lip 0.25 9 Short lip 0.25 5 Higher opening 0.25 7 Trapered 0.25(bottom) 0.125 (free tip) 7 thin lip 0.125 7

Table 1. Properties of the glottal wave (Normal phonation)

Table 2. Properties of Mouth

Table 3. Parametric estimationof noise

Table 4. Result Obtained

LMS - Least mean square RLS - Recursive least square

NLMS - Normalized LMS

Fig. 9. The acoustic tube model of the vocal tract with high damping efficiency

Fig. 10. (a) voice + Noise (b) Voice obtained through modeling

#### **6. Conclusion**

Producing an opposing signal (anti-noise) with the same amplitude as the noise you want to reduce (unwanted noise) but with the opposite phase, yields a significant reduction in the noise level. ANC tries to eliminate sound components by adding the exact opposite sound. The level of attenuation is highly dependent on the accuracy of the system for producing the amplitude and the phase of the reductive signal (anti-noise).

The mathematical modeling of vocal fold will recognize only the voice it never create a signal opposite to the noise. It will feed only the vocal output and not the noise, Since it uses shape and characteristic of speech.

The parameters of clean speech sample considered for testing of the algorithms were: duration 2 seconds, PCM 22.050 kHz, 8 bit mono sample recorded under laboratory conditions. This


Table 1. Properties of the glottal wave (Normal phonation)


Table 2. Properties of Mouth

14 Speech processing

Fig. 9. The acoustic tube model of the vocal tract with high damping efficiency

Fig. 10. (a) voice + Noise (b) Voice obtained through modeling

amplitude and the phase of the reductive signal (anti-noise).

Producing an opposing signal (anti-noise) with the same amplitude as the noise you want to reduce (unwanted noise) but with the opposite phase, yields a significant reduction in the noise level. ANC tries to eliminate sound components by adding the exact opposite sound. The level of attenuation is highly dependent on the accuracy of the system for producing the

The mathematical modeling of vocal fold will recognize only the voice it never create a signal opposite to the noise. It will feed only the vocal output and not the noise, Since it uses shape

The parameters of clean speech sample considered for testing of the algorithms were: duration 2 seconds, PCM 22.050 kHz, 8 bit mono sample recorded under laboratory conditions. This

**6. Conclusion**

and characteristic of speech.


Table 3. Parametric estimationof noise


Table 4. Result Obtained

speech signal is used as a benchmark for speech processing. Various noises were generated and added to the original speech signal. The SNR of the signal corrupted with the noise was 8 dB. A linear combination of the generated noise and the original signal is used as the primary input. The outputs SNR of the denoised speech signal are calculated.

MMVF - Mathematical modeling of Vocal fold LMS - Least mean square RLS - Recursive least square AFA - Adaptive filter algorithm (Adaptive RLS) NLMS - Normalized LMS

**0**

**4**

*Tunisia*

**Multi-Resolution Spectral Analysis of**

Nefissa Annabi-Elkadri, Atef Hamouda and Khaled Bsaies

*URPAH Research Unit, Computer Science Department, Faculty of Sciences of Tunis.*

Classic speech spectrogram shows log-magnitude amplitude (dB) versus time and frequency. The sound pressure level in dB is approximately proportional to the volume perceived by the ear. The classic speech sonagram offers a single integration time which is the length of the window. It implements a uniform bandpass filter, the spectral samples are regularly spaced and correspond to equal bandwidths. The choice of the window length determines the time-frequency resolution for all frequencies of sonagram. The more the window is narrower, the better the time resolution and the worse the frequency resolution. This implies that the display resolution of formants, voicing and frictions at low frequencies is less good than the resolution of the bursts in the high frequencies and vice versa. It is so necessary to make the

(Mallat, 1989, p.674) remarks *"it is difficult to analyze the information content of an image directly from the gray-level intensity of the image pixels... Generally, the structures we want to recognize have very different sizes. Hence, it is not possible to define a priori an optimal resolution for analyzing images."*. To improve the standard spectral output, we can calculate a multi-resolution (MR) spectrum. In the original papers, the MR analysis is based on discrete wavelet transforms (Grossmann & Morlet, 1984; Mallat, 1989; 2000; 2008). Since that it has been applied to several domains: image analysis (Mallat, 1989), time-frequency analysis (Cnockaert, 2008), speech enhancement (Fu & Wan, 2003; Manikandan, 2006), automatic signal segmentation by search

The MR spectrum, a compromise that provides both a higher frequency and a higher temporal resolution, is not a new method. In phonetic analysis, (Annabi-Elkadri & Hamouda, 2010; 2011) presents a study of two common vowels [a] and [E] in Tunisian dialect and french language. Vowels are pronounced in Tunisian context. The analysis of the obtained results shows that due to the influence of french language on the Tunisian dialect, the vowels [a] and [E] are, in some contexts, similarly pronounced. Annabi-Elkadri & Hamouda (2011 (in press) applies the MRS for an automatic method for Silence/Sonorant/Non-Sonorant detection used the ANOVA method. Results are compared to the classical methods for classifications such as Standard Deviation and Mean with ANOVA who were better. The method for automatic Silence/Sonorant/Non-Sonorant detection based on MRS provides better results when compared to classical spectral analysis. Cheung & Lim (1991) presented a method for

**1. Introduction**

right choice of windows compared to the signal.

of stationary areas from the scalogram (Leman & Marque, 1998).

**Vowels in Tunisian Context**

*Tunis El Manar University, Tunis*

#### **7. References**


### **Multi-Resolution Spectral Analysis of Vowels in Tunisian Context**

Nefissa Annabi-Elkadri, Atef Hamouda and Khaled Bsaies *URPAH Research Unit, Computer Science Department, Faculty of Sciences of Tunis. Tunis El Manar University, Tunis Tunisia*

#### **1. Introduction**

16 Speech processing

50 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

[1] N. R. Raajan, Y. Venkaaramani, & T. R. Sivaramakrishnan *A novel approach to noise cancellation for communication devices*, Instrumentation Science and Technology, Taylor and

[2] Fitch, H. L. *Reclaiming temporal information after dynamic time warping.*, J. Acoust. Soc.

[5] Rabiner, L. R. & Schafer, R. W. *Digital Processing of Speech Signals*, Prentice Hall:

[6] Sondhi, M. M. *Model for wave propagation in a lossy vocal tract.*, PP. 1070 - 1075, J. Acoust.

[7] Webster, A. G. *Acoustical impedance and the theory of horns.*, Proc. Nat. Acad. Sci. 1919, PP.

[3] Coker, C. H. *A model of articulatory dynamics and control*, Proc. IEEE, pp. 452-460, 1989. [4] Portnoff, M. R. *A quasi-one-dimensional digital simulation for the time varying vocal*

Francis group, Volume 37, Issue 6, PP: 720-729, 2009.

Amer. 1983 , 74 (Suppl. 1), 816.

*tract.*,Masters thesis, MIT, 1973.

Englewood Cliffs, NJ, 1978.

Soc. Amer. 1974, PP:55 - 67.

**7. References**

275-282.

Classic speech spectrogram shows log-magnitude amplitude (dB) versus time and frequency. The sound pressure level in dB is approximately proportional to the volume perceived by the ear. The classic speech sonagram offers a single integration time which is the length of the window. It implements a uniform bandpass filter, the spectral samples are regularly spaced and correspond to equal bandwidths. The choice of the window length determines the time-frequency resolution for all frequencies of sonagram. The more the window is narrower, the better the time resolution and the worse the frequency resolution. This implies that the display resolution of formants, voicing and frictions at low frequencies is less good than the resolution of the bursts in the high frequencies and vice versa. It is so necessary to make the right choice of windows compared to the signal.

(Mallat, 1989, p.674) remarks *"it is difficult to analyze the information content of an image directly from the gray-level intensity of the image pixels... Generally, the structures we want to recognize have very different sizes. Hence, it is not possible to define a priori an optimal resolution for analyzing images."*. To improve the standard spectral output, we can calculate a multi-resolution (MR) spectrum. In the original papers, the MR analysis is based on discrete wavelet transforms (Grossmann & Morlet, 1984; Mallat, 1989; 2000; 2008). Since that it has been applied to several domains: image analysis (Mallat, 1989), time-frequency analysis (Cnockaert, 2008), speech enhancement (Fu & Wan, 2003; Manikandan, 2006), automatic signal segmentation by search of stationary areas from the scalogram (Leman & Marque, 1998).

The MR spectrum, a compromise that provides both a higher frequency and a higher temporal resolution, is not a new method. In phonetic analysis, (Annabi-Elkadri & Hamouda, 2010; 2011) presents a study of two common vowels [a] and [E] in Tunisian dialect and french language. Vowels are pronounced in Tunisian context. The analysis of the obtained results shows that due to the influence of french language on the Tunisian dialect, the vowels [a] and [E] are, in some contexts, similarly pronounced. Annabi-Elkadri & Hamouda (2011 (in press) applies the MRS for an automatic method for Silence/Sonorant/Non-Sonorant detection used the ANOVA method. Results are compared to the classical methods for classifications such as Standard Deviation and Mean with ANOVA who were better. The method for automatic Silence/Sonorant/Non-Sonorant detection based on MRS provides better results when compared to classical spectral analysis. Cheung & Lim (1991) presented a method for

After French colonization, the French government wanted to spread the French language in the country. The French instituted a bilingual education system with the Franco-Arab schools. Programs of bilingual schools were modeled primarily on the model of French primary education for children of European origin (French, Italian and Maltese), which were added courses in colloquial Arabic. As for the Tunisian children, they received their education in classical Arabic in order to study the Quoran. Only a small Tunisian elite received a truly bilingual education, in order to co-administer the country. Tunisian Muslim mass continued to speak only Arabic or one of its many varieties. The report of the Tunisian Minister of Affairs, Jean-Jules Jusserand, pursuing the logic of Jules Ferry. In a "Note on Education in Tunisia", dated February 1882, Jusserand exposing his ideas: "*We have not at this time we better way to assimilate the Arabs of Tunisia, to the extent that is possible, that they learn our language, it is the opinion of all who know them best: we can not rely on religion to make this comparison, they do not convert to Christianity, but as they learn our language, a host of European ideas will prove to be bound to them, as experience has sufficiently demonstrated. In the reorganization of Tunisia, a large part must*

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 53

After independence, education of the French language such as Arabic was required for all Tunisian children in primary school. This explains why French has become the second

There are different varieties of TD depending on the region, such as dialect of Tunis, Sahel, Sfax, etc. Its morphology, syntax, pronunciation and vocabulary are quite different from the Arabic (Marcais, 1950). There are several differences in pronunciation between Standard Arabic and TD. Short vowels are frequently omitted, especially where they would occur as the final element of an open syllable. While Standard Arabic can have only one consonant at the beginning of a syllable, after which a vowel must follow, TD commonly has two consonants in the onset. For example Standard Arabic "book" is /*kita*?*b*/, while in TD, it is /*kta*?*b*/. The nucleus in TD may contain a short or long vowel, and at the end of the syllable, in the coda, it may have up to three consonants, but in standard Arabic, we cannot have more than two consonants at the end of the syllable. Word-internal syllables are generally heavy in that they either have a long vowel in the nucleus or consonant in the coda. Non-final syllables composed of just a consonant and a short vowel (i.e. light syllables) are very rare in TD, and are generally loaned from standard Arabic: short vowels in this position have generally been lost, resulting in the many initial CC clusters. For example /?*awa*?*b*/ "reply" is a loan from Standard Arabic, but the same word has the natural development /?*wa*?*b*/, which is the usual

In TD's non-pharyngealised context, there is a strong fronting and closing of /*a*?/, which, especially among younger speakers in Tunis can reach as far as /*e*?/, and to a lesser extent of

This is an example of Tunisian Arabic sentence (SAMPA and X-SAMPA symbols): '/*ddZ bA* : *k Ukil* ?\*adailkas mta*?\*u milEna bil lE sykse*/'. This sentence is a mixture of three languages; '/*ddZ bA* : *k*/' in English, '/*Ukil* ?\*ada ilkas mta*?\*u milEna bil*/' in Tunisian Arabic, which means '*as usual the show is interesting*' and

We introduce a study of six common vowels [a], [E], [i], [e], [o] and [u] in TD and French. Vowels are pronounced in Tunisian context. Our study is realized in time-frequency domain.

language in Tunisia. It is spoken by the majority of the population.

*be made to education*".

word for "letter" (Gibson, 1998).

finally '/*lE sykse*/' in French, which means '*success*'.

/*a*/.

combining the wideband spectrogram and the narrowband spectrogram by evaluating the geometric mean of their corresponding pixel values. The combined spectrogram appears to preserve the visual features associated with high resolution in both frequency and time. Lee & Ching (1999) described an approach of using multi-resolution analysis (MRA) for clean connected speech and noisy phone conversation speech. Experiments show that the use of MRA cepstra results reduces significantly error insertion when compared with MFCCs. For music signals, Cancela et al. (2009) presents two algorithms: efficient constant-Q transform and multi-resolution FFT. They are reviewed and compared with a new proposal based on the IIR filtering of the FFT. The proposed method shows to be a good compromise between design flexibility and reduced computational effort. Additionally, it was used as a part of an effective melody extraction algorithm. In this context, Dressler was interested in the description of spectral analysis to extract melodies based on spectrograms multi-resolution (Dressler, 2006). The approach aims to extract the sinusoidal components of the audio signal. A calculation of the spectra of different resolutions of frequencies is done in order to detect sinusoids stable in different frames of the FFT. The evaluated results showed that the multi-resolution analysis improves the extraction of the sinusoidal.

The aim of this study was an extension of Annabi-Elkadri & Hamouda (2010; 2011) researches. We presented and tested the concept of multi-resolution for the spectral analysis (MRS) of vowels in Tunisian words and in French words under the Tunisian context. Our method was composed of two parts. The first part was applied MRS method to the signal. MRS was calculated by combining several FFT of different lengths (Annabi-Elkadri & Hamouda, 2010; 2011). The second part was the formant detection by applied multi-resolution LPC (Annabi-Elkadri & Hamouda, 2010). We present an improvement of our method of multi-resolution spectral analysis MR FFT. As an application, we used our system VASP for a Tunisian Dialect corpus pronounced by Tunisian speakers.

Standard Arabic is composed by 34 phonemes (Muhammad, 1990). It has three vowels, with long and short forms of [a], [i], and [u]. Arabic phonemes are classified in two classes pharyngeal and emphatic. There are characteristics of semitic languages (Elshafei, 1991; Muhammad, 1990). Arabic has two kinds of syllables: open syllables (CV) and (CVV) and closed syllables (CVC), (CVVC), and (CVCC). Syllables (CVVC) and (CVCC) occur only at the end of the sentence. V is a vowel and C is a consonant (Muhammad, 1990).

In section 2, we presented a brief history of Tunisian Dialect and it's relationship with Arabic and French. In section 3, we presented our calculated method of the multi-resolution FFT. In section 4, we presented the materials and methods composed by our corpus and our system Visual Assistance of Speech Processing Software (VASP). We presented our experimental results in section 5 and we discussed it in section 6. Our conclusion is presented in section 7.

#### **2. History of Tunisian dialect and it's relationship with Arabic and French**

The official language in Tunisia is Arabic. But, the popular language is the Tunisian Dialect (TD). It is a mix of Arabic with a lot of other languages: French, Italian, English, Turkich, German, Berber and Spanish. This mixture is related to the history of Tunisia, since it was invaded and colonized by many civilizations like the Romans, Vandals, Byzantains, the Arab Moslems and French.

2 Will-be-set-by-IN-TECH

combining the wideband spectrogram and the narrowband spectrogram by evaluating the geometric mean of their corresponding pixel values. The combined spectrogram appears to preserve the visual features associated with high resolution in both frequency and time. Lee & Ching (1999) described an approach of using multi-resolution analysis (MRA) for clean connected speech and noisy phone conversation speech. Experiments show that the use of MRA cepstra results reduces significantly error insertion when compared with MFCCs. For music signals, Cancela et al. (2009) presents two algorithms: efficient constant-Q transform and multi-resolution FFT. They are reviewed and compared with a new proposal based on the IIR filtering of the FFT. The proposed method shows to be a good compromise between design flexibility and reduced computational effort. Additionally, it was used as a part of an effective melody extraction algorithm. In this context, Dressler was interested in the description of spectral analysis to extract melodies based on spectrograms multi-resolution (Dressler, 2006). The approach aims to extract the sinusoidal components of the audio signal. A calculation of the spectra of different resolutions of frequencies is done in order to detect sinusoids stable in different frames of the FFT. The evaluated results showed that the multi-resolution analysis

The aim of this study was an extension of Annabi-Elkadri & Hamouda (2010; 2011) researches. We presented and tested the concept of multi-resolution for the spectral analysis (MRS) of vowels in Tunisian words and in French words under the Tunisian context. Our method was composed of two parts. The first part was applied MRS method to the signal. MRS was calculated by combining several FFT of different lengths (Annabi-Elkadri & Hamouda, 2010; 2011). The second part was the formant detection by applied multi-resolution LPC (Annabi-Elkadri & Hamouda, 2010). We present an improvement of our method of multi-resolution spectral analysis MR FFT. As an application, we used our system VASP for a

Standard Arabic is composed by 34 phonemes (Muhammad, 1990). It has three vowels, with long and short forms of [a], [i], and [u]. Arabic phonemes are classified in two classes pharyngeal and emphatic. There are characteristics of semitic languages (Elshafei, 1991; Muhammad, 1990). Arabic has two kinds of syllables: open syllables (CV) and (CVV) and closed syllables (CVC), (CVVC), and (CVCC). Syllables (CVVC) and (CVCC) occur only at the

In section 2, we presented a brief history of Tunisian Dialect and it's relationship with Arabic and French. In section 3, we presented our calculated method of the multi-resolution FFT. In section 4, we presented the materials and methods composed by our corpus and our system Visual Assistance of Speech Processing Software (VASP). We presented our experimental results in section 5 and we discussed it in section 6. Our conclusion is presented in section

The official language in Tunisia is Arabic. But, the popular language is the Tunisian Dialect (TD). It is a mix of Arabic with a lot of other languages: French, Italian, English, Turkich, German, Berber and Spanish. This mixture is related to the history of Tunisia, since it was invaded and colonized by many civilizations like the Romans, Vandals, Byzantains, the Arab

improves the extraction of the sinusoidal.

7.

Moslems and French.

Tunisian Dialect corpus pronounced by Tunisian speakers.

end of the sentence. V is a vowel and C is a consonant (Muhammad, 1990).

**2. History of Tunisian dialect and it's relationship with Arabic and French**

After French colonization, the French government wanted to spread the French language in the country. The French instituted a bilingual education system with the Franco-Arab schools. Programs of bilingual schools were modeled primarily on the model of French primary education for children of European origin (French, Italian and Maltese), which were added courses in colloquial Arabic. As for the Tunisian children, they received their education in classical Arabic in order to study the Quoran. Only a small Tunisian elite received a truly bilingual education, in order to co-administer the country. Tunisian Muslim mass continued to speak only Arabic or one of its many varieties. The report of the Tunisian Minister of Affairs, Jean-Jules Jusserand, pursuing the logic of Jules Ferry. In a "Note on Education in Tunisia", dated February 1882, Jusserand exposing his ideas: "*We have not at this time we better way to assimilate the Arabs of Tunisia, to the extent that is possible, that they learn our language, it is the opinion of all who know them best: we can not rely on religion to make this comparison, they do not convert to Christianity, but as they learn our language, a host of European ideas will prove to be bound to them, as experience has sufficiently demonstrated. In the reorganization of Tunisia, a large part must be made to education*".

After independence, education of the French language such as Arabic was required for all Tunisian children in primary school. This explains why French has become the second language in Tunisia. It is spoken by the majority of the population.

There are different varieties of TD depending on the region, such as dialect of Tunis, Sahel, Sfax, etc. Its morphology, syntax, pronunciation and vocabulary are quite different from the Arabic (Marcais, 1950). There are several differences in pronunciation between Standard Arabic and TD. Short vowels are frequently omitted, especially where they would occur as the final element of an open syllable. While Standard Arabic can have only one consonant at the beginning of a syllable, after which a vowel must follow, TD commonly has two consonants in the onset. For example Standard Arabic "book" is /*kita*?*b*/, while in TD, it is /*kta*?*b*/. The nucleus in TD may contain a short or long vowel, and at the end of the syllable, in the coda, it may have up to three consonants, but in standard Arabic, we cannot have more than two consonants at the end of the syllable. Word-internal syllables are generally heavy in that they either have a long vowel in the nucleus or consonant in the coda. Non-final syllables composed of just a consonant and a short vowel (i.e. light syllables) are very rare in TD, and are generally loaned from standard Arabic: short vowels in this position have generally been lost, resulting in the many initial CC clusters. For example /?*awa*?*b*/ "reply" is a loan from Standard Arabic, but the same word has the natural development /?*wa*?*b*/, which is the usual word for "letter" (Gibson, 1998).

In TD's non-pharyngealised context, there is a strong fronting and closing of /*a*?/, which, especially among younger speakers in Tunis can reach as far as /*e*?/, and to a lesser extent of /*a*/.

This is an example of Tunisian Arabic sentence (SAMPA and X-SAMPA symbols): '/*ddZ bA* : *k Ukil* ?\*adailkas mta*?\*u milEna bil lE sykse*/'. This sentence is a mixture of three languages; '/*ddZ bA* : *k*/' in English, '/*Ukil* ?\*ada ilkas mta*?\*u milEna bil*/' in Tunisian Arabic, which means '*as usual the show is interesting*' and finally '/*lE sykse*/' in French, which means '*success*'.

We introduce a study of six common vowels [a], [E], [i], [e], [o] and [u] in TD and French. Vowels are pronounced in Tunisian context. Our study is realized in time-frequency domain.

The FFT windowing for the frame number *p* was calculated as:

511 ∑ *l*=0

*N*−1 ∑ *l*=0

*sl*(*p*)*e*

<sup>−</sup> <sup>2</sup>*jπkl*

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 55

*sl*(*p*)*e*<sup>−</sup> <sup>2</sup>*jπkl*

(*p*) the center of the frame number *p* with *p* = 1...[

To improve the standard spectral, we calculated the MR FFT by combining several FFT of different lengths. The temporal accuracy is higher in the high frequency region and the

We calculated the FFT windowing of the signal several times *NB*. The number of steps *NB* was equal to the number of band frequency fixed a priori. For each step number *i* (*i* ≤ *NB*), the signal *s* was sampled into frames *si*(*pi*) and windowed with the window *w*. We noted *Ni*

*si*,*l*(*pi*)*e*<sup>−</sup> <sup>2</sup>*jπkl*

frequencies. A low overlap causes a discontinuity in the spectrum MRS and thus give us a bad estimation of the energy dispersal. So our problem consisted on the overlap choosing. It was necessary that the frames overlap with a percentage higher to 50% of the frame length.

> *s*0(1) = *x*<sup>0</sup> *s*1(1) = *x*<sup>1</sup> **. . .** *sl*(1) = *xl*

*sNi*−1(1) = *xN*−<sup>1</sup>

(*pi*) the center of the frame *pi* when the overlap= *Ni*

*<sup>N</sup> w*(*si*,*l*(*pi*) − *Ci*,*pi*

<sup>2</sup> can not satisfy the principle of continuity of the MRS in different band

2 .

*<sup>N</sup> w*(*sl*(*p*) − *s <sup>N</sup>*

2

<sup>512</sup> *w*(*sl*(*p*) − *s*256(*p*)) (1)

(*p*)) (2)

) (3)

*<sup>N</sup>* − 1], [ ] the

2(*L*−1)

*Sk*(*p*) =

*Sk*(*p*) =

*sl*(*p*): the componant of *s* number *l* of the frame *p Sk*(*p*): the componant of *S* number *k* of the frame *p*

**3.2 The center estimation in the case of the MR FFT**

resolution of high frequency in the low frequencies.

*Si*,*k*(*pi*) =

The spectral *Si*,*k*(*pi*) for each step *i* was:

We choosed an overlap equal to 75% (fig. 2).

the length of frames and of *w* for each step *i*. *Ci*,*pi* was the center of *w*.

*Ni*−1 ∑ *l*=0

For the frame *pi* = 1 of the step number *i*, we have *Ni* components:

⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

**. . .**

In general case:

We noted *Cp* = *s <sup>N</sup>*

*L*: the length of the signal *s N*: the length of the window *w*

integer part and :

with: *Ci*,*pi* = *s*

*i*, *Ni* 2

In MRS, the overlap *Ni*

2

#### **3. Multi-resolution FFT**

It's so difficult to choose the ideal window with the ideal characteristics. The size of the ideal window (Hancq & Leich, 2000) was equal to twice the length of the pitch of the signal. A wider window show the harmonics in the spectrum, a shorter window approximated very roughly the spectral envelope. This amounts to estimate the energy dispersion with the least error. When we calculated the windowed FFT, we supposed that the eneregy was concentrated at the center of the frame (Haton & al., 2006, p.41). We noted the center *Cp*. So our problem now, is the estimation of *Cp*.

#### **3.1 The center estimation in the case of the Discrete Fourier Transform (DFT)**

We would like to calculate the spectral of the speech signal *s*. We note *L* the length of *s*. The first step is to sample *s* into frames. The size of each frame was between 10 ms and 20 ms (Calliope, 1989; Ladefoged, 1996) to meet the stationnarity condition. We choosed the Hamming window and we fixed the size to 512 points and the overlap to 50%. Figure 1 shows the principle of the center estimation.

Fig. 1. Signal sampling and windowing for center estimation *Cp*. The window length *N* = 512 points and overlap = 50%.

For each frame *p*, the center *Cp* was estimated:

$$\begin{cases} \mathsf{C}\_{1} = \ \mathsf{x}\_{256} & \text{for} \quad p = 1\\ \mathsf{C}\_{2} = \mathsf{x}\_{2 \ast 256} & \text{for} \quad p = 2\\ \vdots\\ \mathsf{C}\_{p} = \mathsf{x}\_{256p} & \text{in} \quad \text{general} \quad \text{case} \end{cases}$$

The center *Cp* = *x*256*<sup>p</sup>* with *p* = 1...[ *<sup>L</sup>*−<sup>1</sup> <sup>256</sup> − 1] and [ ] the integer part.

Each signal *s* was sampled into frames. Each frame number *p* was composed by *N* = 512 points:

$$\begin{cases} s\_0(p) = \ x\_{256(p-1)} \\ s\_1(p) = \ x\_{256(p-1)+1} \\ \vdots \\ s\_{511}(p) = \ x\_{256(p-1)+511} \end{cases}$$

In general case, for the componant number *l* of *s*:

$$s\_l(p) = \mathfrak{x}\_{256(p-1)+l}$$

The FFT windowing for the frame number *p* was calculated as:

$$S\_k(p) = \sum\_{l=0}^{511} s\_l(p) e^{-\frac{2j\pi ll}{512}} w(s\_l(p) - s\_{256}(p)) \tag{1}$$

In general case:

4 Will-be-set-by-IN-TECH

It's so difficult to choose the ideal window with the ideal characteristics. The size of the ideal window (Hancq & Leich, 2000) was equal to twice the length of the pitch of the signal. A wider window show the harmonics in the spectrum, a shorter window approximated very roughly the spectral envelope. This amounts to estimate the energy dispersion with the least error. When we calculated the windowed FFT, we supposed that the eneregy was concentrated at the center of the frame (Haton & al., 2006, p.41). We noted the center *Cp*. So our problem now,

We would like to calculate the spectral of the speech signal *s*. We note *L* the length of *s*. The first step is to sample *s* into frames. The size of each frame was between 10 ms and 20 ms (Calliope, 1989; Ladefoged, 1996) to meet the stationnarity condition. We choosed the Hamming window and we fixed the size to 512 points and the overlap to 50%. Figure 1 shows

Fig. 1. Signal sampling and windowing for center estimation *Cp*. The window length

*C*<sup>1</sup> = *x*<sup>256</sup> *f or p* = 1 *C*<sup>2</sup> = *x*2∗<sup>256</sup> *f or p* = 2

*Cp* = *x*256*<sup>p</sup> in general case*

Each signal *s* was sampled into frames. Each frame number *p* was composed by *N* = 512

*<sup>s</sup>*0(*p*) = *<sup>x</sup>*256(*p*−1) *<sup>s</sup>*1(*p*) = *<sup>x</sup>*256(*p*−1)+<sup>1</sup> **. . .** *<sup>s</sup>*511(*p*) = *<sup>x</sup>*256(*p*−1)+<sup>511</sup>

*sl*(*p*) = *<sup>x</sup>*256(*p*−1)+*<sup>l</sup>*

<sup>256</sup> − 1] and [ ] the integer part.

**3.1 The center estimation in the case of the Discrete Fourier Transform (DFT)**

**3. Multi-resolution FFT**

is the estimation of *Cp*.

the principle of the center estimation.

*N* = 512 points and overlap = 50%.

The center *Cp* = *x*256*<sup>p</sup>* with *p* = 1...[ *<sup>L</sup>*−<sup>1</sup>

In general case, for the componant number *l* of *s*:

points:

For each frame *p*, the center *Cp* was estimated:

⎧ ⎪⎪⎪⎨

⎪⎪⎪⎩

**. . .**

⎧ ⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

$$\mathcal{S}\_k(p) = \sum\_{l=0}^{N-1} s\_l(p) e^{-\frac{2|\pi l|}{N}} w(s\_l(p) - s\_{\frac{N}{2}}(p)) \tag{2}$$

We noted *Cp* = *s <sup>N</sup>* 2 (*p*) the center of the frame number *p* with *p* = 1...[ 2(*L*−1) *<sup>N</sup>* − 1], [ ] the integer part and :

*sl*(*p*): the componant of *s* number *l* of the frame *p Sk*(*p*): the componant of *S* number *k* of the frame *p L*: the length of the signal *s N*: the length of the window *w*

#### **3.2 The center estimation in the case of the MR FFT**

To improve the standard spectral, we calculated the MR FFT by combining several FFT of different lengths. The temporal accuracy is higher in the high frequency region and the resolution of high frequency in the low frequencies.

We calculated the FFT windowing of the signal several times *NB*. The number of steps *NB* was equal to the number of band frequency fixed a priori. For each step number *i* (*i* ≤ *NB*), the signal *s* was sampled into frames *si*(*pi*) and windowed with the window *w*. We noted *Ni* the length of frames and of *w* for each step *i*. *Ci*,*pi* was the center of *w*.

The spectral *Si*,*k*(*pi*) for each step *i* was:

$$\mathcal{S}\_{i,k}(p\_i) = \sum\_{l=0}^{N\_l - 1} s\_{i,l}(p\_i) e^{-\frac{2j\pi ll}{N}} w(s\_{i,l}(p\_i) - \mathbb{C}\_{i,p\_i}) \tag{3}$$

with: *Ci*,*pi* = *s i*, *Ni* 2 (*pi*) the center of the frame *pi* when the overlap= *Ni* 2 .

In MRS, the overlap *Ni* <sup>2</sup> can not satisfy the principle of continuity of the MRS in different band frequencies. A low overlap causes a discontinuity in the spectrum MRS and thus give us a bad estimation of the energy dispersal. So our problem consisted on the overlap choosing. It was necessary that the frames overlap with a percentage higher to 50% of the frame length. We choosed an overlap equal to 75% (fig. 2).

For the frame *pi* = 1 of the step number *i*, we have *Ni* components:

$$\begin{cases} s\_0(1) &=& x\_0 \\ s\_1(1) &=& x\_1 \\ & \vdots \\ s\_l(1) &=& x\_l \\ \vdots \\ s\_{N\_l - 1}(1) &=& x\_{N - 1} \end{cases}$$

with : *Ni*(*pi*+1)

with: *Ci*,*pi* = *x Ni*(*pi*+1)

frequency bandwidth.

**4.1 Corpus**

**4. Materials and methods**

<sup>4</sup> <sup>≤</sup> *<sup>L</sup>* and *pi* <sup>≤</sup> <sup>4</sup>*<sup>L</sup>*

The spectral *Si*,*k*(*pi*) of each step *i* was :

4

So, the multi-resolution spectral MRS was:

with: 0 ≤ *k* ≤ *N*<sup>0</sup> + *N*<sup>1</sup> + ... + *NP* and 1 ≤ *p* ≤ *P*.

approximates very roughly the spectral envelope.

for each FFT (Annabi-Elkadri & Hamouda, 2010).

MRS windows size. It is dependent on the frequency band.

Fig. 3. Standard FFT (on the left) and MR FFT (on the right)

*Ni* − 1

*Ni*−1 ∑ *l*=0

*si*,*l*(*pi*)*e*<sup>−</sup> <sup>2</sup>*jπkl*

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 57

the center of the frame *pi* and the overlap equal to 75%.

The size of the ideal window (Hancq & Leich, 2000) is equal to twice the length of the pitch of the signal. A wider window shows the harmonics in the spectrum, a shorter window

To improve the standard spectral, we calculate a multi-resolution spectral (MRS) with two methods; by combining several FFT of different lengths and by combining several windows

Diagrams displayed in figure 3 illustrates the difference between the standard FFT and the MRS. For a standard FFT, the size of the window is equal for each frequency band unlike the

Figure 4 shows the classical sonagram; Hamming window, 11 ms with an overlap equal to 1/3. The sentence pronounced is: "Le soir approchait, le soir du dernier jour de l'anné". Figure 5 shows the multi-resolution sonagram of the same sentence. It offers several time integrations which are combinations of several FFT of different lengths depending on

Our corpus is composed of TD prounounced by Tunisian speakers. The sampling frequency is equal to 44.1 KHz, the wav format was adopted in mono-stereo. We avoided all types

*<sup>N</sup> w*(*si*,*l*(*pi*) − *Ci*,*pi*

*Sk*(*p*) = *Si*,*k*(*pi*)*siki* ≤ *k* ≤ *ki*+<sup>1</sup> (5)

) (4)

*Si*,*k*(*pi*) =

Fig. 2. Signal sampling and windowing for center estimation *Ci*,*pi* (overlap = 75%). For the frame *pi* = 2 of the step number *i*, we have *Ni* components:

$$\begin{cases} s\_0(2) &=& \mathbf{x}\_{N\_l} \\ s\_1(2) &=& \mathbf{x}\_{N\_l} + \mathbf{1} \\ &\vdots \\ s\_l(2) &=& \mathbf{x}\_{N\_l} + l \\ &\vdots \\ s\_{N\_l - 1}(2) &=& \mathbf{x}\_{N\_l} + N\_l - 1 \end{cases}$$

In general case, for the frame *pi* of the step number *i*, we have *Ni* components:

$$\begin{cases} s\_0(p\_i) &=& \mathbf{x}\_{(p\_i-1)N\_i} \\ s\_1(p\_i) &=& \mathbf{x}\_{(p\_i-1)N\_i} + \mathbf{1} \\ & \vdots \\ s\_l(p\_i) &=& \mathbf{x}\_{(p\_i-1)N\_l} + l \\ & \vdots \\ s\_{N\_i-1}(p\_i) &=& \mathbf{x}\_{(p\_i-1)N\_l} + N\_i - \mathbf{1} \\ s\_{N\_i-1}(p\_i) &=& \mathbf{x}\_{p\_i N\_l} - \mathbf{1} \end{cases}$$

The center *Ci*,*pi* of *pi* = 1 was:

$$\mathbf{C}\_{i,1} = \frac{N\_i}{2}.$$

The center *Ci*,*pi* of *pi* = 2 was:

$$\begin{cases} \mathbf{C}\_{i,2} = \frac{1}{2} (\frac{1}{4} + \frac{5}{4}) N\_i \\ \quad = \frac{3}{4} N\_i \end{cases}$$

In general case, the center *Ci*,*pi* of *pi* was:

$$\begin{cases} \mathbf{C}\_{i,p\_i} = & \mathbf{C}\_{i,p\_i-1} + \frac{N\_i}{4} \\ &= \mathbf{C}\_{i,1} + (p\_i - 1)\frac{N\_i}{4} \\ &= & \mathbf{x}\_{\frac{N\_i(p\_i+1)}{4}} \end{cases}$$

with : *Ni*(*pi*+1) <sup>4</sup> <sup>≤</sup> *<sup>L</sup>* and *pi* <sup>≤</sup> <sup>4</sup>*<sup>L</sup> Ni* − 1

6 Will-be-set-by-IN-TECH

Fig. 2. Signal sampling and windowing for center estimation *Ci*,*pi* (overlap = 75%).

*s*0(2) = *xNi s*1(2) = *xNi* + 1 **. . .** *sl*(2) = *xNi* + *l* **. . .** *sNi*−1(2) = *xNi* + *Ni* − <sup>1</sup>

*<sup>s</sup>*0(*pi*) = *<sup>x</sup>*(*pi*−1)*Ni <sup>s</sup>*1(*pi*) = *<sup>x</sup>*(*pi*−1)*Ni* <sup>+</sup> <sup>1</sup> **. . .**

*sl*(*pi*) = *<sup>x</sup>*(*pi*−1)*Ni* <sup>+</sup> *<sup>l</sup>* **. . .**

*sNi*−1(*pi*) = *<sup>x</sup>*(*pi*−1)*Ni* <sup>+</sup> *Ni* <sup>−</sup> <sup>1</sup> *sNi*−1(*pi*) = *xpiNi* − <sup>1</sup>

> *Ci*,1 <sup>=</sup> *Ni* 2

> > 2 ( 1 <sup>4</sup> <sup>+</sup> <sup>5</sup> <sup>4</sup> )*Ni*

= <sup>3</sup> <sup>4</sup> *Ni*

*Ci*,*pi* <sup>=</sup> *Ci*,*pi*−<sup>1</sup> <sup>+</sup> *Ni*

= *x Ni*(*pi*+1) 4

<sup>=</sup> *Ci*,1 + (*pi* <sup>−</sup> <sup>1</sup>) *Ni*

4

4

� *Ci*,2 = <sup>1</sup>

⎧ ⎪⎨

⎪⎩

For the frame *pi* = 2 of the step number *i*, we have *Ni* components: ⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

The center *Ci*,*pi* of *pi* = 1 was:

The center *Ci*,*pi* of *pi* = 2 was:

In general case, the center *Ci*,*pi* of *pi* was:

In general case, for the frame *pi* of the step number *i*, we have *Ni* components:

The spectral *Si*,*k*(*pi*) of each step *i* was :

$$S\_{i,k}(p\_i) = \sum\_{l=0}^{N\_l - 1} s\_{i,l}(p\_i) e^{-\frac{2jl\pi l}{N}} w(s\_{i,l}(p\_i) - \mathbb{C}\_{i,p\_i}) \tag{4}$$

with: *Ci*,*pi* = *x Ni*(*pi*+1) 4 the center of the frame *pi* and the overlap equal to 75%.

So, the multi-resolution spectral MRS was:

$$S\_k(p) = S\_{i,k}(p\_i) 
sik\_i \le k \le k\_{i+1} \tag{5}$$

with: 0 ≤ *k* ≤ *N*<sup>0</sup> + *N*<sup>1</sup> + ... + *NP* and 1 ≤ *p* ≤ *P*.

The size of the ideal window (Hancq & Leich, 2000) is equal to twice the length of the pitch of the signal. A wider window shows the harmonics in the spectrum, a shorter window approximates very roughly the spectral envelope.

To improve the standard spectral, we calculate a multi-resolution spectral (MRS) with two methods; by combining several FFT of different lengths and by combining several windows for each FFT (Annabi-Elkadri & Hamouda, 2010).

Diagrams displayed in figure 3 illustrates the difference between the standard FFT and the MRS. For a standard FFT, the size of the window is equal for each frequency band unlike the MRS windows size. It is dependent on the frequency band.

Fig. 3. Standard FFT (on the left) and MR FFT (on the right)

Figure 4 shows the classical sonagram; Hamming window, 11 ms with an overlap equal to 1/3. The sentence pronounced is: "Le soir approchait, le soir du dernier jour de l'anné". Figure 5 shows the multi-resolution sonagram of the same sentence. It offers several time integrations which are combinations of several FFT of different lengths depending on frequency bandwidth.

#### **4. Materials and methods**

#### **4.1 Corpus**

Our corpus is composed of TD prounounced by Tunisian speakers. The sampling frequency is equal to 44.1 KHz, the wav format was adopted in mono-stereo. We avoided all types

Speakers don't know the text. The sampling frequency is equal to 44,1 KHz, the wav format was adopted in mono-stereo. We avoided the remarkable accents and all types of noise filter that would degrade the quality of the signal and thus, causes information lost. Our corpus

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 59

For our study, we have created our first prototype System for Visual Assistance of Speech Processing VASP. It offers many functions for speech visualization and analysis. We developed our system with GUI Matlab. In the following subsection, we will present some of

VASP reads sound files in wav format. It represent a wav file in time domain by waveform and in time-frequency domain by spectral representation, classical spectrogram in narrow band and wide band (see figure 6), spectrograms calculated with linear prediction and cepstral coefficients, gammatone, discrete cosine transform (DCT), Wigner-Ville transformation, multi-resolution LPC representation (MR LPC), multi-resolution spectral representation (MR

From waveform, we can choose, in real time, the frame for which we want to represent a spectrum (see figure 7). Parameters are manipulated from a menu; we can select the type of windows (Hamming, Hanning, triangular, rectangular, Kaiser, Barlett, gaussian and Blackmann-Harris), window length (64, 128, 256, 512, 1024 and 2048 samples) and LPC factor. From all visual representations, coordinates of any pixel can be read. For example, we can select a point from a spectrogram and read its coordinates directly (time, frequency and

VASP offers the possibility to choose a part of a signal to calculate and visualize it in any

Our system can automatically detect Silence/Speech from a waveform. From the spectrogram, the system can detect acoustic cues like formants, and classify it automatically to two classes:

was transcribed in sentences, words and phonemes.

the functionalities offered by our system.

FFT) and multi-resolution spectrogram.

Fig. 6. Classical Spectrogram Interface.

time-frequency representations.

sonorant or non-sonorant.

intensity).

**4.2 VASP software: Visual assistance of speech processing software**

Fig. 4. Classical sonagram (Hamming, 11 ms, overlap 1/3) of this sentence: "Le soir approchait, le soir du dernier jour de l'année"

Fig. 5. MR sonagram; Hamming (23, 20, 15, 11 ms), overlap 75%, Band-limits in Hz were [0, 2000, 4000, 7000, 10000] of this sentence: "Le soir approchait, le soir du dernier jour de l'année"

of noise filter that would degrade the quality of the signal and thus, causes information lost. We have recorded a real time spontaneous lyrics and discussions of 4 speakers. We have removed noise and funds sounds like laughing, music, etc. It was difficult to realize a spontaneous corpus because, in real time, it is impossible to have all phonemes and syllables. Another difficulty was the variability of discussion themes and pronounced sounds. For these reasons, we decided to complete our corpus with another one. We prepared a text in Tunisian dialect with all sounds to study. Every phoneme and syllable appeared 15 times. We asked from four speakers: two men and two women, to read the text in a high voice in the same conditions of the first corpus records. All speakers are between 25 and 32 years old. 8 Will-be-set-by-IN-TECH

<sup>0</sup> 0.2 0.4 0.6 0.8 <sup>1</sup> 1.2 1.4 1.6 1.8 <sup>2</sup> 2.2 2.4 2.6 2.8 <sup>0</sup>

Fig. 5. MR sonagram; Hamming (23, 20, 15, 11 ms), overlap 75%, Band-limits in Hz were [0, 2000, 4000, 7000, 10000] of this sentence: "Le soir approchait, le soir du dernier jour de

of noise filter that would degrade the quality of the signal and thus, causes information lost. We have recorded a real time spontaneous lyrics and discussions of 4 speakers. We have removed noise and funds sounds like laughing, music, etc. It was difficult to realize a spontaneous corpus because, in real time, it is impossible to have all phonemes and syllables. Another difficulty was the variability of discussion themes and pronounced sounds. For these reasons, we decided to complete our corpus with another one. We prepared a text in Tunisian dialect with all sounds to study. Every phoneme and syllable appeared 15 times. We asked from four speakers: two men and two women, to read the text in a high voice in the same conditions of the first corpus records. All speakers are between 25 and 32 years old.

Fig. 4. Classical sonagram (Hamming, 11 ms, overlap 1/3) of this sentence: "Le soir

approchait, le soir du dernier jour de l'année"

l'année"

Speakers don't know the text. The sampling frequency is equal to 44,1 KHz, the wav format was adopted in mono-stereo. We avoided the remarkable accents and all types of noise filter that would degrade the quality of the signal and thus, causes information lost. Our corpus was transcribed in sentences, words and phonemes.

#### **4.2 VASP software: Visual assistance of speech processing software**

For our study, we have created our first prototype System for Visual Assistance of Speech Processing VASP. It offers many functions for speech visualization and analysis. We developed our system with GUI Matlab. In the following subsection, we will present some of the functionalities offered by our system.

VASP reads sound files in wav format. It represent a wav file in time domain by waveform and in time-frequency domain by spectral representation, classical spectrogram in narrow band and wide band (see figure 6), spectrograms calculated with linear prediction and cepstral coefficients, gammatone, discrete cosine transform (DCT), Wigner-Ville transformation, multi-resolution LPC representation (MR LPC), multi-resolution spectral representation (MR FFT) and multi-resolution spectrogram.

Fig. 6. Classical Spectrogram Interface.

From waveform, we can choose, in real time, the frame for which we want to represent a spectrum (see figure 7). Parameters are manipulated from a menu; we can select the type of windows (Hamming, Hanning, triangular, rectangular, Kaiser, Barlett, gaussian and Blackmann-Harris), window length (64, 128, 256, 512, 1024 and 2048 samples) and LPC factor.

From all visual representations, coordinates of any pixel can be read. For example, we can select a point from a spectrogram and read its coordinates directly (time, frequency and intensity).

VASP offers the possibility to choose a part of a signal to calculate and visualize it in any time-frequency representations.

Our system can automatically detect Silence/Speech from a waveform. From the spectrogram, the system can detect acoustic cues like formants, and classify it automatically to two classes: sonorant or non-sonorant.

generated by a signal of minimum energy being passed through a purely-recursive IIR filter. Multi-resolution LPC (MR LPC) is calculated by the LPC of the average of the convolution of

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 61

To our knowledge, there is no normative studies in Standard Arabic vowels like those of Peterson and Barney (Peterson & Barney, 1952) for American english, and those of Fant and

We applied VASP on TD and French language. We measured the two first formants F1 and F2 of vowels in Tunisian and French words. We compared our experimental results with those for Calliope (Calliope, 1989). Their corpus is constituted of vowels repetitions in [p\_R] context

The analysis of the obtained results shows that due to the influence of French language on the TD, the vowels [a], [E], [i], [e], [o] and [u] are, in some contexts, similarly pronounced.

We measured the two first formants F1 and F2 of vowels [a] in Tunisian and french words in Tunisian context. Figures 9(a) and 9(b) show scatters of formants respectively for [a] in Tunisian words and in french words. There was a high matching of the two scatters for F1

(a) Tunisian words

(b) French words

**5.1 Tunisian dialect and French spectral analysis of [a] in Tunisian context**

several windows to the signal.

al. (Fant, 1969) for Swedish.

and pronounced by 10 men and 9 women.

(400-700 Hz) and for F2 (500-3000 Hz).

Fig. 9. Variation of two first formants of [a].

Fig. 7. Waveform and spectrum of the selected signal.

Our system can analyse visual representations with two methods image analysis with edge detection and sound analysis signal. Edge detection is calculated with gradiant method or median filter method (fig.8). The second method is based on detecting energy from a time-frequency representation.

Fig. 8. Edge detection calculated with the gradiant method.

#### **5. Experimental results**

Formants frequencies are the properties of the vocal tract system and need to be inferred from the speech signal rather than just measured. The spectral shape of the vocal tract excitation strongly influences the observed spectral envelope. However, not all vocal tract resonances can cause peaks in the observed spectral envelope.

To extract formants frequencies from the signal, we resampled it to 8 KHz. We use a linear prediction method for our analysis. Linear prediction models the signal as if it were 10 Will-be-set-by-IN-TECH

Our system can analyse visual representations with two methods image analysis with edge detection and sound analysis signal. Edge detection is calculated with gradiant method or median filter method (fig.8). The second method is based on detecting energy from a

Formants frequencies are the properties of the vocal tract system and need to be inferred from the speech signal rather than just measured. The spectral shape of the vocal tract excitation strongly influences the observed spectral envelope. However, not all vocal tract resonances

To extract formants frequencies from the signal, we resampled it to 8 KHz. We use a linear prediction method for our analysis. Linear prediction models the signal as if it were

Fig. 7. Waveform and spectrum of the selected signal.

Fig. 8. Edge detection calculated with the gradiant method.

can cause peaks in the observed spectral envelope.

time-frequency representation.

**5. Experimental results**

generated by a signal of minimum energy being passed through a purely-recursive IIR filter. Multi-resolution LPC (MR LPC) is calculated by the LPC of the average of the convolution of several windows to the signal.

To our knowledge, there is no normative studies in Standard Arabic vowels like those of Peterson and Barney (Peterson & Barney, 1952) for American english, and those of Fant and al. (Fant, 1969) for Swedish.

We applied VASP on TD and French language. We measured the two first formants F1 and F2 of vowels in Tunisian and French words. We compared our experimental results with those for Calliope (Calliope, 1989). Their corpus is constituted of vowels repetitions in [p\_R] context and pronounced by 10 men and 9 women.

The analysis of the obtained results shows that due to the influence of French language on the TD, the vowels [a], [E], [i], [e], [o] and [u] are, in some contexts, similarly pronounced.

#### **5.1 Tunisian dialect and French spectral analysis of [a] in Tunisian context**

We measured the two first formants F1 and F2 of vowels [a] in Tunisian and french words in Tunisian context. Figures 9(a) and 9(b) show scatters of formants respectively for [a] in Tunisian words and in french words. There was a high matching of the two scatters for F1 (400-700 Hz) and for F2 (500-3000 Hz).

(b) French words

Fig. 9. Variation of two first formants of [a].

(a) Tunisian words

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 63

(b) french words

Figure 14 show scatters of formants mean for vowels [a], [e], [E], [o], [i] and [u] in Tunisian

We compared our experimental results with those for Calliope (1989) in tables 1, 2, 3, 4 and 5. Their corpus was constituted of two repetitions of [a], [E], [i], [o], [e] and [u] in [p\_R] context

In French, [i] was a front vowel. The difference between the two first formants was great. F1 was around 200 Hz and F2 was greater than 2000 Hz. For the Tunisian dialect, we note that

Calliope French Tunisian Dialect F1 F2 F1 F2 F1 F2 Med 684 1256 512 2137 515 1758 SD 47 32 74 600 158 383 Table 1. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2)

Fig. 11. Variation of two first formants of [i].

and pronounced by 10 men and 9 women.

of [a] for Tunisian Dialect and French language

words.

**6. Discussion**

**5.6 Results of Tunisian dialect formants variation**

#### **5.2 Tunisian dialect and French spectral analysis of [E] in Tunisian context**

We measured the two first formants F1 and F2 of vowels [E] in Tunisian and french words in Tunisian context. Figures 10(a) and 10(b) show scatters of formants respectively for [E] in Tunisian words and in french words. There is a high matching of the two scatters for F1 (300-550 Hz) and for F2 (1500-3000 Hz).

(a) Tunisian words

(b) French words

Fig. 10. Variation of two first formants of [E].

#### **5.3 Tunisian dialect and French spectral analysis of [i] in Tunisian context**

We measured the two first formants F1 and F2 of vowels [i] in Tunisian and french words in Tunisian context. Figures 11(a) and 11(b) show scatters of formants respectively for [i] in Tunisian words and in french words. There is a high matching of the two scatters for F1 (250-400 Hz) and for F2 (1800-2500 Hz).

#### **5.4 Tunisian dialect spectral analysis of [o] and [e] in Tunisian context**

We measured the two first formants F1 and F2 of vowels [o] and [e] in Tunisian words. Figure 12(a) and 12(b) show scatters of formants for [o] and [e] in Tunisian words.

#### **5.5 Tunisian dialect and French spectral analysis of [u] in Tunisian context**

We measured the two first formants F1 and F2 of vowels [u] in Tunisian and french words in Tunisian context. Figures 13(a) and 13(b) show scatters of formants respectively for [u] in Tunisian words and in french words. There is a high matching of the two scatters for F1 (300-440 Hz) and for F2 (1500-3000 Hz).

(a) Tunisian words

(b) french words

Fig. 11. Variation of two first formants of [i].

#### **5.6 Results of Tunisian dialect formants variation**

Figure 14 show scatters of formants mean for vowels [a], [e], [E], [o], [i] and [u] in Tunisian words.

#### **6. Discussion**

12 Will-be-set-by-IN-TECH

We measured the two first formants F1 and F2 of vowels [E] in Tunisian and french words in Tunisian context. Figures 10(a) and 10(b) show scatters of formants respectively for [E] in Tunisian words and in french words. There is a high matching of the two scatters for F1

(a) Tunisian words

(b) French words

We measured the two first formants F1 and F2 of vowels [i] in Tunisian and french words in Tunisian context. Figures 11(a) and 11(b) show scatters of formants respectively for [i] in Tunisian words and in french words. There is a high matching of the two scatters for F1

We measured the two first formants F1 and F2 of vowels [o] and [e] in Tunisian words. Figure

We measured the two first formants F1 and F2 of vowels [u] in Tunisian and french words in Tunisian context. Figures 13(a) and 13(b) show scatters of formants respectively for [u] in Tunisian words and in french words. There is a high matching of the two scatters for F1

**5.3 Tunisian dialect and French spectral analysis of [i] in Tunisian context**

**5.4 Tunisian dialect spectral analysis of [o] and [e] in Tunisian context**

12(a) and 12(b) show scatters of formants for [o] and [e] in Tunisian words.

**5.5 Tunisian dialect and French spectral analysis of [u] in Tunisian context**

**5.2 Tunisian dialect and French spectral analysis of [E] in Tunisian context**

(300-550 Hz) and for F2 (1500-3000 Hz).

Fig. 10. Variation of two first formants of [E].

(250-400 Hz) and for F2 (1800-2500 Hz).

(300-440 Hz) and for F2 (1500-3000 Hz).

We compared our experimental results with those for Calliope (1989) in tables 1, 2, 3, 4 and 5. Their corpus was constituted of two repetitions of [a], [E], [i], [o], [e] and [u] in [p\_R] context and pronounced by 10 men and 9 women.


Table 1. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2) of [a] for Tunisian Dialect and French language

In French, [i] was a front vowel. The difference between the two first formants was great. F1 was around 200 Hz and F2 was greater than 2000 Hz. For the Tunisian dialect, we note that

(a) Tunisian words

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 65

(b) french words

[o] for Calliope [e] for Calliope [o] for Tunisian Dialect [e] for Tunisian Dialect F1 F2 F1 F2 F1 F2 F1 F2 Med 383 793 381 1417 381.4 2184.32 380.23 2188.11 SD 22 63 44 106 37.88 258.24 39.85 153.44 Table 4. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2)

Fig. 13. Variation of two first formants of [u].

Fig. 14. Variation of two first formants of Tunisian Dialect Vowels.

of [e] and [o] for Tunisian Dialect and French language

(a) Variation of two first formants of [o]

(b) Variation of two first formants of [e]

Fig. 12. Variation of two first formants of [o] and [e] for Tunisian words.


Table 2. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2) of [E] for Tunisian Dialect and French language


Table 3. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2) of [i] for Tunisian Dialect and French language

the [i] retains the characteristic of front vowel. The average value of F1 was 443 Hz, more higher than that of the French language. It can be considered as a lax variant of the vowel [i]. It was like close-mid front vowel. The tongue was positioned halfway between a close vowel and a mid vowel but it was less constricted.

14 Will-be-set-by-IN-TECH

(a) Variation of two first formants of [o]

(b) Variation of two first formants of [e]

Calliope French Tunisian Dialect F1 F2 F1 F2 F1 F2 Med 530 1718 436 2374 464 2017 SD 49 132 63 354 90 360 Table 2. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2)

Calliope French Tunisian Dialect F1 F2 F1 F2 F1 F2 Med 308 2064 307.69 2175.12 384.3 2181.32 SD 34 134 415.37 241.61 61.57 249.6 Table 3. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2)

the [i] retains the characteristic of front vowel. The average value of F1 was 443 Hz, more higher than that of the French language. It can be considered as a lax variant of the vowel [i]. It was like close-mid front vowel. The tongue was positioned halfway between a close vowel

Fig. 12. Variation of two first formants of [o] and [e] for Tunisian words.

of [E] for Tunisian Dialect and French language

of [i] for Tunisian Dialect and French language

and a mid vowel but it was less constricted.

(b) french words

#### Fig. 13. Variation of two first formants of [u].

Fig. 14. Variation of two first formants of Tunisian Dialect Vowels.


Table 4. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2) of [e] and [o] for Tunisian Dialect and French language

Method, *International Workshop on Future Communication and Networking*, IEEE, Hong

télécommunications, MASSON et CENT-ENST, Paris, ISBN :2-225-81516-X, ISSN :

for music analysis, *10th International Society for Music Information Retrieval Conference*

spectrogram, *International Conference on Acoustics, Speech, and Signal Processing,*

*tremblement vocal et application à des locuteurs parkinsoniens*, PhD thesis, F512 - Faculté

multi-resolution FFT, *Proceeding of the 9th International Conference on Digital Audio*

*Center of Spoken Language Understanding, OGI School of Science and Engineering at*

Calliope (1989). *La parole et son traitement automatique*, collection technique et scientifique des

Multi-Resolution Spectral Analysis of Vowels in Tunisian Context 67

Cancela, P., Rocamora, M. & Lopez, E. (2009). An efficient multi-resolution spectral transform

Cheung, S. & Lim, J. (1991). Combined multi-resolution (wideband/narrowband)

Cnockaert, L. (2008). *Analysis of vocal tremor and application to parkinsonian speakers / Analyse du*

Dressler, K. (2006). Sinusoidal extraction using an efficient implementation of a

Elshafei, M. (1991). Toward an arabic text-to-speech system, *The Arabian J. Science and*

Fu, Q. & Wan, E. A. (2003). A novel speech enhancement system based on wavelet denoising,

Gibson, M. (1998). *Dialect Contact in Tunisian Arabic: sociolinguistic and structural aspects*, PhD

Grossmann, A. & Morlet, J. (1984). Decomposition of hardy functions into square integrable wavelets of consonant shape, *SIAM Journal on Mathematical Analysis* 15(4): 723–736. Hancq, R. B. H. B. T. D. J. & Leich, H. (2000). *Traitement de la parole*, Presses Polytechniques et

Lee, C. C. Y. W. T. & Ching, P. (1999). Two-dimensional multi-resolution analysis of speech

Leman, H. & Marque, C. (1998). Un algorithme rapide d'extraction d'arêtes dans le

Mallat, S. (1989). A theory for multiresolution signal decomposition : the wavelet

Mallat, S. (2000). *Une Exploration des Signaux en Ondelettes*, Editions de l'Ecole Polytechnique,

Manikandan, S. (2006). Speech enhancement based on wavelet denoising, *Academic Open*

Marcais, W. (1950). *Les parlers arabes, Initiation à la Tunisie*, d. Adrien Maisonneuve, Paris.

Mallat, S. (2008). *A wavelet Tour of Signal Processing*, 3rd edition edn, Academic Press.

signals and its application to speech recognition, *International Conference on Acoustics,*

scalogramme et son utilisation dans la recherche de zones stationnaires / a fast ridge extraction algorithm from the scalogram, applied to search of stationary areas,

representation, *IEEE Transaction on Pattern Analysis and Machine Intelligence*

Ladefoged, P. (1996). *Elements of Acoustic Phonetics*, The University of Chicago Press.

*Speech, and Signal Processing, ICASSP99*, Vol. 1, IEEE, pp. 405–408.

Kong.

0221-2579.

*OHSU* .: .

11: 674–693.

Ellipses diffusion.

*(ISMIR 2009)*, pp. 309–314.

*ICASSP-91*, IEEE, pp. 457–460.

*Effects (DAFx-06)*, pp. 247–252.

thesis, University of Reading.

*Traitement du Signal* 15(6): 577–581.

*Internet Journal* 17(1311–4360): .

*Engineering*, Vol. 4B, pp. 565–583.

des sciences appliquées - Electronique.

Fant, G. (1969). Stops in cv syllables, *STL-QPSR*, Vol. 4, pp. 1–25.

Universitaires Romandes. ISBN 2-88074-388-5. Haton, J. & al. (2006). *Reconnaissance automatique de la parole*, DUNOD.


Table 5. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2) of [u] for Tunisian Dialect and French language

We can say the same thing about [u] for Tunisian dialect. The average value of F1, equal to 406 Hz, was greater than that of the French language (200 Hz) and the average value of F2, equal to 2195 Hz, was higher than that of the French language (1200 Hz).

The vowel [a] was open front. The average value of F1 was equal to 443 Hz and the average value of F2 was equal to 2090 Hz for Tunisian dialect. F1 was less than the average of first French formant (700 Hz). F2 takes a greater value than the average second French formant (1500 Hz). Therefore, we can say that the position of the tongue was narrower and it was positioned as far forward as possible in the mouth for the Tunisian dialect.

In opposition to all vowels studied, the [E] of the Tunisian dialect was same to the [E] of the French language. We note an average of 436 Hz for F1 and 2120 Hz for F2.

For the vowels [a] and [E] the F1 median was nearer to Calliope for TD than for French language. For TD and French language, the F1 medians are far from Calliope for F2 median but near each other. For the vowel [E], the F1 and F2 Standard Deviation was high and far from Calliope for TD and French language. This may be explained by the facts that the position of the tongue was narrower and it was positioned as far forward as possible in the mouth for the Tunisian dialect. High Standard Deviation was related to the small size of our corpus.

We remarked that Tunisian speakers pronounce vowels in the same way for both French language and TD.

#### **7. Conclusion**

The analysis of the obtained results shows that due to the influence of French language on the Tunisian dialect, the vowels are, in some contexts, similarly pronounced. It will be interesting to extend the study to other vowels, on a large corpus and to compare it with the study of other languages corpus like Standard Arabic, Berber, Italian, English and Spanish.

#### **8. References**


16 Will-be-set-by-IN-TECH

We can say the same thing about [u] for Tunisian dialect. The average value of F1, equal to 406 Hz, was greater than that of the French language (200 Hz) and the average value of F2, equal

The vowel [a] was open front. The average value of F1 was equal to 443 Hz and the average value of F2 was equal to 2090 Hz for Tunisian dialect. F1 was less than the average of first French formant (700 Hz). F2 takes a greater value than the average second French formant (1500 Hz). Therefore, we can say that the position of the tongue was narrower and it was

In opposition to all vowels studied, the [E] of the Tunisian dialect was same to the [E] of the

For the vowels [a] and [E] the F1 median was nearer to Calliope for TD than for French language. For TD and French language, the F1 medians are far from Calliope for F2 median but near each other. For the vowel [E], the F1 and F2 Standard Deviation was high and far from Calliope for TD and French language. This may be explained by the facts that the position of the tongue was narrower and it was positioned as far forward as possible in the mouth for the Tunisian dialect. High Standard Deviation was related to the small size of our corpus.

We remarked that Tunisian speakers pronounce vowels in the same way for both French

The analysis of the obtained results shows that due to the influence of French language on the Tunisian dialect, the vowels are, in some contexts, similarly pronounced. It will be interesting to extend the study to other vowels, on a large corpus and to compare it with the study of

Annabi-Elkadri, N. & Hamouda, A. (2010). Spectral analysis of vowels /a/ and /E/ in

Annabi-Elkadri, N. & Hamouda, A. (2011). Analyse spectrale des voyelles /a/ et /E/ dans

Annabi-Elkadri, N. & Hamouda, A. (2011 (in press)). Automatic Silence/ Sonorant/

tunisian context, *2010 International Conference on Audio, Language and Image Processing*, number CFP1050D-ART in *978-1-4244-5858-5*, IEEE/IET indexed in both EI and ISTP.

le contexte tunisien, *Actes des IXe Rencontres des Jeunes Chercheurs en Parole RJCP*,

Non-Sonorant Detection based on Multi-resolution Spectral Analysis and ANOVA

other languages corpus like Standard Arabic, Berber, Italian, English and Spanish.

of [u] for Tunisian Dialect and French language

language and TD.

**7. Conclusion**

**8. References**

(in Press).

Université Stendhal, Grenoble, pp. 1–4.

to 2195 Hz, was higher than that of the French language (1200 Hz).

positioned as far forward as possible in the mouth for the Tunisian dialect.

French language. We note an average of 436 Hz for F1 and 2120 Hz for F2.

Calliope French Tunisian Dialect F1 F2 F1 F2 F1 F2 Med 315 764 575.52 2355.61 381.58 2183.57 SD 43 59 199.38 172.25 48.05 392.42 Table 5. Values of median (Med) and Standard Deviation (SD) of two first formants (F1, F2) Method, *International Workshop on Future Communication and Networking*, IEEE, Hong Kong.


**0**

**5**

<sup>1</sup>*Accenture*

*Finland*

**Voice Conversion**

<sup>2</sup>*Tampere University of Technology*

Jani Nurminen1, Hanna Silén2, Victor Popa2,

*Voice conversion* (VC) is an area of speech processing that deals with the conversion of the perceived speaker identity. In other words, the speech signal uttered by a first speaker, the *source* speaker, is modified to sound as if it was spoken by a second speaker, referred to as the *target* speaker. The most obvious use case for voice conversion is *text-to-speech* (TTS) synthesis where VC techniques can be used for creating new and personalized voices in a cost-efficient manner. Other potential applications include security related usage (e.g. hiding the identity of the speaker), vocal pathology, voice restoration, as well as games and other entertainment applications. Yet other possible applications could be speech-to-speech

Despite the increased research attention that the topic has attracted, voice conversion has remained a challenging area. One of the challenges is that the perception of the quality and the successfulness of the identity conversion are largely subjective. Furthermore, there is no unique correct conversion result: when a speaker utters a given sentence multiple times, each repetition is different. Due to these reasons, time-consuming listening tests must be used in the development and evaluation of voice conversion systems. The use of listening tests can be complemented with some objective quality measures approximating the subjective rating,

Before diving deeper into different aspects of voice conversion, it is essential to understand the factors that determine the perceived speaker identity. Speech conveys a variety of information that can be categorized, for example, into linguistic and nonlinguistic information. Linguistic information has not traditionally been considered in the existing VC systems but is of high interest for example in the field of speech recognition. Even though some hints of speaker identity exist on the linguistic level, nonlinguistic information is more clearly linked to speaker individuality. The nonlinguistic factors affecting speaker individuality can be linked into sociological and physiological dimensions that both have their effect on the acoustic speech signal. Sociological factors, such as the social class, the region of birth or residence, and the age of the speaker, mostly affect the speaking style that is acoustically realized predominantly in prosodic features, such as pitch contour, duration of words, rhythm, etc. The physical attributes of the speaker (e.g. the anatomy of the vocal tract), on the other hand, strongly affect the spectral content and determine the individual voice quality. Perceptually,

**1. Introduction**

translation and dubbing of television programs.

such as the one proposed in (Möller, 2000).

Elina Helander<sup>2</sup> and Moncef Gabbouj2

Muhammad, A. (1990). *Alaswaat Alaghawaiyah, (in Arabic)*, Daar Alfalah, Jordan.

Peterson, G. & Barney, H. (1952). Control methods used in a study of the words, *Journal of Acoustical Society of America, JASA* 24: 175–184.

### **Voice Conversion**

Jani Nurminen1, Hanna Silén2, Victor Popa2, Elina Helander<sup>2</sup> and Moncef Gabbouj2 <sup>1</sup>*Accenture* <sup>2</sup>*Tampere University of Technology Finland*

#### **1. Introduction**

18 Will-be-set-by-IN-TECH

68 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Peterson, G. & Barney, H. (1952). Control methods used in a study of the words, *Journal of*

Muhammad, A. (1990). *Alaswaat Alaghawaiyah, (in Arabic)*, Daar Alfalah, Jordan.

*Acoustical Society of America, JASA* 24: 175–184.

*Voice conversion* (VC) is an area of speech processing that deals with the conversion of the perceived speaker identity. In other words, the speech signal uttered by a first speaker, the *source* speaker, is modified to sound as if it was spoken by a second speaker, referred to as the *target* speaker. The most obvious use case for voice conversion is *text-to-speech* (TTS) synthesis where VC techniques can be used for creating new and personalized voices in a cost-efficient manner. Other potential applications include security related usage (e.g. hiding the identity of the speaker), vocal pathology, voice restoration, as well as games and other entertainment applications. Yet other possible applications could be speech-to-speech translation and dubbing of television programs.

Despite the increased research attention that the topic has attracted, voice conversion has remained a challenging area. One of the challenges is that the perception of the quality and the successfulness of the identity conversion are largely subjective. Furthermore, there is no unique correct conversion result: when a speaker utters a given sentence multiple times, each repetition is different. Due to these reasons, time-consuming listening tests must be used in the development and evaluation of voice conversion systems. The use of listening tests can be complemented with some objective quality measures approximating the subjective rating, such as the one proposed in (Möller, 2000).

Before diving deeper into different aspects of voice conversion, it is essential to understand the factors that determine the perceived speaker identity. Speech conveys a variety of information that can be categorized, for example, into linguistic and nonlinguistic information. Linguistic information has not traditionally been considered in the existing VC systems but is of high interest for example in the field of speech recognition. Even though some hints of speaker identity exist on the linguistic level, nonlinguistic information is more clearly linked to speaker individuality. The nonlinguistic factors affecting speaker individuality can be linked into sociological and physiological dimensions that both have their effect on the acoustic speech signal. Sociological factors, such as the social class, the region of birth or residence, and the age of the speaker, mostly affect the speaking style that is acoustically realized predominantly in prosodic features, such as pitch contour, duration of words, rhythm, etc. The physical attributes of the speaker (e.g. the anatomy of the vocal tract), on the other hand, strongly affect the spectral content and determine the individual voice quality. Perceptually,

source speech target speech

Parameter extraction

Voice Conversion 71

source features target features

aligned features

source features

converted features

converted speech

Alignment

Model training

Parameter extraction

Conversion

Signal generation

Fig. 1. Block diagram illustrating stand-alone voice conversion. The training phase generates conversion models based on training data that in the most common scenario includes speech from both source and target speakers. In the conversion phase, the trained models can be

Most of the voice conversion approaches use segmental feature extraction to find a set of representative features that are then converted from source to target speakers. In principle, the features to be transformed in voice conversion can be any parameters describing the speaker-dependent factors of speech. The parameterization of the speech and the flexibility of the analysis/synthesis framework have a fundamental effect on the quality of converted speech. Hence, the parameterization should allow easy modification of the perceptually important characteristics of speech as well as to provide high-quality waveform resynthesis. The most popular speech representations are based on the source-filter model. In the source-filter model, the glottal airflow is represented as an excitation signal that can be thought to take the form of a pulse train for the voiced sounds and the form of a noise signal for the unvoiced sounds. A voiced excitation is characterized by a fundamental frequency or

Conversion model(s)

Speech database (source and target)

Training

Conversion

input speech (source speaker)

used for converting unseen utterances of source speech.

**1.1 Speech parameterization and modification**

the most important acoustic features characterizing speaker individuality include the third and the fourth formant, the fundamental frequency and the closing phase of the glottal wave, but the specific parameter importance varies from speaker to speaker and from listener to listener (Lavner et al., 2001).

The vast majority of the existing voice conversion systems deal with the conversion of spectral features, and that will also be the main focus of this chapter. However, prosodic features, such as *F*<sup>0</sup> movements and speaking rhythm, also contain important cues of identity: in (Helander & Nurminen, 2007b) it was shown that pure prosody alone can be used, to an extent, to recognize speakers that are familiar to us. Nevertheless, it is usually assumed that relatively good results can be obtained through a simple statistical mean and variance scaling of *F*<sup>0</sup> conversion methods, sometimes together with average speaking rate modification. More advanced prosody conversion techniques have also been proposed for example in (Chapell & Hansen, 1998; Gillet & King, 2003; Helander & Nurminen, 2007a).

A typical voice conversion system is depicted in Figure 1. To convert the source features into target features, a training phase is required. During training, a conversion model is generated to capture the relationship between the source and target speech features, after which the system is able to transform new, previously unseen utterances of the source speaker. Consequently, training data from both the source and the target speaker is usually required. Typical sizes of training sets are usually rather small. Depending on the targeted use case, the data used for the training can be either parallel, i.e. the speakers have uttered the same sentences, or non-parallel. The former is also sometimes referred to as text-dependent and the latter text-independent voice conversion. The most extreme case of text-independent voice conversion is cross-lingual conversion where the source and the target speakers speak different languages that may have different phoneme sets.

In practice, the performance of a voice conversion system is rather dependent on the particular speaker pair. In the most common problem formulation illustrated in Figure 1, it is assumed that we only have data from one source and one target speaker. However, there are voice conversion approaches that can utilize speech from more than two speakers. In *hidden Markov model* (HMM) based speech synthesis, an average voice model trained from multi-speaker data can be adapted using speech data from the target speaker as shown in Figure 2. Furthermore, the use of eigenvoices (Toda et al., 2007a) is another example of an approach utilizing speech from many speakers. In the eigenvoice method, originally developed for speaker adaptation (Kuhn et al., 2000), the parameters of any speaker are formed as a linear combination of eigenvoices. Yet another unconventional approach is to build a model of only the target speaker characteristics without having the source speaker data available in the training phase (Desai et al., 2010).

Numerous different VC approaches have been proposed in the literature. One way to categorize the VC techniques is to divide them into methods used for stand-alone voice conversion and the adaptation techniques used in HMM-based speech synthesis. The former methods are discussed in Section 2 while Section 3 focuses on the latter. Speech parameterization and modification issues that are relevant for both scenarios are introduced in the next subsection. Finally, at the end of the chapter, we will provide a short discussion on the remaining challenges and possible future directions in voice conversion research.

2 Will-be-set-by-IN-TECH

the most important acoustic features characterizing speaker individuality include the third and the fourth formant, the fundamental frequency and the closing phase of the glottal wave, but the specific parameter importance varies from speaker to speaker and from listener to

The vast majority of the existing voice conversion systems deal with the conversion of spectral features, and that will also be the main focus of this chapter. However, prosodic features, such as *F*<sup>0</sup> movements and speaking rhythm, also contain important cues of identity: in (Helander & Nurminen, 2007b) it was shown that pure prosody alone can be used, to an extent, to recognize speakers that are familiar to us. Nevertheless, it is usually assumed that relatively good results can be obtained through a simple statistical mean and variance scaling of *F*<sup>0</sup> conversion methods, sometimes together with average speaking rate modification. More advanced prosody conversion techniques have also been proposed for example in (Chapell &

A typical voice conversion system is depicted in Figure 1. To convert the source features into target features, a training phase is required. During training, a conversion model is generated to capture the relationship between the source and target speech features, after which the system is able to transform new, previously unseen utterances of the source speaker. Consequently, training data from both the source and the target speaker is usually required. Typical sizes of training sets are usually rather small. Depending on the targeted use case, the data used for the training can be either parallel, i.e. the speakers have uttered the same sentences, or non-parallel. The former is also sometimes referred to as text-dependent and the latter text-independent voice conversion. The most extreme case of text-independent voice conversion is cross-lingual conversion where the source and the target speakers speak

In practice, the performance of a voice conversion system is rather dependent on the particular speaker pair. In the most common problem formulation illustrated in Figure 1, it is assumed that we only have data from one source and one target speaker. However, there are voice conversion approaches that can utilize speech from more than two speakers. In *hidden Markov model* (HMM) based speech synthesis, an average voice model trained from multi-speaker data can be adapted using speech data from the target speaker as shown in Figure 2. Furthermore, the use of eigenvoices (Toda et al., 2007a) is another example of an approach utilizing speech from many speakers. In the eigenvoice method, originally developed for speaker adaptation (Kuhn et al., 2000), the parameters of any speaker are formed as a linear combination of eigenvoices. Yet another unconventional approach is to build a model of only the target speaker characteristics without having the source speaker data available in the training phase

Numerous different VC approaches have been proposed in the literature. One way to categorize the VC techniques is to divide them into methods used for stand-alone voice conversion and the adaptation techniques used in HMM-based speech synthesis. The former methods are discussed in Section 2 while Section 3 focuses on the latter. Speech parameterization and modification issues that are relevant for both scenarios are introduced in the next subsection. Finally, at the end of the chapter, we will provide a short discussion on

the remaining challenges and possible future directions in voice conversion research.

Hansen, 1998; Gillet & King, 2003; Helander & Nurminen, 2007a).

different languages that may have different phoneme sets.

listener (Lavner et al., 2001).

(Desai et al., 2010).

Fig. 1. Block diagram illustrating stand-alone voice conversion. The training phase generates conversion models based on training data that in the most common scenario includes speech from both source and target speakers. In the conversion phase, the trained models can be used for converting unseen utterances of source speech.

#### **1.1 Speech parameterization and modification**

Most of the voice conversion approaches use segmental feature extraction to find a set of representative features that are then converted from source to target speakers. In principle, the features to be transformed in voice conversion can be any parameters describing the speaker-dependent factors of speech. The parameterization of the speech and the flexibility of the analysis/synthesis framework have a fundamental effect on the quality of converted speech. Hence, the parameterization should allow easy modification of the perceptually important characteristics of speech as well as to provide high-quality waveform resynthesis.

The most popular speech representations are based on the source-filter model. In the source-filter model, the glottal airflow is represented as an excitation signal that can be thought to take the form of a pulse train for the voiced sounds and the form of a noise signal for the unvoiced sounds. A voiced excitation is characterized by a fundamental frequency or

valleys equally. The generalized Mel-cepstral analysis method (Tokuda et al., 1994) provides a unification that offers flexibility to balance between them. The procedure is controlled by two parameters, *α* and *γ*, where *γ* balances between the cepstral and linear prediction representations and *α* describes the frequency resolution of the spectrum. Mel-cepstral coefficients (MCCs) (*γ* = 0, *α* = 0.42 for 16 kHz speech) are a widely used representation in both VC and HMM-based speech synthesis (Desai et al., 2010; Helander et al., 2010a; Toda

Voice Conversion 73

The modification techniques based on the source-filter model use different ways to estimate and convert the excitation and vocal tract filter parameters. In mixed mode excitation (Fujimura, 1968), the level of devoicing is included typically as bandwise mean aperiodicity (BAP) of some frequency sub-bands, and the excitation signal is reconstructed as a weighted sum of voiced and unvoiced signals. An attractive alternative is to use the sinusoidal model developed by McAulay and Quatieri (McAulay & Quatieri, 1986) in which the speech or the excitation is represented as a sum of time-varying sinusoids whose amplitude, frequency and phase parameters are estimated from the short-time Fourier transform using a peak-picking algorithm. This framework lends itself to time and pitch scale modifications producing high-quality results. A variant of this approach has been successfully used in (Nurminen

STRAIGHT vocoder (Kawahara et al., 1999) is a widely used analysis/synthesis framework for both stand-alone voice conversion and HMM-based speech synthesis. It decomposes speech into a spectral envelope without periodic interferences, *F*0, and relative voice aperiodicity. The STRAIGHT-based speech parameters are further encoded, typically into MCCs or LSFs, logarithmic *F*0, and bandwise mean aperiodicities. Alternative speech parameterization schemes include harmonic plus stochastic model (Erro et al., 2010a), glottal modeling using inverse filtering (Raitio et al., 2010), and frequency-domain two-band voicing modeling (Kim et al., 2006; Silén et al., 2009). It is also possible to operate directly on spectral

Table 1 provides a summary of typical features used in voice conversion. It should be noted that any given voice conversion system utilizes only a subset of the features listed in the table. Some voice conversion systems may also operate on some other features, not listed in Table 1.

The first step in the training of a stand-alone voice conversion system is data alignment. To be able to model the differences between the source and target speakers, the relationship needs to be captured using similar data from both speakers. While it is intuitively clear that proper alignment is needed for building high-quality models, the study presented in (Helander et al., 2008) demonstrated that simple frame-level alignment using *dynamic time warping* (DTW) offers sufficient accuracy when the training data is parallel. More detailed discussion, especially covering more difficult use cases, is considered to be outside the scope of this chapter but it should be noted that relevant studies have been published in the literature: for example, text-independent voice conversion is discussed in (Tao et al., 2010) and cross-lingual conversion in (Sündermann et al., 2006). In the strict sense, the alignment step may also be omitted through model adaptation techniques which can, for instance, adapt

et al., 2007b; Tokuda et al., 2002).

domain samples (Sündermann & Ney, 2003).

**2. Stand-alone voice conversion**

an already trained conversion model.

et al., 2006).

Fig. 2. Block diagram of speaker adaptation in HMM-based TTS. In the training phase, HSMMs are generated using speech data from multiple speakers. Then, model adaptation is applied to obtain HSMMs for a given target speaker. The adapted HSMMs can be used in TTS synthesis for producing speech with the target voice.

pitch that is determined by the oscillation frequency of the vocal folds. The vocal tract is seen as a resonator cavity that shapes the excitation signal in frequency, and can be understood as a filter having its resonances at formant frequencies. The use of formants as VC features would in theory be a highly attractive alternative that has been studied in (Narendranath et al., 1995; Rentzos et al., 2004) but the inherent difficulties in reliable estimation and modification of formants have prevented wider adoption, and the representations obtained by simple mathematical methods have remained the preferred solution.

The use of linear prediction, and in particular the line spectral frequency (LSF) representation has been highly popular in VC research (Arslan, 1999; Erro et al., 2010a; Nurminen et al., 2006; Tao et al., 2010; Turk & Arslan, 2006), due to its favorable interpolation properties and the close relationship to the formant structure. In addition to the linear prediction based methods, cepstrum-based parameterization has been widely used, for example in the form of Mel-frequency cepstrum coefficients (MFCCs) (Stylianou et al., 1998).

Standard linear prediction coefficients give information on the formants (peaks) but not the valleys (spectral zeros) in the spectrum whereas cepstral processing treats both peaks and 4 Will-be-set-by-IN-TECH

speech

labels

labels

Text analysis input text labels

TTS synthesis for producing speech with the target voice.

mathematical methods have remained the preferred solution.

of Mel-frequency cepstrum coefficients (MFCCs) (Stylianou et al., 1998).

features Model

Fig. 2. Block diagram of speaker adaptation in HMM-based TTS. In the training phase, HSMMs are generated using speech data from multiple speakers. Then, model adaptation is applied to obtain HSMMs for a given target speaker. The adapted HSMMs can be used in

pitch that is determined by the oscillation frequency of the vocal folds. The vocal tract is seen as a resonator cavity that shapes the excitation signal in frequency, and can be understood as a filter having its resonances at formant frequencies. The use of formants as VC features would in theory be a highly attractive alternative that has been studied in (Narendranath et al., 1995; Rentzos et al., 2004) but the inherent difficulties in reliable estimation and modification of formants have prevented wider adoption, and the representations obtained by simple

The use of linear prediction, and in particular the line spectral frequency (LSF) representation has been highly popular in VC research (Arslan, 1999; Erro et al., 2010a; Nurminen et al., 2006; Tao et al., 2010; Turk & Arslan, 2006), due to its favorable interpolation properties and the close relationship to the formant structure. In addition to the linear prediction based methods, cepstrum-based parameterization has been widely used, for example in the form

Standard linear prediction coefficients give information on the formants (peaks) but not the valleys (spectral zeros) in the spectrum whereas cepstral processing treats both peaks and

Speech database (target speaker)

Speech database (multiple speakers)

Training

Synthesis

Adaptation

Parameter extraction

> Model training

features

adaptation

Parameter generation

Signal generation

features

speech

HSMMs

HSMMs

valleys equally. The generalized Mel-cepstral analysis method (Tokuda et al., 1994) provides a unification that offers flexibility to balance between them. The procedure is controlled by two parameters, *α* and *γ*, where *γ* balances between the cepstral and linear prediction representations and *α* describes the frequency resolution of the spectrum. Mel-cepstral coefficients (MCCs) (*γ* = 0, *α* = 0.42 for 16 kHz speech) are a widely used representation in both VC and HMM-based speech synthesis (Desai et al., 2010; Helander et al., 2010a; Toda et al., 2007b; Tokuda et al., 2002).

The modification techniques based on the source-filter model use different ways to estimate and convert the excitation and vocal tract filter parameters. In mixed mode excitation (Fujimura, 1968), the level of devoicing is included typically as bandwise mean aperiodicity (BAP) of some frequency sub-bands, and the excitation signal is reconstructed as a weighted sum of voiced and unvoiced signals. An attractive alternative is to use the sinusoidal model developed by McAulay and Quatieri (McAulay & Quatieri, 1986) in which the speech or the excitation is represented as a sum of time-varying sinusoids whose amplitude, frequency and phase parameters are estimated from the short-time Fourier transform using a peak-picking algorithm. This framework lends itself to time and pitch scale modifications producing high-quality results. A variant of this approach has been successfully used in (Nurminen et al., 2006).

STRAIGHT vocoder (Kawahara et al., 1999) is a widely used analysis/synthesis framework for both stand-alone voice conversion and HMM-based speech synthesis. It decomposes speech into a spectral envelope without periodic interferences, *F*0, and relative voice aperiodicity. The STRAIGHT-based speech parameters are further encoded, typically into MCCs or LSFs, logarithmic *F*0, and bandwise mean aperiodicities. Alternative speech parameterization schemes include harmonic plus stochastic model (Erro et al., 2010a), glottal modeling using inverse filtering (Raitio et al., 2010), and frequency-domain two-band voicing modeling (Kim et al., 2006; Silén et al., 2009). It is also possible to operate directly on spectral domain samples (Sündermann & Ney, 2003).

Table 1 provides a summary of typical features used in voice conversion. It should be noted that any given voice conversion system utilizes only a subset of the features listed in the table. Some voice conversion systems may also operate on some other features, not listed in Table 1.

#### **2. Stand-alone voice conversion**

The first step in the training of a stand-alone voice conversion system is data alignment. To be able to model the differences between the source and target speakers, the relationship needs to be captured using similar data from both speakers. While it is intuitively clear that proper alignment is needed for building high-quality models, the study presented in (Helander et al., 2008) demonstrated that simple frame-level alignment using *dynamic time warping* (DTW) offers sufficient accuracy when the training data is parallel. More detailed discussion, especially covering more difficult use cases, is considered to be outside the scope of this chapter but it should be noted that relevant studies have been published in the literature: for example, text-independent voice conversion is discussed in (Tao et al., 2010) and cross-lingual conversion in (Sündermann et al., 2006). In the strict sense, the alignment step may also be omitted through model adaptation techniques which can, for instance, adapt an already trained conversion model.

**<sup>D</sup>**(*y*) *<sup>m</sup>* :

component:

main weaknesses.

**<sup>E</sup>**(*y*) *<sup>m</sup>* <sup>=</sup> <sup>μ</sup>(*y*) *<sup>m</sup>* <sup>+</sup> **<sup>Σ</sup>**(*yx*) *<sup>m</sup>*

*<sup>m</sup>* <sup>=</sup> **<sup>Σ</sup>**(*yy*) *<sup>m</sup>* <sup>−</sup> **<sup>Σ</sup>**(*yx*) *<sup>m</sup>*

*M* ∑ *m*=1 *ωm* 

*ω<sup>m</sup>* =

 <sup>μ</sup>(*x*) *<sup>m</sup>* <sup>μ</sup>(*y*) *<sup>m</sup>*

μ*<sup>m</sup>* =

and the minimum mean square error (MMSE) solution for the converted target ˆ**y** is:

*αm*N 

and the mean μ*m* and covariance **Σ***m* of the *m*th Gaussian distribution are defined as:

enhancements to the basic codebook based methods are presented in Section 2.3.

Finally, we consider frequency warping to offer the third very basic approach for voice conversion. In this method, a warping function is established between the source and target spectra. In the simplest case, the warping function can be formed based on spectra representing a single voiced frame (Shuang et al., 2006). Then, during the actual conversion, the frequency warping function is directly applied to the spectral envelope. The frequency warping methods can at best obtain very high speech quality but have limitations regarding the success of identity conversion, due to problems in preserving the shape of modified spectral peaks and controlling the bandwidths of close formants. Proper controlling of the formant amplitudes is also challenging. Furthermore, the use of only a single warping

∑*<sup>M</sup> <sup>j</sup>*=<sup>1</sup> *αj*N

**D**(*y*)

*<sup>ω</sup>m***E**(*y*) *<sup>m</sup>* <sup>=</sup>

**y**ˆ =

*M* ∑ *m*=1  **<sup>Σ</sup>**(*xx*) *<sup>m</sup>*

Voice Conversion 75

 **<sup>Σ</sup>**(*xx*) *<sup>m</sup>* −<sup>1</sup>

<sup>μ</sup>(*y*) *<sup>m</sup>* <sup>+</sup> **<sup>Σ</sup>**(*yx*) *<sup>m</sup>*

Here *ωm* denotes the posterior probability of the observation **x** for the *m*th Gaussian

 **x**; μ(*x*)

, **Σ***<sup>m</sup>* =

The use of GMMs in voice conversion has been extremely popular. In the next subsection, we will discuss some shortcomings of this method and possible solutions for overcoming the

Another basic voice conversion technique is codebook mapping (Abe et al., 1988). The simplest way to realize codebook based mapping would be to train a codebook of combined feature vectors **z**. Then, during conversion, the source side of the vectors could be used for finding the closest codebook entry, and the target side of the selected entry could be used as the converted vector. The classical paper on codebook based conversion (Abe et al., 1988) proposes a slightly different approach that can utilize existing vector quantizers. There the training phase involves generating histograms of the vector correspondences between the quantized and aligned source and target vectors. These histograms are then used as weighting functions for generating a linear combination based mapping codebook. Regardless of the details of the implementation, codebook based mapping offers a very simple and straightforward approach that can capture the speaker identity quite well, but the result suffers from frame-to-frame discontinuities and poor prediction capability on new data. Some

−<sup>1</sup>

 **<sup>Σ</sup>**(*xx*) *<sup>m</sup>*

**<sup>x</sup>**; <sup>μ</sup>(*x*) *<sup>m</sup>* , **<sup>Σ</sup>**(*xx*) *<sup>m</sup>*

**<sup>x</sup>** <sup>−</sup> <sup>μ</sup>(*x*) *<sup>m</sup>*

−<sup>1</sup>

**<sup>x</sup>** <sup>−</sup> <sup>μ</sup>(*x*) *<sup>m</sup>*

, (3)

. (4)

**<sup>Σ</sup>**(*xy*) *<sup>m</sup>*

*<sup>j</sup>* , **<sup>Σ</sup>**(*xx*) *j*

**<sup>Σ</sup>**(*xx*) *<sup>m</sup>* **<sup>Σ</sup>**(*xy*) *<sup>m</sup>* **<sup>Σ</sup>**(*yx*) *<sup>m</sup>* **<sup>Σ</sup>**(*yy*) *<sup>m</sup>* , (1)

. (2)


Table 1. Examples of speech features commonly used in voice conversion.

#### **2.1 Basic approaches**

The most popular voice conversion approach in the literature has been *Gaussian mixture model* (GMM) based conversion (Kain & Macon, 1998; Stylianou et al., 1998). The data is modeled using a GMM and converted by a function that is a weighted sum of local regression functions. A GMM can be trained to model the density of source features only (Stylianou et al., 1998) or the joint density of both source and target features (Kain & Macon, 1998). Here we review the approach based on a joint density GMM (Kain & Macon, 1998).

First, let us assume that we have aligned source and target vectors **z** = [**x***T*, **y***T*] *<sup>T</sup>* that can be used to train a conversion model. Here, **x** and **y** correspond to the source and target feature vectors, respectively. In the training, the aligned data **z** is used to estimate the GMM parameters (α, μ, **Σ**) of the joint distribution *p*(**x**, **y**) (Kain & Macon, 1998). This is accomplished iteratively through the well-known Expectation Maximization (EM) algorithm (Dempster et al., 1977).

The conditional probability of the converted vector **y** given the input vector **x** and the *m*th Gaussian component is a Gaussian distribution characterized by mean **<sup>E</sup>**(*y*) *<sup>m</sup>* and the covariance **<sup>D</sup>**(*y*) *<sup>m</sup>* :

6 Will-be-set-by-IN-TECH

LSFs Offer stability, good interpolation properties, and close relationship to formants. Model spectral peaks.

MFCCs Model both spectral peaks and valleys. Reliable for measuring

MCCs Perhaps the most widely used features for representing spectra

Formants Formant bandwidths, locations and intensities would be highly

Spectral samples Spectral domain samples can also be used as VC features. Typically used in warping based conversion.

Voicing At least binary voicing or aperiodicity information is typically

Excitation spectra Sometimes details of the excitation spectra need to be modeled as well, for example when using sinusoidal modeling.

The most popular voice conversion approach in the literature has been *Gaussian mixture model* (GMM) based conversion (Kain & Macon, 1998; Stylianou et al., 1998). The data is modeled using a GMM and converted by a function that is a weighted sum of local regression functions. A GMM can be trained to model the density of source features only (Stylianou et al., 1998) or the joint density of both source and target features (Kain & Macon, 1998). Here we review the

be used to train a conversion model. Here, **x** and **y** correspond to the source and target feature vectors, respectively. In the training, the aligned data **z** is used to estimate the GMM parameters (α, μ, **Σ**) of the joint distribution *p*(**x**, **y**) (Kain & Macon, 1998). This is accomplished iteratively through the well-known Expectation Maximization (EM) algorithm

The conditional probability of the converted vector **y** given the input vector **x** and the *m*th Gaussian component is a Gaussian distribution characterized by mean **<sup>E</sup>**(*y*) *<sup>m</sup>* and the covariance

First, let us assume that we have aligned source and target vectors **z** = [**x***T*, **y***T*]

challenging.

the target speaker.

Table 1. Examples of speech features commonly used in voice conversion.

approach based on a joint density GMM (Kain & Macon, 1998).

acoustic distances and thus useful especially for alignment.

both in stand-alone conversion and in HMM based synthesis. Benefits e.g. in alignment very similar to those of MFCCs.

useful features in VC but reliable estimation is extremely

*F*<sup>0</sup> or log *F*<sup>0</sup> are typically mean-shifted and scaled to the values of

used. More refined voicing information may also be employed.

*<sup>T</sup>* that can

**Feature Notes**

*F*0

**2.1 Basic approaches**

(Dempster et al., 1977).

$$\begin{aligned} \mathbf{E}\_{m}^{(y)} &= \mu\_{m}^{(y)} + \Sigma\_{m}^{(yx)} \left(\Sigma\_{m}^{(xx)}\right)^{-1} \left(\mathbf{x} - \mu\_{m}^{(x)}\right) \\ \mathbf{D}\_{m}^{(y)} &= \Sigma\_{m}^{(yy)} - \Sigma\_{m}^{(yx)} \left(\Sigma\_{m}^{(xx)}\right)^{-1} \Sigma\_{m}^{(xy)} \end{aligned} \tag{1}$$

and the minimum mean square error (MMSE) solution for the converted target ˆ**y** is:

$$\hat{\mathbf{y}} = \sum\_{m=1}^{M} \omega\_{m} \mathbf{E}\_{m}^{(y)} = \sum\_{m=1}^{M} \omega\_{m} \left[ \mu\_{m}^{(y)} + \mathbf{E}\_{m}^{(yx)} \left( \mathbf{E}\_{m}^{(xx)} \right)^{-1} \left( \mathbf{x} - \mu\_{m}^{(x)} \right) \right]. \tag{2}$$

Here *ωm* denotes the posterior probability of the observation **x** for the *m*th Gaussian component:

$$
\omega\_m = \frac{\alpha\_m \mathcal{N}\left(\mathbf{x}; \mu\_m^{(\mathbf{x})}, \Sigma\_m^{(\mathbf{x}\mathbf{x})}\right)}{\sum\_{j=1}^M \alpha\_j \mathcal{N}\left(\mathbf{x}; \mu\_j^{(\mathbf{x})}, \Sigma\_j^{(\mathbf{x}\mathbf{x})}\right)},\tag{3}
$$

and the mean μ*m* and covariance **Σ***m* of the *m*th Gaussian distribution are defined as:

$$\boldsymbol{\mu}\_{m} = \begin{bmatrix} \mu\_{m}^{(\boldsymbol{x})} \\ \mu\_{m}^{(\boldsymbol{y})} \end{bmatrix}, \quad \boldsymbol{\Sigma}\_{m} = \begin{bmatrix} \boldsymbol{\Sigma}\_{m}^{(\boldsymbol{x}\boldsymbol{x})} \boldsymbol{\Sigma}\_{m}^{(\boldsymbol{x}\boldsymbol{y})} \\ \boldsymbol{\Sigma}\_{m}^{(\boldsymbol{y}\boldsymbol{x})} \boldsymbol{\Sigma}\_{m}^{(\boldsymbol{y}\boldsymbol{y})} \end{bmatrix}. \tag{4}$$

The use of GMMs in voice conversion has been extremely popular. In the next subsection, we will discuss some shortcomings of this method and possible solutions for overcoming the main weaknesses.

Another basic voice conversion technique is codebook mapping (Abe et al., 1988). The simplest way to realize codebook based mapping would be to train a codebook of combined feature vectors **z**. Then, during conversion, the source side of the vectors could be used for finding the closest codebook entry, and the target side of the selected entry could be used as the converted vector. The classical paper on codebook based conversion (Abe et al., 1988) proposes a slightly different approach that can utilize existing vector quantizers. There the training phase involves generating histograms of the vector correspondences between the quantized and aligned source and target vectors. These histograms are then used as weighting functions for generating a linear combination based mapping codebook. Regardless of the details of the implementation, codebook based mapping offers a very simple and straightforward approach that can capture the speaker identity quite well, but the result suffers from frame-to-frame discontinuities and poor prediction capability on new data. Some enhancements to the basic codebook based methods are presented in Section 2.3.

Finally, we consider frequency warping to offer the third very basic approach for voice conversion. In this method, a warping function is established between the source and target spectra. In the simplest case, the warping function can be formed based on spectra representing a single voiced frame (Shuang et al., 2006). Then, during the actual conversion, the frequency warping function is directly applied to the spectral envelope. The frequency warping methods can at best obtain very high speech quality but have limitations regarding the success of identity conversion, due to problems in preserving the shape of modified spectral peaks and controlling the bandwidths of close formants. Proper controlling of the formant amplitudes is also challenging. Furthermore, the use of only a single warping

5 10 15 20

Number of Gaussians

Voice Conversion 77

Fig. 3. Example of overfitting. Increasing the number of Gaussians reduces the distortion for the training data but not necessarily for a separate test set because the model might be

with the GMM-based converted spectrum reduces the effect of oversmoothing by retaining

Target (female) Source (male) Conversion result

Target (female) Source (male) Conversion result

0.5 1 1.5

Time (sec)

0.5 1 1.5

Time (sec)

Fig. 4. Example of oversmoothing. Linear transformation of spectral features is not able to retain all the details and causes oversmoothing. The conversion result (black line) is achieved using linear multivariate regression to convert the source speaker's MCCs (dashed gray line)

In time domain, the converted feature trajectory has much less variation than the original target feature trajectory. This phenomenon is illustrated in Figure 4. According to (Chen et al.,

to match with the target speaker's MCCs (solid gray line).

Training data Test data

4

4.5

Mel−cepstral distortion

overfitted to the training set.

0

0

1

5th Mel−cepstral coefficient

1

3rd Mel−cepstral coefficient

2

more spectral details (Toda et al., 2001)

5

function can be considered a weakness. To overcome this, proposals have been made to utilize several warping functions (Erro et al., 2010b) but the above-mentioned fundamental problems remain largely unsolved.

#### **2.2 Problems and improvements in GMM-based conversion**

GMM-based voice conversion has been a dominating technique in VC despite its problems. In this section, we review some of the problems and solutions proposed to overcome them.

The control of model complexity is a crucial issue when learning a model from data. There is a trade-off between two objectives: model fidelity and the generalization-capability of the model for unseen data. This trade-off problem, also referred to as bias-variance dilemma (Geman et al., 1992), is common for all model fitting tasks. In essence, simple models are subject to oversmoothing, whereas the use of complex models may result in overfitting and thus in poor prediction ability on new data. In addition to oversmoothing and overfitting, a major problem in conventional GMM-based conversion, as well as in many codebook based algorithms, is the time-independent mapping of features that ignores the inherent temporal correlation of speech features.

#### **2.2.1 Overfitting**

In GMM-based VC, overfitting can be caused by two factors: first, the GMM may be overfitted to the training set as demonstrated in Figure 3. Second, when a mapping function is estimated, it may also become overfitted.

In particular, a GMM with full covariance matrices is difficult to estimate and is subject to overfitting (Mesbashi et al., 2007). With unconstrained (full) covariance matrices, the number of free parameters grows quadratically with the input dimensionality. Considering for example 24-dimensional source and target feature vectors and a joint-density GMM model with 16 mixture components and full covariance matrices, 18816 (((2x24)x(2x24)/2+24)x16) variance terms are to be estimated. One solution is to use diagonal covariance matrices **Σ***xx*, **Σ***xy*, **Σ***yx*, **Σ***yy* with an increased number of components. In the joint-density GMM, this results in converting each feature dimension separately. In reality, however, the *pth* spectral descriptor of the source may not be directly related to the *pth* spectral descriptor of the target, making this approach inaccurate.

Overfitting of the mapping function can be avoided by applying partial least squares (PLS) for regression estimation (Helander et al., 2010a); a source GMM (usually with diagonal covariance matrices) is trained and a mapping function is then estimated using partial least squares regression between source features weighted by posterior probability for each Gaussian and the original target features.

#### **2.2.2 Oversmoothing**

Oversmoothing occurs both in frequency and in the time domain. In frequency domain, this results in losing fine details of the spectrum and in broadening of the formants. In speech coding, it is common to use post-filtering to emphasize the formants (Kondoz, 2004) and similarly post-filtering can also be used to improve the quality of the speech in voice conversion. It has also been found that combining the frequency warped source spectrum 8 Will-be-set-by-IN-TECH

function can be considered a weakness. To overcome this, proposals have been made to utilize several warping functions (Erro et al., 2010b) but the above-mentioned fundamental problems

GMM-based voice conversion has been a dominating technique in VC despite its problems. In this section, we review some of the problems and solutions proposed to overcome them. The control of model complexity is a crucial issue when learning a model from data. There is a trade-off between two objectives: model fidelity and the generalization-capability of the model for unseen data. This trade-off problem, also referred to as bias-variance dilemma (Geman et al., 1992), is common for all model fitting tasks. In essence, simple models are subject to oversmoothing, whereas the use of complex models may result in overfitting and thus in poor prediction ability on new data. In addition to oversmoothing and overfitting, a major problem in conventional GMM-based conversion, as well as in many codebook based algorithms, is the time-independent mapping of features that ignores the inherent temporal

In GMM-based VC, overfitting can be caused by two factors: first, the GMM may be overfitted to the training set as demonstrated in Figure 3. Second, when a mapping function is estimated,

In particular, a GMM with full covariance matrices is difficult to estimate and is subject to overfitting (Mesbashi et al., 2007). With unconstrained (full) covariance matrices, the number of free parameters grows quadratically with the input dimensionality. Considering for example 24-dimensional source and target feature vectors and a joint-density GMM model with 16 mixture components and full covariance matrices, 18816 (((2x24)x(2x24)/2+24)x16) variance terms are to be estimated. One solution is to use diagonal covariance matrices **Σ***xx*, **Σ***xy*, **Σ***yx*, **Σ***yy* with an increased number of components. In the joint-density GMM, this results in converting each feature dimension separately. In reality, however, the *pth* spectral descriptor of the source may not be directly related to the *pth* spectral descriptor of the target,

Overfitting of the mapping function can be avoided by applying partial least squares (PLS) for regression estimation (Helander et al., 2010a); a source GMM (usually with diagonal covariance matrices) is trained and a mapping function is then estimated using partial least squares regression between source features weighted by posterior probability for each

Oversmoothing occurs both in frequency and in the time domain. In frequency domain, this results in losing fine details of the spectrum and in broadening of the formants. In speech coding, it is common to use post-filtering to emphasize the formants (Kondoz, 2004) and similarly post-filtering can also be used to improve the quality of the speech in voice conversion. It has also been found that combining the frequency warped source spectrum

remain largely unsolved.

correlation of speech features.

it may also become overfitted.

making this approach inaccurate.

**2.2.2 Oversmoothing**

Gaussian and the original target features.

**2.2.1 Overfitting**

**2.2 Problems and improvements in GMM-based conversion**

Fig. 3. Example of overfitting. Increasing the number of Gaussians reduces the distortion for the training data but not necessarily for a separate test set because the model might be overfitted to the training set.

with the GMM-based converted spectrum reduces the effect of oversmoothing by retaining more spectral details (Toda et al., 2001)

Fig. 4. Example of oversmoothing. Linear transformation of spectral features is not able to retain all the details and causes oversmoothing. The conversion result (black line) is achieved using linear multivariate regression to convert the source speaker's MCCs (dashed gray line) to match with the target speaker's MCCs (solid gray line).

In time domain, the converted feature trajectory has much less variation than the original target feature trajectory. This phenomenon is illustrated in Figure 4. According to (Chen et al., 2003), oversmoothing occurs because the term **<sup>Σ</sup>***yx <sup>m</sup>* (**Σ***xx <sup>m</sup>* )−<sup>1</sup> (Equation 1) becomes close to zero and thus the converted target becomes only a weighted sum of means of GMM components as

$$\hat{\mathbf{y}} = \sum\_{m=1}^{M} \omega\_m \mu\_m^{(y)}.\tag{5}$$

representation of the acoustic spaces as a limited set of spectral envelopes. Another severe problem is caused by the frame-based operation which ignores the relationships between neighboring frames or any information related to the temporal evolution of the parameters. These problems produce spectral discontinuities and lead to a degraded quality of the converted speech. In terms of spectral mapping, though, the codebook has the attractive

Voice Conversion 79

The above issues have been addressed in a number of articles and several have been proposed to improve the spectral continuity of the codebook mapping. A selection of methods will be presented in this section, including weighted linear combination of codewords (Arslan, 1999), hierarchical codebook mapping (Wang et al., 2005), local linear transformation (Popa et al., 2012) and trellis structured vector quantization (Eslami et al., 2011). It is worth mentioning

Weighted linear combination of codewords (Arslan, 1999) addresses the problem of discrete representation of the acoustic space by utilizing a weighted sum of codewords in order to cover well the acoustic space of the target speaker. Phoneme centroids are computed for both the source and the target speaker, forming two codebooks of spectral vectors with one-to-one

In order to convert a source vector, a set of weights is determined depending on a similarity measure between the source vector and the set of centroids in the source codebook. The conversion is realized by using the weights to linearly combine the corresponding centroids in the target codebook. While improving the continuity with respect to the basic codebook approach, this method causes severe oversmoothing by summing over a wide range of

Hierarchical codebook mapping (Wang et al., 2005) aims to improve the precision of the spectral conversion by estimating and adding a residual term to the typical codeword mapping. In addition to the mapping codebook between the source vectors **x** and the target vectors **y**, a new codebook is trained from the same source vectors **x** and the corresponding conversion residuals = **y** − **y**ˆ. The residuals represent the differences between a real target vector **y** aligned to **x** and **x**'s conversion through the first codebook, ˆ**y**. In conversion, both codebooks are used; the first for predicting a target codeword ˆ**y** and the second to find the corresponding residual . The final result of the conversion is obtained by summing outputs of the two codebooks, i.e. ˆ**y**� = **y**ˆ + . Although hierarchical codebook mapping improves to some extent the precision compared to the basic codebook based conversion, this approach is essentially only producing a finer representation of the acoustic space while being otherwise

Methods based on linear transformations such as GMM typically compute a number of linear transformations corresponding to different acoustic classes and use a linear combination of these transformations to convert a given spectral vector. As discussed in Section 2.2.2, this

likely to inherit the fundamental problems of the basic codebook mapping.

property of preserving the details that appear in the training data.

that these algorithms have their own limitations.

**2.3.1 Weighted linear combination of codewords**

correspondence.

different spectral envelopes.

**2.3.2 Hierarchical codebook mapping**

**2.3.3 Local linear transformation**

To avoid the problem, the source GMM can be built from a larger data set and only the means are adapted using maximum a posteriori estimation (Chen et al., 2003). Thus, the converted target becomes:

$$\hat{\mathbf{y}} = \mathbf{x} + \sum\_{m=1}^{M} \omega\_m \left( \mu\_m^{(y)} - \mu\_m^{(x)} \right). \tag{6}$$

Global variance can be used to compensate for the reduced variance of the converted speech feature sequence with feature trajectory estimation (Toda et al., 2007b). Alternatively, the global variance can be accounted already in the estimation of the conversion function; this degrades the objective performance but improves the subjective quality (Benisty & Malah, 2011).

#### **2.2.3 Time-independent mapping**

The conventional GMM-based method converts each frame regardless of other frames and thus ignores the temporal correlation between consecutive frames. This can lead to discontinuities in feature trajectories and thus degrade perceptual speech quality. It has been shown that there is usually only a single mixture component that dominates in each frame in GMM-based VC approaches (Helander et al., 2010a). This makes the conventional GMM-based approaches to shift from a soft acoustic classification method to a hard classification method, making it susceptible to discontinuities similarly as in the case of codebook based methods.

Solving the time-independency problem of GMM-based conversion was proposed in (Toda et al., 2007b) through the introduction of maximum likelihood (ML) estimation of the spectral parameter trajectory. Static source and target feature vectors are extended with first-order deltas, i.e **z** = [**x***T*, Δ**x***T*, **y***T*, Δ**y***T*] *<sup>T</sup>* and a joint-density GMM is estimated. In synthesis, both converted mean and covariance matrices (Equation 1) are used to generate the target trajectory. The trajectory estimation is similar to HMM-based speech synthesis described in Section 3.1.1. A recent approach (Helander et al., 2010b) bears some similarity to (Toda et al., 2007b) by using the relationship between the static and dynamic features to obtain the optimal speech sequence but does not use the transformed mean and (co)variance from the GMM-based conversion. To obtain smooth feature trajectory, the converted features can be low-pass filtered after conducting the GMM-based transformation (Chen et al., 2003) or the GMM posterior probabilities can be smoothed before making the conversion (Helander et al., 2010a). Instead of frame-wise transformation of the source spectral features, in (Nguyen & Akagi, 2008) each phoneme was modeled to consist of event targets and these event targets were used as conversion features.

#### **2.3 Advanced codebook-based methods**

The basic codebook mapping (Abe et al., 1988) introduced in Section 2.1 is affected by several important limitations. A fundamental problem of codebook mapping is the discrete 10 Will-be-set-by-IN-TECH

and thus the converted target becomes only a weighted sum of means of GMM components

To avoid the problem, the source GMM can be built from a larger data set and only the means are adapted using maximum a posteriori estimation (Chen et al., 2003). Thus, the converted

Global variance can be used to compensate for the reduced variance of the converted speech feature sequence with feature trajectory estimation (Toda et al., 2007b). Alternatively, the global variance can be accounted already in the estimation of the conversion function; this degrades the objective performance but improves the subjective quality (Benisty & Malah,

The conventional GMM-based method converts each frame regardless of other frames and thus ignores the temporal correlation between consecutive frames. This can lead to discontinuities in feature trajectories and thus degrade perceptual speech quality. It has been shown that there is usually only a single mixture component that dominates in each frame in GMM-based VC approaches (Helander et al., 2010a). This makes the conventional GMM-based approaches to shift from a soft acoustic classification method to a hard classification method, making it susceptible to discontinuities similarly as in the case

Solving the time-independency problem of GMM-based conversion was proposed in (Toda et al., 2007b) through the introduction of maximum likelihood (ML) estimation of the spectral parameter trajectory. Static source and target feature vectors are extended with first-order

both converted mean and covariance matrices (Equation 1) are used to generate the target trajectory. The trajectory estimation is similar to HMM-based speech synthesis described in Section 3.1.1. A recent approach (Helander et al., 2010b) bears some similarity to (Toda et al., 2007b) by using the relationship between the static and dynamic features to obtain the optimal speech sequence but does not use the transformed mean and (co)variance from the GMM-based conversion. To obtain smooth feature trajectory, the converted features can be low-pass filtered after conducting the GMM-based transformation (Chen et al., 2003) or the GMM posterior probabilities can be smoothed before making the conversion (Helander et al., 2010a). Instead of frame-wise transformation of the source spectral features, in (Nguyen & Akagi, 2008) each phoneme was modeled to consist of event targets and these event targets

The basic codebook mapping (Abe et al., 1988) introduced in Section 2.1 is affected by several important limitations. A fundamental problem of codebook mapping is the discrete

*M* ∑ *m*=1

**y**ˆ =

*M* ∑ *m*=1 *ωm* 

**y**ˆ = **x** +

*<sup>m</sup>* (**Σ***xx*

<sup>μ</sup>(*y*) *<sup>m</sup>* <sup>−</sup> <sup>μ</sup>(*x*) *<sup>m</sup>*

*<sup>T</sup>* and a joint-density GMM is estimated. In synthesis,

*<sup>m</sup>* )−<sup>1</sup> (Equation 1) becomes close to zero

. (6)

*<sup>ω</sup>m*μ(*y*) *<sup>m</sup>* . (5)

2003), oversmoothing occurs because the term **<sup>Σ</sup>***yx*

as

target becomes:

**2.2.3 Time-independent mapping**

of codebook based methods.

deltas, i.e **z** = [**x***T*, Δ**x***T*, **y***T*, Δ**y***T*]

were used as conversion features.

**2.3 Advanced codebook-based methods**

2011).

representation of the acoustic spaces as a limited set of spectral envelopes. Another severe problem is caused by the frame-based operation which ignores the relationships between neighboring frames or any information related to the temporal evolution of the parameters. These problems produce spectral discontinuities and lead to a degraded quality of the converted speech. In terms of spectral mapping, though, the codebook has the attractive property of preserving the details that appear in the training data.

The above issues have been addressed in a number of articles and several have been proposed to improve the spectral continuity of the codebook mapping. A selection of methods will be presented in this section, including weighted linear combination of codewords (Arslan, 1999), hierarchical codebook mapping (Wang et al., 2005), local linear transformation (Popa et al., 2012) and trellis structured vector quantization (Eslami et al., 2011). It is worth mentioning that these algorithms have their own limitations.

#### **2.3.1 Weighted linear combination of codewords**

Weighted linear combination of codewords (Arslan, 1999) addresses the problem of discrete representation of the acoustic space by utilizing a weighted sum of codewords in order to cover well the acoustic space of the target speaker. Phoneme centroids are computed for both the source and the target speaker, forming two codebooks of spectral vectors with one-to-one correspondence.

In order to convert a source vector, a set of weights is determined depending on a similarity measure between the source vector and the set of centroids in the source codebook. The conversion is realized by using the weights to linearly combine the corresponding centroids in the target codebook. While improving the continuity with respect to the basic codebook approach, this method causes severe oversmoothing by summing over a wide range of different spectral envelopes.

#### **2.3.2 Hierarchical codebook mapping**

Hierarchical codebook mapping (Wang et al., 2005) aims to improve the precision of the spectral conversion by estimating and adding a residual term to the typical codeword mapping. In addition to the mapping codebook between the source vectors **x** and the target vectors **y**, a new codebook is trained from the same source vectors **x** and the corresponding conversion residuals = **y** − **y**ˆ. The residuals represent the differences between a real target vector **y** aligned to **x** and **x**'s conversion through the first codebook, ˆ**y**. In conversion, both codebooks are used; the first for predicting a target codeword ˆ**y** and the second to find the corresponding residual . The final result of the conversion is obtained by summing outputs of the two codebooks, i.e. ˆ**y**� = **y**ˆ + . Although hierarchical codebook mapping improves to some extent the precision compared to the basic codebook based conversion, this approach is essentially only producing a finer representation of the acoustic space while being otherwise likely to inherit the fundamental problems of the basic codebook mapping.

#### **2.3.3 Local linear transformation**

Methods based on linear transformations such as GMM typically compute a number of linear transformations corresponding to different acoustic classes and use a linear combination of these transformations to convert a given spectral vector. As discussed in Section 2.2.2, this

The method proposes a rigorous way to handle the spectral continuity by utilizing dynamic information and keeping at the same time the advantages of good preservation of spectral details provided by the codebook framework. The approach was shown to clearly outperform the basic GMM and codebook-based techniques which are known to suffer from

Voice Conversion 81

The bilinear approach reformulates the spectral envelope representation from e.g. line spectral frequencies to a two-factor parameterization corresponding to speaker identity and phonetic information. The spectral vector **y***sc*, uttered by speaker *s* and corresponding to the phonetic content class *c*, is represented as a product of a speaker-dependent matrix **A***<sup>s</sup>* and a phonetic

**y***sc* = **A***<sup>s</sup>*

If the training set contains an equal number of spectral vectors for each speaker and in each content class, a closed form procedure exists for fitting the asymmetric model using *singular*

As discussed in the introduction, the usual problem formulation of voice conversion can be extended by considering the case of generating speech with a target voice, using parallel speech data from multiple source speakers. The alignment of the training data (*S* source speakers and one target speaker) is a prerequisite step for model estimation and is usually handled using DTW. On the other hand, the alignment of the test data (*S* utterances of the

A so-called *complete* data is formed by concatenating the aligned training and test data of the *S* source speakers. Considering each aligned *S*-tuple a separate class of phonetic content, an asymmetric bilinear model is fit to the *complete* data following the closed-form SVD procedure described in (Tenenbaum & Freeman, 2000). With the *complete* data arranged as a stacked

> **y**<sup>11</sup> ... **y**1*<sup>C</sup>* ... ... ... **y***S*<sup>1</sup> ... **y***SC*

where *C* denotes the total number of aligned frames in the *complete* data, the equations of the

to determine the parameters **A** as the first *J* columns of **UZ** and **B** as first *J* rows of **V***<sup>T</sup>* where *J* is the model dimensionality chosen according to some precision criterion and where the diagonal elements of **Z** are considered to be in decreasing eigenvalue order. This yields a matrix **A***<sup>s</sup>* for each source speaker *s* and a vector **b***<sup>c</sup>* for each phonetic content class *c* in the

The model adaptation to the target voice *t* can be done in closed form using the phonetic content vectors **b***<sup>c</sup>* learned during training. Suppose the aligned training data from our target

⎤

**Y** =

**b**<sup>1</sup> ... **b***<sup>C</sup>* �

*complete* data (hence producing also the **b***c*s of the test utterance).

⎡ ⎣ **b***c*

. (7)

⎦ , (8)

**Y** = **AB**, (9)

. The SVD of the *complete* data **Y** = **UZV***<sup>T</sup>* is used

content vector **b***<sup>c</sup>* using the asymmetric bilinear model (Popa et al., 2011):

*value decomposition* (SVD) (Tenenbaum & Freeman, 2000).

source speakers) is also required if *S >* 1.

asymmetric bilinear model can be rewritten as:

<sup>⎦</sup> and **<sup>B</sup>** <sup>=</sup> �

oversmoothing and discontinuities respectively.

**2.4 Bilinear models**

matrix **Y**:

where **A** =

⎡ ⎣ **A**1 ... **A***<sup>S</sup>* ⎤

effectively causes the problem of oversmoothing characterized by smoothed spectra and parameter tracks. The local linear transformation approach reduces the oversmoothing by operating with neighboring acoustic vectors that share similar properties (Popa et al., 2012). Linear regression models are estimated from neighborhoods of source-target codeword pairs with similar acoustic properties. Each spectral vector is converted with an individual linear transformation determined in the least squares sense from a subset of nearby codewords.

In order to convert a source spectral vector **x**, the first step is to select a set of nearest codewords in the source speaker's codebook. Assuming a one-to-one mapping between the codebooks of the source and target speakers, we can estimate in the second step, in the least squares sense, a linear transformation β<sup>0</sup> between the selected source and target codewords. The result of the linear transformation **y***<sup>T</sup>* <sup>0</sup> = **<sup>x</sup>***T*β<sup>0</sup> is used next to refine the selection of source-target codeword pairs by replacing the old set with the joint codewords nearest to **x***T*, **y***<sup>T</sup>* 0 *T* . A new linear transformation β<sup>1</sup> is estimated from the newly selected neighborhood leading to an updated conversion result **y***<sup>T</sup>* <sup>1</sup> <sup>=</sup> **<sup>x</sup>***T*β1. The iteration of the last neighborhood selection and the linear transformation estimation steps was found to be pseudo-convergent. It was also found beneficial to estimate band diagonal matrices β*<sup>i</sup>* instead of full ones. An entire sequence of spectral vectors is converted by repeating the above procedure for each vector.

The main idea of this method is in line with (Wang et al., 2004) that proposed a phoneme-tied weighting scheme which splits the codebook into groups by phoneme types. At the same time, the discontinuities typical to the basic codebook mapping are alleviated due to the overlapping of neighborhoods from consecutive frames. The conversion-time computation is somewhat intensive and can be regarded as a drawback.

#### **2.3.4 Trellis structured vector quantization**

Trellis structured vector quantization (Eslami et al., 2011) tackles the problem of discontinuities common for many codebook-based conversion approaches. The method operates with blocks of consecutive frames to obtain dynamic information and uses a trellis structure and dynamic programming to optimize a codeword path based on this dynamic information.

Parallel training speech quantized in the form of codeword sequences is aligned and source-target codeword pairs are formed. Preceding codewords in the source and target sequences are combined with each pair forming blocks of consecutive codewords which reflect the speech dynamics. The conversion of a source speech sequence requires the construction of an equally long trellis structure whose lines correspond to the codewords of the target codebook. The nodes in the trellis structure are assigned an initial cost and a maximum number of so-called survivor paths, or valid preceding target codewords. The initial cost is based on the similarities between the consecutive frames from the input sequence and memorized blocks of consecutive source codewords while the survivor paths are selected based on memorized blocks of consecutive target codewords. The survivor paths are also associated a transition cost based on Euclidean distance. Dynamic programming is used to find the optimal path in the trellis structure resulting in a converted sequence of target codewords.

12 Will-be-set-by-IN-TECH

effectively causes the problem of oversmoothing characterized by smoothed spectra and parameter tracks. The local linear transformation approach reduces the oversmoothing by operating with neighboring acoustic vectors that share similar properties (Popa et al., 2012). Linear regression models are estimated from neighborhoods of source-target codeword pairs with similar acoustic properties. Each spectral vector is converted with an individual linear transformation determined in the least squares sense from a subset of nearby codewords.

In order to convert a source spectral vector **x**, the first step is to select a set of nearest codewords in the source speaker's codebook. Assuming a one-to-one mapping between the codebooks of the source and target speakers, we can estimate in the second step, in the least squares sense, a linear transformation β<sup>0</sup> between the selected source and target codewords.

source-target codeword pairs by replacing the old set with the joint codewords nearest to

selection and the linear transformation estimation steps was found to be pseudo-convergent. It was also found beneficial to estimate band diagonal matrices β*<sup>i</sup>* instead of full ones. An entire sequence of spectral vectors is converted by repeating the above procedure for each

The main idea of this method is in line with (Wang et al., 2004) that proposed a phoneme-tied weighting scheme which splits the codebook into groups by phoneme types. At the same time, the discontinuities typical to the basic codebook mapping are alleviated due to the overlapping of neighborhoods from consecutive frames. The conversion-time computation

Trellis structured vector quantization (Eslami et al., 2011) tackles the problem of discontinuities common for many codebook-based conversion approaches. The method operates with blocks of consecutive frames to obtain dynamic information and uses a trellis structure and dynamic programming to optimize a codeword path based on this dynamic

Parallel training speech quantized in the form of codeword sequences is aligned and source-target codeword pairs are formed. Preceding codewords in the source and target sequences are combined with each pair forming blocks of consecutive codewords which reflect the speech dynamics. The conversion of a source speech sequence requires the construction of an equally long trellis structure whose lines correspond to the codewords of the target codebook. The nodes in the trellis structure are assigned an initial cost and a maximum number of so-called survivor paths, or valid preceding target codewords. The initial cost is based on the similarities between the consecutive frames from the input sequence and memorized blocks of consecutive source codewords while the survivor paths are selected based on memorized blocks of consecutive target codewords. The survivor paths are also associated a transition cost based on Euclidean distance. Dynamic programming is used to find the optimal path in the trellis structure resulting in a converted sequence of target

. A new linear transformation β<sup>1</sup> is estimated from the newly selected neighborhood

<sup>0</sup> = **<sup>x</sup>***T*β<sup>0</sup> is used next to refine the selection of

<sup>1</sup> <sup>=</sup> **<sup>x</sup>***T*β1. The iteration of the last neighborhood

The result of the linear transformation **y***<sup>T</sup>*

leading to an updated conversion result **y***<sup>T</sup>*

**2.3.4 Trellis structured vector quantization**

is somewhat intensive and can be regarded as a drawback.

 **x***T*, **y***<sup>T</sup>* 0 *T*

vector.

information.

codewords.

The method proposes a rigorous way to handle the spectral continuity by utilizing dynamic information and keeping at the same time the advantages of good preservation of spectral details provided by the codebook framework. The approach was shown to clearly outperform the basic GMM and codebook-based techniques which are known to suffer from oversmoothing and discontinuities respectively.

#### **2.4 Bilinear models**

The bilinear approach reformulates the spectral envelope representation from e.g. line spectral frequencies to a two-factor parameterization corresponding to speaker identity and phonetic information. The spectral vector **y***sc*, uttered by speaker *s* and corresponding to the phonetic content class *c*, is represented as a product of a speaker-dependent matrix **A***<sup>s</sup>* and a phonetic content vector **b***<sup>c</sup>* using the asymmetric bilinear model (Popa et al., 2011):

$$\mathbf{y}^{\text{sc}} = \mathbf{A}^{\text{s}} \mathbf{b}^{\text{c}}.\tag{7}$$

If the training set contains an equal number of spectral vectors for each speaker and in each content class, a closed form procedure exists for fitting the asymmetric model using *singular value decomposition* (SVD) (Tenenbaum & Freeman, 2000).

As discussed in the introduction, the usual problem formulation of voice conversion can be extended by considering the case of generating speech with a target voice, using parallel speech data from multiple source speakers. The alignment of the training data (*S* source speakers and one target speaker) is a prerequisite step for model estimation and is usually handled using DTW. On the other hand, the alignment of the test data (*S* utterances of the source speakers) is also required if *S >* 1.

A so-called *complete* data is formed by concatenating the aligned training and test data of the *S* source speakers. Considering each aligned *S*-tuple a separate class of phonetic content, an asymmetric bilinear model is fit to the *complete* data following the closed-form SVD procedure described in (Tenenbaum & Freeman, 2000). With the *complete* data arranged as a stacked matrix **Y**:

$$\mathbf{Y} = \begin{bmatrix} \mathbf{y}^{11} \dots \mathbf{y}^{1C} \\ \dots \dots \dots \dots \\ \mathbf{y}^{S1} \dots \mathbf{y}^{SC} \end{bmatrix}, \tag{8}$$

where *C* denotes the total number of aligned frames in the *complete* data, the equations of the asymmetric bilinear model can be rewritten as:

$$\mathbf{Y} = \mathbf{A}\mathbf{B},\tag{9}$$

where **A** = ⎡ ⎣ **A**1 ... **A***<sup>S</sup>* ⎤ <sup>⎦</sup> and **<sup>B</sup>** <sup>=</sup> � **b**<sup>1</sup> ... **b***<sup>C</sup>* � . The SVD of the *complete* data **Y** = **UZV***<sup>T</sup>* is used

to determine the parameters **A** as the first *J* columns of **UZ** and **B** as first *J* rows of **V***<sup>T</sup>* where *J* is the model dimensionality chosen according to some precision criterion and where the diagonal elements of **Z** are considered to be in decreasing eigenvalue order. This yields a matrix **A***<sup>s</sup>* for each source speaker *s* and a vector **b***<sup>c</sup>* for each phonetic content class *c* in the *complete* data (hence producing also the **b***c*s of the test utterance).

The model adaptation to the target voice *t* can be done in closed form using the phonetic content vectors **b***<sup>c</sup>* learned during training. Suppose the aligned training data from our target speaker *t* consists of *M* spectral vectors which by convention we considered to be in *M* different phonetic content classes **C***<sup>T</sup>* = {*c*1, *c*2,..., *cM*}. We can derive the speaker-dependent matrix **A***<sup>t</sup>* that minimizes the total squared error over the target training data

$$E^\* = \sum\_{c \in \mathbf{C}\_T} \left\| \mathbf{y}^{tc} - \mathbf{A}^t \mathbf{b}^c \right\|^2. \tag{10}$$

parameters of synthetic speech are generated from the state output and duration statistics of the context-dependent HMMs corresponding to a given input label sequence. Waveform

Voice Conversion 83

In HMM-based speech synthesis, even a relatively small database can be used to produce understandable speech. Models can be easily adapted, and producing new voices or altering speech characteristics such as emotions is easy. The statistical models of an existing HMM voice, trained using data either from one speaker (a speaker-dependent system) or multiple speakers (a speaker-independent system) are adapted using a small amount of data from the target speaker. A typical approach employs linear regression to map the models for the target speaker. The mapping functions are typically different for different sets of models allowing the individual conversion functions to be simple. This is in contrast to for example the GMM-based voice conversion discussed in Section 2.1 that attempts to provide a global conversion model consisting of several linear transforms. In stand-alone VC, it is common to rely on acoustic information only, but in TTS, the phonetic information is usually readily

In the following, we discuss the transformation techniques applied in HMM-based speech synthesis. We first give an overview on the basic HMM modeling techniques required both in speaker-dependent and speaker-adaptive synthesis. After that we discuss the speaker adaptive synthesis where the average models are transformed using a smaller set of data from a specific target speaker. For the most of the discussed ideas, the implementations are publicly available in Hidden Markov Model-Based Speech Synthesis System (HTS) (Tokuda et al., 2011). HTS is a widely used and extensive framework for HMM-based speech synthesis containing tools for both HMM-based speech modeling and parameter generation as well as

Speech modeling using context-dependent HMMs, common for both speaker-dependent and speaker-adaptive synthesis, are described in the following. Many of the core techniques

HMM-based speech synthesis provides a flexible framework for speech synthesis, where all speech features can be modeled simultaneously within the same multi-stream HMM. Spectral parameter modeling involves continuous-density HMMs with single multivariate Gaussian distributions and typically diagonal covariance matrices or mixtures of such Gaussian distributions. In *F*<sup>0</sup> modeling, multi-space probability distribution HMMs (MSD-HMM) with two types of distributions are used: continuous densities for voiced speech segments and a single symbol for unvoiced segments. A typical modeling scheme uses 5-state left-to-right modeling with no state skipping. In addition to the state output probability distributions the modeling also involves the estimation of state transition probabilities indicating the

The training phase aims at determining model parameters of the HMMs based on the training data. These parameters include means and covariances of the state output probability distributions and probabilities of the state transitions. This parameter set *λ*∗ that maximizes

originate from HMM-based speech recognition summarized in (Rabiner, 1989).

resynthesis is used for creating the actual synthetic speech signal.

available and can be effectively utilized.

**3.1 Statistical modeling of speech features for synthesis**

probability of staying in the state or transferring to the next one.

for speaker adaptation.

**3.1.1 HMM modeling of speech**

The missing spectral vectors in the target voice *t* and a phonetic content class *c* of the test sentence can be then synthesized from **y***tc* = **A***<sup>t</sup>* **b***c* . This means we can estimate the target version of the test sentence by multiplying the target speaker matrix **A***<sup>t</sup>* with the phonetic content vectors corresponding to the test sentence.

The performance of the bilinear approach was found close to that of a GMM-based conversion with optimal number of Gaussian components particularly for reduced training sets. The method benefits of efficient computational algorithms based on SVD. On the downside, the bilinear approach suffers from oversmoothing problem, similarly as many other VC techniques (e.g. GMM-based conversion).

#### **2.5 Nonlinear methods**

Artificial neural networks offer a powerful tool for modeling complex (nonlinear) relationships between input and output. They have been applied to voice conversion for example in (Desai et al., 2010). The main disadvantage is the requirement of massive tuning when selecting the best architecture for the network. Another alternative to model nonlinear relationships is kernel partial least squares regression (Helander et al., 2011); a kernel transformation is carried out on the source data as a preprocessing step and PLS regression is applied on kernel transformed data. In addition, the kernel transformed source data of the current frame is augmented from kernel transformed source data from the previous and next frames before regression calculation. This helps in improving the accuracy of the model and maintaining the temporal continuity that is a major problem of many voice conversion algorithms. In (Song et al., 2011), support vector regression was used for non-linear spectral conversion. Compared to neural networks, the tuning of support vector regression is less demanding.

#### **3. Voice transformation in text-to-speech synthesis**

Text-to-speech or speech synthesis refers to artificial conversion of text into speech. Currently the most widely studied TTS methods are corpus-based: they rely on the use of real recorded speech data. The quality of a text-to-speech system can be measured in terms of how well the synthesized speech can be understood, how natural-sounding it is, and how well the synthesis captures the speaker identity of the training speech data.

Statistical parametric speech synthesis, such as *hidden Markov model* (HMM) based speech synthesis (Tokuda et al., 2002; Yoshimura et al., 1999), provides a flexible framework for TTS with good capabilities for speaker or style adaptation. In this kind of synthesis, the recorded speech data is parameterized into a form that enables control of the perceptually important features of speech, such as the spectrum and the fundamental frequency. Statistical modeling is then used to create models for the speech features based on the labeled training data. The training procedure is quite similar to training in HMM-based speech recognition, but now all of the speech features needed for the analysis/synthesis framework are modeled. The 14 Will-be-set-by-IN-TECH

speaker *t* consists of *M* spectral vectors which by convention we considered to be in *M* different phonetic content classes **C***<sup>T</sup>* = {*c*1, *c*2,..., *cM*}. We can derive the speaker-dependent

> **<sup>y</sup>***tc* <sup>−</sup> **<sup>A</sup>***<sup>t</sup>*

The missing spectral vectors in the target voice *t* and a phonetic content class *c* of the test

version of the test sentence by multiplying the target speaker matrix **A***<sup>t</sup>* with the phonetic

The performance of the bilinear approach was found close to that of a GMM-based conversion with optimal number of Gaussian components particularly for reduced training sets. The method benefits of efficient computational algorithms based on SVD. On the downside, the bilinear approach suffers from oversmoothing problem, similarly as many other VC

Artificial neural networks offer a powerful tool for modeling complex (nonlinear) relationships between input and output. They have been applied to voice conversion for example in (Desai et al., 2010). The main disadvantage is the requirement of massive tuning when selecting the best architecture for the network. Another alternative to model nonlinear relationships is kernel partial least squares regression (Helander et al., 2011); a kernel transformation is carried out on the source data as a preprocessing step and PLS regression is applied on kernel transformed data. In addition, the kernel transformed source data of the current frame is augmented from kernel transformed source data from the previous and next frames before regression calculation. This helps in improving the accuracy of the model and maintaining the temporal continuity that is a major problem of many voice conversion algorithms. In (Song et al., 2011), support vector regression was used for non-linear spectral conversion. Compared to neural networks, the tuning of support vector regression is less

Text-to-speech or speech synthesis refers to artificial conversion of text into speech. Currently the most widely studied TTS methods are corpus-based: they rely on the use of real recorded speech data. The quality of a text-to-speech system can be measured in terms of how well the synthesized speech can be understood, how natural-sounding it is, and how well the synthesis

Statistical parametric speech synthesis, such as *hidden Markov model* (HMM) based speech synthesis (Tokuda et al., 2002; Yoshimura et al., 1999), provides a flexible framework for TTS with good capabilities for speaker or style adaptation. In this kind of synthesis, the recorded speech data is parameterized into a form that enables control of the perceptually important features of speech, such as the spectrum and the fundamental frequency. Statistical modeling is then used to create models for the speech features based on the labeled training data. The training procedure is quite similar to training in HMM-based speech recognition, but now all of the speech features needed for the analysis/synthesis framework are modeled. The

**b***c*

**b***c* 2

. (10)

. This means we can estimate the target

matrix **A***<sup>t</sup>* that minimizes the total squared error over the target training data

*E*<sup>∗</sup> = ∑ *c*∈**C***<sup>T</sup>*

sentence can be then synthesized from **y***tc* = **A***<sup>t</sup>*

content vectors corresponding to the test sentence.

**3. Voice transformation in text-to-speech synthesis**

captures the speaker identity of the training speech data.

techniques (e.g. GMM-based conversion).

**2.5 Nonlinear methods**

demanding.

parameters of synthetic speech are generated from the state output and duration statistics of the context-dependent HMMs corresponding to a given input label sequence. Waveform resynthesis is used for creating the actual synthetic speech signal.

In HMM-based speech synthesis, even a relatively small database can be used to produce understandable speech. Models can be easily adapted, and producing new voices or altering speech characteristics such as emotions is easy. The statistical models of an existing HMM voice, trained using data either from one speaker (a speaker-dependent system) or multiple speakers (a speaker-independent system) are adapted using a small amount of data from the target speaker. A typical approach employs linear regression to map the models for the target speaker. The mapping functions are typically different for different sets of models allowing the individual conversion functions to be simple. This is in contrast to for example the GMM-based voice conversion discussed in Section 2.1 that attempts to provide a global conversion model consisting of several linear transforms. In stand-alone VC, it is common to rely on acoustic information only, but in TTS, the phonetic information is usually readily available and can be effectively utilized.

In the following, we discuss the transformation techniques applied in HMM-based speech synthesis. We first give an overview on the basic HMM modeling techniques required both in speaker-dependent and speaker-adaptive synthesis. After that we discuss the speaker adaptive synthesis where the average models are transformed using a smaller set of data from a specific target speaker. For the most of the discussed ideas, the implementations are publicly available in Hidden Markov Model-Based Speech Synthesis System (HTS) (Tokuda et al., 2011). HTS is a widely used and extensive framework for HMM-based speech synthesis containing tools for both HMM-based speech modeling and parameter generation as well as for speaker adaptation.

#### **3.1 Statistical modeling of speech features for synthesis**

Speech modeling using context-dependent HMMs, common for both speaker-dependent and speaker-adaptive synthesis, are described in the following. Many of the core techniques originate from HMM-based speech recognition summarized in (Rabiner, 1989).

#### **3.1.1 HMM modeling of speech**

HMM-based speech synthesis provides a flexible framework for speech synthesis, where all speech features can be modeled simultaneously within the same multi-stream HMM. Spectral parameter modeling involves continuous-density HMMs with single multivariate Gaussian distributions and typically diagonal covariance matrices or mixtures of such Gaussian distributions. In *F*<sup>0</sup> modeling, multi-space probability distribution HMMs (MSD-HMM) with two types of distributions are used: continuous densities for voiced speech segments and a single symbol for unvoiced segments. A typical modeling scheme uses 5-state left-to-right modeling with no state skipping. In addition to the state output probability distributions the modeling also involves the estimation of state transition probabilities indicating the probability of staying in the state or transferring to the next one.

The training phase aims at determining model parameters of the HMMs based on the training data. These parameters include means and covariances of the state output probability distributions and probabilities of the state transitions. This parameter set *λ*∗ that maximizes the likelihood of the training data **O** is:

$$\lambda^\* = \underset{\lambda}{\text{arg}\,\text{max}}\, P(\mathbf{O}|\lambda) = \underset{\lambda}{\text{arg}\,\text{max}}\,\sum\_{all\,\mathbf{q}}P(\mathbf{O},\mathbf{q}|\lambda). \tag{11}$$

This can be written in the matrix form as **O** = **WC**:

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

ML solution for a feature sequence **C**∗ is:

**C**∗ = arg max **C**

Tokuda, 2007) in speech parameter generation.

are included in context-dependent labeling of English data:

**3.1.2 Labeling with rich context features**

**O** � �� � ⎡

. . . **<sup>c</sup>***t*−<sup>1</sup> <sup>Δ</sup>**c***t*−<sup>1</sup> **c***t* Δ**c***<sup>t</sup>* **c***t*+<sup>1</sup> Δ**c***t*+<sup>1</sup> . . .

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

=

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ... . . . . . . . . . . . . ...

... . . . . . . . . . . . . ...

If the state output probability distributions are modeled as single Gaussian distributions, the

where μq<sup>∗</sup> and **Σ**q<sup>∗</sup> refer to the mean and covariance of the state output probabilities of the

HMM-based speech synthesis suffers from the same problem as GMM-based voice conversion: the statistical modeling loses fine details and introduces oversmoothing in the generated speech parameter trajectories. Postfiltering of the generated spectral parameters can be utilized to improve the synthesis quality. Another widely used approach for restoring the natural variance of the speech parameters is to use global variance modeling (Toda &

The prosody of HMM-based speech synthesis is controlled by the context-dependent labeling. It tries to capture the language-dependent contextual variation in the speech unit waveforms. Separate models are trained for each phoneme in different contexts. In addition to phoneme identities, a large set of other phonetic and prosodic features related to for instance position, stress, accent, part of speech and number of different phonetic units are used to make a distinction between different context-dependent phonemes. No high-level linguistic knowledge is needed and instead, the characteristics of the speech in different contexts are automatically learned from the training data. In (Tokuda et al., 2011), the following features

Phoneme level: Phoneme identity of the current and two preceding/succeeding

Syllable level: Number of phonemes/accent/stress of the current/preceding/

succeeding syllable, position in a word/phrase, number of preceding/succeeding stressed/accented syllables in a phrase, distance from the previous/following stressed/accented syllable, and

phonemes and position in a syllable.

phoneme identity of the syllable vowel.

**C**

*P*(**WC**|**q**∗, *λ*, *T*) = arg max

state sequence **q**∗. The solution of Equation 17 can be found in a closed form.

**W** � �� � ⎡

Voice Conversion 85

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

**C** � �� � ⎡

. . . **c***t*−<sup>2</sup> **<sup>c</sup>***t*−<sup>1</sup> **c***t* **c***t*+<sup>1</sup> . . .

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

N (**WC**; μq<sup>∗</sup> , **Σ**q<sup>∗</sup> ), (17)

(16)

... **0 I 00** ... ... −**II 00** ... ... **0 0 I0** ... ... **0** −**II0** ... ... **0 0 0I** ... ... **0 0** −**I I** ...

Here **q** is a hidden variable denoting an HMM state sequence, each state having output probability distributions for each speech feature and transition probability. Due to the hidden variable there is no analytical solution for the problem. A local optimum can be found using Baum-Welch estimation.

The use of acceptable state durations is essential for high-quality synthesis. Hence, in addition to the speech features such as spectral parameters and *F*0, a model for the speech rhythm is needed as well. It is modeled through the state duration distributions employing either Gaussian (Yoshimura et al., 1998) or Gamma distributions and in the synthesis phase these state duration distributions are used to determine how many frames are generated from each HMM state. In the conventional approach, the duration distributions are derived from the statistics of the last iteration of the HMM training. The duration densities are used in synthesis but they are not present in the standard HMM training. A more accurate modeling can be achieved using hidden semi-Markov model (HSMM) based techniques (Zen et al., 2004) where the duration distributions are explicitly present already during the parameter re-estimation of the training phase.

In the synthesis phase, the trained HMMs are used to generate speech parameters for text unseen in the training data. Waveform resynthesis then turns these parameters into an acoustic speech waveform using e.g. vocoding. A sentence HMM is formed by concatenating the required context-dependent state models. The maximum likelihood estimate for the synthetic speech parameter sequence **O** = {**o**1, **o**2,..., **o***T*} is (Tokuda et al., 2000):

$$\mathbf{O}^\* = \mathop{\arg\max}\_{\mathbf{O}} P(\mathbf{O}|\lambda, T). \tag{12}$$

The solution of Equation 12 can be approximated by dividing the estimation into the separate search of the optimal state sequence **q**∗ and maximum likelihood observations **O**∗ given the state sequence:

$$\begin{aligned} \mathbf{q}^\* &= \operatorname\*{arg\,max}\_{\mathbf{q}} P(\mathbf{q} | \boldsymbol{\lambda}, T) \\ \mathbf{0}^\* &= \operatorname\*{arg\,max}\_{\mathbf{0}} P(\mathbf{O} | \mathbf{q}^\*, \boldsymbol{\lambda}, T) . \end{aligned} \tag{13}$$

To introduce continuity in synthesis, dynamic modeling is typically used. Without the delta-augmentation the parameter generation algorithm would only output a sequence of mean vectors corresponding to the state sequence **q**∗. The delta-augmented observation vectors **o***<sup>t</sup>* contain both static **c***<sup>t</sup>* and dynamic feature values Δ**c***t*:

$$\mathbf{o}\_t = \begin{bmatrix} \mathbf{c}\_t^T \, \Delta \mathbf{c}\_t^T \end{bmatrix}^T \,\,\,\,\tag{14}$$

where the dynamic feature vectors Δ**c***<sup>t</sup>* are defined as:

$$
\Delta \mathbf{c}\_{t} = \mathbf{c}\_{t} - \mathbf{c}\_{t-1}.\tag{15}
$$

16 Will-be-set-by-IN-TECH

*P*(**O**|*λ*) = arg max

Here **q** is a hidden variable denoting an HMM state sequence, each state having output probability distributions for each speech feature and transition probability. Due to the hidden variable there is no analytical solution for the problem. A local optimum can be found using

The use of acceptable state durations is essential for high-quality synthesis. Hence, in addition to the speech features such as spectral parameters and *F*0, a model for the speech rhythm is needed as well. It is modeled through the state duration distributions employing either Gaussian (Yoshimura et al., 1998) or Gamma distributions and in the synthesis phase these state duration distributions are used to determine how many frames are generated from each HMM state. In the conventional approach, the duration distributions are derived from the statistics of the last iteration of the HMM training. The duration densities are used in synthesis but they are not present in the standard HMM training. A more accurate modeling can be achieved using hidden semi-Markov model (HSMM) based techniques (Zen et al., 2004) where the duration distributions are explicitly present already during the parameter re-estimation of

In the synthesis phase, the trained HMMs are used to generate speech parameters for text unseen in the training data. Waveform resynthesis then turns these parameters into an acoustic speech waveform using e.g. vocoding. A sentence HMM is formed by concatenating the required context-dependent state models. The maximum likelihood estimate for the

The solution of Equation 12 can be approximated by dividing the estimation into the separate search of the optimal state sequence **q**∗ and maximum likelihood observations **O**∗ given the

To introduce continuity in synthesis, dynamic modeling is typically used. Without the delta-augmentation the parameter generation algorithm would only output a sequence of mean vectors corresponding to the state sequence **q**∗. The delta-augmented observation

*P*(**q**|*λ*, *T*)

synthetic speech parameter sequence **O** = {**o**1, **o**2,..., **o***T*} is (Tokuda et al., 2000): **O**∗ = arg max **O**

> **q**∗ = arg max **q**

**O**∗ = arg max **O**

> **o***<sup>t</sup>* = **c***T <sup>t</sup>* , <sup>Δ</sup>**c***<sup>T</sup> t T*

vectors **o***<sup>t</sup>* contain both static **c***<sup>t</sup>* and dynamic feature values Δ**c***t*:

where the dynamic feature vectors Δ**c***<sup>t</sup>* are defined as:

*λ*

∑ *all* **q** *P*(**O**, **q**|*λ*). (11)

*P*(**O**|*λ*, *T*). (12)

*<sup>P</sup>*(**O**|**q**∗, *<sup>λ</sup>*, *<sup>T</sup>*). (13)

, (14)

<sup>Δ</sup>**c***<sup>t</sup>* = **<sup>c</sup>***<sup>t</sup>* − **<sup>c</sup>***t*−1. (15)

the likelihood of the training data **O** is:

Baum-Welch estimation.

the training phase.

state sequence:

*λ*∗ = arg max *λ*

This can be written in the matrix form as **O** = **WC**:

$$
\begin{aligned}
\begin{array}{c}
\textbf{o} \\
\hline
\vdots \\
\textbf{c}\_{t-1} \\
\Delta\textbf{c}\_{t-1} \\
\cline{s} \\
\textbf{c}\_{t} \\
\cline{c} \\
\textbf{c}\_{t+1} \\
\dots \\
\end{array} = \begin{array}{c}
\textbf{w} \\
\hline
\ldots \ldots \overbrace{\textbf{I} \quad \vdots \vdots \vdots \dots}\_{\text{I}} \\
\ldots \ldots \overbrace{\textbf{0} \quad \textbf{I} \quad \textbf{0} \quad \textbf{0} \dots} \\
\ldots \ldots \texttt{0} \quad \textbf{I} \quad \textbf{0} \dots \ldots \\
\ldots \ldots \texttt{0} \quad \textbf{I} \quad \textbf{0} \dots \ldots \\
\ldots \ldots \texttt{0} \quad \textbf{0} \quad \textbf{I} \dots \ldots \\
\ldots \ldots \texttt{0} \quad \textbf{0} \quad \textbf{I} \dots \ldots \\
\ldots \ldots \vdots \quad \vdots \vdots \vdots \dots}\_{\ldots} \end{aligned} \qquad \begin{cases}
\textbf{c} \\
\hline
\vdots \\
\textbf{c} \\
\textbf{c}\_{t-2} \\
\textbf{c}\_{t-1} \\
\textbf{c}\_{t} \\
\textbf{c}\_{t} \\
\vdots \\
\end{array} \tag{16}$$

If the state output probability distributions are modeled as single Gaussian distributions, the ML solution for a feature sequence **C**∗ is:

$$\mathbf{C}^\* = \mathop{\arg\max}\_{\mathbf{C}} P(\mathbf{W}\mathbf{C}|\mathbf{q}^\*, \lambda\_\prime T) = \mathop{\arg\max}\_{\mathbf{C}} \mathcal{N}(\mathbf{W}\mathbf{C}; \boldsymbol{\mu}\_\mathbf{q^\*}, \boldsymbol{\Sigma}\_\mathbf{q^\*}), \tag{17}$$

where μq<sup>∗</sup> and **Σ**q<sup>∗</sup> refer to the mean and covariance of the state output probabilities of the state sequence **q**∗. The solution of Equation 17 can be found in a closed form.

HMM-based speech synthesis suffers from the same problem as GMM-based voice conversion: the statistical modeling loses fine details and introduces oversmoothing in the generated speech parameter trajectories. Postfiltering of the generated spectral parameters can be utilized to improve the synthesis quality. Another widely used approach for restoring the natural variance of the speech parameters is to use global variance modeling (Toda & Tokuda, 2007) in speech parameter generation.

#### **3.1.2 Labeling with rich context features**

The prosody of HMM-based speech synthesis is controlled by the context-dependent labeling. It tries to capture the language-dependent contextual variation in the speech unit waveforms. Separate models are trained for each phoneme in different contexts. In addition to phoneme identities, a large set of other phonetic and prosodic features related to for instance position, stress, accent, part of speech and number of different phonetic units are used to make a distinction between different context-dependent phonemes. No high-level linguistic knowledge is needed and instead, the characteristics of the speech in different contexts are automatically learned from the training data. In (Tokuda et al., 2011), the following features are included in context-dependent labeling of English data:



The above-mentioned HMM adaptation approaches are discussed in more detail in the following. In addition to the MAP and linear regression derivatives originating from the speaker adaptation of HMM-based speech recognition, the adaptation approaches used in stand-alone voice conversion can be applied in HMM-based speech synthesis. In *speaker interpolation* of HMMs (Yoshimura et al., 1997) a set of HMMs from representative speakers is interpolated to form models matching with the characteristics of the target speaker's voice. The interpolation of an HMM set can change the synthetic speech smoothly from the existing voice to the target voice by changing the interpolation ratio. In addition to the speaker adaptation, interpolation can be used for instance in emotion or speaking-style conversion. The *eigenvoice* approach (Shichiri et al., 2002), also familiar from voice conversion (Kuhn et al., 2000), tackles the problem of how to determine the interpolation ratio by constructing a speaker specific super-vector from all the state output mean vectors of each speaker, emotion, or style-dependent HMM set. The dimension of the super-vector is reduced by PCA and the

Maximum a posteriori adaptation of HMMs updates parameters of each state output probability distribution according to the given adaptation data. If we have some knowledge on what the model parameters are likely to be already before observing any data, also a limited amount of data from the target speaker can be enough to adapt the model parameters. In MAP adaptation of HMMs (Lee et al., 1991; Masuko et al., 1997), this prior information of model

MAP estimate for HMM parameters *λ* is defined as the mode of the posterior probability distribution *P* (*λ*|**O**) given the prior probability *P* (*λ*) and the data **O** = {**o**1, **o**2,..., **o***T*} (Lee

*P* (*λ*|**O**) = arg max

The speaker-independent models can be used as informative priors that are updated according to the adaptation data. In the MAP adaptation approach of (Masuko et al., 1997), the adaptation data are segmented by Viterbi alignment of HMMs and state means and

The use of prior information is useful when only a small amount of training data is available. However, every distribution is adapted individually and for a small amount or sparse adaptation data, MAP estimates may be unreliable and there might even be states for which no new set of parameters is trained. This makes the synthesis jump between the average voice and the target voice even within a sentence. Vector-field-smoothing (VFS) (Takahashi & Sagayama, 1995) can be used to alleviate the problem: it uses *K* nearest neighbor distributions to interpolate means and covariances for the distributions having no adaptation available. A rather similar approach can also be used for smoothing the means and the covariances of the

*λ*

*P* (**O**|*λ*) *P* (*λ*). (18)

new HMM set is reconstructed from the first eigenvoices (eigenvectors).

parameters is taken into account when deriving the new output distributions.

*λ*¯ = arg max *λ*

covariances are updated using the data assigned to the state.

**3.2.1 Maximum a posteriori adaptation**

et al., 1991):

adapted distributions.

Even though context-dependent labeling enables the separation of different contexts in the modeling, it also makes the training data very sparse. Collecting a training database that would include enough training data to estimate reliable models for all possible context-dependent labels of a language is practically impossible. To pool acoustically similar models and to provide a prediction mechanism for labels not seen in the training data, decision tree clustering using a set of binary questions and the *minimum description length* (MDL) criterion (Shinoda & Watanabe, 2000) is often employed. The construction of a MDL-based decision tree takes into account both the acoustic similarity of the state output probability distributions assigned to each node and the overall complexity of the resulting tree. In the synthesis phase, the input text is parsed to form a context-dependent label sequence and the tree is traversed from the root to the leaves to find the cluster for each synthesis label.

#### **3.2 Changing voice characteristics in HMM-based speech synthesis**

Speaker adaptation provides an efficient way of creating new synthetic voices for HMM-based speech synthesis. Once an initial model is trained, either speaker-dependently (SD) or speaker-independently (SI), its parameters can be adapted for an unlimited number of new speakers, speaking styles, or emotions using only a small number of adaptation sentences. An extreme example is given in (Yamagishi et al., 2010), where thousands of new English, Finnish, Spanish, Japanese, and Mandarin synthesis voices were created by adapting the trained average voices using only a limited amount of adaptation sentences from each target speaker.

In adaptive HMM-based speech synthesis, there is no need for parallel data. The adaptation updates the HMM model parameters including the state output probability distributions and the duration densities using data from the target speaker or speaking style. The first speaker adaptation approaches were developed for the standard HMMs but HSMMs with explicit duration modeling have been widely used in adaptation as well. The commonly used methods for speaker adaptation include *maximum a posteriori* (MAP) adaptation (Lee et al., 1991), *maximum likelihood linear regression* (MLLR) adaptation (Leggetter & Woodland, 1995), *structural maximum a posteriori linear regression* (SMAPLR) adaptation (Shiohan et al., 2002), and their variants. In MAP adaptation of HMMs, each Gaussian distribution is updated according to the new data and the prior probability of the model. MLLR and SMAPLR, on the other hand, use linear regression to convert the existing model parameters to match with the characteristics of the adaptation data; to cope with the data sparseness, models are typically clustered and a shared transformation is trained for the models of each cluster. While the MAP-based adaptation can only update distributions that have observations in the adaptation data, MLLR and SMAPLR using linear conversion to transform the existing parameters into 18 Will-be-set-by-IN-TECH

syllables in the current/preceding/succeeding word, position in the phrase, number of preceding/succeeding content words, number of

Word level: Part of speech of the current/preceding/succeeding word, number of

Phrase level: Number of syllables in the preceding/current/succeeding phrase, position in a major phrase, and ToBI endtone.

Even though context-dependent labeling enables the separation of different contexts in the modeling, it also makes the training data very sparse. Collecting a training database that would include enough training data to estimate reliable models for all possible context-dependent labels of a language is practically impossible. To pool acoustically similar models and to provide a prediction mechanism for labels not seen in the training data, decision tree clustering using a set of binary questions and the *minimum description length* (MDL) criterion (Shinoda & Watanabe, 2000) is often employed. The construction of a MDL-based decision tree takes into account both the acoustic similarity of the state output probability distributions assigned to each node and the overall complexity of the resulting tree. In the synthesis phase, the input text is parsed to form a context-dependent label sequence and the tree is traversed from the root to the leaves to find the cluster for each

Speaker adaptation provides an efficient way of creating new synthetic voices for HMM-based speech synthesis. Once an initial model is trained, either speaker-dependently (SD) or speaker-independently (SI), its parameters can be adapted for an unlimited number of new speakers, speaking styles, or emotions using only a small number of adaptation sentences. An extreme example is given in (Yamagishi et al., 2010), where thousands of new English, Finnish, Spanish, Japanese, and Mandarin synthesis voices were created by adapting the trained average voices using only a limited amount of adaptation sentences from each target

In adaptive HMM-based speech synthesis, there is no need for parallel data. The adaptation updates the HMM model parameters including the state output probability distributions and the duration densities using data from the target speaker or speaking style. The first speaker adaptation approaches were developed for the standard HMMs but HSMMs with explicit duration modeling have been widely used in adaptation as well. The commonly used methods for speaker adaptation include *maximum a posteriori* (MAP) adaptation (Lee et al., 1991), *maximum likelihood linear regression* (MLLR) adaptation (Leggetter & Woodland, 1995), *structural maximum a posteriori linear regression* (SMAPLR) adaptation (Shiohan et al., 2002), and their variants. In MAP adaptation of HMMs, each Gaussian distribution is updated according to the new data and the prior probability of the model. MLLR and SMAPLR, on the other hand, use linear regression to convert the existing model parameters to match with the characteristics of the adaptation data; to cope with the data sparseness, models are typically clustered and a shared transformation is trained for the models of each cluster. While the MAP-based adaptation can only update distributions that have observations in the adaptation data, MLLR and SMAPLR using linear conversion to transform the existing parameters into

words from previous/next content word.

Utterance level: Number of syllables/words/phrases in the utterance.

**3.2 Changing voice characteristics in HMM-based speech synthesis**

synthesis label.

speaker.

new ones are effective in adapting any distributions. The adaptation performance of MLLR or SMAPLR can be further improved by using *speaker-adaptive training* (SAT) to prevent single speaker's data from biasing the training of the average voice.

The above-mentioned HMM adaptation approaches are discussed in more detail in the following. In addition to the MAP and linear regression derivatives originating from the speaker adaptation of HMM-based speech recognition, the adaptation approaches used in stand-alone voice conversion can be applied in HMM-based speech synthesis. In *speaker interpolation* of HMMs (Yoshimura et al., 1997) a set of HMMs from representative speakers is interpolated to form models matching with the characteristics of the target speaker's voice. The interpolation of an HMM set can change the synthetic speech smoothly from the existing voice to the target voice by changing the interpolation ratio. In addition to the speaker adaptation, interpolation can be used for instance in emotion or speaking-style conversion. The *eigenvoice* approach (Shichiri et al., 2002), also familiar from voice conversion (Kuhn et al., 2000), tackles the problem of how to determine the interpolation ratio by constructing a speaker specific super-vector from all the state output mean vectors of each speaker, emotion, or style-dependent HMM set. The dimension of the super-vector is reduced by PCA and the new HMM set is reconstructed from the first eigenvoices (eigenvectors).

#### **3.2.1 Maximum a posteriori adaptation**

Maximum a posteriori adaptation of HMMs updates parameters of each state output probability distribution according to the given adaptation data. If we have some knowledge on what the model parameters are likely to be already before observing any data, also a limited amount of data from the target speaker can be enough to adapt the model parameters. In MAP adaptation of HMMs (Lee et al., 1991; Masuko et al., 1997), this prior information of model parameters is taken into account when deriving the new output distributions.

MAP estimate for HMM parameters *λ* is defined as the mode of the posterior probability distribution *P* (*λ*|**O**) given the prior probability *P* (*λ*) and the data **O** = {**o**1, **o**2,..., **o***T*} (Lee et al., 1991):

$$\bar{\lambda} = \underset{\lambda}{\text{arg}\, \text{max}} \, P\left(\lambda | \mathbf{O}\right) = \underset{\lambda}{\text{arg}\, \text{max}} \, P\left(\mathbf{O} | \lambda\right) P\left(\lambda\right) \,. \tag{18}$$

The speaker-independent models can be used as informative priors that are updated according to the adaptation data. In the MAP adaptation approach of (Masuko et al., 1997), the adaptation data are segmented by Viterbi alignment of HMMs and state means and covariances are updated using the data assigned to the state.

The use of prior information is useful when only a small amount of training data is available. However, every distribution is adapted individually and for a small amount or sparse adaptation data, MAP estimates may be unreliable and there might even be states for which no new set of parameters is trained. This makes the synthesis jump between the average voice and the target voice even within a sentence. Vector-field-smoothing (VFS) (Takahashi & Sagayama, 1995) can be used to alleviate the problem: it uses *K* nearest neighbor distributions to interpolate means and covariances for the distributions having no adaptation available. A rather similar approach can also be used for smoothing the means and the covariances of the adapted distributions.

amount of speech data from each target speaker is typically rather small, hence MAP criterion as a more robust one compared to the ML criterion might be more attractive. HMM adaptation by structural maximum a posteriori linear regression (SMAPLR) (Shiohan et al., 2002) combines the idea of linear mapping of the HMM distributions and structural MAP (SMAP) exploiting a tree structure to derive the prior distributions. The use of constrained SMAPLR (CSMAPLR) in adaptive HSMM-based speech synthesis was introduced in (Yamagishi et al.,

Voice Conversion 89

Replacing the ML criterion in MLLR with the MAP criterion leads to the model that also takes

In the best case, the use of the MAP criterion can help to avoid training of unrealistic transformations that would not generalize that well for unseen content. Furthermore, well selected prior distributions can increase the conversion accuracy. In SMAPLR adaptation (Shiohan et al., 2002), a hierarchical tree structure is used to derive priors that better take into account the relation and similarity of different distributions. For the root node, a global transform is computed using all the adaptation data. Rest of the nodes recursively inherit their prior distributions from their parent nodes: hyperparameters of the parent node posterior distributions *P*(**W**|**O**, *λ*) are propagated to the child nodes where the distribution is approximated and used as a prior distribution *P*(**W**). In each node the MAPLR transformation **W** is derived as in Equation 22 using the prior distribution and the adaptation data assigned

The amount of training data from the target speaker is typically small whereas the initial models are usually estimated from a large set of training data preferably spoken by multiple speakers. This *speaker independent* (SI) training with multi-speaker training data resulting in average voice HMM usually provides a more robust basis for the mapping compared to the *speaker-dependent* (SD) training using only single-speaker data (Yamagishi & Kobayashi, 2007). In addition, especially in *F*<sup>0</sup> modeling larger datasets tend to provide more complete modeling hence making average voice training even more attractive compared to the speaker-dependent

The average voice used for adaptation should provide high-quality mapping to various target voices and should not have bias from single speakers' data. *Speaker adaptive training* (SAT) of HMMs introduced in (Anastasakos et al., 1996) and applied in HSMM-based speech synthesis in (Yamagishi, 2006; Yamagishi & Kobayashi, 2005), addresses the problem by estimating the average voice parameters simultaneously with the linear-regression-based transformation reducing the influence of speaker differences. While the SI training aims at finding the best set of model parameters, SAT searches for both the speaker adaptation parameters and the average voice parameters that provide the maximum likelihood result in the transformation. In SAT, the set of HSMM model parameters *λSAT* and the adaptation parameters Λ*SAT* are optimized jointly for all *F* speakers using maximum likelihood criterion (Yamagishi &

W

*P* (**O**|**W**, *λ*) *P* (**W**). (22)

*P* (**W**|**O**, *λ*) = arg max

2009) and it is widely used for the speaker adaptation task in speech synthesis.

into account some prior information about the transform **W**:

**W**¯ = arg max W

to the node.

Kobayashi, 2005):

**3.2.4 Speaker adaptive training**

modeling (Yamagishi & Kobayashi, 2007).

#### **3.2.2 Maximum likelihood linear regression adaptation**

Adaptation using mapping of the existing HMM distribution parameters according to the adaptation data avoids the MAP adaptation problem of non-updated distributions. HMM adaptation using maximum likelihood linear regression (MLLR) to find such transformations (Leggetter & Woodland, 1995) was first applied in HMM-based speech synthesis in (Tamura et al., 1998). In MLLR adaptation, a linear mapping of the model distributions is found in a way that the likelihood of the adaptation data from the target speaker is maximized. Regression or decision tree-based clustering is used to tie similar models for the adaptation and the transformation is shared across the distributions of each cluster. Sharing the transformations across multiple distributions decreases the amount of data needed for the adaptation. Hence, MLLR-based adaptation often works better than MAP adaptation if only a small amount of data is used (Zen et al., 2009).

The model for the target voice is created by mapping the output probability distributions of an existing voice using a set of linear transforms. The *i*th multivariate Gaussian distribution of an MLLR-adapted voice is of the form:

$$b\_{l}(\mathbf{o}\_{l}) = \mathcal{N}\left(\mathbf{o}\_{l}; \boldsymbol{\zeta}\boldsymbol{\mu}\_{l} + \boldsymbol{\epsilon}, \boldsymbol{\Sigma}\_{l}\right) = \mathcal{N}\left(\mathbf{o}\_{l}; \mathbf{W}\boldsymbol{\xi}\_{l}, \boldsymbol{\Sigma}\_{l}\right), \tag{19}$$

where μ*<sup>i</sup>* and **Σ***<sup>i</sup>* are the mean and covariance of the average voice distribution, ζ and � the mapping and the bias, and ξ*<sup>i</sup>* = [μ*<sup>T</sup> <sup>i</sup>* , 1] *<sup>T</sup>*. The transformation **W** = [ζ, �] is tied across the distributions of each cluster. Transformation **W**¯ is the one that maximizes the likelihood of the adaptation data **O** = {**o**1, **o**2,..., **o***T*}:

$$\mathbf{W} = \underset{\mathbf{W}}{\text{arg}\max} \, P\left(\mathbf{O} \, | \, \lambda, \mathbf{W}\right). \tag{20}$$

Baum-Welch estimation can be used to find **W**¯ .

In the standard MLLR adaptation, the model means are adapted but the covariances are taken from the existing model. The adaptation of the distribution variances is needed especially in *F*<sup>0</sup> adaptation. In the constrained MLLR (CMLLR), both the model means and the covariances are transformed using the same set of transformations estimated simultaneously. The adapted means and covariances are transformed from the average voice means and covariances of the existing models using the same transformation matrix ζ:

$$\rho\_i(\mathbf{o}\_t) = \mathcal{N}\left(\mathbf{o}\_{t'}\zeta\mu\_i + \epsilon\_\prime\zeta\Sigma\_i\zeta^T\right). \tag{21}$$

MLLR-based HMM adaptation of continuous-density spectral parameters can be extended to adapt the parameters of MSD-HMMs of *F*<sup>0</sup> modeling (Tamura et al., 2001a) and the parameters of the state duration distributions (Tamura et al., 2001b) as well. In HSMM modeling, the state duration distributions are present in the HMM training from the beginning. The transformed HSMM distributions also have the form of Equation 19 or Equation 21 (Yamagishi & Kobayashi, 2007), however, state duration distributions having scalar mean and variance.

#### **3.2.3 Structural maximum a posteriori linear regression adaptation**

MLLR and CMMLR adaptation work well in the average voice constructions since there is a lot of training data available from multiple speakers. However, in the model adaptation, the 20 Will-be-set-by-IN-TECH

Adaptation using mapping of the existing HMM distribution parameters according to the adaptation data avoids the MAP adaptation problem of non-updated distributions. HMM adaptation using maximum likelihood linear regression (MLLR) to find such transformations (Leggetter & Woodland, 1995) was first applied in HMM-based speech synthesis in (Tamura et al., 1998). In MLLR adaptation, a linear mapping of the model distributions is found in a way that the likelihood of the adaptation data from the target speaker is maximized. Regression or decision tree-based clustering is used to tie similar models for the adaptation and the transformation is shared across the distributions of each cluster. Sharing the transformations across multiple distributions decreases the amount of data needed for the adaptation. Hence, MLLR-based adaptation often works better than MAP adaptation if only

The model for the target voice is created by mapping the output probability distributions of an existing voice using a set of linear transforms. The *i*th multivariate Gaussian distribution

where μ*<sup>i</sup>* and **Σ***<sup>i</sup>* are the mean and covariance of the average voice distribution, ζ and � the

distributions of each cluster. Transformation **W**¯ is the one that maximizes the likelihood of the

In the standard MLLR adaptation, the model means are adapted but the covariances are taken from the existing model. The adaptation of the distribution variances is needed especially in *F*<sup>0</sup> adaptation. In the constrained MLLR (CMLLR), both the model means and the covariances are transformed using the same set of transformations estimated simultaneously. The adapted means and covariances are transformed from the average voice means and covariances of the

MLLR-based HMM adaptation of continuous-density spectral parameters can be extended to adapt the parameters of MSD-HMMs of *F*<sup>0</sup> modeling (Tamura et al., 2001a) and the parameters of the state duration distributions (Tamura et al., 2001b) as well. In HSMM modeling, the state duration distributions are present in the HMM training from the beginning. The transformed HSMM distributions also have the form of Equation 19 or Equation 21 (Yamagishi & Kobayashi, 2007), however, state duration distributions having scalar mean and variance.

MLLR and CMMLR adaptation work well in the average voice constructions since there is a lot of training data available from multiple speakers. However, in the model adaptation, the

**o***t*; ζμ*<sup>i</sup>* + �, ζ**Σ***i*ζ*<sup>T</sup>*

*<sup>i</sup>* , 1]

**W**¯ = arg max W

*bi* (**o***t*) = N (**o***t*; ζμ*<sup>i</sup>* + �, **Σ***i*) = N (**o***t*; **W**ξ*i*, **Σ***i*), (19)

*<sup>T</sup>*. The transformation **W** = [ζ, �] is tied across the

*P* (**O**|*λ*, **W**). (20)

. (21)

**3.2.2 Maximum likelihood linear regression adaptation**

a small amount of data is used (Zen et al., 2009).

of an MLLR-adapted voice is of the form:

mapping and the bias, and ξ*<sup>i</sup>* = [μ*<sup>T</sup>*

adaptation data **O** = {**o**1, **o**2,..., **o***T*}:

Baum-Welch estimation can be used to find **W**¯ .

existing models using the same transformation matrix ζ:

*bi* (**o***t*) = N

**3.2.3 Structural maximum a posteriori linear regression adaptation**

amount of speech data from each target speaker is typically rather small, hence MAP criterion as a more robust one compared to the ML criterion might be more attractive. HMM adaptation by structural maximum a posteriori linear regression (SMAPLR) (Shiohan et al., 2002) combines the idea of linear mapping of the HMM distributions and structural MAP (SMAP) exploiting a tree structure to derive the prior distributions. The use of constrained SMAPLR (CSMAPLR) in adaptive HSMM-based speech synthesis was introduced in (Yamagishi et al., 2009) and it is widely used for the speaker adaptation task in speech synthesis.

Replacing the ML criterion in MLLR with the MAP criterion leads to the model that also takes into account some prior information about the transform **W**:

$$\tilde{\mathbf{W}} = \underset{\mathbf{W}}{\arg\max} \, P\left(\mathbf{W}|\mathbf{O}, \lambda\right) = \underset{\mathbf{W}}{\arg\max} \, P\left(\mathbf{O}|\mathbf{W}, \lambda\right) \, P\left(\mathbf{W}\right). \tag{22}$$

In the best case, the use of the MAP criterion can help to avoid training of unrealistic transformations that would not generalize that well for unseen content. Furthermore, well selected prior distributions can increase the conversion accuracy. In SMAPLR adaptation (Shiohan et al., 2002), a hierarchical tree structure is used to derive priors that better take into account the relation and similarity of different distributions. For the root node, a global transform is computed using all the adaptation data. Rest of the nodes recursively inherit their prior distributions from their parent nodes: hyperparameters of the parent node posterior distributions *P*(**W**|**O**, *λ*) are propagated to the child nodes where the distribution is approximated and used as a prior distribution *P*(**W**). In each node the MAPLR transformation **W** is derived as in Equation 22 using the prior distribution and the adaptation data assigned to the node.

#### **3.2.4 Speaker adaptive training**

The amount of training data from the target speaker is typically small whereas the initial models are usually estimated from a large set of training data preferably spoken by multiple speakers. This *speaker independent* (SI) training with multi-speaker training data resulting in average voice HMM usually provides a more robust basis for the mapping compared to the *speaker-dependent* (SD) training using only single-speaker data (Yamagishi & Kobayashi, 2007). In addition, especially in *F*<sup>0</sup> modeling larger datasets tend to provide more complete modeling hence making average voice training even more attractive compared to the speaker-dependent modeling (Yamagishi & Kobayashi, 2007).

The average voice used for adaptation should provide high-quality mapping to various target voices and should not have bias from single speakers' data. *Speaker adaptive training* (SAT) of HMMs introduced in (Anastasakos et al., 1996) and applied in HSMM-based speech synthesis in (Yamagishi, 2006; Yamagishi & Kobayashi, 2005), addresses the problem by estimating the average voice parameters simultaneously with the linear-regression-based transformation reducing the influence of speaker differences. While the SI training aims at finding the best set of model parameters, SAT searches for both the speaker adaptation parameters and the average voice parameters that provide the maximum likelihood result in the transformation.

In SAT, the set of HSMM model parameters *λSAT* and the adaptation parameters Λ*SAT* are optimized jointly for all *F* speakers using maximum likelihood criterion (Yamagishi & Kobayashi, 2005):

Chen, Y., Chu, M., Chang, E., Liu, J. & Liu, R. (2003). Voice conversion with smoothed GMM

Voice Conversion 91

Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete

Desai, S., Black, A., Yegnanarayana, B. & Prahallad, K. (2010). Spectral mapping using

Erro, D., Moreno, A. & Bonafonte, A. (2010a). INCA algorithm for training voice

Erro, D., Moreno, A. & Bonafonte, A. (2010b). Voice conversion based on weighted frequency

Eslami, M., Sheikhzadeh, H. & Sayadiyan, A. (2011). Quality improvement of voice

Fujimura, O. (1968). An approximation to voice aperiodicity, *IEEE Trans. Audio Electroacoust.*

Geman, S., Bienenstock, E. & Doursat, R. (1992). Neural networks and the bias/variance

Gillet, B. & King, S. (2003). Transforming F0 contours, *Proc. of Eurospeech*, Geneve, pp. 101–104. Helander, E. & Nurminen, J. (2007a). A novel method for prosody prediction in voice

Helander, E. & Nurminen, J. (2007b). On the importance of pure prosody in the perception of

Helander, E., Schwarz, J., Nurminen, J., Silén, H. & Gabbouj, M. (2008). On the impact of alignment on voice conversion performance, *Proc. of Interspeech*, pp. 1453–1456. Helander, E., Silén, H., Miguez, J. & Gabbouj, M. (2010b). Maximum a posteriori voice

Helander, E., Silén, H., Virtanen, T. & Gabbouj, M. (2011). Voice conversion using dynamic

Helander, E., Virtanen, T., Nurminen, J. & Gabbouj, M. (2010a). Voice conversion using partial least squares regression, *IEEE Trans. Audio, Speech, Lang. Process.* 18(5): 912–921. Kain, A. & Macon, M. W. (1998). Spectral voice conversion for text-to-speech synthesis, *Proc.*

Kawahara, H., Masuda-Katsuse, I. & de Cheveigné, A. (1999). Restructuring

Kim, S.-J., Kim, J.-J. & Hahn, M. (2006). HMM-based Korean speech synthesis system for

Kondoz, A. M. (2004). *Digital speech coding for low bit rate communication systems*, Wiley and

Kuhn, R., Junqua, J.-C., Nguyen, P. & Niedzielski, N. (2000). Rapid speaker adaptation in

hand-held devices, *IEEE Trans. Consum. Electron.* 52(4): 1384–1390.

eigenvoice space, *IEEE Trans. Speech Audio Process.* 8(6): 695–707.

conversion using sequential Monte Carlo methods, *Proc. of Interspeech*, pp. 1716–1719.

kernel partial least squares regression, *IEEE Trans. Audio, Speech, Lang. Process.* To

speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure

warping, *IEEE Trans. Audio, Speech, Lang. Process.* 18(5): 922–931.

data via the EM algorithm, *Journal of the Royal Statistical Society: Series B (Statistical*

artificial neural networks for voice conversion, *IEEE Trans. Audio, Speech, Lang.*

conversion systems from nonparallel corpora, *IEEE Trans. Audio, Speech, Lang. Process.*

conversion systems based on trellis structured vector quantization, *Proc. of*

and MAP adaptation, *Proc. of Eurospeech*, pp. 2413–2416.

*Methodology)* 39(1): 1–38.

*Process.* 18(5): 954–964.

*Interspeech*, pp. 665–668.

dilemma, *Neural Communication* 4(1): 1–58.

conversion, *Proc. of ICASSP*, pp. 509–512.

speaker identity, *Proc. of Interspeech*, pp. 2665–2668.

in sounds, *Speech Communication* 27(3-4): 187–207.

18(5): 944–953.

16(1): 68–72.

appear in 2011.

Sons, England.

*of ICASSP*, Vol. 1, pp. 285–288.

$$P(\lambda\_{\text{SAT}}, \Lambda\_{\text{SAT}}) = \underset{\lambda, \Lambda}{\arg\max} \, P\left(\mathbf{O} \, | \lambda, \Lambda\right) = \underset{\lambda, \Lambda}{\arg\max} \prod\_{f=1}^{F} P\left(\mathbf{O}^{(f)} | \lambda, \Lambda^{(f)}\right). \tag{23}$$

This differs from the SI training where only the model parameters are estimated during the average voice building. The maximization can be done with Baum-Welch estimation.

#### **4. Concluding remarks**

The research on voice conversion has been fairly active and several important advances have been made on different fronts. In this chapter, we have aimed to provide an overview covering the basics and the most important research directions. Despite the fact that the state-of-the-art VC methods provide fairly successful results, additional research advances are needed to progress further towards providing excellent speech quality and highly successful identity conversion at the same time. Also, the practical limitations in different application scenarios may offer additional challenges to overcome. For example, in many real-world applications, the speech data is noisy, making the training of high-quality conversion models even more difficult.

There is still room for improvement in all sub-areas of voice conversion, both in stand-alone voice conversion and in speaker adaptation in HMM-based speech synthesis. Recently, there has been a trend shift from text-dependent to text-independent use cases. It is likely that the trend will continue and eventually shift towards cross-lingual scenarios required in the attractive application area of speech-to-speech translation. Also, the two sub-areas treated separately in this chapter will be likely to merge at least to some extent, especially when they are needed in hybrid TTS systems (such as (Ling et al., 2007; Silén et al., 2010)) that combine unit selection and HMM-based synthesis.

An interesting and potentially very important future direction of VC research is enhanced parameterization. The current parameterizations often cause problems with synthetic speech quality, both in stand-alone conversion and in HMM-based synthesis, and the currently-used feature sets do not ideally represent the speaker-dependencies. More realistic mimicking of the human speech production could turn out to be crucial. This topic has been touched in (Z-H. Ling et al., 2009), and for example the use of glottal inverse filtering (Raitio et al., 2010) could also provide another initial step to this direction.

#### **5. References**


22 Will-be-set-by-IN-TECH

*P* (**O**|*λ*, Λ) = arg max

This differs from the SI training where only the model parameters are estimated during the

The research on voice conversion has been fairly active and several important advances have been made on different fronts. In this chapter, we have aimed to provide an overview covering the basics and the most important research directions. Despite the fact that the state-of-the-art VC methods provide fairly successful results, additional research advances are needed to progress further towards providing excellent speech quality and highly successful identity conversion at the same time. Also, the practical limitations in different application scenarios may offer additional challenges to overcome. For example, in many real-world applications, the speech data is noisy, making the training of high-quality conversion models even more

There is still room for improvement in all sub-areas of voice conversion, both in stand-alone voice conversion and in speaker adaptation in HMM-based speech synthesis. Recently, there has been a trend shift from text-dependent to text-independent use cases. It is likely that the trend will continue and eventually shift towards cross-lingual scenarios required in the attractive application area of speech-to-speech translation. Also, the two sub-areas treated separately in this chapter will be likely to merge at least to some extent, especially when they are needed in hybrid TTS systems (such as (Ling et al., 2007; Silén et al., 2010)) that combine

An interesting and potentially very important future direction of VC research is enhanced parameterization. The current parameterizations often cause problems with synthetic speech quality, both in stand-alone conversion and in HMM-based synthesis, and the currently-used feature sets do not ideally represent the speaker-dependencies. More realistic mimicking of the human speech production could turn out to be crucial. This topic has been touched in (Z-H. Ling et al., 2009), and for example the use of glottal inverse filtering (Raitio et al., 2010)

Abe, M., Nakamura, S., Shikano, K. & Kuwabara, H. (1988). Voice conversion through vector

Anastasakos, T., McDonough, J. & Schwartz, R. (1996). A compact model for speaker-adaptive

Arslan, L. (1999). Speaker transformation algorithm using segmental codebooks (STASC),

Benisty, H. & Malah, D. (2011). Voice conversion using GMM with enhanced global variance,

Chapell, D. & Hansen, J. (1998). Speaker-specific pitch contour modelling and modification,

average voice building. The maximization can be done with Baum-Welch estimation.

*λ*,Λ

*F* ∏ *f*=1 *P* **O**(*f*)

<sup>|</sup>*λ*, <sup>Λ</sup>(*f*) 

. (23)

(*λSAT*, Λ*SAT*) = arg max

unit selection and HMM-based synthesis.

could also provide another initial step to this direction.

quantization, *Proc. of ICASSP*, pp. 655–658.

training, *Proc. of ICSLP*, pp. 1137–1140.

*Speech Communication* 28(3): 211–226.

*Proc. of ICASSP*, Seattle, pp. 885–888.

*Proc. of Interspeech*, pp. 669–672.

**4. Concluding remarks**

difficult.

**5. References**

*λ*,Λ


Silén, H., Helander, E., Nurminen, J. & Gabbouj, M. (2009). Parameterization of vocal fry in

Voice Conversion 93

Silén, H., Helander, E., Nurminen, J., Koppinen, K. & Gabbouj, M. (2010). Using robust Viterbi

Sündermann, D., Höge, H., Bonafonte, A., Ney, H. & Hirschberg, J. (2006). Text-independent

Sündermann, D. & Ney, H. (2003). VTLN-based voice conversion, *Proc. of ISSPIT*, pp. 556–559. Song, P., Bao, Y., Zhao, L. & Zou, C. (2011). Voice conversion using support vector regression,

Stylianou, Y., Cappe, O. & Moulines, E. (1998). Continuous probabilistic transform for voice

Takahashi, J. & Sagayama, S. (1995). Vector-field-smoothed Bayesian learning for incremental

Tamura, M., Masuko, T., Tokuda, K. & Kobayashi, T. (1998). Speaker adaptation for

Tamura, M., Masuko, T., Tokuda, K. & Kobayashi, T. (2001a). Adaptation of pitch

Tamura, M., Masuko, T., Tokuda, K. & Kobayashi, T. (2001b). Text-to-speech synthesis with arbitrary speaker's voice from average voice, *Proc. of Interspeech*, pp. 345–348. Tao, J., Zhang, M., Nurminen, J., Tian, J. & Wang, X. (2010). Supervisory data alignment

Tenenbaum, J. B. & Freeman, W. T. (2000). Separating style and content with bilinear models,

Toda, T., Black, A. & Tokuda, K. (2007b). Voice conversion based on maximum-likelihood

Toda, T., Ohtani, Y. & Shikano, K. (2007a). One-to-many and many-to-one voice conversion

Toda, T., Saruwatari, H. & Shikano, K. (2001). Voice conversion algorithm based on Gaussian

Toda, T. & Tokuda, K. (2007). A speech parameter generation algorithm considering global

Tokuda, K., Kobayashi, T., Masuko, T. & Imai, S. (1994). Mel-generalized cepstral analysis - a unified approach to speech spectral estimation, *Proc. of ICSLP*, pp. 1043–1046. Tokuda, K., Kobayashi, T., Masuko, T., Kobayashi, T. & Kitamura, T. (2000). Speech

Tokuda, K., Oura, K., Hashimoto, K., Shiota, S., Zen, H., Yamagishi, J., Toda, T., Nose, T., Sako, S. & Black, A. W. (2011). HMM-based speech synthesis system (HTS).

Tokuda, K., Zen, H. & Black, A. (2002). An HMM-based speech synthesis system applied to English, *Proc. of 2002 IEEE Workshop on Speech Synthesis*, pp. 227–230.

HMM-based speech synthesis system using MLLR, *Proc. of the 3th ESCA/COCOSDA*

and spectrum for HMM-based speech synthesis using MLLR, *Proc. of ICASSP*,

for text-independent voice conversion, *IEEE Trans. Audio, Speech, Lang. Process.*

estimation of spectral parameter trajectory, *IEEE Trans. Audio, Speech, Lang. Process.*

mixture model with dynamic frequency warping of STRAIGHT spectrum, *Proc. of*

variance for HMM-based speech synthesis, *IEICE Trans. Inf. & Syst.* E90-D: 816–824.

parameter generation algorithms for HMM-based speech synthesis, *Proc. of ICASSP*,

algorithm and HMM-modeling in unit selection TTS to replace units of poor quality,

HMM-based speech synthesis, *Proc. of Interspeech*, pp. 1775–1778.

cross-language voice conversion, *Proc. of Interspeech*, pp. 2262–2265.

conversion, *IEEE Trans. Audio, Speech, Lang. Process.* 6(2): 131–142.

*Proc. of Interspeech*, pp. 166–169.

*Electronics Letters* 47(18): 1045–1046.

speaker adaptation, *Proc. of ICASSP*, pp. 696–699.

*Workshop on Speech Synthesis*, pp. 273–276.

*Neural Computation* 12(6): 1247–1283.

based on eigenvoices, *Proc. of ICASSP*, pp. 1249–1252.

pp. 805–808.

18(5): 932–943.

15(8): 2222–2235.

*ICASSP*, pp. 841–844.

pp. 1315–1318.

URL: *http://hts.sp.nitech.ac.jp/*


24 Will-be-set-by-IN-TECH

Lavner, Y., Rosenhouse, J. & Gath, I. (2001). The prototype model in speaker identification by

Lee, C.-H., Lin, C.-H. & Juang, B.-H. (1991). A study on speaker adaptation of the

Leggetter, C. & Woodland, P. (1995). Maximum likelihood linear regression for speaker

Ling, Z.-H., Qin, L., Lu, H., Gao, Y., Dai, L.-R., Wang, R.-H., Jiang, Y., Zhao, Z.-W., Yang,

Masuko, T., Tokuda, K., Kobayashi, T. & Imai, S. (1997). Voice characteristics conversion for HMM-based speech synthesis system, *Proc. of ICASSP*, pp. 1611–1614. McAulay, R. & Quatieri, T. (1986). Speech analysis/synthesis based on a sinusoidal representation, *IEEE Trans. Acoust., Speech, Signal Process.* 34(4): 744–754. Mesbashi, L., Barreaud, V. & Boeffard, O. (2007). Comparing GMM-based speech

Möller, S. (2000). *Assessment and Prediction of Speech Quality in Telecommunications*, Kluwer

Narendranath, M., Murthy, H. A., Rajendran, S. & Yegnanarayana, B. (1995). Transformation

Nguyen, B. & Akagi, M. (2008). Phoneme-based spectral voice conversion using temporal

Nurminen, J., Popa, V., Tian, J., Tang, Y. & Kiss, I. (2006). A parametric approach for voice conversion, *Proc. of TC-STAR Workshop on Speech-to-Speech Translation*, pp. 225–229. Popa, V., Nurminen, J. & Moncef, G. (2011). A study of bilinear models in voice conversion,

Popa, V., Silen, H., Nurminen, J. & Gabbouj, M. (2012). Local linear transformation for voice

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech

Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M. & Alku, P. (2010).

Rentzos, D., Vaseghi, S., Q., Y. & Ho, C.-H. (2004). Voice conversion through transformation

Shichiri, K., Sawabe, A., Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura,

Shinoda, K. & Watanabe, T. (2000). MDL-based context-dependent subword modeling for

Shiohan, O., Myrvoll, T. & Lee, C. (2002). Structural maximum a posteriori linear regression

Shuang, Z., Bakis, R. & Qin, Y. (2006). Voice conversion based on mapping formants, *TC-STAR*

of spectral and intonation features, *Proc. of ICASSP*, pp. 21–24.

speech recognition, *Acoustical Science and Technology* 21(2): 79–86.

for fast HMM adaptation, *Comput. Speech Lang.* 16(3): 5–24.

*Workshop on Speech-to-Speech Translation*, pp. 219–223.

HMM-based speech synthesis utilizing glottal inverse filtering, *IEEE Trans. Audio,*

T. (2002). Eigenvoices for HMM-based speech synthesis, *Proc. of Interspeech*,

of formants for voice conversion using artificial neural networks, *Speech*

decomposition and Gaussian mixture model, *Communications and Electronics, 2008.*

parameters of continuous density hidden Markov models, *IEEE Trans. Signal Process*.

adaptation of continuous density hidden Markov models, *Comput. Speech Lang.*

J.-H., Chen, J. & Hu, G.-P. (2007). The USTC and iFlytek speech synthesis systems for

human listeners, *International Journal of Speech Technology* 4: 63–74.

Blizzard Challenge 2007, *Proc. of Blizzard Challenge Workshop*.

transformation systems, *Proc. of Interspeech*, pp. 1989–1456.

*ICCE 2008. Second International Conference on*, pp. 224–229.

*Journal of Signal and Information Processing* 2(2): 125–139.

recognition, *Proceedings of the IEEE* 77(2): 257–286.

39(4): 806–814.

9(2): 171–185.

Academic Publisher.

*Communication* 16(2): 207–216.

conversion, *submitted to ICASSP*.

*Speech, Lang. Process.* 19(1): 153–165.

pp. 1269–1272.


**6** 

²

**Automatic Visual Speech Recognition** 

Lip reading was thought for many years to be specific to hearing impaired persons. Therefore, it was considered that lip reading is one possible solution to an abnormal situation. Even the name of the domain suggests that lip reading was considered to be a rather artificial way of communication because it associates lip reading with the written language which is a relatively new cultural phenomenon and is not an evolutionary inherent ability. Extensive lip reading research was primarily done in order to improve the teaching methodology for hearing impaired persons to increase their chances for integration in the society. Later on, the research done in human perception and more exactly in speech perception proved that lip reading is actively employed in different degrees by all humans irrespective to their hearing capacity. The most well know study in this respect was performed by Harry McGurk and John MacDonald in 1976. In their experiment the two researchers were trying to understand the perception of speech by children. Their finding, now called the McGurk effect, published in Nature (Mcgurk & Macdonald, 1976), was that if a person is presented a video sequence with a certain utterance (i.e. in their experiments utterance 'ga'), but in the same time the acoustics present a different utterance (i.e. in their experiments the sound 'ba'), in a large majority of cases the person will perceive a third utterance (i.e. in this case 'da'). Subsequent experiments showed that this is true as well for longer utterances and that is not a particularity of the visual and aural senses but also true for other perception functions. Therefore, lip reading is part of our multi-sensory speech perception process and could be better named visual speech recognition. Being an evolutionary acquired capacity, same as speech perception, some scientists consider the lip reading's neural mechanism the one that enables humans to achieve high literacy skills with

Another source of confusion is the "lip" word, because it implies that the lips are the only part of the speaker face that transmit information about what is being said. The teeth, the tongue and the cavity were shown to be of great importance for lip reading by humans (Williams et al., 1998). Also other face elements were shown to be important during face to face communication; however, their exact influence is not completely elucidated. During experiments in which a gaze tracker was used to track the speaker's areas of attention during communication it was found that the human lip readers focus on four major areas: the mouth, the eyes and the centre of the face depending on the task and the noise level (Buchan et al., 2007). In normal situations the listener scans the mouth and the other areas

**1. Introduction** 

relative easiness (van Atteveldt, 2006).

Alin Chiţu¹ and Léon J.M. Rothkrantz¹,

*¹Delft University of Technology ²Netherlands Defence Academy* 

*The Netherlands* 


### **Automatic Visual Speech Recognition**

Alin Chiţu¹ and Léon J.M. Rothkrantz¹, ²

> *¹Delft University of Technology ²Netherlands Defence Academy The Netherlands*

#### **1. Introduction**

26 Will-be-set-by-IN-TECH

94 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Turk, O. & Arslan, L. (2006). Robust processing techniques for voice conversion, *Computer*

Wang, Z., Wang, R., Shuang, Z. & Ling, Z. (2004). A novel voice conversion system based on

Wang, Y.-P., Ling, Z.-H. & Wang, R.-H. (2005). Emotional speech synthesis based on improved

Yamagishi, J. (2006). *Average-voice-based speech synthesis*, PhD thesis, Tokyo Institute of

Yamagishi, J. & Kobayashi, T. (2005). Adaptive training for hidden semi-Markov model, *Proc.*

Yamagishi, J. & Kobayashi, T. (2007). Average-voice-based speech synthesis using

Yamagishi, J., Kobayashi, T., Nakano, Y., Ogata, K. & Isogai, J. (2009). Analysis of speaker

codebook mapping voice conversion, *Proc. of ACII*, pp. 374–381.

codebook mapping with phoneme-tied weighting, *Proc. of Interspeech*, pp. 1197–1200.

HSMM-based speaker adaptation and adaptive training, *IEICE Trans. Inf. & Syst.*

adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm, *IEEE Trans. Audio, Speech, and Lang. Process.* 17(1): 66–83. Yamagishi, J., Usabaev, B., King, S., Watts, O., Dines, J., Tian, J., Guan, Y., Hu, R., Oura, K.,

Wu, Y.-J., Tokuda, K., Karhila, R. & Kurimo, M. (2010). Thousands of voices for HMM-based speech synthesis – Analysis and application of TTS systems built on various ASR corpora, *IEEE Trans. Audio, Speech, and Lang. Process.* 18(5): 984–1004. Yoshimura, T., Masuko, T., Tokuda, K., Kobayashi, T. & Kitamura, T. (1998). Duration modeling for HMM-based speech synthesis, *Proc. of ICSLP*, pp. 29–32. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura, T. (1997). Speaker

interpolation in HMM-based speech synthesis system, *Proc. of Eurospeech*,

modeling of spectrum, pitch and duration in HMM-based speech synthesis, *Proc. of*

HMM-based parametric speech synthesis, *IEEE Trans. Audio, Speech, Lang. Process.*

Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura, T. (1999). Simultaneous

Z-H. Ling, K., Richmond, J. Y. & Wang, R.-H. (2009). Integrating articulatory features into

Zen, H., Tokuda, K. & Black, A. W. (2009). Statistical parametric speech synthesis, *Speech*

Zen, H., Tokuda, K., Masuko, T., Kobayashi, T. & Kitamura, T. (2004). Hidden semi-Markov

model based speech synthesis, *Proc. of Interspeech*, pp. 1393–1396.

*Speech and Language* 4(20): 441–467.

Technology.

*of ICASSP*, pp. 365–366.

E90-D(2): 533–543.

pp. 2523–2526.

17(6): 1171–1185.

*Eurospeech*, pp. 2347–2350.

*Commun.* 51(11): 1039–1064.

Lip reading was thought for many years to be specific to hearing impaired persons. Therefore, it was considered that lip reading is one possible solution to an abnormal situation. Even the name of the domain suggests that lip reading was considered to be a rather artificial way of communication because it associates lip reading with the written language which is a relatively new cultural phenomenon and is not an evolutionary inherent ability. Extensive lip reading research was primarily done in order to improve the teaching methodology for hearing impaired persons to increase their chances for integration in the society. Later on, the research done in human perception and more exactly in speech perception proved that lip reading is actively employed in different degrees by all humans irrespective to their hearing capacity. The most well know study in this respect was performed by Harry McGurk and John MacDonald in 1976. In their experiment the two researchers were trying to understand the perception of speech by children. Their finding, now called the McGurk effect, published in Nature (Mcgurk & Macdonald, 1976), was that if a person is presented a video sequence with a certain utterance (i.e. in their experiments utterance 'ga'), but in the same time the acoustics present a different utterance (i.e. in their experiments the sound 'ba'), in a large majority of cases the person will perceive a third utterance (i.e. in this case 'da'). Subsequent experiments showed that this is true as well for longer utterances and that is not a particularity of the visual and aural senses but also true for other perception functions. Therefore, lip reading is part of our multi-sensory speech perception process and could be better named visual speech recognition. Being an evolutionary acquired capacity, same as speech perception, some scientists consider the lip reading's neural mechanism the one that enables humans to achieve high literacy skills with relative easiness (van Atteveldt, 2006).

Another source of confusion is the "lip" word, because it implies that the lips are the only part of the speaker face that transmit information about what is being said. The teeth, the tongue and the cavity were shown to be of great importance for lip reading by humans (Williams et al., 1998). Also other face elements were shown to be important during face to face communication; however, their exact influence is not completely elucidated. During experiments in which a gaze tracker was used to track the speaker's areas of attention during communication it was found that the human lip readers focus on four major areas: the mouth, the eyes and the centre of the face depending on the task and the noise level (Buchan et al., 2007). In normal situations the listener scans the mouth and the other areas

Automatic Visual Speech Recognition 97

of the lip reader, a comparison among the experiments is not always possible. When the corpora are about the same, then the comparison of the different feature types and feature extraction techniques becomes feasible. It can still give an impression on the state of the art

The task of isolated letters was among the first analysed by Petajan et al. (Petajan et al., 1988) back in 1998. The authors report the correct recognition close to 90%. However, based on the AVletters data corpus, Matthews et al. (Matthews et al., 1996) reports only a 50% recognition rate. Li et al. (Li et al., 1995) reports a perfect recognition 100% on the same task, but two years later in (Li et al., 1997) only 90% recognition. The second most popular task is digit recognition either in isolation or as connected strings. Based on the TULIPS1 data corpus, which only contains the first four digits, Luettin et al. (Luettin et al., 1996) and Luettin and Thacker (Luettin & Thacker, 1997) reported 83.3% and 88.5% recognition rates, respectively. Arsic and Thiran (Arsic & Thiran, 2006) report on the same data corpus 81.25% and 89.6% depending on the feature extraction method. Other experiments with the digit recognition task are: Potamianos et al. (Potamianos et al., 1998) reported 95.7%, Dupont and Luettin (Dupont & Luettin, 2000) reported 59.7%, Wojdel reported in his thesis (Wojdel, 2003) 91.1% correct recognition and 81.1% accuracy, Patamianos et al. (Potamianos et al., 2004) reported 63% and Perez et al. (Prez et al., 2005) 47%. Lucey and Potamianos (Lucey & Potamianos, 2006) reported 74.6% recognition rate for the isolated digits task. Potamianos et al. (Potamianos1998a) report 64.5% recognition rate for the connected letter task. For the isolated word task Nefian et al. (Nefian et al., 2002) report 66.9%, Zhang et al. (Zhang et al., 2002) report 42%, Kumar et al. (Kumar et al., 2007) report 42.3%. We can conclude that there is still a large variation in the performances obtained, and there is still no convergence visible since the newer studies do not necessarily show an increase in accuracy. This is, to our opinion, clearly a sign of the immaturity of the lip reading domain. Also, as can be observed in the listing above, there are yet no results of experiments with continuous speech. Patamianos et al. (Potamianos et al., 2004) report an extremely low result on the continuous speech task, namely 12%. The lip reading domain is still young and there are many limiting factors that need to be conquered. Therefore, the experiments in lip reading are still dealing with relatively easy tasks. However, the promising results in these tasks give us hopes that larger experiments are possible. As the domain becomes more popular, the number of data corpora will increase and with a better cooperation among scientists it will be possible to better compare the achievements. However, there are objective factors which limit the performance of the lip readers. Nevertheless, as shown in many studies, lip reading can be successfully used in conjunction with speech for an enhanced speech

in lip reading.

recognition system.

**3. Building an automatic lip reader: Overview** 

Building a lip reader is in many ways similar to building any automatic system which performs an autonomous role in its environment. The first decision needed to be made before starting the construction of the system is with respect to the role of the system and with respect to the environment where the system will be deployed. After establishing, in pattern recognition jargon, the recognition task, building the system consists of four separate stages: data acquisition, data parametrization, model training and model testing. Figure 1 describes the general process of building a lip reader. These activities are

relatively equal periods of times. However, when the background noise increases, the centre of the face becomes the central point of attention. Most probably the peripheral vision becomes extremely active in these situations. When the task was set to the inference of the emotional load of the interlocutor, the listener's gaze started to be shifted towards the eyes since they convey more emotional related information. It is well accepted that the human lip readers make great use of the context in which the interaction takes place. This can be one of the reasons the human listener scans the entire face during the interaction. In (Hilder et al., 2009) the authors found that when a human lip reader was presented with appearance information, compared with only mouth shapes, his performance increased considerably from 42.9% to 71.6%.

We should realise that during face to face interaction a human engages in a complex process which involves various channels of information corresponding to our senses. In this way the speaker builds up the context using both verbal and non-verbal cues such as body gesture, facial expressions, prosody, and other physiological manifestations. Other information about the settings in which the communication takes place is used as well as the knowledge accumulated in time through experience. A human is a multi-modality, multi-sensory, multi-media fusing machine.

The rest of the chapter is organized as follows: section 2 presents relevant research works in the area of lip reading. Section 3 presents the aspects related to building an automated lip reader. Section 4 details the characteristics of the facial model used during the visual analysis of the lip reading process. The next sections illustrate the results and discuss the conclusions of the algorithms presented in the chapter.

#### **2. State of the art in lip reading**

It is about three decades since automatic lip reading domain emerged in the scientific community. However, only starting from the 90s, and more sustained in the second half of the 90s, the subject started to become viable. Even today it still lags the speech recognition by some decades. Until some years ago the most impeding factor was the computational power of the computers. Nowadays it is the difficulty in finding the most suitable visual features that capture the information related with what is being spoken. Also it is the hard problem of accurately detecting and tracking the facial elements that convey speech related information. The automatic and robust detection and tracking of the face elements is still not entirely achieved by the current technology. As in other similar visual pattern recognition applications, the two monsters "illumination variations" and "occlusions" are still alive and menacing. A special case of occlusion is in this case generated by the posture of the speaker.

Therefore, any study concerning lip reading deals with the overwhelming task of manually or in the best case semi-automatically processing the data corpus. The data corpora for lip reading are still very small due to partially the storage and bandwidth limitations and other recording related settings, but much more limiting due to the overwhelming task of processing and preparing the data for experiments. Because of these issues, each data corpus is created for a stated recognition task. The lip reading experiments to this date are limited to isolated or connected random words, isolated or connected digits, isolated or connected letters. Some of the reported performance is listed below. However, it is very important to keep in mind that, because the data corpus used influences in great respect the performance 96 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

relatively equal periods of times. However, when the background noise increases, the centre of the face becomes the central point of attention. Most probably the peripheral vision becomes extremely active in these situations. When the task was set to the inference of the emotional load of the interlocutor, the listener's gaze started to be shifted towards the eyes since they convey more emotional related information. It is well accepted that the human lip readers make great use of the context in which the interaction takes place. This can be one of the reasons the human listener scans the entire face during the interaction. In (Hilder et al., 2009) the authors found that when a human lip reader was presented with appearance information, compared with only mouth shapes, his performance increased considerably

We should realise that during face to face interaction a human engages in a complex process which involves various channels of information corresponding to our senses. In this way the speaker builds up the context using both verbal and non-verbal cues such as body gesture, facial expressions, prosody, and other physiological manifestations. Other information about the settings in which the communication takes place is used as well as the knowledge accumulated in time through experience. A human is a multi-modality, multi-sensory,

The rest of the chapter is organized as follows: section 2 presents relevant research works in the area of lip reading. Section 3 presents the aspects related to building an automated lip reader. Section 4 details the characteristics of the facial model used during the visual analysis of the lip reading process. The next sections illustrate the results and discuss the

It is about three decades since automatic lip reading domain emerged in the scientific community. However, only starting from the 90s, and more sustained in the second half of the 90s, the subject started to become viable. Even today it still lags the speech recognition by some decades. Until some years ago the most impeding factor was the computational power of the computers. Nowadays it is the difficulty in finding the most suitable visual features that capture the information related with what is being spoken. Also it is the hard problem of accurately detecting and tracking the facial elements that convey speech related information. The automatic and robust detection and tracking of the face elements is still not entirely achieved by the current technology. As in other similar visual pattern recognition applications, the two monsters "illumination variations" and "occlusions" are still alive and menacing. A special case of occlusion is in this case generated by the posture of the speaker. Therefore, any study concerning lip reading deals with the overwhelming task of manually or in the best case semi-automatically processing the data corpus. The data corpora for lip reading are still very small due to partially the storage and bandwidth limitations and other recording related settings, but much more limiting due to the overwhelming task of processing and preparing the data for experiments. Because of these issues, each data corpus is created for a stated recognition task. The lip reading experiments to this date are limited to isolated or connected random words, isolated or connected digits, isolated or connected letters. Some of the reported performance is listed below. However, it is very important to keep in mind that, because the data corpus used influences in great respect the performance

from 42.9% to 71.6%.

multi-media fusing machine.

**2. State of the art in lip reading** 

conclusions of the algorithms presented in the chapter.

of the lip reader, a comparison among the experiments is not always possible. When the corpora are about the same, then the comparison of the different feature types and feature extraction techniques becomes feasible. It can still give an impression on the state of the art in lip reading.

The task of isolated letters was among the first analysed by Petajan et al. (Petajan et al., 1988) back in 1998. The authors report the correct recognition close to 90%. However, based on the AVletters data corpus, Matthews et al. (Matthews et al., 1996) reports only a 50% recognition rate. Li et al. (Li et al., 1995) reports a perfect recognition 100% on the same task, but two years later in (Li et al., 1997) only 90% recognition. The second most popular task is digit recognition either in isolation or as connected strings. Based on the TULIPS1 data corpus, which only contains the first four digits, Luettin et al. (Luettin et al., 1996) and Luettin and Thacker (Luettin & Thacker, 1997) reported 83.3% and 88.5% recognition rates, respectively. Arsic and Thiran (Arsic & Thiran, 2006) report on the same data corpus 81.25% and 89.6% depending on the feature extraction method. Other experiments with the digit recognition task are: Potamianos et al. (Potamianos et al., 1998) reported 95.7%, Dupont and Luettin (Dupont & Luettin, 2000) reported 59.7%, Wojdel reported in his thesis (Wojdel, 2003) 91.1% correct recognition and 81.1% accuracy, Patamianos et al. (Potamianos et al., 2004) reported 63% and Perez et al. (Prez et al., 2005) 47%. Lucey and Potamianos (Lucey & Potamianos, 2006) reported 74.6% recognition rate for the isolated digits task. Potamianos et al. (Potamianos1998a) report 64.5% recognition rate for the connected letter task. For the isolated word task Nefian et al. (Nefian et al., 2002) report 66.9%, Zhang et al. (Zhang et al., 2002) report 42%, Kumar et al. (Kumar et al., 2007) report 42.3%. We can conclude that there is still a large variation in the performances obtained, and there is still no convergence visible since the newer studies do not necessarily show an increase in accuracy. This is, to our opinion, clearly a sign of the immaturity of the lip reading domain. Also, as can be observed in the listing above, there are yet no results of experiments with continuous speech. Patamianos et al. (Potamianos et al., 2004) report an extremely low result on the continuous speech task, namely 12%. The lip reading domain is still young and there are many limiting factors that need to be conquered. Therefore, the experiments in lip reading are still dealing with relatively easy tasks. However, the promising results in these tasks give us hopes that larger experiments are possible. As the domain becomes more popular, the number of data corpora will increase and with a better cooperation among scientists it will be possible to better compare the achievements. However, there are objective factors which limit the performance of the lip readers. Nevertheless, as shown in many studies, lip reading can be successfully used in conjunction with speech for an enhanced speech recognition system.

#### **3. Building an automatic lip reader: Overview**

Building a lip reader is in many ways similar to building any automatic system which performs an autonomous role in its environment. The first decision needed to be made before starting the construction of the system is with respect to the role of the system and with respect to the environment where the system will be deployed. After establishing, in pattern recognition jargon, the recognition task, building the system consists of four separate stages: data acquisition, data parametrization, model training and model testing. Figure 1 describes the general process of building a lip reader. These activities are

Automatic Visual Speech Recognition 99

In order to evaluate the results of different solutions to a certain problem, the data corpora used should be shared between researchers or otherwise there should exist a set of guidelines for building a corpus that all datasets should comply with. In the case when a data corpus is build with the intention to be made public, a greater level of reusability is required. In all cases, the first and probably the most important step in building a data corpus is to carefully state the targeted applications of the system that will be trained using the dataset. Some of the most cited data corpora for lip reading are: TULIPS1 (Movellan, 1995), AVletters (Matthews et al., 1996), AVOZES (Goecke & Millar, 2004), CUAVE (Patterson et al., 2002), DAVID (Chibelushi et al., 1996), ViaVoice (Neti et al., 2000), DUTAVSC (Wojdel et al., 2002), AVICAR (Lee et al., 2004), AT&T (Potamianos et al., 1997), CMU (Zhang et al., 2002), XM2VTSDB (Messer et al., 1999), M2VTS (Pigeon & Vandendorpe, 1997) and LIUM-AVS (Daubias & Deleglise, 2003). With the exception of M2VTS which is in French, XM2VTSDB which is in four languages and DUTAVSC which is in Dutch the rest are only in English (Table 1). Since the target language for our research was Dutch, we had only one option, namely the DUTAVSC (Delft University of Technology Audio-Visual Speech Corpus). For reasons that will be explained in the next paragraphs, we decided to build our own data corpus. This corpus was build as an extension to the DUTAVSC and is called NDUTAVSC (Chitu & Rothkrantz, 2009) which stands for "New Delft University of

 The complexity of audio data recording is much smaller than of the video recordings. Therefore, all datasets store the audio signal with sufficiently high accuracy, namely using a sample rate of 22 kHz to 48 kHz and a sample size of 16 bits. Therefore, the quality of the audio data is not subject to storage accuracy but from the perspective of recording environment. There are two approaches to the recordings environment: specific and neutral. In the first case the database is built with a very narrow application domain in mind such as speech recognition in the car. In this case the recording environment matches the conditions of the environment where the system will be deployed. This approach can guarantee that the particularities of the target environment are closely matched. The downside of this approach is that the resulting corpus is too much dedicated to the problem domain and suffers from over training, and offers little generalization. In the second approach the dataset can be recorded in controlled, noise free environment. The advantage of this approach is the possibility to adapt the corpus to a specific environment in a post process. Therefore, a data corpus of this kind can be used for virtually any number of applications. The specific noise can be simulated or recorded in the required conditions and later superimposed on the clear

 In the case of video data recording there is a larger number of important factors that control the success of the resulting data corpus. Hence, not only the environment, but also the equipment used for recording and other settings is actively influencing the final result. In the case of the environment the classification made for audio holds for video as well. The environment where the recordings are made is important since it can determine the illumination of the scene, and the background of the speakers. In the case of a controlled environment the speakers background is usually monochrome so that by

**3.1 On building a data corpus for lip reading: A comparison** 

Technology Audio-Visual Speech Corpus".

audio data.

Some aspects related to the data set preparation are as follows:

performed in cycles, the larger the cycle the less frequent its corresponding process is performed.

The data acquisition process should ensure that the resulting corpus correctly describes the distribution of the possible states of the modelled process. The importance of the data parametrization is twofold; it should extract only the relevant information from the data and it should reduce the dimensionality of the feature space, therefore increasing the tractability of the problem. Training and testing are dependent on the mathematical models chosen for inference. These range from plain heuristics to complex probabilistic graphical models. The training process should solve two problems: identify the structure of the models such as the number of parameters and their relation, and compute the values of the models' parameters.

Training and testing is usually performed in a cycle which will fine tune the structure of the models and the values of the weights in the model. However, the data parametrization step is the one that is most of the time investigated, since there are many ways to extract suitable information for the process under study. Choosing the right parametrization is not straightforward and usually a trial and error sequence of experiments is started.

Fig. 1. The activity sequence for building a lip reader

A lip reader and in general a speech recogniser is built for a particular target language. The recognition task, namely the size of the vocabulary and the type of utterances accepted, are paramount for the entire design of the system. For instance if for a small vocabulary (i.e. a few tens of words) one model can be used to recognise one entire word, for larger vocabularies it is more suitable to build sub-word models, i.e., to directly recognise subwords and build the words and sentences using dictionaries and grammar networks.

So far, the most successful approach for speech recognition, and therefore also applied to lip reading, is the Bayesian approach. In the Bayesian approach, the recognition problem can be formulated as follows: given a set of possible words and an observation sequence *O OO O* 1 2 , ,, *<sup>n</sup>* the solution of the recognition problem is the word that maximizes the

probability *PW O* ( |) . Based on the Bayesian rule we can write: | () <sup>|</sup> ( ) *POW PW PWO P O* ,

where *PO W* (| ) is the likelihood of the observation given the word W and P (W), usually called the language model, represents the probability of the word W. The problem can be thus rewritten as: <sup>ˆ</sup>*W argmax P O W P W <sup>W</sup>* | , where W is the recognized word. In the above equation the denominator *P O*( ) has been deleted since it does not influence the solution. Therefore, the recognition problem is reduced to building a language model P(W) and a word model *POW* | for each legal word.

98 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

performed in cycles, the larger the cycle the less frequent its corresponding process is

The data acquisition process should ensure that the resulting corpus correctly describes the distribution of the possible states of the modelled process. The importance of the data parametrization is twofold; it should extract only the relevant information from the data and it should reduce the dimensionality of the feature space, therefore increasing the tractability of the problem. Training and testing are dependent on the mathematical models chosen for inference. These range from plain heuristics to complex probabilistic graphical models. The training process should solve two problems: identify the structure of the models such as the number of parameters and their relation, and compute the values of the models' parameters. Training and testing is usually performed in a cycle which will fine tune the structure of the models and the values of the weights in the model. However, the data parametrization step is the one that is most of the time investigated, since there are many ways to extract suitable information for the process under study. Choosing the right parametrization is not

straightforward and usually a trial and error sequence of experiments is started.

A lip reader and in general a speech recogniser is built for a particular target language. The recognition task, namely the size of the vocabulary and the type of utterances accepted, are paramount for the entire design of the system. For instance if for a small vocabulary (i.e. a few tens of words) one model can be used to recognise one entire word, for larger vocabularies it is more suitable to build sub-word models, i.e., to directly recognise subwords and build the words and sentences using dictionaries and grammar networks.

So far, the most successful approach for speech recognition, and therefore also applied to lip reading, is the Bayesian approach. In the Bayesian approach, the recognition problem can be formulated as follows: given a set of possible words and an observation sequence *O OO O* 1 2 , ,, *<sup>n</sup>* the solution of the recognition problem is the word that maximizes the

probability *PW O* ( |) . Based on the Bayesian rule we can write: | () <sup>|</sup> ( )

where *PO W* (| ) is the likelihood of the observation given the word W and P (W), usually called the language model, represents the probability of the word W. The problem can be thus rewritten as: <sup>ˆ</sup>*W argmax P O W P W <sup>W</sup>* | , where W is the recognized word. In the above equation the denominator *P O*( ) has been deleted since it does not influence the solution. Therefore, the recognition problem is reduced to building a language model P(W)

*POW PW*

*P O* ,

*PWO*

Fig. 1. The activity sequence for building a lip reader

and a word model *POW* | for each legal word.

performed.

#### **3.1 On building a data corpus for lip reading: A comparison**

In order to evaluate the results of different solutions to a certain problem, the data corpora used should be shared between researchers or otherwise there should exist a set of guidelines for building a corpus that all datasets should comply with. In the case when a data corpus is build with the intention to be made public, a greater level of reusability is required. In all cases, the first and probably the most important step in building a data corpus is to carefully state the targeted applications of the system that will be trained using the dataset. Some of the most cited data corpora for lip reading are: TULIPS1 (Movellan, 1995), AVletters (Matthews et al., 1996), AVOZES (Goecke & Millar, 2004), CUAVE (Patterson et al., 2002), DAVID (Chibelushi et al., 1996), ViaVoice (Neti et al., 2000), DUTAVSC (Wojdel et al., 2002), AVICAR (Lee et al., 2004), AT&T (Potamianos et al., 1997), CMU (Zhang et al., 2002), XM2VTSDB (Messer et al., 1999), M2VTS (Pigeon & Vandendorpe, 1997) and LIUM-AVS (Daubias & Deleglise, 2003). With the exception of M2VTS which is in French, XM2VTSDB which is in four languages and DUTAVSC which is in Dutch the rest are only in English (Table 1). Since the target language for our research was Dutch, we had only one option, namely the DUTAVSC (Delft University of Technology Audio-Visual Speech Corpus). For reasons that will be explained in the next paragraphs, we decided to build our own data corpus. This corpus was build as an extension to the DUTAVSC and is called NDUTAVSC (Chitu & Rothkrantz, 2009) which stands for "New Delft University of Technology Audio-Visual Speech Corpus".

Some aspects related to the data set preparation are as follows:


Automatic Visual Speech Recognition 101

(Potamianos et al., 1998) the authors report that the degradation of the video signal by the image compression algorithm by the addition of white noise does not influence the lip reading performance unless the Signal to Noise Ratio(SNR) falls under some threshold: 50% and 15%, respectively. These findings are reported when the features used are a linear

transformation of the intensities in the images, namely discrete wavelet transform.

Table 1. Characteristics of data corpora.

**3.1.1 Language quality** 

Table 2. Resolution of the mouth area in six known corpora for lip reading.

By its nature lip reading requires, irrespective of the other qualities, that the data corpus has a good coverage of the language and task vocabulary. Therefore, in the case of a word based recognizer all the words in the vocabulary need to be present in the corpus. In the case of a sub-word recognizer every sub-word item needs to be present in the corpus in all existing contexts. Therefore, the co-articulatory effects appear with a reasonable frequency. However, due to the amount of work necessary and the storage and bandwidth required

using a "colour keying" technique the speaker can be placed in different locations inducing in this way some degree of visual noise. However, the illumination conditions of different environment are not as easily applied to the clean recordings, since the 3D information is not available anymore. In controlled environments the light is reflected by special panels which cast the light uniformly, reducing the artefacts on the speaker's faces.


Figure 2 shows some examples from six available data corpora. The differences among the examples in this figure are clearly visible, with the exception of the DUTAVSC corpus, all other corpora reserve a small number of pixels for the mouth area. Table 2 gives the sizes of the mouth bounding box in all six samples. This low level of detail makes the detection and tracking of the lips much more difficult. Any parametrization that considers a description of the shape of the mouth will be heavily influenced by image degradation. In the paper

Fig. 2. The resolution of the ROI in some data corpora available for lip reading

100 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

 The equipment used for recording plays a major role, because the resolution and the sample rate is still a heavy burden. Hence, the resolution of the recordings ranges from 100x75 pixels in Tulips1 and 80x60 pixels in AVletters datasets to 720x576 pixels in AVOZES and CUAVE datasets. The same improvement in quality is also observed in

 The frame rate of the existing data corpora is conforming to one of the colour encoding systems used in broadcast television systems. Therefore, the video is recorded at 24Hz, 25Hz, 29.97Hz of 30Hz depending on the place in the world where the recordings are

Figure 2 shows some examples from six available data corpora. The differences among the examples in this figure are clearly visible, with the exception of the DUTAVSC corpus, all other corpora reserve a small number of pixels for the mouth area. Table 2 gives the sizes of the mouth bounding box in all six samples. This low level of detail makes the detection and tracking of the lips much more difficult. Any parametrization that considers a description of the shape of the mouth will be heavily influenced by image degradation. In the paper

Fig. 2. The resolution of the ROI in some data corpora available for lip reading

made. The data corpus used for the current research was recorded at 100Hz. The Region Of Interest (ROI) is important as well. For lip reading only the lower half of the face is important. However, in case context information is required, a larger area might be needed. Most of the datasets show, however, a passport like image of the speaker. In our opinion, at least for increasing the performance of the parametrization process a smaller ROI is more advantageous. Of course a ROI that is too narrow adds high constraints on the performances of the video camera used and it might be argued that this is not the case in real life where the resulting system will be used. Recording only the mouth area as is done in the Tulips1 data set is a tough goal to achieve in an uncontrolled environment. However, by using a face detection algorithm combined with a face tracking algorithm we could automatically focus and zoom in on the face of the speaker. A small ROI facilitates acquiring a much greater detail of the area of interest, in our case the mouth area, while keeping the resolution and, therefore, the

faces.

colour fidelity.

bandwidth needs in manageable limits.

using a "colour keying" technique the speaker can be placed in different locations inducing in this way some degree of visual noise. However, the illumination conditions of different environment are not as easily applied to the clean recordings, since the 3D information is not available anymore. In controlled environments the light is reflected by special panels which cast the light uniformly, reducing the artefacts on the speaker's (Potamianos et al., 1998) the authors report that the degradation of the video signal by the image compression algorithm by the addition of white noise does not influence the lip reading performance unless the Signal to Noise Ratio(SNR) falls under some threshold: 50% and 15%, respectively. These findings are reported when the features used are a linear transformation of the intensities in the images, namely discrete wavelet transform.


Table 1. Characteristics of data corpora.



#### **3.1.1 Language quality**

By its nature lip reading requires, irrespective of the other qualities, that the data corpus has a good coverage of the language and task vocabulary. Therefore, in the case of a word based recognizer all the words in the vocabulary need to be present in the corpus. In the case of a sub-word recognizer every sub-word item needs to be present in the corpus in all existing contexts. Therefore, the co-articulatory effects appear with a reasonable frequency. However, due to the amount of work necessary and the storage and bandwidth required

Automatic Visual Speech Recognition 103

Bregler and Konig in their 1994 paper (Bregler & Konig, 1994): "The real information in lipreading lies in the temporal change of lip positions, rather than the absolute lip shape". The OFA can be used as well as a measure of the overall movement and be employed for onset/offset detection. The main advantage of this approach is that it can be easily automated, since it requires only the definition of the Region Of Interest (ROI). The ROI can be considered the bounding box of the face or the bounding box of the mouth, thus requiring some object detection and tracking algorithm. A good example is the face detection algorithm developed by Viola and Jones in (Viola & Jones, 2001). The main disadvantage of this type of features is that the a-priory information about lip reading is not inherently used in the process of feature extraction. Therefore, there is minimum control over the information contained in the resulting feature vectors, on whether this information is relevant for lip reading or not. The exceptions can be the OPA and LBP where the analysis is usually performed in carefully chosen regions around the mouth. We defined the set of features based on OFA and analyzed the performance of the lip reading system trained on our data corpus. The features from the second class share the belief that in order to accurately capture the most relevant features, with respect to lip reading, a careful description of the contour of the speaker's mouth is needed. The feature extraction proceeds in two steps; first a number of key points are detected and based on these points the mouth contour is recovered, and second the feature vectors are defined based on the shape of the mouth. The detection of the key points is performed based on colour segmentation techniques that identify pixels that are on the lips. Thereafter, the contour of the lips is usually extracted by imposing a lip model to the detected points. These methods are using the so called "smart snakes" (Lievin et al., 1999); (Luettin & Thacker, 1997); (Salazar et al., 2007), or as called in (Eveno et al., 2004) "jumping snakes", or later on Active Shape Models (ASM) (Luettin et al., 1996); (Prez et al., 2005); (Morn & Pinto-Elas, 2007) or Active Contour Models (ACM). Any other parametric model can be used here. The lips' contour is usually detected as a result of an iterative process which searches to minimise the error between the real contour and the approximation of the contour the parametric model allows for. The actual feature vectors are defined in the second step. The feature vectors fall into two categories here: model based features and mouth high level features. In the first category the feature vectors contain directly the parameters of the models used for describing the mouth contour. In the second category the feature vectors contain measurable quantities, which are meaningful to humans. The most used high level features are mouth height, mouth width, contour perimeter, aperture height, aperture width, aperture area, mouth area, aperture angle and other relations among these (e.g. the ratio between the width and the height) (Chitu & Rothkrantz, 2009); (Goecke et al., 2000a, 2000b); (Kumar et al., 2007); (Matthews et

In our research we used Statistical Lip Geometry Estimation (LGE) which is a feature extraction method introduced by Wojdel and Rothkrantz (Wojdel & Rothkrantz, 2000). This method is special because it is a model free approach for describing the shape of the lips. It strongly depends, however, on the performance of the image segmentation technique used to detect the pixels which belong to the lips. The third class consists of feature vectors that contain both geometric and texture features. The features from each category are usually concatenated in a larger feature vector. For instance (Dupont & Luettin, 2000) and (Luettin et al., 1996) combine ASM with PCA features and (Chiou & Hwang, 1997) combines snake features with PCA. It was shown that the tongue, teeth and cavity have great influence on

al., 2002); (Yoshinaga et al., 2004).

most of the data corpora only consider small recognition tasks and small language corpus. Most frequently the data corpora focus on the digits and letters of the language considered. These are recorded either isolated, or in short sequences, or as in DUTAVSC in spelling of words. Some corpora even only consider nonsense combinations of vowels(V) and consonants(C) (e.g. DAVID considers VCVCVC sequences, AVOZES repetitions of "ba" and "eo" constructions, AT&T CVC sequences). The continuous speech case is only considered in AVOZES which contains only 3 phonetically balanced sentences, in AVICAR which contains ten sentences from the TIMIT (Garofolo, 1988) speech data corpus, XM2VTSDB and M2VTS which contains one random sentence and DUTAVSC which contains 80 phonetically rich sentences. The DUTAVSC is by far the most rich data corpus. The NDUTAVSC corpus which was built as an extension of DUTAVSC contains more than 2000 unique rich sentences. However, none of the existing corpora can match the language coverage offered by the data corpora used for speech recognition which can easily have a vocabulary of 100k words (e.g. the Polyphone corpus (Boogaart et al., 1994) contains more than one million words recorded and a vocabulary of 150k words).

#### **3.2 Feature vectors definition**

There are many approaches to data parametrization, but with respect to the feature vectors definition they all fit in three broad classes: texture based features, geometric based features, and combination of texture and geometric features. A good overview of most of the feature extraction methods can be found in (Potamianos et al., 2004). In the first class the feature vectors are composed of pixels' intensities values or a transformation of them in some smaller feature space. The main function of the projection is to reduce the dimensionality of the feature space while preserving as much as possible the most relevant speech related information. Principal Component Analysis (PCA) is one of the first choices, and therefore very popular, and was used in many studies e.g. (Bregler et al., 1993); (Bregler & Konig, 1994); (Duchnowski et al., 1994); (Li et al., 1995); (Tomlinson et al., 1996); (Chiou & Hwang, 1997); (Gray et al., 1997); (Li et al., 1997); (Luettin & Thacker, 1997); (Potamianos et al., 1998); (Dupont & Luettin, 2000); (Hong et al., 2006). The feature definition is based on the notion of eigenfaces or eigenlips which represent the eigenvectors of the training sets. An alternative to PCA, very common as well, is Discrete Cosine Transform (DCT) such as in (Duchnowski et al., 1995); (Prez et al., 2005); (Hong et al., 2006); (Lucey & Potamianos, 2006). Linear Discriminant Analysis (LDA), Maximum Likelihood Data Rotation (MLLT), Discrete Wavelet Transform, Discrete Walsh Transform (Potamianos et al., 1998) are other methods that fit in this class and were used for lip reading. Virtually, any other method, usually borrowed from the data compression domain, which results in a lower dimensionality of the feature vectors can be applied for data parametrization in the lip reading domain. Local Binary Patterns (LBP) is just another technique, borrowed from the texture segmentation domain, and shows promising results for lip reading as well (Morn & Pinto-Elas, 2007); (Zhao et al., 2007); (Kricke et al., 2008). LBP was developed by Timo Ojala and Matti Pietikainen and presented in (Ojala & Pietikainen, 1997). A special place in this class is taken by the feature vectors that are based on Optical Flow Analysis (OFA) (Mase & Pentland, 1991); (Martin, 1995); (Gray et al., 1997); (Fleet et al., 2000); (Iwano et al., 2001); (Tamura et al., 2002); (Furui, 2003); (Yoshinaga et al., 2003); (Yoshinaga et al., 2004); (Tamura et al., 2004); (Chitu et al., 2007); (Chitu & Rothkrantz, 2009). The optical flow is defined as "the apparent velocity field in an image". This definition closely matches the affirmation of 102 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

most of the data corpora only consider small recognition tasks and small language corpus. Most frequently the data corpora focus on the digits and letters of the language considered. These are recorded either isolated, or in short sequences, or as in DUTAVSC in spelling of words. Some corpora even only consider nonsense combinations of vowels(V) and consonants(C) (e.g. DAVID considers VCVCVC sequences, AVOZES repetitions of "ba" and "eo" constructions, AT&T CVC sequences). The continuous speech case is only considered in AVOZES which contains only 3 phonetically balanced sentences, in AVICAR which contains ten sentences from the TIMIT (Garofolo, 1988) speech data corpus, XM2VTSDB and M2VTS which contains one random sentence and DUTAVSC which contains 80 phonetically rich sentences. The DUTAVSC is by far the most rich data corpus. The NDUTAVSC corpus which was built as an extension of DUTAVSC contains more than 2000 unique rich sentences. However, none of the existing corpora can match the language coverage offered by the data corpora used for speech recognition which can easily have a vocabulary of 100k words (e.g. the Polyphone corpus (Boogaart et al., 1994) contains more than one million

There are many approaches to data parametrization, but with respect to the feature vectors definition they all fit in three broad classes: texture based features, geometric based features, and combination of texture and geometric features. A good overview of most of the feature extraction methods can be found in (Potamianos et al., 2004). In the first class the feature vectors are composed of pixels' intensities values or a transformation of them in some smaller feature space. The main function of the projection is to reduce the dimensionality of the feature space while preserving as much as possible the most relevant speech related information. Principal Component Analysis (PCA) is one of the first choices, and therefore very popular, and was used in many studies e.g. (Bregler et al., 1993); (Bregler & Konig, 1994); (Duchnowski et al., 1994); (Li et al., 1995); (Tomlinson et al., 1996); (Chiou & Hwang, 1997); (Gray et al., 1997); (Li et al., 1997); (Luettin & Thacker, 1997); (Potamianos et al., 1998); (Dupont & Luettin, 2000); (Hong et al., 2006). The feature definition is based on the notion of eigenfaces or eigenlips which represent the eigenvectors of the training sets. An alternative to PCA, very common as well, is Discrete Cosine Transform (DCT) such as in (Duchnowski et al., 1995); (Prez et al., 2005); (Hong et al., 2006); (Lucey & Potamianos, 2006). Linear Discriminant Analysis (LDA), Maximum Likelihood Data Rotation (MLLT), Discrete Wavelet Transform, Discrete Walsh Transform (Potamianos et al., 1998) are other methods that fit in this class and were used for lip reading. Virtually, any other method, usually borrowed from the data compression domain, which results in a lower dimensionality of the feature vectors can be applied for data parametrization in the lip reading domain. Local Binary Patterns (LBP) is just another technique, borrowed from the texture segmentation domain, and shows promising results for lip reading as well (Morn & Pinto-Elas, 2007); (Zhao et al., 2007); (Kricke et al., 2008). LBP was developed by Timo Ojala and Matti Pietikainen and presented in (Ojala & Pietikainen, 1997). A special place in this class is taken by the feature vectors that are based on Optical Flow Analysis (OFA) (Mase & Pentland, 1991); (Martin, 1995); (Gray et al., 1997); (Fleet et al., 2000); (Iwano et al., 2001); (Tamura et al., 2002); (Furui, 2003); (Yoshinaga et al., 2003); (Yoshinaga et al., 2004); (Tamura et al., 2004); (Chitu et al., 2007); (Chitu & Rothkrantz, 2009). The optical flow is defined as "the apparent velocity field in an image". This definition closely matches the affirmation of

words recorded and a vocabulary of 150k words).

**3.2 Feature vectors definition** 

Bregler and Konig in their 1994 paper (Bregler & Konig, 1994): "The real information in lipreading lies in the temporal change of lip positions, rather than the absolute lip shape". The OFA can be used as well as a measure of the overall movement and be employed for onset/offset detection. The main advantage of this approach is that it can be easily automated, since it requires only the definition of the Region Of Interest (ROI). The ROI can be considered the bounding box of the face or the bounding box of the mouth, thus requiring some object detection and tracking algorithm. A good example is the face detection algorithm developed by Viola and Jones in (Viola & Jones, 2001). The main disadvantage of this type of features is that the a-priory information about lip reading is not inherently used in the process of feature extraction. Therefore, there is minimum control over the information contained in the resulting feature vectors, on whether this information is relevant for lip reading or not. The exceptions can be the OPA and LBP where the analysis is usually performed in carefully chosen regions around the mouth. We defined the set of features based on OFA and analyzed the performance of the lip reading system trained on our data corpus. The features from the second class share the belief that in order to accurately capture the most relevant features, with respect to lip reading, a careful description of the contour of the speaker's mouth is needed. The feature extraction proceeds in two steps; first a number of key points are detected and based on these points the mouth contour is recovered, and second the feature vectors are defined based on the shape of the mouth. The detection of the key points is performed based on colour segmentation techniques that identify pixels that are on the lips. Thereafter, the contour of the lips is usually extracted by imposing a lip model to the detected points. These methods are using the so called "smart snakes" (Lievin et al., 1999); (Luettin & Thacker, 1997); (Salazar et al., 2007), or as called in (Eveno et al., 2004) "jumping snakes", or later on Active Shape Models (ASM) (Luettin et al., 1996); (Prez et al., 2005); (Morn & Pinto-Elas, 2007) or Active Contour Models (ACM). Any other parametric model can be used here. The lips' contour is usually detected as a result of an iterative process which searches to minimise the error between the real contour and the approximation of the contour the parametric model allows for. The actual feature vectors are defined in the second step. The feature vectors fall into two categories here: model based features and mouth high level features. In the first category the feature vectors contain directly the parameters of the models used for describing the mouth contour. In the second category the feature vectors contain measurable quantities, which are meaningful to humans. The most used high level features are mouth height, mouth width, contour perimeter, aperture height, aperture width, aperture area, mouth area, aperture angle and other relations among these (e.g. the ratio between the width and the height) (Chitu & Rothkrantz, 2009); (Goecke et al., 2000a, 2000b); (Kumar et al., 2007); (Matthews et al., 2002); (Yoshinaga et al., 2004).

In our research we used Statistical Lip Geometry Estimation (LGE) which is a feature extraction method introduced by Wojdel and Rothkrantz (Wojdel & Rothkrantz, 2000). This method is special because it is a model free approach for describing the shape of the lips. It strongly depends, however, on the performance of the image segmentation technique used to detect the pixels which belong to the lips. The third class consists of feature vectors that contain both geometric and texture features. The features from each category are usually concatenated in a larger feature vector. For instance (Dupont & Luettin, 2000) and (Luettin et al., 1996) combine ASM with PCA features and (Chiou & Hwang, 1997) combines snake features with PCA. It was shown that the tongue, teeth and cavity have great influence on

Automatic Visual Speech Recognition 105

Even though the definition of the concept of phoneme crosses the boundary of the auditory realm, and therefore is not bound to any sensory modality, the term "viseme" is used as the counter part of phoneme in the visual modality. The term was introduced by Fisher in

The visemes have a similar definition with the phonemes, namely, a viseme is a set of indistinguishable phonemes; indistinguishable phonemes from the point of view of the visual information available and not as in the phonemes case from the point of view of their meaning. There are two direct consequences of this definition. Firstly, there is no exact method of deciding the number and composition of the viseme classes; this is actually done either by a theoretical discussion of auditory-visual lip reading of phonemes or by modelling the human ability of recognizing the phonemes in the absence of the auditory stimulus, therefore, by modelling the degree of confusion of phonemes in the visual modality. Secondly, since there is no one-to-one mapping between the phonetic transcription of an utterance and the corresponding visual transcription, the separability of utterances in the visual modality decreases, which decreases the theoretical performance of a lip reader. The dependence of the visemes on the phonemes can be thought of as one

Unlike for English, to date there is only a limited number of publications which deal with the definition of visemes in Dutch; this is an almost complete list of them: (Breeuwer, 1985), (Corthals , 1984), (Eggermont, 1964), (van Son et al., 1994), (Visser et al., 1999) and (Beun, 1996). The papers (van Son et al., 1994) and (Beun, 1996) cited in (Wojdel, 2003), are the only examples, at least to the author's knowledge, where the classification of the viseme sets is done by elicitation of the human confusion matrices of phonemes. The authors of (van Son et al., 1994) found in their experiments that the Dutch lip readers are only able to recognize

As a sub-word based speech recogniser, the building blocks of our lip reader are the visemes of the Dutch language. Therefore, one HMM corresponds to one viseme. To the set of visemes are added two special models, namely sp for "short pause" and sil for "silence". The sp model is used for recognition of the short pause between words, while sil is used for the silence moments before and after the utterance. Depending on the recognition task, some visemes do not appear at all in the expected utterances and are, therefore, excluded from the

The set of visemes which appear in the digit recognition task are listed in Table 4 and the set of visemes which appear in the letter recognition task are listed in Table 5. The visemes "at" and "a" are only present in the digit set, while the visemes "aa" and "pbm" are only present

The topology of the models used for modelling the visemes, usually used for phonemebased speech recognition as well, is a 3-state left-right with no skips as shown in Figure 3. For implementation reasons, HTK requires that the models start and end with a non emitting node that facilitate the generation of recognition networks. A recognition network

**3.3.2 From phonemes to visemes** 

reason why a new term was needed.

four consonantal and four vocalic visemes.

**3.3.3 Modelling the visemes using HMM** 

study. This is the case for the digit and letter tasks.

in the letter set.

(Fisher, 1968).

lip reading (Williams et al., 1998), therefore, the addition of these appearance related elements has significant influence on the performance of lip reading (Chitu et al., 2007). A special example is the so called Active Appearance Models (AAM) (Cootes et al., 1998) which combines the ASM method with texture based information to accurately detect the shape of the mouth or the face. The searching algorithm is iteratively adjusting the shape such that to minimise the error between the generated shape and the real shape. The core of AAM is PCA which is applied three times, on the shape space, on the texture space and on the combined space of shape and texture. The AAM based features can either consist of AAM model parameters in which case we have a combined geometric and texture feature vector, or of high level features computed based on the shape generated in which case we have a geometric feature vector. The lip reading results based on high level feature vectors which are computed starting from the lips' shape generated based on AAM are given in this chapter.

#### **3.3 Lip reading primitives**

This section introduces the visemes which are the lip reading counterparts of the phonemes.

#### **3.3.1 Phonemes**

In any spoken language a phoneme is the smallest segmental unit of sound which generates a meaningful contrast between utterances. Thus a phoneme is a group of slightly different sounds which are all perceived to have the same function by speakers of the language or dialect in question. An example of a phoneme is the group of /p/ sounds in the words pit spin and tip. Even though these /p/ sounds are formed differently and are slightly different sounds they belong to the same phoneme in English because for an English speaker interchanging the sounds will not change the meaning of the word, however strange the word will sound. The phones, or sounds, that make up a phoneme are called allophones. A speech recognizer can be built at word level or at sub-word level. While for a small vocabulary recognition task a word level system might be preferred, for large vocabulary, continuous speech task systems the phonemes are used as building blocks. Therefore, each phoneme in the target language corresponds to a recognition model in the speech recognizer.

In the Dutch language, approximately 40 distinguishable phonemes are defined. However, there can be slight differences among different phoneme and phoneme sets as a consequence of the target dialect and definition of accepted words. In the present research we used the phoneme set defined in (Damhuis et al., 1994). One problem is generated, for instance, by the neologisms. These words are divided in two classes: the ones that are already established into the language (e.g. the words of French origin) and have a stable pronunciation but which contain phonemes that are still under-represented in the language and a second class of very new words (e.g. the International English words from various technical and economical background) which bring a set of new phonemes that have no correspondence in Dutch. Table 3 shows the phonemes of the Dutch language as used in the Polyphone corpus. The phonemes are given in International Phonetic Alphabet (IPA), Speech Assessment Methods Phonetic Alphabet (SAMPA), and HTK notations, respectively.

104 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

lip reading (Williams et al., 1998), therefore, the addition of these appearance related elements has significant influence on the performance of lip reading (Chitu et al., 2007). A special example is the so called Active Appearance Models (AAM) (Cootes et al., 1998) which combines the ASM method with texture based information to accurately detect the shape of the mouth or the face. The searching algorithm is iteratively adjusting the shape such that to minimise the error between the generated shape and the real shape. The core of AAM is PCA which is applied three times, on the shape space, on the texture space and on the combined space of shape and texture. The AAM based features can either consist of AAM model parameters in which case we have a combined geometric and texture feature vector, or of high level features computed based on the shape generated in which case we have a geometric feature vector. The lip reading results based on high level feature vectors which are computed starting from the lips' shape generated based on AAM are given in this

This section introduces the visemes which are the lip reading counterparts of the phonemes.

In any spoken language a phoneme is the smallest segmental unit of sound which generates a meaningful contrast between utterances. Thus a phoneme is a group of slightly different sounds which are all perceived to have the same function by speakers of the language or dialect in question. An example of a phoneme is the group of /p/ sounds in the words pit spin and tip. Even though these /p/ sounds are formed differently and are slightly different sounds they belong to the same phoneme in English because for an English speaker interchanging the sounds will not change the meaning of the word, however strange the word will sound. The phones, or sounds, that make up a phoneme are called allophones. A speech recognizer can be built at word level or at sub-word level. While for a small vocabulary recognition task a word level system might be preferred, for large vocabulary, continuous speech task systems the phonemes are used as building blocks. Therefore, each phoneme in the target language corresponds to a recognition model in the speech

In the Dutch language, approximately 40 distinguishable phonemes are defined. However, there can be slight differences among different phoneme and phoneme sets as a consequence of the target dialect and definition of accepted words. In the present research we used the phoneme set defined in (Damhuis et al., 1994). One problem is generated, for instance, by the neologisms. These words are divided in two classes: the ones that are already established into the language (e.g. the words of French origin) and have a stable pronunciation but which contain phonemes that are still under-represented in the language and a second class of very new words (e.g. the International English words from various technical and economical background) which bring a set of new phonemes that have no correspondence in Dutch. Table 3 shows the phonemes of the Dutch language as used in the Polyphone corpus. The phonemes are given in International Phonetic Alphabet (IPA), Speech Assessment Methods Phonetic Alphabet (SAMPA), and HTK notations, respectively.

chapter.

**3.3 Lip reading primitives** 

**3.3.1 Phonemes** 

recognizer.

#### **3.3.2 From phonemes to visemes**

Even though the definition of the concept of phoneme crosses the boundary of the auditory realm, and therefore is not bound to any sensory modality, the term "viseme" is used as the counter part of phoneme in the visual modality. The term was introduced by Fisher in (Fisher, 1968).

The visemes have a similar definition with the phonemes, namely, a viseme is a set of indistinguishable phonemes; indistinguishable phonemes from the point of view of the visual information available and not as in the phonemes case from the point of view of their meaning. There are two direct consequences of this definition. Firstly, there is no exact method of deciding the number and composition of the viseme classes; this is actually done either by a theoretical discussion of auditory-visual lip reading of phonemes or by modelling the human ability of recognizing the phonemes in the absence of the auditory stimulus, therefore, by modelling the degree of confusion of phonemes in the visual modality. Secondly, since there is no one-to-one mapping between the phonetic transcription of an utterance and the corresponding visual transcription, the separability of utterances in the visual modality decreases, which decreases the theoretical performance of a lip reader. The dependence of the visemes on the phonemes can be thought of as one reason why a new term was needed.

Unlike for English, to date there is only a limited number of publications which deal with the definition of visemes in Dutch; this is an almost complete list of them: (Breeuwer, 1985), (Corthals , 1984), (Eggermont, 1964), (van Son et al., 1994), (Visser et al., 1999) and (Beun, 1996). The papers (van Son et al., 1994) and (Beun, 1996) cited in (Wojdel, 2003), are the only examples, at least to the author's knowledge, where the classification of the viseme sets is done by elicitation of the human confusion matrices of phonemes. The authors of (van Son et al., 1994) found in their experiments that the Dutch lip readers are only able to recognize four consonantal and four vocalic visemes.

#### **3.3.3 Modelling the visemes using HMM**

As a sub-word based speech recogniser, the building blocks of our lip reader are the visemes of the Dutch language. Therefore, one HMM corresponds to one viseme. To the set of visemes are added two special models, namely sp for "short pause" and sil for "silence". The sp model is used for recognition of the short pause between words, while sil is used for the silence moments before and after the utterance. Depending on the recognition task, some visemes do not appear at all in the expected utterances and are, therefore, excluded from the study. This is the case for the digit and letter tasks.

The set of visemes which appear in the digit recognition task are listed in Table 4 and the set of visemes which appear in the letter recognition task are listed in Table 5. The visemes "at" and "a" are only present in the digit set, while the visemes "aa" and "pbm" are only present in the letter set.

The topology of the models used for modelling the visemes, usually used for phonemebased speech recognition as well, is a 3-state left-right with no skips as shown in Figure 3. For implementation reasons, HTK requires that the models start and end with a non emitting node that facilitate the generation of recognition networks. A recognition network

Automatic Visual Speech Recognition 107

In Figure 3 the numbers on the arcs represent the initial transition probabilities, set before training. Under the emitting states there is a generic drawing of the distribution of the feature vectors which is approximated by a mixture of Gaussian distributions. The

Fig. 3. The models used for modelling the visemes. The topology is 5-State Left-Right with

It is not possible to build a continuous speech recognizer without including a model for silence. However, there are two types of silence, the ones between the words and the ones that appear in the beginning of the utterance and at the end of the utterance. The silence model that covers the entering and exit time of the utterances can be modelled using the same topology as for viseme models (i.e. 3-state left-right topology). However, in order to make the model more robust by allowing the states to absorb more non verbal mouth movement, the silence model is modified so that a backwards transition from state 4 to state 2 is accepted. The model for short pause is build starting from the model for [sil]. The short pause model is a so called tee-model and has a single emitting state which is tied to the central state of the [sp] model. This means that the central state of the [sil] model and the emitting state of the [sp] model share the same Gaussian mixture and therefore are trained using the same data. Parameter tying is very often used in speech recognition for the cases when there is not sufficient data for training models for similar entities. The topology used for the two silence models is shown in Figure 4. The silence models defined above are the same as the ones used for speech recognition. However, there is a big difference between the concept of silence in speech recognition and the concept of silence in lip reading. Consequently, the noise can have a more robust definition. For instance, in the case of visual speech the speaker can move his mouth for non verbal reasons (e.g. to moisture his lips, or to exteriorise the emotional status by showing a facial expression). The noise sources are more diverse for lip reading. Even though the silence model has an extra backward arc which should, in principle, also accommodate for noise in the training data, we found out in our experiments that the silence model defined in this way did not perform at the same level as in the case of speech. As we will see later in the results sections, sometimes the insertion rate was

modelling of the two silence models are introduced in the next section.

three emitting states. The arcs are annotated with transition probabilities

unexpectedly large. This can also be due to poorly trained silence models.

**3.3.4 Silence and pause models** 

consists of a string of linked models which are used during recognition by matching to the input utterance.


Table 3. Polyphone's Dutch phoneme set: consonants.


Table 4. The viseme set in HTK working notation for the digit recognition task.


Table 5. The viseme set in HTK working notation for the letter recognition task.

106 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

consists of a string of linked models which are used during recognition by matching to the

input utterance.

Table 3. Polyphone's Dutch phoneme set: consonants.

Table 4. The viseme set in HTK working notation for the digit recognition task.

Table 5. The viseme set in HTK working notation for the letter recognition task.

In Figure 3 the numbers on the arcs represent the initial transition probabilities, set before training. Under the emitting states there is a generic drawing of the distribution of the feature vectors which is approximated by a mixture of Gaussian distributions. The modelling of the two silence models are introduced in the next section.

Fig. 3. The models used for modelling the visemes. The topology is 5-State Left-Right with three emitting states. The arcs are annotated with transition probabilities

#### **3.3.4 Silence and pause models**

It is not possible to build a continuous speech recognizer without including a model for silence. However, there are two types of silence, the ones between the words and the ones that appear in the beginning of the utterance and at the end of the utterance. The silence model that covers the entering and exit time of the utterances can be modelled using the same topology as for viseme models (i.e. 3-state left-right topology). However, in order to make the model more robust by allowing the states to absorb more non verbal mouth movement, the silence model is modified so that a backwards transition from state 4 to state 2 is accepted. The model for short pause is build starting from the model for [sil]. The short pause model is a so called tee-model and has a single emitting state which is tied to the central state of the [sp] model. This means that the central state of the [sil] model and the emitting state of the [sp] model share the same Gaussian mixture and therefore are trained using the same data. Parameter tying is very often used in speech recognition for the cases when there is not sufficient data for training models for similar entities. The topology used for the two silence models is shown in Figure 4. The silence models defined above are the same as the ones used for speech recognition. However, there is a big difference between the concept of silence in speech recognition and the concept of silence in lip reading. Consequently, the noise can have a more robust definition. For instance, in the case of visual speech the speaker can move his mouth for non verbal reasons (e.g. to moisture his lips, or to exteriorise the emotional status by showing a facial expression). The noise sources are more diverse for lip reading. Even though the silence model has an extra backward arc which should, in principle, also accommodate for noise in the training data, we found out in our experiments that the silence model defined in this way did not perform at the same level as in the case of speech. As we will see later in the results sections, sometimes the insertion rate was unexpectedly large. This can also be due to poorly trained silence models.

Automatic Visual Speech Recognition 109

performance change, the optimum number of mixtures can be found. During our experiments we iteratively increased the number of mixtures by one until a maximum of 32 mixtures. The "magic" number 32 was found sufficiently big to cover the optimum number

The AAM algorithm iteratively searches for the best fit of a model defined by a set of landmarks and the image being processed. Based on a-priori knowledge about the shape of the object, the set of landmarks is defined such that it optimally describes the object. In our case we required that the points selected describe the shape of the mouth in detail, especially capturing the speech related aspects. Therefore, the final model should exactly segment the lips in all moments during speech. After experimenting with different models and analysing the results, followed by long discussions, we decided to use a model composed of 29 points, distributed

For training a model, a number of two to four hundred images was manually processed. In order to obtain reliable results the images were selected such that they cover all the variance in the data. This was achieved in an iterative process. We first started with a random selection of a few tens of images which were used to build a first model. This model was used for processing until the performance of the model decreased below some visually assessed threshold. The images that were badly processed were added in the training set and a new model was obtained. This process continued until the performance of the model stabilized. In the end we trained a number of models for each speaker in the dataset. For

Even though the process is fairly automated, this was an extremely laborious work, since the corpus contains more than 4.3 million frames, and was split among various people. Each assistant was asked to train a model and supervise the processing of the rest of the frames. Splitting the data among different people makes it more difficult to guaranty the uniformity over the entire corpus of the end result. Therefore, to assure uniformity of the processing we

around the mouth, chin and nose. This model is shown in Figure 5.

speakers that recorded multiple sessions we trained one model per session.

of mixtures in all the experiments.

**4. Facial model for lip reading** 

Fig. 5. The AAM model

Fig. 4. The models used for modelling the silence

#### **3.3.5 Modelling the low level context using Tri-visemes**

In order to model the context at the level of the visemes, each viseme is considered in all the possible contexts. Only a one step context is considered, namely for each viseme only the left and the right possible visemes are considered, therefore, the name of the new entity is triviseme. The notation for tri-visemes is lf-vis+rt, where "vis" is the viseme in question, "lf" is the left context and "rt" is right context. For instance the word nul with the viseme transcription gkx oyu l will generate the following tri-visemes: gkx+oyu, gkx-oyu+l and oyu-l. The context of each viseme can be build at word level, also called word internal, or at the level of utterance called word external. In the first case, for finding all possible contexts of a viseme, only the words in the vocabulary are considered, while in the second case also the possible combinations of words can build the context. It should be noted that sometimes bi-visemes (i.e. viseme context containing only the left or the right viseme) are also generated. For each tri-viseme, a new model will be build which makes the number of models explode, making the data requirements for training a tri-viseme based recognizer many times larger. The major problem with the tri-visemes is that some contexts can appear only once (or a very small number of times) in the training data, or can even be absent from the training data, as in the case of trans-word boundary contexts. To solve this problem the parameter tying technique is used. The clustering of possible similar contexts can be made either by a data-driven approach, or by the use of decision trees. Even after the parameter tying, there can still be tri-viseme models which are undertrained.

#### **3.4 Gaussian mixtures**

The HMM approach considers that each of the emitting states in the model will be described by a continuous density distribution. This distribution is approximated in HTK by a mixture of Gaussian distributions. Building of the models in HTK starts by using only one Gaussian distribution. In the refining step the number of Gaussian mixtures is increased iteratively by 1 or 2 units until the optimum number of components is obtained. By monitoring the 108 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

In order to model the context at the level of the visemes, each viseme is considered in all the possible contexts. Only a one step context is considered, namely for each viseme only the left and the right possible visemes are considered, therefore, the name of the new entity is triviseme. The notation for tri-visemes is lf-vis+rt, where "vis" is the viseme in question, "lf" is the left context and "rt" is right context. For instance the word nul with the viseme transcription gkx oyu l will generate the following tri-visemes: gkx+oyu, gkx-oyu+l and oyu-l. The context of each viseme can be build at word level, also called word internal, or at the level of utterance called word external. In the first case, for finding all possible contexts of a viseme, only the words in the vocabulary are considered, while in the second case also the possible combinations of words can build the context. It should be noted that sometimes bi-visemes (i.e. viseme context containing only the left or the right viseme) are also generated. For each tri-viseme, a new model will be build which makes the number of models explode, making the data requirements for training a tri-viseme based recognizer many times larger. The major problem with the tri-visemes is that some contexts can appear only once (or a very small number of times) in the training data, or can even be absent from the training data, as in the case of trans-word boundary contexts. To solve this problem the parameter tying technique is used. The clustering of possible similar contexts can be made either by a data-driven approach, or by the use of decision trees. Even after the parameter

The HMM approach considers that each of the emitting states in the model will be described by a continuous density distribution. This distribution is approximated in HTK by a mixture of Gaussian distributions. Building of the models in HTK starts by using only one Gaussian distribution. In the refining step the number of Gaussian mixtures is increased iteratively by 1 or 2 units until the optimum number of components is obtained. By monitoring the

Fig. 4. The models used for modelling the silence

**3.3.5 Modelling the low level context using Tri-visemes** 

tying, there can still be tri-viseme models which are undertrained.

**3.4 Gaussian mixtures** 

performance change, the optimum number of mixtures can be found. During our experiments we iteratively increased the number of mixtures by one until a maximum of 32 mixtures. The "magic" number 32 was found sufficiently big to cover the optimum number of mixtures in all the experiments.

#### **4. Facial model for lip reading**

The AAM algorithm iteratively searches for the best fit of a model defined by a set of landmarks and the image being processed. Based on a-priori knowledge about the shape of the object, the set of landmarks is defined such that it optimally describes the object. In our case we required that the points selected describe the shape of the mouth in detail, especially capturing the speech related aspects. Therefore, the final model should exactly segment the lips in all moments during speech. After experimenting with different models and analysing the results, followed by long discussions, we decided to use a model composed of 29 points, distributed around the mouth, chin and nose. This model is shown in Figure 5.

For training a model, a number of two to four hundred images was manually processed. In order to obtain reliable results the images were selected such that they cover all the variance in the data. This was achieved in an iterative process. We first started with a random selection of a few tens of images which were used to build a first model. This model was used for processing until the performance of the model decreased below some visually assessed threshold. The images that were badly processed were added in the training set and a new model was obtained. This process continued until the performance of the model stabilized. In the end we trained a number of models for each speaker in the dataset. For speakers that recorded multiple sessions we trained one model per session.

Fig. 5. The AAM model

Even though the process is fairly automated, this was an extremely laborious work, since the corpus contains more than 4.3 million frames, and was split among various people. Each assistant was asked to train a model and supervise the processing of the rest of the frames. Splitting the data among different people makes it more difficult to guaranty the uniformity over the entire corpus of the end result. Therefore, to assure uniformity of the processing we

Automatic Visual Speech Recognition 111

The first approach towards lip reading and other similar problems was to use as visual features directly the AAM parameters. The other approach is to use the final results of the method, namely the co-ordinates of the landmarks as assigned by the algorithm for the current image. In our research we adopted this latter approach. Based on the position of the landmarks we defined seven high level geometric features. The features are computed as the Euclidean distances and areas between the certain key points that describe the shape of the mouth, namely mouth height and width, mouth aperture width and height, mouth area, aperture area

Figure 8 shows the plots of the feature vectors computed for a random recording of the letter F having the viseme transcription [eeh fvw]. In this case the onset and offset moments of the utterance are clearly visible around the frame 75 and the frame 200 of the video recording, respectively. The onset of the viseme [eeh] is around the frame 80, while the onset of the viseme [fvw] is seen around frame 160. The actual shape of the mouth can be seen in

Fig. 7. The high level geometric features: 1) Outer lip width, 2) Outer lip height, 3) Inner lip width, 4) Inner lip height; 5) Chin to nose distance, 6) Outer lip area, 7) Inner lip area

Figure 9 shows the plots of the feature vectors for seven letters of the alphabet and the digit < 8 > ([a gkx td]). We see that the variability of the features is very high which makes them suitable for the recognition task at hand. We can also remark that, for instance, even though the viseme [aa] is present in the transcription of all letters, A([aa]), H([h aa]) and K([gkx aa]) we can clearly see that there is a slight difference between them with respect to the duration in each instance. This is best visible in the curve showing the height of the mouth, which shows that the duration of the viseme is shorter in the utterance of the letter K and H than in

An interesting result was obtained when visually inspecting the curves described by the feature vectors for all the visemes. By simple visual inspection we found that we could easily distinguish between some of the visemes, which proved that the feature set captures much of the speech related information. Table 6 summarises our findings in this respect. For a simple recognition task such as for instance the recognition of isolated visemes, or even the recognition of isolated digits, based on this table we could use a static classifier such as Support Vector Machines (SVM) (Ganapathiraju, 2002). However, for these types of

and the nose to chin distance. The features are graphically described in Figure 7.

the images shown below the graphs, which are extracted from the video sequence.

**4.2 Defining the feature vectors** 

**4.3 Visual validation of the feature vectors** 

the case of the letter A.

used a strict definition of the landmarks. We defined as well constraints that acted on pairs of landmarks. The rest of this section gives the definition used for the landmarks. Before going to the next paragraphs, we should introduce some anatomical elements on which the definition of the landmarks depends.

#### **4.1 AAM results on the training data**

The AAM process is very fast and very accurate given that a good training set was selected. We combined the AAM searching scheme with the Viola&Jones mouth detection algorithm, which made the selection of a very good location for the initial guess possible. This has speeded up the search process to real time performance. The mouth detection was used only in the first few frames of the recording. In the subsequent frames the initial guess used was the result of the processing in the previous frame. This approach was very successful both in speeding up the search scheme and improving the accuracy of the detection. Figure 6 shows the first six most important components in PCA terminology. The mean shape and texture model is shown on the centre row. The top row shows for each mode the resulting object after an adjustment by two standard deviations is applied to on the corresponding mode. The bottom row shows the result when the adjustment is negative. The first two modes seem to have more control over the vertical and horizontal movement of the mouth, while mode four seems to control the presence of the tongue. However, there is no strict separation between the information controlled by each mode, at least not easily discernable by visual inspection. This model was trained on a set of 440 images, selected in an iterative process. All three models (i.e. appearance, shape and combined models) were truncated at 95% level. Based on the 95% level truncation, the final combined model had 38 parameters, while the shape model had 11 parameters and the texture model had 120 parameters. The first six modes in the combined model cover 78.65%. However, in the case of the shape models the first two modes already cover 82.53% of the total variation, while the first six cover 91.83% of the variation.

Fig. 6. Combined shape and appearance statistical model. The images show from left to right the first six most important components in PCA terminology. These modes account for 78.65% of the total variation. Centre row: Mean shape and appearance. Top row: Mean shape and appearance +2σ. Bottom row: Mean shape and appearance -2σ

110 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

used a strict definition of the landmarks. We defined as well constraints that acted on pairs of landmarks. The rest of this section gives the definition used for the landmarks. Before going to the next paragraphs, we should introduce some anatomical elements on which the

The AAM process is very fast and very accurate given that a good training set was selected. We combined the AAM searching scheme with the Viola&Jones mouth detection algorithm, which made the selection of a very good location for the initial guess possible. This has speeded up the search process to real time performance. The mouth detection was used only in the first few frames of the recording. In the subsequent frames the initial guess used was the result of the processing in the previous frame. This approach was very successful both in speeding up the search scheme and improving the accuracy of the detection. Figure 6 shows the first six most important components in PCA terminology. The mean shape and texture model is shown on the centre row. The top row shows for each mode the resulting object after an adjustment by two standard deviations is applied to on the corresponding mode. The bottom row shows the result when the adjustment is negative. The first two modes seem to have more control over the vertical and horizontal movement of the mouth, while mode four seems to control the presence of the tongue. However, there is no strict separation between the information controlled by each mode, at least not easily discernable by visual inspection. This model was trained on a set of 440 images, selected in an iterative process. All three models (i.e. appearance, shape and combined models) were truncated at 95% level. Based on the 95% level truncation, the final combined model had 38 parameters, while the shape model had 11 parameters and the texture model had 120 parameters. The first six modes in the combined model cover 78.65%. However, in the case of the shape models the first two modes already cover 82.53% of the total variation, while the first six

Fig. 6. Combined shape and appearance statistical model. The images show from left to right the first six most important components in PCA terminology. These modes account for 78.65% of the total variation. Centre row: Mean shape and appearance. Top row: Mean

shape and appearance +2σ. Bottom row: Mean shape and appearance -2σ

definition of the landmarks depends.

**4.1 AAM results on the training data** 

cover 91.83% of the variation.

#### **4.2 Defining the feature vectors**

The first approach towards lip reading and other similar problems was to use as visual features directly the AAM parameters. The other approach is to use the final results of the method, namely the co-ordinates of the landmarks as assigned by the algorithm for the current image. In our research we adopted this latter approach. Based on the position of the landmarks we defined seven high level geometric features. The features are computed as the Euclidean distances and areas between the certain key points that describe the shape of the mouth, namely mouth height and width, mouth aperture width and height, mouth area, aperture area and the nose to chin distance. The features are graphically described in Figure 7.

#### **4.3 Visual validation of the feature vectors**

Figure 8 shows the plots of the feature vectors computed for a random recording of the letter F having the viseme transcription [eeh fvw]. In this case the onset and offset moments of the utterance are clearly visible around the frame 75 and the frame 200 of the video recording, respectively. The onset of the viseme [eeh] is around the frame 80, while the onset of the viseme [fvw] is seen around frame 160. The actual shape of the mouth can be seen in the images shown below the graphs, which are extracted from the video sequence.

Fig. 7. The high level geometric features: 1) Outer lip width, 2) Outer lip height, 3) Inner lip width, 4) Inner lip height; 5) Chin to nose distance, 6) Outer lip area, 7) Inner lip area

Figure 9 shows the plots of the feature vectors for seven letters of the alphabet and the digit < 8 > ([a gkx td]). We see that the variability of the features is very high which makes them suitable for the recognition task at hand. We can also remark that, for instance, even though the viseme [aa] is present in the transcription of all letters, A([aa]), H([h aa]) and K([gkx aa]) we can clearly see that there is a slight difference between them with respect to the duration in each instance. This is best visible in the curve showing the height of the mouth, which shows that the duration of the viseme is shorter in the utterance of the letter K and H than in the case of the letter A.

An interesting result was obtained when visually inspecting the curves described by the feature vectors for all the visemes. By simple visual inspection we found that we could easily distinguish between some of the visemes, which proved that the feature set captures much of the speech related information. Table 6 summarises our findings in this respect. For a simple recognition task such as for instance the recognition of isolated visemes, or even the recognition of isolated digits, based on this table we could use a static classifier such as Support Vector Machines (SVM) (Ganapathiraju, 2002). However, for these types of

Automatic Visual Speech Recognition 113

considered both the case with simple static features (i.e. seven geometric features) and the case when the feature space was enriched with dynamic information consisting of deltas and accelerations (i.e. making 21-dimensional vectors). We trained systems based on monovisemes as well as context aware tri-viseme systems. We used a Gaussian mixture arrangement to better describe the feature space and we performed a 10-fold validation in order to increase the confidence in the observed results. The best results obtained were WRR 90.32% with word accuracy 84.27% for the CD recognition task. In this case, 75% of the sequences was recognized correctly. Figure 10 shows the plot of the performance of the best

Fig. 9. Feature values plotted for the letters A ([aa]), H ([h aa]), K ([gkx aa]) and Q ([gkx oyu]), I ([ie]), O ([oyu]), IJ ([ei]) and 8 ([a gkx td]). The vectors are scaled using the time

variance and centred around their mean

recognizer as a function of the number of Gaussian mixtures used.

classifiers the features need to be global features because they cannot handle time series. Therefore, the generalisation to longer and of variable length utterances is not possible.

Fig. 8. The seven features plotted for one recording for the letter F transcribed using the visemes: eeh and fvw


Table 6. Feature patterns per viseme: +) peak -) valley -+) increase +-) decrease.

#### **4.4 AAM as ROI detection algorithm**

It is worth mentioning that AAM can be used as well as a preprocess for defining a more accurate ROI. Therefore, the ROI defined using a mouth detection algorithm is further improved using the AAM. A more accurate ROI makes the data parametrization process more robust, because the background is better removed and, therefore, there is less noise in the input data.

#### **5. Lip reading results**

The method presented in this section produces for each frame in the corpus a vector with seven entries: mouth width, mouth height, aperture width, aperture height, mouth area, aperture area and the distance between the nose and the chin. We trained and tuned a lip reader based on the HMMs approach for each recognition task. In a similar approach, we 112 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

classifiers the features need to be global features because they cannot handle time series. Therefore, the generalisation to longer and of variable length utterances is not possible.

Fig. 8. The seven features plotted for one recording for the letter F transcribed using the

Table 6. Feature patterns per viseme: +) peak -) valley -+) increase +-) decrease.

It is worth mentioning that AAM can be used as well as a preprocess for defining a more accurate ROI. Therefore, the ROI defined using a mouth detection algorithm is further improved using the AAM. A more accurate ROI makes the data parametrization process more robust, because the background is better removed and, therefore, there is less noise in

The method presented in this section produces for each frame in the corpus a vector with seven entries: mouth width, mouth height, aperture width, aperture height, mouth area, aperture area and the distance between the nose and the chin. We trained and tuned a lip reader based on the HMMs approach for each recognition task. In a similar approach, we

visemes: eeh and fvw

the input data.

**5. Lip reading results** 

**4.4 AAM as ROI detection algorithm** 

considered both the case with simple static features (i.e. seven geometric features) and the case when the feature space was enriched with dynamic information consisting of deltas and accelerations (i.e. making 21-dimensional vectors). We trained systems based on monovisemes as well as context aware tri-viseme systems. We used a Gaussian mixture arrangement to better describe the feature space and we performed a 10-fold validation in order to increase the confidence in the observed results. The best results obtained were WRR 90.32% with word accuracy 84.27% for the CD recognition task. In this case, 75% of the sequences was recognized correctly. Figure 10 shows the plot of the performance of the best recognizer as a function of the number of Gaussian mixtures used.

Fig. 9. Feature values plotted for the letters A ([aa]), H ([h aa]), K ([gkx aa]) and Q ([gkx oyu]), I ([ie]), O ([oyu]), IJ ([ei]) and 8 ([a gkx td]). The vectors are scaled using the time variance and centred around their mean

Automatic Visual Speech Recognition 115

For the GU recognition task we observed a 56% WRR. Using an N-Best approach with five most probable outcomes did not improve the result, which suggests the system is fairly robust. The 10-fold validation showed an 80.27% mean WRR with a 6% standard deviation, the minimum performance being 74.80% WRR. This shows some instability, however, the minimum is still a very good result. We also tested the results of the recognition at viseme level (i.e. before using the language model to build the corresponding words). This is useful for analysing the degree of confusion between different visemes. Figure 11 shows the confusion matrix for the best case. The mean confusion matrices computed over the mixture number is also displayed. We can remark in these figures that the degree of confusion is relatively small. However, the confusion is greater for visemes defined by larger phoneme sets. This is the case especially for the visemes [oyu] and [gkx] which are very often a source

We introduced in this chapter an AAM based approach for lip reading. The AAM method is in our opinion a valuable tool for lip reading, both as a data parametrization method but also as a ROI detection technique. The method can be very robust and has a good generalization for unseen faces, however, the training process can be very long for satisfactory results to be obtained. Nevertheless, the shape obtained from the search scheme can be used as a starting point for testing other feature types, since it can always function as background elimination stencil. Based on the shape computed using the AAM searching scheme, we defined a set of high level geometric features. Based on these features we built different lip readers with very good results. These results validate the findings reported in the literature which showed that the width and the height of the mouth largely capture the content of the spoken utterance (Wojdel, 2003). This also justifies why a simple mouth model for lips synchronization based only on varying the mouth opening synchronous with the sound output is so convincing. We did not include in the feature vectors used in this chapter any information that describes the presence of the teeth, tongue or other elements of the mouth. This information was shown in the literature but also in our other experiments to be very important for lip reading. We expect that this is the case in the current settings as well. However, we did not include this information here because we wanted to have a clear

Arsic, I. & Thiran, J.-P. (2006). *Mutual information eigenlips for audiovisual speech recognition*, In

Atteveldt, N. van. (2006). *Speech meets script fMRI studies on the integration of letters and speech* 

Beun, D. (1996). *Viseme syllable sets*, Master's thesis, Institute of Phonetic Sciences, University

Boogaart, T.; Bos, L. & Bouer, L. (1994). *Use of the dutch polyphone corpus for application* 

Breeuwer, M. (1985). *Speechreading Suplimented With Auditory Information*, Ph.D. thesis, Free

*development*. In 2nd IEEE Workshop on Iterative Voice Technology for

understanding of the factors that influence the observed results.

*sounds*. Ph.D. thesis, Universiteit Maastricht

Telecomunication Applications. September

14th European Signal Processing Conference (EU-SIPCO)

of confusion.

**6. Conclusion** 

**7. References** 

of Amsterdam

University of Amsterdam

Fig. 10. The WRR and Acc results for CD recognition task as a function of the number of mixtures. The X axis gives the number of mixtures and the Y axis shows the results obtained. The feature vectors consisted of geometric features computed based on the AAM shape corroborated with their corresponding deltas and accelerations. The HMM models consisted of intra-word tri-visemes

Fig. 11. The confusion matrices obtained by the best systems in the CD and CL tasks at the viseme level, respectively. a) the confusion matrix for CD task in the best case. b) the confusion matrix for CL task in the best case. c) the mean, over the mixture number, confusion matrix for the CD task. d) the mean, over the mixture number, confusion matrix for the CL task

114 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Fig. 10. The WRR and Acc results for CD recognition task as a function of the number of mixtures. The X axis gives the number of mixtures and the Y axis shows the results

consisted of intra-word tri-visemes

for the CL task

obtained. The feature vectors consisted of geometric features computed based on the AAM shape corroborated with their corresponding deltas and accelerations. The HMM models

Fig. 11. The confusion matrices obtained by the best systems in the CD and CL tasks at the viseme level, respectively. a) the confusion matrix for CD task in the best case. b) the confusion matrix for CL task in the best case. c) the mean, over the mixture number, confusion matrix for the CD task. d) the mean, over the mixture number, confusion matrix For the GU recognition task we observed a 56% WRR. Using an N-Best approach with five most probable outcomes did not improve the result, which suggests the system is fairly robust. The 10-fold validation showed an 80.27% mean WRR with a 6% standard deviation, the minimum performance being 74.80% WRR. This shows some instability, however, the minimum is still a very good result. We also tested the results of the recognition at viseme level (i.e. before using the language model to build the corresponding words). This is useful for analysing the degree of confusion between different visemes. Figure 11 shows the confusion matrix for the best case. The mean confusion matrices computed over the mixture number is also displayed. We can remark in these figures that the degree of confusion is relatively small. However, the confusion is greater for visemes defined by larger phoneme sets. This is the case especially for the visemes [oyu] and [gkx] which are very often a source of confusion.

#### **6. Conclusion**

We introduced in this chapter an AAM based approach for lip reading. The AAM method is in our opinion a valuable tool for lip reading, both as a data parametrization method but also as a ROI detection technique. The method can be very robust and has a good generalization for unseen faces, however, the training process can be very long for satisfactory results to be obtained. Nevertheless, the shape obtained from the search scheme can be used as a starting point for testing other feature types, since it can always function as background elimination stencil. Based on the shape computed using the AAM searching scheme, we defined a set of high level geometric features. Based on these features we built different lip readers with very good results. These results validate the findings reported in the literature which showed that the width and the height of the mouth largely capture the content of the spoken utterance (Wojdel, 2003). This also justifies why a simple mouth model for lips synchronization based only on varying the mouth opening synchronous with the sound output is so convincing. We did not include in the feature vectors used in this chapter any information that describes the presence of the teeth, tongue or other elements of the mouth. This information was shown in the literature but also in our other experiments to be very important for lip reading. We expect that this is the case in the current settings as well. However, we did not include this information here because we wanted to have a clear understanding of the factors that influence the observed results.

#### **7. References**


Automatic Visual Speech Recognition 117

Furui, S. (2003). *Robust Methods in Automatic Speech Recognition and Understanding*, In

Ganapathiraju, A. (2002). *Support vector machines for speech recognition*, Ph.D. thesis,

Goecke, R.; Tran, Q. N.; Millar, J. B.; Zelinsky, A. & Robert-Ribes, J. (2000). *Validation of an* 

Goecke, R.; Millar, J. B.; Zelinsky, A. & Robert-Ribes, J. (2000). *Automatic extraction of lip* 

Goecke, R. & Millar, J. (2004). *The audio-video australian english speech data corpus avoze,* In

Garofolo, J. (1988) *Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic* 

Gray, M. S.; Movellan, J. R. & Sejnowski T. J. (1997). *Dynamic features for visual speechreading:* 

Hilder, S.; Harvey, R. & Theobald, B. J. (2009). *Comparison of human and machine-based lip-*

Hong, X.; Yao, H.; Wan,Y. & Chen, R. (2006). *A PCA based visual DCT feature extraction method* 

Iwano, K.; Tamura, S. & Furui, S. (2001). *Bimodal Speech Recognition Using Lip Movement* 

Kricke, R.; Gernoth, T. & Grigat, R.-R. (2008). *Local binary patterns for lip motion analysis*. In Image Processing 2008, 15th IEEE International Conference on, pp. 1472-1475 Kumar, K.; Chen, T. & Stern, R. M. (2007). *Profile view lip reading*, In Proceedings of the

Lee, B.; Hasegawa-Johnson, M.; Goudeseune, C.; Kamdar, S.; Borys, S.; Liu, M. & Huang, T.

Damhuis, M.; Boogaart, T.; Veld, C.; Versteijlen, M.; Schelvis, W.; Bos, L. & Boves, L. (1994).

Li, N.; Dettmer, S. & Shah, M. (1995). *Lipreading using eigen sequences*, In Proc. International

Li, N.; Dettmer, S. & Shah, M. (1997). *Visually recognizing speech using eigensequences*, Motion-

Lievin, M.; Delmas, P.; Coulon, P. Y.; Luthon, F. & Fristot, V. (1999). *Automatic lip tracking:* 

ICSLP2004, vol. III, pp. 2525-2528. Jeju, Korea, October

Mississippi State University, Mississippi State, MS, USA, 2002. Major Professor-

*automatic lip-tracking algorithm and design of a database for audio-video speech processing*, In Proc. 8th Australian Int. Conf. on Speech Science and Technology SST2000, pp.

*feature points*, In Proc. of the Australian Conference on Robotics and Automation

Proceedings of the 8th International Conference on Spoken Language Processing

*continuous speech database*, National Institute of Standards and Technology (NIST),

*A systematic comparison*, Advances in Neural Information Processing Systems, vol.

*reading*, In B. J. Theobald and R. W. Harvey, editors, AVSP 2009, pp. 86-89.

International Conference on Acoustics, Speech and Signal Processing ICASSP, vol.

(2004). *Avicar: Audio-visual speech corpus in a car environment*, In INTERSPEECH2004-

*Creation and analysis of the dutch polyphone corpus*, In Third International Conference

Workshop on Automatic Face- and Gesture-Recognition, pp.30-34. Zurich,

*Bayesian segmentation and active contours in a cooperative scheme*, In IEEE Conference

EUROSPEECH 2003 - Geneva

Picone, Joseph

ACRA2000, pp. 31-36

Gaithersburgh, MD, USA

9, pp. 751-757

4, pp. 429-432

Switzerland

Norwich, September

*for lip-reading*, pp. 321-326

ICSLP. Jeju Island, Korea, October

on Spoken Language Processing. ISCA

based recognition, vol. 1, pp. 345-371

*Measured By Optical-Flow analysis*, In HSC2001

92-97


116 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Bregler, C.; Hild, H. ; Manke, S. & Waibel, A. (1993). *Improving connected letter recognition by* 

Bregler, C. & Konig, Y. (1994). *Eigenlips for robust speech recognition*. In Acoustics, Speech, and

Buchan, J. N.; Pare, M. & Munhall, K. G. (2007). *Spatial statistics of gaze fixations during* 

Chibelushi, C. ; Gandon, S. ; Mason,J. ; Deravi, F. & Johnston, R. (1996). *Design issues for a* 

Chiou, G. I. & Hwang, J. N. (1997). *Lipreading from color video*, IEEE Transactions on Image

Chitu, A. G. ; Rothkrantz, L.J.M. ; Wiggers, P. & Wojdel, J.C. (2007). *Comparison between* 

Cootes, T.; Edwards, G. & Taylor; C. (1998). *Active appearance models*, In H. Burkhardt and B.

Corthals , P. (1984). *Een eenvoudige visementaxonomie voor spraakafzien [a simple viseme taxonomy for lipreading]*, In Tijdscrijf Log en Audio, vol. 14, pp. 126-134 Daubias, P. & Deleglise, P. (2003). *The lium-avs database: a corpus to test lip segmentation and* 

Duchnowski, P.; Meier, U. & Waibel, A. (1994). *See me, hear me: Integrating automatic speech* 

Duchnowski, P.; Hunke, M.; Büsching, D.; Meier, U. & Waibel, A. (1995). *Toward movement-*

Acoustics, Speech, and Signal Processing, ICASSP-95, vol. 1, pp. 109-112 Dupont, S. & Luettin, J. (2000). *Audio-visual speech modeling for continuous speech recognition*,

Eggermont, J. P. M. (1964). *Taalverwerving bij een Groep Dove Kinderen [Language Acquisition in* 

Eveno, N.; Caplier, A. & Coulon, P.-Y. (2004). *Automatic and accurate lip tracking*, In IEEE

Fisher, C. G. (1968). *Confusions among visually perceived consonants*, Journal of Speech,

Fleet, D. J.; Black, M. J.; Yacoob, Y. & Jepson, A. D. (2000). *Design and Use of Linear Models for* 

Multimodal User Interfaces, vol. 1,no. 1, pages 7-20, Springer, March Chitu, A. G. & Rothkrantz, L. J. M. (2009). *The New Delft University of Technology Data Corpus for Audio-Visual Speech Recognition*. In Euromedia'2009, pp. 63-69. April Chitu, A. G. ; Rothkrantz, L.J.M. (2009). *Visual Speech recognition- Automatic System for Lip* 

Processing, vol. 1. Institute of Electrical Engineers Inc (IEE)

Signal Processing, ICASSP-94 IEEE International Conference on

*dynamic face processing*, Social Neuroscience, vol. 2(1), pp.1-13

Processing, vol. 6(8),pp. 1192-1195

484-498. Springer

*a Group of Deaf Children]* 

May

193

no. 3, pages 2{9, Simolini-94, Sofia, Bulgaria

Speech Communication and Technology

*recognition and lip-reading*. Reading, vol. 1(1)

In IEEE Transactions On Multimedia, vol. 2. September

Language and Hearing Research, vol. 11(4), p. 796

on

*lipreading*. In IEEE International Conference on Acoustics Speech and Signal

*digital audio-visual integrated database*, In Integrated Audio-Visual Processing for Recognition, Synthesis and Communication (Digest No: 1996/213), IEE Colloquium

*different feature extraction techniques for audio-visual speech recognition*, In Journal on

*Reading of Dutch*, In Journal on Information Technologies and Control, vol. year vii,

Neumann, editors, Proc. European Conference on Computer Vision 1998, vol. 2, pp.

*speechreading systems in natural conditions*, In Eighth European Conference on

*invariant automatic lip-reading and speech recognition,* In International Conference on

Transactions on Circuits and Systems for Video technology, vol. 15, pp. 706-715.

*Image Motion Analysis*, International Journal of Computer Vision, vol. 36(3), pp. 171-


Automatic Visual Speech Recognition 119

Potamianos, G.; Cosatto, E.; Graf, H. & Roe, D. (1997). *Speaker independent audio-visual* 

Potamianos, G.; Graf, H. P. & Cosatto, E. (1998). *An image transform approach for hmm based* 

Potamianos, G.; Neti, C.; Luettin, J. & Matthews, I. (2004). *Audio-visual automatic speech recognition: An overview*, Issues in Visual and Audio-Visual Speech Processing Prez, J. F. G.; Frangi, A. F.; Solano, E. L. & Lukas, K. (2005). *Lip reading for robust speech* 

Salazar, A.; Hernandez, J. & Prieto, F. (2007). *Automatic quantitative mouth shape analysis*,

Son van, N.; Huiskamp, T. M. I.; Bosman, A. J. & Smoorenburg, G. F. (1994). *Viseme* 

Tamura, S.; Iwano, K. & Furui, S. (2002). *A robust multi-modal speech recognition method using* 

Tamura, S.; Iwano, K. & Furui, S. (2004). *Multi-modal speech recognition using optical-flow* 

Tomlinson, M. J.; Russell, M. J. & Brooke, N. M. (1996). *Integrating audio and visual information* 

Viola, P. & Jones, M. (2001). *Robust Real-time Object Detection*, In Second International

Visser, M.; Poel, M. & Nijholt, A. (1999). *Classifying visemes for automatic Lipreading*, Lecture

Williams, J. J.; Rutledge, J. C. & Katsaggelos, A. K. (1998). *Frame rate and viseme analysis for* 

Wojdel, J. C. & Rothkrantz, L. J. M. (2000). *Visually based speech onset/offset detection*, In

Wojdel, J.; Wiggers, P. & Rothkrantz, L.J.M. (2002). *An audio-visual corpus for multimodal speech recognition in dutch language*, In ICSLP, Conference Proceedings of Wojdel, J. C. (2003). *Automatic Lipreading in the Dutch Language*, Ph.D. thesis, Delft University

Yoshinaga, T.; Tamura, S.; Iwano, K. & Furui, S. (2003). *Audio-Visual Speech Recognition Using* 

Yoshinaga, T.; Tamura, S.; Iwano, K. & Furui, S. (2004). *Audio-visual speech recognition using* 

*Lip Movement Extracted from Side-Face Images*, In AVSP2003, pp. 117-120. September

Learning, Computing, And Sampling. Vancouver, Canada, July

Lecture Notes in Computer Science, vol. 4673, pp. 416-421

Acoustics Speech and Signal Processing, vol. 2

(Euromedia 2000), pp. 156-160. Antwerp, Belgium

*new lip features extracted from side-face images*

notes in computer science, pp. 349-352

Rhodes

vol. 1

Processing, vol. I, pp. 473-476

America, vol. 96

Germany, June

20, pp. 7-23

of Technology, November

*database for bimodal ASR*, In Proc. Europ. Tut. Work. Audio-Visual Speech Proc.,

*automatic lipreading*, In Proc. IEEE International Conference on Image Processing,

*recognition on embedded devices*, In Int. Conf. Acoustics, Speech and Signal

*classifications of Dutch consonants and vowels*, The Journal of the Acoustical Society of

*optical-flow analysis*, In Extended summary of IDS02, pp. 2-4. Kloster Irsee,

*analysis for lip images*, Journal VLSI Signal Process Systems, vol. 36(2-3), pp. 117-124

*to provide highly robust speech recognition*, In IEEE International Conference on

Workshop On Statistical And Computational Theories Of Vision Modelling,

*multimedia applications to assist speechreading*. Journal of VLSI Signal Processing, vol.

Proceedings of 5th Annual Scientific Conference on Web Technology, New Media, Communications and Telematics Theory, Methods, Tools and Application

on Multimedia, Computing and Systems, ICMCS99, vol. 1, pp. 691-696. Fiorenza, Italy, June


118 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Lucey, P. & Potamianos, G. (2006). *Lipreading using profile versus frontal views*, In IEEE

Luettin, J.; Thacker, N. A. & Beet, S. W. (1996). *Statistical lip modelling for visual speech* 

Luettin, J. & Thacker, N. A. (1997). *Speechreading using probabilistic models*, Computer Vision

Martin, A. (1995). *Lipreading by optical flow correlation*, Technical report, Compute Science

Mase, K. & Pentland, A. (1991). *Automatic lipreading by optical-flow analysis*, In Systems and

Matthews, I. A.; Bangham, J. & Cox, S. J. (1996). *Audiovisual speech recognition using multiscale* 

Matthews, I., Cootes, T. F.; Bangham, J. A.; Cox, S. & Harvey, R. (2002). *Extraction of visual* 

Mcgurk, H. & Macdonald, J. (1976). *Hearing lips and seeing voices*, Nature, vol. 264, pp. 746-

Messer, K.; Matas, J.; Kittler, J.; Luettin, J. & Maitre, G. (1999). *XM2VTSDB: The Extended* 

Morn, L. E. L & Pinto-Elas, R. (2007). *Lips shape extraction via active shape model and local binary pattern*. MICAI 2007: Advances in Artificial Intelligence, vol. 4827, pp. 779-788 Movellan, J. R. (1995). *Visual Speech Recognition with Stochastic Networks*, In Advances in Neural Information Processing Systems, vol. 7. MIT Press, Cambridge Nefian, A. V.; Liang, L.; Pi, X.; Liu, X. & Murphy, K. (2002). *Dynamic bayesian networks for* 

Neti, C.; Potamianos, G.; Luettin, J.; Matthews, I.; Glotin, H.; Vergyri, D.; Sison, J.; Mashari,

Ojala, T. & Pietikainen, M. (1997). *Unsupervised texture segmentation using feature distributions*,

Patterson, E.; Gurbuz, S. ; Tufekci, Z. & Gowdy, J. (2002). *CUAVE: A New Audio-Visual* 

IEEE International Conference on Acoustics, Speech, and Signal Processing Petajan, E., Bischoff, B. & Bodoff, D. (1988). *An improved automatic lipreading system to enhance* 

factors in computing systems, pp. 19-25. ACM Press, New York, NY, USA Pigeon, S. & Vandendorpe, L. (1997). *The M2VTS multimodal face database(release 1.00)*,

*nonlinear image decomposition*, In Fourth International Conference on Spoken

*features for lipreading*, In IEEE Transactions on Pattern Analysis and Machine

*M2VTS Database,* In Audio- and Video-based Biometric Person Authentication,

*audio-visual speech recognition*, EURASIP Journal on Applied Signal Processing, vol.

A. & Zhou, J.(2000). *Audio-visual speech recognition*, In Final Workshop 2000 Report,

*Database for Multimodal Human-Computer Interface Research*, In Proceedings of the

*speech recognition,* In CHI '88: Proceedings of the SIGCHI conference on Human

Multimedia Signal Processing Workshop, pp. 24-28

and Image Understanding, vol. 65(2), pp. 163-178

AVBPA'99, pp. 72-77. Washington, D.C., March

Image Analysis and Processing, vol. 1310, pp. 311-318

Lecture Notes in Computer Science, vol. 1206, pp. 403-410

Department University of Central Florida

Computers in Japan, vol. 22, pp. 67-76

Intelligence, vol. 24, pp. 198-213

Italy, June

(EUSIPCO96)

Language Processing

748, December

11, pp. 1274-1288

vol. 764

on Multimedia, Computing and Systems, ICMCS99, vol. 1, pp. 691-696. Fiorenza,

*recognition*, In Proceedings of the 8th European Signal Processing Conference


**7** 

*India* 

S. Ramakrishnan

*Department of Information Technology,* 

*Dr. Mahalingam College of Engineering and Technology, Pollachi* 

**Recognition of Emotion from Speech: A Review** 

Emotional speech recognition is an area of great interest for human-computer interaction. The system must be able to recognize the user's emotion and perform the actions accordingly. It is essential to have a framework that includes various modules performing actions like speech to text conversion, feature extraction, feature selection and classification of those features to identify the emotions. The classifications of features involve the training of various emotional models to perform the classification appropriately. Another important aspect to be considered in emotional speech recognition is the database used for training the models. Then the features selected to be classified must be salient to identify the emotions correctly. The integration of all the above modules provides us with an application that can recognize the emotions of the user and give it as input to the system to respond

In human interactions there are many ways in which information is exchanged (speech, body language, facial expressions, etc.). A speech message in which people express ideas or communicate has a lot of information that is interpreted implicitly. This information may be expressed or perceived in the intonation, volume and speed of the voice and in the emotional state of people, among others. The speaker's emotional state is closely related to this information. In evolutionary theory, it is widely accepted the "basic" term to define some emotions. The most popular set of basic emotions: happiness (joy), anger, fear, boredom, sadness, disgust and neutral. Over the last years the recognition of emotions has become a multi-disciplinary research area that has received great interest. This plays an important role in the improvement of human–machine interaction. Automatic recognition of speaker emotional state aims to achieve a more natural interaction between humans and machines. Also, it could be used to make the computer act according to the actual human emotion. This is useful in various real life applications as systems for real-life emotion detection using a corpus of agent-client spoken dialogues from a medical emergency call centre, detection of the emotional manifestation of fear in abnormal situations for a security application, support of semi-automatic diagnosis of psychiatric diseases and detection of emotional attitudes from child in spontaneous dialog interactions with computer characters. On the other hand, considering the other part of a communication system, progress was made in the context of speech synthesis too. The use of bio signals (such as ECG, EEG, etc.), face and body images are an interesting alternative to detect emotional states. However, methods to record and use these signals are more invasive, complex and impossible in

**1. Introduction** 

appropriately.


### **Recognition of Emotion from Speech: A Review**

#### S. Ramakrishnan

*Department of Information Technology, Dr. Mahalingam College of Engineering and Technology, Pollachi India* 

#### **1. Introduction**

120 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

Zhang, X.; Broun, C. C.; Mersereau, R. M. & Clements, M. A. (2002). *Automatic speechreading* 

Zhao, G., Pietikäinen, M. & Hadid, A. (2007). *Local spatiotemporal descriptors for visual* 

vol. 2002(1), pp. 1228-1247

Human-centered multimedia, pp. 66-75. ACM

*with applications to human-computer interfaces,* EURASIP Journal Appl Signal Process,

*recognition of spoken phrases*, In Proceedings of the international workshop on

Emotional speech recognition is an area of great interest for human-computer interaction. The system must be able to recognize the user's emotion and perform the actions accordingly. It is essential to have a framework that includes various modules performing actions like speech to text conversion, feature extraction, feature selection and classification of those features to identify the emotions. The classifications of features involve the training of various emotional models to perform the classification appropriately. Another important aspect to be considered in emotional speech recognition is the database used for training the models. Then the features selected to be classified must be salient to identify the emotions correctly. The integration of all the above modules provides us with an application that can recognize the emotions of the user and give it as input to the system to respond appropriately.

In human interactions there are many ways in which information is exchanged (speech, body language, facial expressions, etc.). A speech message in which people express ideas or communicate has a lot of information that is interpreted implicitly. This information may be expressed or perceived in the intonation, volume and speed of the voice and in the emotional state of people, among others. The speaker's emotional state is closely related to this information. In evolutionary theory, it is widely accepted the "basic" term to define some emotions. The most popular set of basic emotions: happiness (joy), anger, fear, boredom, sadness, disgust and neutral. Over the last years the recognition of emotions has become a multi-disciplinary research area that has received great interest. This plays an important role in the improvement of human–machine interaction. Automatic recognition of speaker emotional state aims to achieve a more natural interaction between humans and machines. Also, it could be used to make the computer act according to the actual human emotion. This is useful in various real life applications as systems for real-life emotion detection using a corpus of agent-client spoken dialogues from a medical emergency call centre, detection of the emotional manifestation of fear in abnormal situations for a security application, support of semi-automatic diagnosis of psychiatric diseases and detection of emotional attitudes from child in spontaneous dialog interactions with computer characters. On the other hand, considering the other part of a communication system, progress was made in the context of speech synthesis too. The use of bio signals (such as ECG, EEG, etc.), face and body images are an interesting alternative to detect emotional states. However, methods to record and use these signals are more invasive, complex and impossible in

Recognition of Emotion from Speech: A Review 123

Feature Vector Extraction

**EEG** is one of the most useful bio signals that detect true emotional state of human. The signal is recorded using the electrodes which measure the electrical activity of the brain. The recorded EEG data is first preprocessed to remove serious and obvious motion artifacts. Then the features are extracted from the raw signal using some feature extraction techniques like discrete wavelet transform, statistical based analysis etc. After the extraction the emotion classifier use the emotion classification techniques like Fuzzy C-Means, Quadratic

**ECG** is recorded using ECG sensor .The signals are preprocessed using low pass filter at 100HZ. Then, features are extracted from the preprocessed signal by continuous wavelet transform (CWT) or discrete wavelets transform (DWT). Feature selection is done using Tabu Search Algorithm (TS), Simba algorithm etc. The selected feature is fed into classifier

Eyes

Eyebrow

Emotion Classifier

> Feature Vector Extraction

Emotion output

Furrow Lips

Emotion Classifier

Output

Fig. 2. Framework for emotion recognition using EEG,ECG,GSR signals

Signal

Biosignal (EEG, ECG,GSR )

Preprocessing

Discriminant Analysis etc. to classify the different emotions of human.

Facial Feature Tracking System

Fig. 3. Framework for Facial Emotional Recognition

Video Input

(fisher or K-Nearest Neighbor (KNN) classifier) to identify the type of emotion.

certain real applications. Therefore, the use of speech signals clearly becomes a more feasible option. Good results are obtained by standard classifiers but their performance improvement could have reached a limit. Fusion, combination and ensemble of classifiers could represent a new step towards better emotion recognition systems.

This chapter aims to provide a comprehensive review on emotional speech recognition. The chapter is organized as follows. Section 2 describes the frameworks used for SER. Section 3 gives an overview of the types of databases. Section 4 presents the acoustic characteristics of emotions. Section 5 presents feature extraction and classification. Section 6 discusses the applications of emotion recognition. Section 7 presents concluding remarks.

#### **2. Basic framework for emotional recognition**

The input files are speech signals. Fig.1 gives the basic framework of emotional speech recognition. The feature extraction script extracts the features that represent global statistics. In the Post-processing step, the interface problem between the script for feature extraction and the feature selection technique can be solved. Then feature selection eliminates irrelevant features that hinder the recognition rates. It lowers the input dimensionality and saves the computational time. Distribution models like GMMs are trained using the most discriminative aspects of the feature. The classifiers distinguish the types of emotion.

#### Fig. 1. Basic framework of SER

Bio signals such as ECG, EEG,GSR, face and body images are an interesting alternative to detect emotional states. Fig 2 discusses the mechanism of emotion recognition using these bio signals.

122 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

certain real applications. Therefore, the use of speech signals clearly becomes a more feasible option. Good results are obtained by standard classifiers but their performance improvement could have reached a limit. Fusion, combination and ensemble of classifiers

This chapter aims to provide a comprehensive review on emotional speech recognition. The chapter is organized as follows. Section 2 describes the frameworks used for SER. Section 3 gives an overview of the types of databases. Section 4 presents the acoustic characteristics of emotions. Section 5 presents feature extraction and classification. Section 6 discusses the

The input files are speech signals. Fig.1 gives the basic framework of emotional speech recognition. The feature extraction script extracts the features that represent global statistics. In the Post-processing step, the interface problem between the script for feature extraction and the feature selection technique can be solved. Then feature selection eliminates irrelevant features that hinder the recognition rates. It lowers the input dimensionality and saves the computational time. Distribution models like GMMs are trained using the most

Bio signals such as ECG, EEG,GSR, face and body images are an interesting alternative to detect emotional states. Fig 2 discusses the mechanism of emotion recognition using these

discriminative aspects of the feature. The classifiers distinguish the types of emotion.

could represent a new step towards better emotion recognition systems.

applications of emotion recognition. Section 7 presents concluding remarks.

**2. Basic framework for emotional recognition** 

Fig. 1. Basic framework of SER

bio signals.

Fig. 2. Framework for emotion recognition using EEG,ECG,GSR signals

**EEG** is one of the most useful bio signals that detect true emotional state of human. The signal is recorded using the electrodes which measure the electrical activity of the brain. The recorded EEG data is first preprocessed to remove serious and obvious motion artifacts. Then the features are extracted from the raw signal using some feature extraction techniques like discrete wavelet transform, statistical based analysis etc. After the extraction the emotion classifier use the emotion classification techniques like Fuzzy C-Means, Quadratic Discriminant Analysis etc. to classify the different emotions of human.

**ECG** is recorded using ECG sensor .The signals are preprocessed using low pass filter at 100HZ. Then, features are extracted from the preprocessed signal by continuous wavelet transform (CWT) or discrete wavelets transform (DWT). Feature selection is done using Tabu Search Algorithm (TS), Simba algorithm etc. The selected feature is fed into classifier (fisher or K-Nearest Neighbor (KNN) classifier) to identify the type of emotion.

Fig. 3. Framework for Facial Emotional Recognition

Recognition of Emotion from Speech: A Review 125

1

Induced

Fig. 4. Types of databases

1)DES 2)EMO-DB 3)Acho command

**S.No Corpus Name No.of** 

(Airplane Behaviour Corpus)

2 EMO(Berlin Emotional Database)

3 SUSAS(Speech Under Stimulated and Actual Stress)

4 AVIC(Audiovis ual Interest Corpus)

1 ABC

**Subjects (Total, Male, female and age and time & days taken)**

Acted

Total=8 Age=25–48 years Time=8.4 s/431 clips

Total=10 Male = 5 Female = 5

Total=32 Male = 19 Female= 13 Age=22-76 years

Total=21 Male =11 Female =

10

**Nature(Acted/Natural/ Induced and purpose, Language& mode)** 

1)ABC

2)eNTERFACE

Nature=Acted Purpose =General Language =German Mode=Audio

Nature= Induced Language =English Purpose=Aircraft Mode=Audio

Nature= Natural Language =English Mode=Audio-Visual

Nature=Acted purpose= Transport surveillance Language=German Mode=Audio-Visual **Types of Emotions(Anger, disgust, fear, joy, sad, etc)** 

1)AVIC 2)Smart ROM 3)SUSAS 4)VAM

Natural

Aggressive, Cheerful,Intoxicated, Nervous, Neutral, Tired

Anger, Boredom, Disgust, Fear, Joy, Neutral, Sadness

Fear ,High Stress, Medium Stress,

http://webcache

Neutral

**Publicably Available(Yes/N o) and URL** 

Publically Available=No Detection of Security Related Affect and Behaviour in Passenger Transport

Publically Available=Yes http://pascal.kg

berlin.de/emodb /docu/#downlo

http://www.ldc. upenn.edu/Catal og/CatalogEntry .jsp?catalogId=L DC99S78

.googleuserconte nt.com/search?hl =en&start=10&q =cache:yptULzKJRwJ:http: //citeseerx.ist.ps

w.tu-

ad

**Galvanic Skin Response** is the measure of skin conductivity. There is a correlation between GSR and the arousal state of body. In the GSR emotional recognition system, the GSR signal is physiologically sensed and the feature is extracted using Immune Hybrid Particle Swarm Optimization (IH-PSO). The extracted features are classified using neural network classifier to identify the type of emotion.

In the facial emotion recognition the facial expression of a person is captured as a video and it is fed into the facial feature tracking system. Fig 3 gives a basic framework of facial emotional recognition.In facial feature tracking system, facial feature tracking algorithms such as Wavelets, Dual-view point-based model etc. are applied to track eyes, eyebrows, furrows and lips to collect all its possible movements. Then the extracted features are fed into classifier like Naïve Bayes , TAN or HMM to classify the type of emotion.

### **3. Emotional speech database**

There should be some criteria that can be used to judge how well a certain emotional database simulates a real-world environment. According to some studies the following are the most relevant factors to be considered:


Most of the developed emotional speech databases are not available for public use. Thus, there are very few benchmark databases that can be shared among researchers. Most of the databases share the following emotions: anger, joy, sadness, surprise, boredom, disgust, and neutral.

#### **Types of DB**

At the beginning of the research on automatic speech emotion recognition, acted speech was used and now it shifts towards more realistic data. The databases that are used in SER are classified into 3 types. Fig 4 briefs the types of databases. Table 1 gives a detailed list of speech databases.

Type 1 is acted emotional speech with human labeling. Simulated or acted speech is expressed in a professionally deliberated manner. They are obtained by asking an actor to speak with a predefined emotion, e.g. DES, EMO-DB.

Type 2 is authentic emotional speech with human labeling. Natural speech is simply spontaneous speech where all emotions are real. These databases come from real-life applications for example call-centers.

Type 3 is elicited emotional speech in which the emotions are induced with self-report instead of labeling, where emotions are provoked and self-report is used for labeling control. The elicited speech is neither neutral nor simulated.

124 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

**Galvanic Skin Response** is the measure of skin conductivity. There is a correlation between GSR and the arousal state of body. In the GSR emotional recognition system, the GSR signal is physiologically sensed and the feature is extracted using Immune Hybrid Particle Swarm Optimization (IH-PSO). The extracted features are classified using neural network classifier

In the facial emotion recognition the facial expression of a person is captured as a video and it is fed into the facial feature tracking system. Fig 3 gives a basic framework of facial emotional recognition.In facial feature tracking system, facial feature tracking algorithms such as Wavelets, Dual-view point-based model etc. are applied to track eyes, eyebrows, furrows and lips to collect all its possible movements. Then the extracted features are fed into classifier like Naïve Bayes , TAN or HMM to classify the type of

There should be some criteria that can be used to judge how well a certain emotional database simulates a real-world environment. According to some studies the following are

Most of the developed emotional speech databases are not available for public use. Thus, there are very few benchmark databases that can be shared among researchers. Most of the databases share the following emotions: anger, joy, sadness, surprise, boredom, disgust, and

At the beginning of the research on automatic speech emotion recognition, acted speech was used and now it shifts towards more realistic data. The databases that are used in SER are classified into 3 types. Fig 4 briefs the types of databases. Table 1 gives a detailed list of

Type 1 is acted emotional speech with human labeling. Simulated or acted speech is expressed in a professionally deliberated manner. They are obtained by asking an actor to

Type 2 is authentic emotional speech with human labeling. Natural speech is simply spontaneous speech where all emotions are real. These databases come from real-life

Type 3 is elicited emotional speech in which the emotions are induced with self-report instead of labeling, where emotions are provoked and self-report is used for labeling

to identify the type of emotion.

**3. Emotional speech database** 

 Who utters the emotions How to simulate the utterances

the most relevant factors to be considered: Real-world emotions or acted ones

 Balanced utterances or unbalanced utterances Utterances are uniformly distributed over emotions

speak with a predefined emotion, e.g. DES, EMO-DB.

control. The elicited speech is neither neutral nor simulated.

applications for example call-centers.

emotion.

neutral.

**Types of DB** 

speech databases.

Fig. 4. Types of databases


Recognition of Emotion from Speech: A Review 127

**Types of Emotions(Anger, disgust, fear, joy, sad, etc)** 

Anger(hot), Anger(cold), Happiness, Neutrality, Sadness

Anger, Contempt, Disgust, Fear, Interest, Joy, Neutrality, Sadness,

Anger, Disgust, Dominance, Fear, Joy, Sadness, Shyness, Surprise

Anger, Sadness, Neutrality(other emotions as well,but in insuffient

numbers to be used)

Shame, Surprise

Neutrality

Anger(cold), Happiness,

**Publicably Available(Yes/N o) and URL** 

short text with many quoted sentences to elicit emotional speech

Emotional Speech Recognition:Reso urces,Feature and Method Linguistic nature of material= 2 utterances(1 emotionally neutral sentence,4 digit number) each repeated

Linguistic nature of material=4 semantically neutral phrases

Emotional Speech Recognition:Reso urces,Feature and Method Linguistic nature of material=3 sentences,1 for each emotion(with appropriate content)

A State of Art Review on Emotional Speech Database Linguistic nature of material=1 semantically neutral phrase

Emotional Speech Recognition:Reso urces,Feature and Method Linguistic nature

material=sentenc e length segments taken

of

**Nature(Acted/Natural/ Induced and purpose, Language& mode)** 

Language = English Mode=Audio

Nature=Acted Language = Dutch Mode=Audio

Language = German Mode=Audio

Language = Swedish Mode=Audio

Nature=Acted Language = English Mode=Audio-Visual

**S.No Corpus Name No.of** 

11 Pereira (Pereira, 2000a,b)

12 Van Bezooijen (Van Bezooijen, 1984)

13 Alter (Alter et al.,2000)

14 Abelin (Abelin and

15 Polzin (Polzin and Waibel,2000)

Allwood,2000)

**Subjects (Total, Male, female and age and time & days taken)**

Total =8 Male = 4 Female = 4

Total =2 Nature=Acted

Total =1 Nature=Acted

Total =1 Nature=Acted

Unspecified number of speakers


126 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

**Types of Emotions(Anger, disgust, fear, joy, sad, etc)** 

Neutral, Joy, Anger, Helplessness, Pondering, Surprise, Undefinable

valence (negative vs. positive), activation (calm vs. excited) and dominance (weak vs. strong).

Anger, Happiness, Neutral, Sadness, Surprise

Anger, Disgust, Fear, Joy, Sadness, Surprise

Publically

**Publicably Available(Yes/N o) and URL** 

u.edu/viewdoc/ download?doi=1 0.1.1.65.9121&rep =rep1&type=pdf +audiovisual+int erest+speech+dat abase&ct=clnk

Available=No http://emotionresearch.net/toolb ox/toolboxdatabas e.2006-09- 26.5667892524

Publically Available=No www.phonetik.u

Publically Available=No http://emotionresearch.net/down load/vam

Publically Available=Yes http://universal. elra.info/product \_info.php?produc ts\_id=78

Publically Available=Yes Learning with synthesized speech for automatic emotion recognition

Publically Available=No www.elda.org/c atalogue/en/spe ech/S0020.html Linguistic nature of material= subjects read too

muenchen.de/Ba s/BasMultiModa leng.html#Smart Kom Linguistic nature of material=Interact ive discourse

ni-

**Nature(Acted/Natural/ Induced and purpose, Language& mode)** 

Nature= Natural Purpose= Human-Computer conversation Language =English Mode=Audio-Visual

Nature= Natural Purpose= Human-Computer conversation Language =German Mode=Audio-Visual

Language =German Mode=Audio-Visual

Nature=Acted Purpose =General Language =Danish Mode=Audio

Nature=Acted Purpose =General Language =English Mode=Audio-

Language = Dutch Mode=Audio

visual

Total=238 Nature=Acted

Total=47 Nature=Natural

**S.No Corpus Name No.of** 

5 SAL(Sensitive Artificial Listener)

7 VAM(Vera-Am-Mittag)

8 DES(Danish Emotional Database)

10 Groningen, 1996 ELRA corpus number S0020

9 eNTERFACE Total=42 Male

6 Smartkom Total =224

**Subjects (Total, Male, female and age and time & days taken)**

Total=4 Female=2 Male=2 Time=20min/ speaker

Time=4.5 min /person

Total=4 Male = 2 Female = 2 Age=18 -58 years old

= 34 Female =

8


Recognition of Emotion from Speech: A Review 129

**Types of Emotions(Anger, disgust, fear, joy, sad, etc)** 

=27 Mode=Audio Recognition:Reso

**Publicably Available(Yes/N o) and URL** 

urces,Feature and Method Linguistic nature

material=subjects made 3 vocal responses to each slide within a forty seconds presentation period-a numerical answer followed by 2 short statements. The start of each was scripted and subjects filled in the blank at the

of

end.

of

material=unscrip ted interactive discourse

available=no http://www.idia p.ch/mmm/corp ora/emotioncorpus Linguistic nature of material= unscripted interactive discourse

Extraction Of Annotation Data From The Reading/Leeds Emotional Speech Corpus Speech Research Laboratory,Univ ersity of Reading, Reading, RG1 6AA, UK Linguistic nature

Automated

Wide range Publically

**Nature(Acted/Natural/ Induced and purpose, Language& mode)** 

**S.No Corpus Name No.of** 

(Tolkmitt and Scherer,1986)

24 Reading-Leeds database (Greasley et al.,1995;Roach et al.,1998)

25 Belfast natural database (Douglas-Cowie et al., 2000) Total =125 Male = 31 Female =94

**Subjects (Total, Male, female and age and time & days taken)**

Time=264 min Nature=Natural

Language =English Mode=Audio

Nature=Natural Language =English Mode=Audio-Visual


128 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

**Types of Emotions(Anger, disgust, fear, joy, sad, etc)** 

Anger(hot), Anger(cold), Anxiety, Boredom, Contempt, Disgust, Elation, Fear(panic), Happiness, Interest, Pride, Sadness, Shame

Anger, Boredom, Fear, Disgust, Guilt, Happiness, Haughtiness, Indignation, Joy, Neutrality, Rage, Sadness, Worry

Desire, Disgust, Fury, Fear, Joy,

Surprise, Sadness

Anger, Fear, Happiness, Neutrality, Sadness

Anger, Fear, Happiness, Neutrality, Sadness

Anger, Disgust, Fear, Joy, Neutrality, Sadness

Stress(both cognitive and

emotional)

Stress Linguistic nature

of

**Publicably Available(Yes/N o) and URL** 

from acted movies

Linguistic nature of material=2 semantically neutral sentences(nonsense sentences composed of phonemes from Indo-European languages)

Linguistic nature of material=8 semantically neutral sentences(each repeated 3 times)

Linguistic nature of material= paragraph length passages

Linguistic nature of material= paragraph length passages written in first person

Linguistic nature of material=noninteractive discourse

material=numeri cal answers to mathematical questions

Emotional Speech

Emotional Speech Recognition:Reso urces,Feature and Method ,Linguistic nature of material=paragra ph length passages(20- 40mms each)

**Nature(Acted/Natural/ Induced and purpose, Language& mode)** 

Nature=Induced Language =German Mode=Audio-Visual

Language =Dutch Mode=Audio

Language =Spanish Mode=Audio

Language =English Mode=Audio

Language =English Mode=Audio

Nature = Induced Language=Hebrew, Russian Mode=Audio

Language =English Mode=Audio

Nature=Induced Language =German

Total =3 Nature=Induced

Total =8 Nature=Induced

Total =40 Nature=Induced

Total =50 Nature=Induced

Total=4 Nature=Induced

Total=61(60 Hebrew speakers and 1 Russian speaker)

Total =60 Male = 33 Female

**S.No Corpus Name No.of** 

scherer (Banse and scherer,1996)

16 Banse and

17 Mozziconacci (Mozziconacci 1998)

18 Iriondo et al. (Iriondo et al., 2000)

19 McGilloway (McGilloway,1 997;Cowie and Douglas-Cowie,1996)

20 Belfast

21 Amir et al. (Amir et al.,2000)

22 Femandez et al.(Femandez and Picard,2000)

23 Tolkmitt and Scherer

structured database

**Subjects (Total, Male, female and age and time & days taken)**

Total =12 Male = 6 Female = 6


Recognition of Emotion from Speech: A Review 131

The prosodic features like pitch, intensity, speaking rate and voice quality are important to identify the different types of emotions. In particular pitch and intensity seem to be correlated to the amount of energy required to express a certain emotion. When one is in a state of anger, fear or joy; the resulting speech is correspondingly loud, fast and enunciated with strong high-frequency energy, a higher average pitch, and wider pitch range, whereas with sadness, producing speech that is slow, low-pitched, and with little high-frequency energy. In Table 2, a short overview of acoustic characteristics of various emotional states is

**EMOTIONS JOY ANGER SADNESS FEAR DISGUST CHARACTERISTICS** Pitch mean High very high very low very high very low Pitch range High high Low High high-male

Pitch variance High very high Low very high Low Pitch contour incline decline Decline Incline Decline

> very highmale highfemale

Intensity range High high Low High Low

low-male highfemale

Durability Low low High Low High

Sometimes breathy; Moderately blaring timbre

The collected emotional data usually contain noise due to the background and "hiss" of the recording machine. The presence of noise will corrupt the signal, and make the feature extraction and classification less accurate. Thus preprocessing of speech signal is very much

Normalization is a preprocessing technique that eliminates speaker and recording variability while keeping the emotional discrimination. Generally 2 types of normalization techniques are performed they are energy normalization and pitch normalization. Energy normalization: the speech files are scaled such that the average RMS energy of the neutral

Low medium/

High

timbre Falsetto Resonant

high-male lowfemale

Resonant

high Low

low-female

very lowmale low-female

timbre

**4. Acoustic characteristics of emotions in speech** 

provided.

Speaking

Transmission

Intensity mean High

Rate High

Voice Quality modal/

Table 2. Acoustic Characteristics of Emotions

**5. Feature extraction and classification** 

required. Preprocessing also reduces the variability.

tense


Table 1. List of emotional speech databases

130 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

**Types of Emotions(Anger, disgust, fear, joy, sad, etc)** 

Anger, Good humour, Indifference, Stress, Sadness

Joy, Neutrality, Sadness(distress) Linguistic nature

Depression, Neutrality, Suicidal

Joy, Sadness, Anger, Neutrality Publically

Wide range Publically

Anger, Emphatic, Neutral, Positive, and Rest

state

**Publicably Available(Yes/N o) and URL** 

http://www.uni ge.ch/fapse/emo tion/demo/Test Analyst/GERG/ apache/htdocs/i ndex.php Linguistic nature of material= unscripted interactive discourse

of material= interactive discourse

Publically Available=no http://emotionresearch.net/Mem bers/admin/test/ ?searchterm=Franc e%20et%20al.(Fran ce%20et%20al.,200 0) Linguistic nature of material= interactive discourse

Available=no

Publically Available=no http://www5.cs. fau.de/de/mitar beiter/steidlstefan/fau-aiboemotion-corpus/

Available=no http://www.ima ge.ntua.gr/ermis / IST-2000-29319, D09 Linguistic nature of material= interactive discourse

**Nature(Acted/Natural/ Induced and purpose, Language& mode)** 

Language =Mixed Mode=Audio-Visual

Nature=Natural Language =Korean, English Mode=Audio-

Nature=Natural Language =English Mode=Audio

Language =English, Japanese Purpose=pet robot Mode=Audio

Nature=Natural Language =German Purpose=pet robot

Language =English Mode=Audio-Visual

Visual

Total =6 Nature=Acted

Total=20 Nature=Induced

**S.No Corpus Name No.of** 

26 Geneva Airport Lost Luggage Study (Scherer and Ceschi,1997, 2000)

27 Chung

28 France et

29 Slaney and McRoberts (1998) or Breazeal (2001)

30 FAU Aibo Database

31 SALAS database

(Chung,2000)

al.(France et al.,2000)

**Subjects (Total, Male, female and age and time & days taken)**

Total =77 (61 Korean speakers,6 American speakers)

Total =115 Male = 67 Female =48

Total=26 children Male=13 Female=13

Table 1. List of emotional speech databases

Total =109 Nature=Natural

### **4. Acoustic characteristics of emotions in speech**

The prosodic features like pitch, intensity, speaking rate and voice quality are important to identify the different types of emotions. In particular pitch and intensity seem to be correlated to the amount of energy required to express a certain emotion. When one is in a state of anger, fear or joy; the resulting speech is correspondingly loud, fast and enunciated with strong high-frequency energy, a higher average pitch, and wider pitch range, whereas with sadness, producing speech that is slow, low-pitched, and with little high-frequency energy. In Table 2, a short overview of acoustic characteristics of various emotional states is provided.


Table 2. Acoustic Characteristics of Emotions

#### **5. Feature extraction and classification**

The collected emotional data usually contain noise due to the background and "hiss" of the recording machine. The presence of noise will corrupt the signal, and make the feature extraction and classification less accurate. Thus preprocessing of speech signal is very much required. Preprocessing also reduces the variability.

Normalization is a preprocessing technique that eliminates speaker and recording variability while keeping the emotional discrimination. Generally 2 types of normalization techniques are performed they are energy normalization and pitch normalization. Energy normalization: the speech files are scaled such that the average RMS energy of the neutral

Recognition of Emotion from Speech: A Review 133

 Speech rate – describes the rate of words or syllables uttered over a unit of time Stress frequency – measures the rate of occurrences of pitch accented utterances

Voice quality- jitter and shimmer of the glottal pulses of the whole segment.

Brilliance – describes the dominance of high Or low frequencies In the speech

 Pause Discontinuity – describes the transitions between sound and silence Pitch Discontinuity – describes the transitions of fundamental frequency.

**Zipf features** used for a better rhythm and prosody characterization.

The list below gives a brief description of each algorithm:

Loudness – measures the amplitude of the speech waveform, translates to the energy of

**Durational pause related features :**The duration features include the chunk length, measured in seconds, and the zero-crossing rate to roughly decode speaking rate. Pause is obtained as the proportion of non-speech to the speech signal calculated by a voice activity

**Hybrid pitch features** combines outputs of two different speech signal based pitch marking

Feature selection determines which features are the most beneficial because most classifiers are negatively influenced by redundant, correlated or irrelevant features. Thus, in order to reduce the dimensionality of the input data, a feature selection algorithm is implemented to choose the most significant features of the training data for the given task. Alternatively, a feature reduction algorithm like principal components analysis (PCA) and Sequential Forward Floating Search (SFFS) can be used to encode the main information of the feature

Most research on SER has concentrated on feature-based and classification-based approaches. Feature-based approaches aim at analyzing speech signals and effectively estimating feature parameters representing human emotional states. The classification-based approaches focus on designing a classifier to determine distinctive boundaries between emotions. The process of emotional speech detection also requires the selection of a successful classifier which will allow for quick and accurate emotion identification. Currently, the most frequently used classifiers are linear discriminant classifiers (LDC), knearest neighbor (k-NN), Gaussian mixture model (GMM), support vector machines (SVM), decision tree algorithms and hidden Markov models (HMMs).Various studies showed that choosing the appropriate classifier can significantly enhance the overall performance of the

**LDC**: A linear classifier uses the feature values to identify which class (or group) it belongs to by making a classification decision based on the value of a linear combination of the feature values .They are usually presented to the system in a vector called a feature vector.

**Time-related features** 

an utterance

detection algorithm

algorithms (PMA)

space more compactly.

system.

Energy- Instantaneous values of energy

**Voice quality parameters and energy descriptors**

Breathiness – measures the aspiration noise in speech

reference database and the neutral subset in the emotional databases are the same for each speaker. This normalization is separately applied for each subject in each database. The goal of this normalization is to compensate for different recording settings among the databases. Pitch normalization: the pitch contour is normalized for each subject (speaker-dependent normalization). The average pitch across speakers in the neutral reference database is estimated. Then, the average pitch value for the neutral set of the emotional databases is estimated for each speaker.

Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. When performing analysis of complex data one of the major problems stems from the number of variables involved. Analysis with a large number of variables generally requires a large amount of memory and computation power or a classification algorithm which overfits the training sample and generalizes poorly to new samples. Feature extraction is a general term for methods of constructing combinations of the variables to get around these problems while still the data with sufficient accuracy.

Although significant advances have been made in speech recognition technology, it is still a difficult problem to design a speech recognition system for speaker-independent, continuous speech. One of the fundamental questions is whether all of the information necessary to distinguish words is preserved during the feature extraction stage. If vital information is lost during this stage, the performance of the following classification stage is inherently crippled and can never measure up to human capability. Typically, in speech recognition, we divide speech signals into frames and extract features from each frame. During feature extraction, speech signals are changed into a sequence of feature vectors. Then these vectors are transferred to the classification stage. For example, for the case of dynamic time warping (DTW), this sequence of feature vectors is compared with the reference data set. For the case of hidden Markov models (HMM), vector quantization may be applied to the feature vectors which can be viewed as a further step of feature extraction. In either case, information loss during the transition from speech signals to a sequence of feature vectors must be kept to a minimum. There have been numerous efforts to develop good features for speech recognition in various circumstances.

The most common speech characteristics that are extracted are categorized in the following groups:

#### **Frequency characteristics**


#### **Time-related features**

132 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

reference database and the neutral subset in the emotional databases are the same for each speaker. This normalization is separately applied for each subject in each database. The goal of this normalization is to compensate for different recording settings among the databases. Pitch normalization: the pitch contour is normalized for each subject (speaker-dependent normalization). The average pitch across speakers in the neutral reference database is estimated. Then, the average pitch value for the neutral set of the emotional databases is

Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. When performing analysis of complex data one of the major problems stems from the number of variables involved. Analysis with a large number of variables generally requires a large amount of memory and computation power or a classification algorithm which overfits the training sample and generalizes poorly to new samples. Feature extraction is a general term for methods of constructing combinations of the

Although significant advances have been made in speech recognition technology, it is still a difficult problem to design a speech recognition system for speaker-independent, continuous speech. One of the fundamental questions is whether all of the information necessary to distinguish words is preserved during the feature extraction stage. If vital information is lost during this stage, the performance of the following classification stage is inherently crippled and can never measure up to human capability. Typically, in speech recognition, we divide speech signals into frames and extract features from each frame. During feature extraction, speech signals are changed into a sequence of feature vectors. Then these vectors are transferred to the classification stage. For example, for the case of dynamic time warping (DTW), this sequence of feature vectors is compared with the reference data set. For the case of hidden Markov models (HMM), vector quantization may be applied to the feature vectors which can be viewed as a further step of feature extraction. In either case, information loss during the transition from speech signals to a sequence of feature vectors must be kept to a minimum. There have been numerous efforts to develop

The most common speech characteristics that are extracted are categorized in the following

Average pitch – description of how high/low the speaker speaks relative to the normal

Contour slope – describes the tendency of the frequency change over time, it can be

 MFCC-representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

 Final lowering – the amount by which the frequency falls at the end of an utterance. Pitch range – measures the spread between maximum and minimum frequency of an

Accent shape – affected by the rate of change of the fundamental frequency.

variables to get around these problems while still the data with sufficient accuracy.

good features for speech recognition in various circumstances.

Formant-frequency components of human speech

Spectral features- measures the slope of the spectrum considered.

estimated for each speaker.

groups:

**Frequency characteristics** 

rising, falling or level.

speech.

utterance.


#### **Voice quality parameters and energy descriptors**


**Durational pause related features :**The duration features include the chunk length, measured in seconds, and the zero-crossing rate to roughly decode speaking rate. Pause is obtained as the proportion of non-speech to the speech signal calculated by a voice activity detection algorithm

**Zipf features** used for a better rhythm and prosody characterization.

**Hybrid pitch features** combines outputs of two different speech signal based pitch marking algorithms (PMA)

Feature selection determines which features are the most beneficial because most classifiers are negatively influenced by redundant, correlated or irrelevant features. Thus, in order to reduce the dimensionality of the input data, a feature selection algorithm is implemented to choose the most significant features of the training data for the given task. Alternatively, a feature reduction algorithm like principal components analysis (PCA) and Sequential Forward Floating Search (SFFS) can be used to encode the main information of the feature space more compactly.

Most research on SER has concentrated on feature-based and classification-based approaches. Feature-based approaches aim at analyzing speech signals and effectively estimating feature parameters representing human emotional states. The classification-based approaches focus on designing a classifier to determine distinctive boundaries between emotions. The process of emotional speech detection also requires the selection of a successful classifier which will allow for quick and accurate emotion identification. Currently, the most frequently used classifiers are linear discriminant classifiers (LDC), knearest neighbor (k-NN), Gaussian mixture model (GMM), support vector machines (SVM), decision tree algorithms and hidden Markov models (HMMs).Various studies showed that choosing the appropriate classifier can significantly enhance the overall performance of the system.

The list below gives a brief description of each algorithm:

**LDC**: A linear classifier uses the feature values to identify which class (or group) it belongs to by making a classification decision based on the value of a linear combination of the feature values .They are usually presented to the system in a vector called a feature vector.

Recognition of Emotion from Speech: A Review 135

**Lie Detection:** Lie Detector helps in deciding whether someone is lying or not. This mechanism is used particularly in areas such as Central Bureau of Investigation for finding out the criminals, cricket council to fight against corruption. **X13-VSA PRO Voice Lie Detector 3.0.1 PRO** is an innovative, advanced and sophisticated software system and a

**Banking:** The ATM will employ speaker recognition and authentication if needed "to ensure higher security level while accessing to confidential data." In other words, the unique deployment of combining speech recognition, speaker recognition and emotion detection is not designed to be spooky or invasive. "It is just one more step forward the creation of humanlike systems that speak to the clients, understand and recognize a speaker". What's different is the incorporation of emotion detection in the enrollment process, which is probably a very good idea if enrollments are going to be conducted without human assistance or supervision. The machine will be able to talk with the prospective enrollee (and later on the client) and will be able to authenticate his or her unique voiceprint while, at the same time, test voice levels for signs of nervousness, anger,

**In-Car Board System:** An in-car board system shall be provided with information about the emotional state of the driver to initiate safety strategies, initiatively provide aid or resolve

**Prosody in Dialog System:** We investigate the use of prosody for the detection of frustration and annoyance in natural human-computer dialog. In addition to prosodic features, we examine the contribution of language model information and speaking "style". Results show that a prosodic model can predict whether an utterance is neutral versus "annoyed or frustrated" with an accuracy on par with that of human interlobular agreement. Accuracy increases when discriminating only "frustrated" from other utterances, and when using only those utterances on which labelers originally agreed. Furthermore, prosodic model accuracy degrades only slightly when using recognized versus true words. Language model features, even if based on true words, are relatively poor predictors of frustration.

**Emotion Recognition in Call Center:** Call-centers often have a difficult task of managing customer disputes. Ineffective resolution of these disputes can often lead to customer discontent, loss of business and in extreme cases, general customer unrest where a large amount of customers move to a competitor. It is therefore important for call-centers to take note of isolated disputes and effectively train service representatives to handle disputes in a

A system was designed to monitor recorded customer messages and provide an emotional assessment for more effective call-back prioritization. However, this system only provided post-call classification and was not designed for real time support or monitoring. Nowadays the systems are different because it aims to provide a real-time assessment to aid in the handling of the customer while he or she is speaking. Early warning signs of customer frustration can be detected from pitch contour irregularities, short-time energy changes, and

**Sorting of Voice Mail:** Voicemail is an electronic system for recording and storing of voice messages for later retrieval by the intended recipient. It would be a potential application to

errors in the communication according to the driver's emotion.

way that keeps the customer satisfied.

changes in the rate of speech.

fully computerized voice stress analyzer that allows us to detect the truth instantly.

or deceit.

**k-NN**: Classification happens by locating the instance in feature space and comparing it with the k nearest neighbors (training examples) and labeling the unknown feature with the same class label as that of the located (known) neighbor. The majority vote decides the outcome of class labeling.

**GMM**: A model of the probability distribution of the features measured in a biometric system such as vocal-tract related spectral features in a speaker recognition system. It is used for representing the existence of sub-populations, which is described using the mixture distribution, within the overall population.

**SVM** : It is a binary classifier to analyze the data and recognize the patterns for classification and regression analysis.

**Decision tree algorithms**: work based on following a decision tree in which leaves represent the classification outcome, and branches represent the conjunction of subsequent features that lead to the classification.

**HMMs**: It is a generalized model in which the hidden variables control the components to be selected. The hidden variables are related through the Markov process. In the case of emotion recognition, the outputs represent the sequence of speech feature vectors, which allow the deduction of states' sequences through which the model progressed. The states can consist of various intermediate steps in the expression of an emotion, and each of them has a probability distribution over the possible output vectors. The states' sequences allow us to predict the emotional state which we are trying to classify, and this is one of the most commonly used techniques within the area of speech affect detection.

**Boostexter**: an iterative algorithm that is based on the principle of combining many simple and moderately inaccurate rules into a single, highly accurate rule. It focuses on text categorization tasks. An advantage of Boostexter is that it can deal with both continuousvalued input (e.g., age) and textual input (e.g., a text string).

### **6. Applications**

Emotion detection is a key phase in our ability to use users' speech and communications as a source of important information on users' needs, desires, preferences and intentions. By recognizing the emotional content of users' communications, marketers can customize offerings to users even more precisely than ever before .This is an exciting innovation that is destined to add an interesting dimension to the man-machine interface, with unlimited potential for marketing as well as consumer products, transportation, medical and therapeutic applications, traffic control and so on.

**Intelligent Tutoring System:** It aims to provide intervention strategies in response to a detected emotional state, with the goal being to keep the student in a positive affect realm to maximize learning potential. The research follows an ethnographic approach in the determination of affective states that naturally occur between students and computers. The multimodal inference component will be evaluated from audio recordings taken during classroom sessions. Further experiments will be conducted to evaluate the affect component and educational impact of the intelligent tutor.

134 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

**k-NN**: Classification happens by locating the instance in feature space and comparing it with the k nearest neighbors (training examples) and labeling the unknown feature with the same class label as that of the located (known) neighbor. The majority vote decides the

**GMM**: A model of the probability distribution of the features measured in a biometric system such as vocal-tract related spectral features in a speaker recognition system. It is used for representing the existence of sub-populations, which is described using the mixture

**SVM** : It is a binary classifier to analyze the data and recognize the patterns for classification

**Decision tree algorithms**: work based on following a decision tree in which leaves represent the classification outcome, and branches represent the conjunction of subsequent features

**HMMs**: It is a generalized model in which the hidden variables control the components to be selected. The hidden variables are related through the Markov process. In the case of emotion recognition, the outputs represent the sequence of speech feature vectors, which allow the deduction of states' sequences through which the model progressed. The states can consist of various intermediate steps in the expression of an emotion, and each of them has a probability distribution over the possible output vectors. The states' sequences allow us to predict the emotional state which we are trying to classify, and this is one of the most

**Boostexter**: an iterative algorithm that is based on the principle of combining many simple and moderately inaccurate rules into a single, highly accurate rule. It focuses on text categorization tasks. An advantage of Boostexter is that it can deal with both continuous-

Emotion detection is a key phase in our ability to use users' speech and communications as a source of important information on users' needs, desires, preferences and intentions. By recognizing the emotional content of users' communications, marketers can customize offerings to users even more precisely than ever before .This is an exciting innovation that is destined to add an interesting dimension to the man-machine interface, with unlimited potential for marketing as well as consumer products, transportation, medical and

**Intelligent Tutoring System:** It aims to provide intervention strategies in response to a detected emotional state, with the goal being to keep the student in a positive affect realm to maximize learning potential. The research follows an ethnographic approach in the determination of affective states that naturally occur between students and computers. The multimodal inference component will be evaluated from audio recordings taken during classroom sessions. Further experiments will be conducted to evaluate the affect component

commonly used techniques within the area of speech affect detection.

valued input (e.g., age) and textual input (e.g., a text string).

therapeutic applications, traffic control and so on.

and educational impact of the intelligent tutor.

outcome of class labeling.

and regression analysis.

**6. Applications** 

that lead to the classification.

distribution, within the overall population.

**Lie Detection:** Lie Detector helps in deciding whether someone is lying or not. This mechanism is used particularly in areas such as Central Bureau of Investigation for finding out the criminals, cricket council to fight against corruption. **X13-VSA PRO Voice Lie Detector 3.0.1 PRO** is an innovative, advanced and sophisticated software system and a fully computerized voice stress analyzer that allows us to detect the truth instantly.

**Banking:** The ATM will employ speaker recognition and authentication if needed "to ensure higher security level while accessing to confidential data." In other words, the unique deployment of combining speech recognition, speaker recognition and emotion detection is not designed to be spooky or invasive. "It is just one more step forward the creation of humanlike systems that speak to the clients, understand and recognize a speaker". What's different is the incorporation of emotion detection in the enrollment process, which is probably a very good idea if enrollments are going to be conducted without human assistance or supervision. The machine will be able to talk with the prospective enrollee (and later on the client) and will be able to authenticate his or her unique voiceprint while, at the same time, test voice levels for signs of nervousness, anger, or deceit.

**In-Car Board System:** An in-car board system shall be provided with information about the emotional state of the driver to initiate safety strategies, initiatively provide aid or resolve errors in the communication according to the driver's emotion.

**Prosody in Dialog System:** We investigate the use of prosody for the detection of frustration and annoyance in natural human-computer dialog. In addition to prosodic features, we examine the contribution of language model information and speaking "style". Results show that a prosodic model can predict whether an utterance is neutral versus "annoyed or frustrated" with an accuracy on par with that of human interlobular agreement. Accuracy increases when discriminating only "frustrated" from other utterances, and when using only those utterances on which labelers originally agreed. Furthermore, prosodic model accuracy degrades only slightly when using recognized versus true words. Language model features, even if based on true words, are relatively poor predictors of frustration.

**Emotion Recognition in Call Center:** Call-centers often have a difficult task of managing customer disputes. Ineffective resolution of these disputes can often lead to customer discontent, loss of business and in extreme cases, general customer unrest where a large amount of customers move to a competitor. It is therefore important for call-centers to take note of isolated disputes and effectively train service representatives to handle disputes in a way that keeps the customer satisfied.

A system was designed to monitor recorded customer messages and provide an emotional assessment for more effective call-back prioritization. However, this system only provided post-call classification and was not designed for real time support or monitoring. Nowadays the systems are different because it aims to provide a real-time assessment to aid in the handling of the customer while he or she is speaking. Early warning signs of customer frustration can be detected from pitch contour irregularities, short-time energy changes, and changes in the rate of speech.

**Sorting of Voice Mail:** Voicemail is an electronic system for recording and storing of voice messages for later retrieval by the intended recipient. It would be a potential application to

Recognition of Emotion from Speech: A Review 137

[2] Panagiotis C. Petrantonakis , and Leontios J. Hadjileontiadis, ' Emotion Recognition

[3] Christos A. Frantzidis, Charalampos Bratsas, et al 'On the Classification of Emotional

[4] Yuan-Pin Lin, Chi-Hong Wang, Tzyy-Ping Jung, Tien-Lin Wu, Shyh-Kang Jeng, Jeng-

[5] Meng-Ju Han, Jing-Huai Hsu and Kai-Tai Song, A New Information Fusion Method for

[6] Claude C. Chibelushi, Farzin Deravi, John S. D. Mason, 'A Review of Speech-Based

[7] Bjorn Schuller , Bogdan Vlasenko, Florian Eyben , Gerhard Rigoll , Andreas Wendemuth,

[8] Ellen Douglas-Cowie , Nick Campbell , Roddy Cowie , Peter Roach, 'Emotional Speech:

[9] John H.L. Hansen, 'Analysis and Compensation of Speech under Stress and Noise for

[11] Nathalie Camelin, Frederic Bechet, Géraldine Damnati, and Renato De Mori, ' Detection

[12] Dimitrios Ververidis , Constantine Kotropoulos, 'Fast and accurate sequential floating

recognition',Elsevier Signal Processing, vol.88,issue 12,pp.2956-2970,2008 [13] K B khanchandani and Moiz A Hussain, 'Emotion Recognition Using Multilayer

[14] Tal Sobol-Shikler, and Peter Robinson, 'Classification of Complex Information:

Scientific And Industrial Research Vol.68, pp.367-371,May 2009

Issue on Speech Under Stress,vol. 20(1-2), pp. 151-170, November 1996. [10] Carlos Busso, , Sungbok Lee, , and Shrikanth Narayanan, , 'Analysis of Emotionally

Technology In Biomedicine, Vol. 14, No. 2, pp.309-318,March 2010

In Biomedicine, Vol. 14, No. 2,pp.186-197, March 2010.

January 2009.

July 2010.

47, July 2008

37,March 2002.

60 ,2003.

596, May 2009.

2010.

Merano,Italy, December 13-20,2009.

Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 1, pp.39-58,

From EEG Using Higher Order Crossings', IEEE Trans. on Information Technology

Biosignals Evoked While Viewing Affective Pictures: An Integrated Data-Mining-Based Approach for Healthcare Applications', IEEE Trans. on Information

Ren Duann, , and Jyh-Horng Chen, 'EEG-Based Emotion Recognition in Music Listening', IEEE Trans. on Biomedical Engineering, Vol. 57, No. 7, pp.1798-1806 ,

Bimodal Robotic Emotion Recognition, Journal of Computers, Vol. 3, No. 7, pp.39-

Bimodal Recognition', IEEE Transactions On Multimedia, vol. 4, No. 1 ,pp.23-

'Acoustic Emotion Recognition:A Benchmark Comparison of Performances', IEEE workshop on Automatic Speech Recognition and Understanding , pp.552-557,

Towards a New Generation Of Databases' , Speech Communication Vol. 40, pp.33–

Environmental Robustness in Speech Recognition', Speech Communication, Special

Salient Aspects of Fundamental Frequency for Emotion Detection', IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17, No. 4, pp.582-

and Interpretation of Opinion Expressions in Spoken Surveys', IEEE Transactions On Audio, Speech, And Language Processing, Vol. 18, No. 2, pp.369-381, February

forward feature selection with the Bayes classifier applied to speech emotion

Perceptron And Generalized Feed Forward Neural Network', IEEE Journal Of

Inference of Co-Occurring Affective States from Their Expressions in Speech', IEEE

sort the voice mail according to the emotion of the person's voice recorded. It will help to respond to the caller appropriately.

**Computer Games:** Computer games can be controlled through emotions of human speech. The computer recognizes human emotion from their speech and compute the level of game (easy, medium, hard). For example, if the human speech is in form of aggressive nature then the level becomes hard. Suppose if the human is too relaxed the level becomes easy. The rest of emotions come under medium level.

**Diagnostic Tool By Speech Therapists:** Person who diagnosis and treats variety of speech, voice, and language disorders is called a Speech Therapist. By understanding and empathizing emotional stress and strains the therapists can know what the patient is suffering from. The software used for recording and analyzing the entire speech is icSpeech. The use of speech communication in healthcare is to allow the patient to describe their health condition to the best of their knowledge. In clinical analysis, human emotions are analyzed based on features related to prosodics, the vocal tract, and parameters extracted directly from the glottal waveform. Emotional expressions can be referred by vocal affect extracted from the human speech.

**Robots:** Robots can interact with people and assist them in their daily routines, in common places such as homes, super markets, hospitals or offices. For accomplishing these tasks, robots should recognize the emotions of the humans to provide a friendly environment. Without recognizing the emotion, the robot cannot interact with the human in a natural way.

#### **7. Conclusion**

The process of speech emotion detection requires the creation of a reliable database, broad enough to fit every need for its application, as well as the selection of a successful classifier which will allow for quick and accurate emotion identification. Thirty-one emotional speech databases are reviewed. Each database consists of a corpus of human speech pronounced under different emotional conditions. A basic description of each database and its applications is provided. And the most common emotions searched for in decreasing frequency of appearance are anger, sadness, happiness, fear, disgust, joy, surprise, and boredom. The complexity of the emotion recognition process increases with the amount of emotions and features used within the classifier. It is therefore crucial to select only the most relevant features in order to assure the ability of the model to successfully identify emotions, as well as increasing the performance, which is particularly significant to real-time detection. SER has in the last decade shifted from a side issue to a major topic in human computer interaction and speech processing. SER has potentially wide applications. For example, human computer interfaces could be made to respond differently according to the emotional state of the user. This could be especially important in situations where speech is the primary mode of interaction with the machine.

#### **8. References**

[1] Zhihong Zeng, Maja Pantic I. Roisman, and Thomas S. Huang, 'A Survey of Affect Recognition Methods: Audio,Visual, and Spontaneous Expressions', IEEE 136 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

sort the voice mail according to the emotion of the person's voice recorded. It will help to

**Computer Games:** Computer games can be controlled through emotions of human speech. The computer recognizes human emotion from their speech and compute the level of game (easy, medium, hard). For example, if the human speech is in form of aggressive nature then the level becomes hard. Suppose if the human is too relaxed the level becomes easy. The rest

**Diagnostic Tool By Speech Therapists:** Person who diagnosis and treats variety of speech, voice, and language disorders is called a Speech Therapist. By understanding and empathizing emotional stress and strains the therapists can know what the patient is suffering from. The software used for recording and analyzing the entire speech is icSpeech. The use of speech communication in healthcare is to allow the patient to describe their health condition to the best of their knowledge. In clinical analysis, human emotions are analyzed based on features related to prosodics, the vocal tract, and parameters extracted directly from the glottal waveform. Emotional expressions can be referred by vocal affect

**Robots:** Robots can interact with people and assist them in their daily routines, in common places such as homes, super markets, hospitals or offices. For accomplishing these tasks, robots should recognize the emotions of the humans to provide a friendly environment. Without recognizing the emotion, the robot cannot interact with the human in a natural

The process of speech emotion detection requires the creation of a reliable database, broad enough to fit every need for its application, as well as the selection of a successful classifier which will allow for quick and accurate emotion identification. Thirty-one emotional speech databases are reviewed. Each database consists of a corpus of human speech pronounced under different emotional conditions. A basic description of each database and its applications is provided. And the most common emotions searched for in decreasing frequency of appearance are anger, sadness, happiness, fear, disgust, joy, surprise, and boredom. The complexity of the emotion recognition process increases with the amount of emotions and features used within the classifier. It is therefore crucial to select only the most relevant features in order to assure the ability of the model to successfully identify emotions, as well as increasing the performance, which is particularly significant to real-time detection. SER has in the last decade shifted from a side issue to a major topic in human computer interaction and speech processing. SER has potentially wide applications. For example, human computer interfaces could be made to respond differently according to the emotional state of the user. This could be especially important in situations where speech is

[1] Zhihong Zeng, Maja Pantic I. Roisman, and Thomas S. Huang, 'A Survey of Affect

Recognition Methods: Audio,Visual, and Spontaneous Expressions', IEEE

respond to the caller appropriately.

of emotions come under medium level.

extracted from the human speech.

the primary mode of interaction with the machine.

way.

**7. Conclusion** 

**8. References** 

Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 1, pp.39-58, January 2009.


Transactions On Pattern Analysis And Machine Intelligence, Vol. 32, No. 7, pp.1284-1297, July 2010


138 Speech Enhancement, Modeling and Recognition – Algorithms and Applications

[15] Daniel Erro, Eva Navas, Inma Hernáez, and Ibon Saratxaga, 'Emotion Conversion

[16] Khiet P. Truong and Stephan Raaijmakers, 'Automatic Recognition of Spontaneous

[17] Bjorn Schuller, Gerhard Rigoll, and Manfred Lang, 'Speech Emotion Recognition

Acoustics, Speech, and Signal Processing, Quebec,Canada,17-21 May,2004 [18] Bjorn Schuller, Bogdan Vlasenko, Dejan Arsic, Gerhard Rigoll, Andreas Wendemuth,

[19] Wernhuar Tarng, Yuan-Yuan Chen, Chien-Lung Li, Kun-Rong Hsie and Mingteh Chen,

[20] Silke Paulmann , Marc D. Pell , Sonja A. Kotz, 'How aging affects the recognition of

[21] Elliot Moore II, Mark A. Clements, , John W. Peifer, , and Lydia Weisser , 'Critical

[22] Yongjin Wang and Ling Guan, Recognizing Human Emotional State From Audiovisual Signals, IEEE Transactions on Multimedia, Vol. 10, No. 4, pp. 659-668, June 2008.

emotional speech', Brain and Language Vol. 104, pp.262–269,2008

Language Processing, Vol. 18, No. 5, pp.974-983, July 2010

Multimedia & Expo,Hannover,Germany,June 23-26,2008

pp.1284-1297, July 2010

pp. 161–172, 2008.

Vol.72, pp.106-113, 2010

1, pp.96-107, January 2008.

Transactions On Pattern Analysis And Machine Intelligence, Vol. 32, No. 7,

Based on Prosodic Unit Selection' , IEEE Transactions On Audio, Speech And

Emotions in Speech Using Acoustic and Lexical Features', MLMI 2008, LNCS 5237,

Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine - Belief Network Architecture', IEEE International Conference on

'Combining Speech Recognition and Acoustic Word Emotion Models for Robust text-Independent Emotion Recognition', IEEE International Conference on

'Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition', World Academy of Science, Engineering and Technology

Analysis of the Impact of Glottal Features in the Classification of Clinical Depression in Speech', IEEE Transactions On Biomedical Engineering,Vol. 55, No.

### *Edited by S. Ramakrishnan*

This book on Speech Processing consists of seven chapters written by eminent researchers from Italy, Canada, India, Tunisia, Finland and The Netherlands. The chapters covers important fields in speech processing such as speech enhancement, noise cancellation, multi resolution spectral analysis, voice conversion, speech recognition and emotion recognition from speech. The chapters contain both survey and original research materials in addition to applications. This book will be useful to graduate students, researchers and practicing engineers working in speech processing.

Speech Enhancement, Modeling and Recognition - Algorithms and Applications

Speech Enhancement,

Modeling and Recognition

Algorithms and Applications

*Edited by S. Ramakrishnan*

Photo by ismagilov / iStock