**Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging**

Dia AbuZeina, Husni Al-Muhtaseb and Moustafa Elshafei

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/48645

## **1. Introduction**

20 Will-be-set-by-IN-TECH

[31] Stolcke, A. [2002]. SRILM – an extensible language modeling toolkit, *Proc. of ICSLP*,

[34] Votrubec, J. [2006]. Morphological tagging based on averaged perceptron, *Proc. of*

[32] Su, Y. [2011]. Bayesian class-based language models, *Proc. of ICASSP*, pp. 5564–5567. [33] Vergyri, D., Kirchhoff, K., Duh, K. & Stolcke, A. [2004]. Morphology-based language

modeling for arabic speech recognition, *Proc. of ICSLP*, pp. 2245–2248.

*Contributed Papers, WDS'06*, Prague, Czech Republic, pp. 191–195.

Denver, Colorado, pp. 901–904.

Speech recognition is often used as the front-end for many natural language processing (NLP) applications. Some of these applications include machine translation, information retrieval and extraction, voice dialing, call routing, speech synthesis/recognition, data entry, dictation, control, etc. Thus, much research work has been done to improve the speech recognition and the related NLP applications. However, speech recognition has some obstacles that should be considered. Pronunciation variations and small words misrecognition are two major problems that lead to performance reduction. Pronunciation variations problem can be divided into two parts: within-word variations and cross-word variations. These two types of pronunciation variations have been tackled by many researchers using different approaches. For example, cross-word problem can be solved using phonological rules and/or small-word merging. (AbuZeina et al., 2011a) used the phonological rules to model cross-word variations for Arabic. For English, (Saon & Padmanabhan, 2001) demonstrated that short words are more frequently misrecognized, they also had achieved a statistically significant enhancement using small-word merging approach.

An automatic speech recognition (ASR) system uses a decoder to perform the actual recognition task. The decoder finds the most likely words sequence for the given utterance using Viterbi algorithm. The ASR decoder task might be seen as an alignment process between the observed phonemes and the reference phonemes (dictionary phonemic transcription). Intuitively, to have a better accuracy in any alignment process, long sequences are highly favorable instead of short ones. As such, we expect enhancement if we merge words (short or long). Hence fore, a thorough investigation was performed on Arabic speech to discover a suitable merging cases. We found that Arabic speakers usually augment two consecutive words; a noun that is followed by an adjective and a preposition that is followed by a word. Even though we believe that other cases are found in Arabic speech, we chose two cases to validate our proposed method. Among the ASR components,

© 2012 Abuzeina et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Abuzeina et al., licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

the pronunciation dictionary and the language model were used to model our above mentioned objective. This means that the acoustic models for the baseline and the enhanced method are the same.

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 279

speech cross-word merging will reduce the performance. Two main methods are usually used to model the cross-word problem, phonological rules and small-word merging. Even though the phonological rules and small-word merging methods enhance the performance,

Initially, there are two reasons why cross-word modeling is an effective method in speech recognition system: First, the speech recognition problem appears as an alignment process, hence for, having long sequences is better than short ones as demonstrated by (Saon and Padmanabhan, 2001). To illustrate the effect of co-articulation phenomenon (merging of words in continuous speech), let us examine Figure 1 and Figure 2. Figure 1 shows the words to be considered with no compound words, while Figure 2 shows the words with compound words. In both figures we represented the hypotheses words using bold black lines. During decoding, the ASR decoder will investigate many words and hypotheses. Intuitively, the ASR decoder will choose the long words instead of two short words. The difference between the two figures is the total number of words that will be considered during the decoding process. Figure 2 shows that the total number of words for the hypotheses is less than the total words in Figure 1 (Figure 1 contains 34 words while Figure 2 contains 18 words). Having less number of total words during decoding process means having less decoding options (i.e. less ambiguity), which is expected to enhance the

Second, compounding words will lead to more robust language model. the compound words which are represented in the language model will provide better representations of words relations. Therefore, enhancement is expected as correct choice of a word will increase the probability of choosing a correct neighbor words. The effect of compounding words was investigated by (Saon & Padmanabhan, 2001). They mathematically demonstrated that compound words enhance the language model performance, therefore, enhancing the overall recognition output. They showed that the compound words have the effect of incorporating a trigram dependency in a bigram language model. In general, the compound words are most likely to be correctly recognized more than two separated words. Consequently, correct recognition of a word might lead to another correct word through the enhanced N-grams language model. In contrast, misrecognition of a word may lead to

For more clarification, we present some cases to show the short word misrecognition, and how is the long word is much likely to be recognized correctly. Table 2 shows three speech files that were tested in the baseline and the enhanced system. Of course, it is early to show some results, but we see that it is worthy to support our motivation claim. In Table 2, it is clear that the misrecognitions were mainly occurred in the short words (the highlighted

In this chapter, the most noticeable Arabic ASRs performance reduction factor, the crossword pronunciation variations, is investigated. To enhance speech recognition accuracy, a knowledge-based technique was utilized to model the cross-word pronunciation variation at two ASR components: the pronunciation dictionary and the language model. The

another misrecognition in the adjacent words and so on.

short words were misrecognized in the baseline system).

we believe that generating compound words is also possible using PoS tagging.

performance.

This research work is conducted for Modern Standard Arabic (MSA). So, the work will necessarily contain many examples in Arabic. Therefore, it would be appropriate for the reader if we start first by providing a Romanization (Ryding, 2005) of the Arabic letters and diacritical marks. Table 1 shows the Arabic–Roman letters mapping table. The diacritics Fatha, Damma, and Kasra are represented using a, u, and i, respectively.


**Table 1.** Arabic–Roman letters mapping table

To validate the proposed method, we used Carnegie Mellon University (CMU) Sphinx speech recognition engine. Our baseline system contains a pronunciation dictionary of 14,234 words from a 5.4 hours pronunciation corpus of MSA broadcast news. For tagging, we used the Arabic module of Stanford tagger. Our results show that part of speech (PoS) tagging is considered a promising track to enhance Arabic speech recognition systems.

The rest of this chapter is organized as follows. Section 2 presents the problem statement. Section 3 demonstrates the speech recognition components. In Section 4, we differentiate between within-word and cross-word pronunciation variations followed by the Arabic speech recognition in Section 5. The proposed method is presented in Section 6 and the results in Section 7. The discussion is provided in Section 8. In Section 9, we highlight some of the future directions. We conclude the work in Section 10.

## **2. Problem statement**

Continuous speech is characterized by augmenting adjacent words, which do not occur in isolated speech. Therefore, handling this phenomenon is a major requirement in continuous speech recognition systems. Even though Hidden Markov Models (HMMs) based ASR decoder uses triphones to alleviate the negative effects of cross-word phenomenon, more effort is still needed to model some cross-word cases that could not be avoided using triphones. In continuous ASR systems, the dictionary is usually initiated using corpus transcription words, i.e. each word is considered as an independent entity. In this case, speech cross-word merging will reduce the performance. Two main methods are usually used to model the cross-word problem, phonological rules and small-word merging. Even though the phonological rules and small-word merging methods enhance the performance, we believe that generating compound words is also possible using PoS tagging.

278 Modern Speech Recognition Approaches with Case Studies

**Table 1.** Arabic–Roman letters mapping table

**2. Problem statement** 

of the future directions. We conclude the work in Section 10.

method are the same.

the pronunciation dictionary and the language model were used to model our above mentioned objective. This means that the acoustic models for the baseline and the enhanced

This research work is conducted for Modern Standard Arabic (MSA). So, the work will necessarily contain many examples in Arabic. Therefore, it would be appropriate for the reader if we start first by providing a Romanization (Ryding, 2005) of the Arabic letters and diacritical marks. Table 1 shows the Arabic–Roman letters mapping table. The diacritics

Arabic Roman Arabic Roman Arabic Roman Arabic Roman

To validate the proposed method, we used Carnegie Mellon University (CMU) Sphinx speech recognition engine. Our baseline system contains a pronunciation dictionary of 14,234 words from a 5.4 hours pronunciation corpus of MSA broadcast news. For tagging, we used the Arabic module of Stanford tagger. Our results show that part of speech (PoS) tagging is considered a promising track to enhance Arabic speech recognition systems.

The rest of this chapter is organized as follows. Section 2 presents the problem statement. Section 3 demonstrates the speech recognition components. In Section 4, we differentiate between within-word and cross-word pronunciation variations followed by the Arabic speech recognition in Section 5. The proposed method is presented in Section 6 and the results in Section 7. The discussion is provided in Section 8. In Section 9, we highlight some

Continuous speech is characterized by augmenting adjacent words, which do not occur in isolated speech. Therefore, handling this phenomenon is a major requirement in continuous speech recognition systems. Even though Hidden Markov Models (HMMs) based ASR decoder uses triphones to alleviate the negative effects of cross-word phenomenon, more effort is still needed to model some cross-word cases that could not be avoided using triphones. In continuous ASR systems, the dictionary is usually initiated using corpus transcription words, i.e. each word is considered as an independent entity. In this case,

ء) hamza) ' د) daal) d ض) Daad) D ك) kaaf) k ب) baa') b ذ) dhaal) dh ط) Taa') T ل) laam) l ت) taa') t ر) raa') r ظ) Zaa') Z م) miim) m ث) thaa') th ز) zaay) z ع') ayn) ' ن) nuun) n ج) jiim) j س) siin) s غ) ghayn) gh ه) haa') h ح) Haa') H ش) shiin) sh ف) faa') f و) waaw) w or u خ) khaa') kh ص) Saad) S ق) qaaf) q ي) yaa') y or ii

Fatha, Damma, and Kasra are represented using a, u, and i, respectively.

Initially, there are two reasons why cross-word modeling is an effective method in speech recognition system: First, the speech recognition problem appears as an alignment process, hence for, having long sequences is better than short ones as demonstrated by (Saon and Padmanabhan, 2001). To illustrate the effect of co-articulation phenomenon (merging of words in continuous speech), let us examine Figure 1 and Figure 2. Figure 1 shows the words to be considered with no compound words, while Figure 2 shows the words with compound words. In both figures we represented the hypotheses words using bold black lines. During decoding, the ASR decoder will investigate many words and hypotheses. Intuitively, the ASR decoder will choose the long words instead of two short words. The difference between the two figures is the total number of words that will be considered during the decoding process. Figure 2 shows that the total number of words for the hypotheses is less than the total words in Figure 1 (Figure 1 contains 34 words while Figure 2 contains 18 words). Having less number of total words during decoding process means having less decoding options (i.e. less ambiguity), which is expected to enhance the performance.

Second, compounding words will lead to more robust language model. the compound words which are represented in the language model will provide better representations of words relations. Therefore, enhancement is expected as correct choice of a word will increase the probability of choosing a correct neighbor words. The effect of compounding words was investigated by (Saon & Padmanabhan, 2001). They mathematically demonstrated that compound words enhance the language model performance, therefore, enhancing the overall recognition output. They showed that the compound words have the effect of incorporating a trigram dependency in a bigram language model. In general, the compound words are most likely to be correctly recognized more than two separated words. Consequently, correct recognition of a word might lead to another correct word through the enhanced N-grams language model. In contrast, misrecognition of a word may lead to another misrecognition in the adjacent words and so on.

For more clarification, we present some cases to show the short word misrecognition, and how is the long word is much likely to be recognized correctly. Table 2 shows three speech files that were tested in the baseline and the enhanced system. Of course, it is early to show some results, but we see that it is worthy to support our motivation claim. In Table 2, it is clear that the misrecognitions were mainly occurred in the short words (the highlighted short words were misrecognized in the baseline system).

In this chapter, the most noticeable Arabic ASRs performance reduction factor, the crossword pronunciation variations, is investigated. To enhance speech recognition accuracy, a knowledge-based technique was utilized to model the cross-word pronunciation variation at two ASR components: the pronunciation dictionary and the language model. The proposed knowledge-based approach method utilizes the PoS tagging to compound consecutive words according to their tags. We investigated two pronunciation cases, a noun that is followed by an adjective, and a preposition that is followed by a word. the proposed method showed a significant enhancement.

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 281

َائ

َ َ ار ِاة الن

ُ ُوروب األ

sayataqabalani wajhan liwajh fy 'lmubarah 'lniha'iya

wamumathilyna 'an 'adadin mina 'lduwali 'l'wrubiya

ِيَّة َائ ِّھ َ َ ار ِاة الن

ِيَّة َائ ِّھ َ َ ار ِاة الن

Linguistic database


sayataqabalani wajhan liwajh fy 'lmubarah 'lniha'iya

wamumathilyna 'an 'adadin mina 'lduwali 'l'wrubiya

sayataqabalani wajhan liwajh 'lmubarah 'lniha'iya

wamumathilyna 'an 'inna 'lduwali 'l'wrubiya

ُُوروب ًا

ِ َ ين َعن َعَدٍد ِمن ُّ الدَو ِل

ِيَّ َو ُمَم ة ث

ِ َّن ُّ الدَو ِل األ

ِيَّ َو ُمَم ة ث

ِّ َسي

ِيَّ َو ُمَم ة ث

َاب َق َت

ِّل

ِ ِّل

ين َعن إ

ِ َ ين َعن َعَدٍد ِّل

ِيَّة ِالن َوجھ

ِي ُ المب ِ َوجه ف ل

َ ھ

ِ َوجه ُ المب ًا ل َ ِالن َوجھ َاب َق َت َسي

ُُوروب

> ِي ُ المب ِ َوجه ف ًا ل َ ِالن َوجھ َاب َق َت َسي

ِمن ُّ الدَو ِل األ

W1 W2 W3 …

Spoken words

acoustic model and the language model. The dictionary contains the words available in the language and the pronunciation of each word in terms of the phonemes available in the

Figure 3 illustrates the sub-systems that are usually found in a typical ASR system. In addition to the knowledge sources, an ASR system contains a Front-End module which is used to convert the input sound into feature vectors to be usable by the rest of the system. Speech recognition systems usually use feature vectors that are based on Mel Frequency

Cepstral Coefficients (MFCCs), (Rabiner and Juang, 2004).

acoustic models.

The speech files to be tested

The baseline system results

The enhance system results

Speech waveform

**Figure 3.** An ASR architecture

**Table 2.** Illustrative cross-word misrecognition results

Features

Front-End ASR Decoder

**Figure 1.** A list of hypotheses without compounding words

**Figure 2.** A list of hypotheses with compounding words

## **3. Speech recognition**

Modern large vocabulary, speaker-independent, continuous speech recognition systems have three knowledge sources, also called linguistic databases: acoustic models, language model, and pronunciation dictionary (also called lexicon). Acoustic models are the HMMs of the phonemes and triphones (Hwang, 1993). The language model is the module that provides the statistical representations of the words sequences based on the transcription of the text corpus. The dictionary is the module that serves as an intermediary between the acoustic model and the language model. The dictionary contains the words available in the language and the pronunciation of each word in terms of the phonemes available in the acoustic models.

Figure 3 illustrates the sub-systems that are usually found in a typical ASR system. In addition to the knowledge sources, an ASR system contains a Front-End module which is used to convert the input sound into feature vectors to be usable by the rest of the system. Speech recognition systems usually use feature vectors that are based on Mel Frequency Cepstral Coefficients (MFCCs), (Rabiner and Juang, 2004).


**Table 2.** Illustrative cross-word misrecognition results

**Figure 3.** An ASR architecture

280 Modern Speech Recognition Approaches with Case Studies

method showed a significant enhancement.

Hypotheses

Hypotheses

**Figure 1.** A list of hypotheses without compounding words

**Figure 2.** A list of hypotheses with compounding words

Observed phonemes

**3. Speech recognition** 

proposed knowledge-based approach method utilizes the PoS tagging to compound consecutive words according to their tags. We investigated two pronunciation cases, a noun that is followed by an adjective, and a preposition that is followed by a word. the proposed

o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o17 o18 o19 o20

Observed phonemes

Modern large vocabulary, speaker-independent, continuous speech recognition systems have three knowledge sources, also called linguistic databases: acoustic models, language model, and pronunciation dictionary (also called lexicon). Acoustic models are the HMMs of the phonemes and triphones (Hwang, 1993). The language model is the module that provides the statistical representations of the words sequences based on the transcription of the text corpus. The dictionary is the module that serves as an intermediary between the

o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 o17 o18 o19 o20

The following is a brief introduction to typical ASR system components. The reader can find more elaborate discussion in (Jurafsky and Martin, 2009).

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 283

*Mel Filter Bank***:** A set of triangular filter banks is used to approximate the frequency resolution of the human ear. The Mel frequency scale is linear up to 1000 Hz and logarithmic

*Log of the Mel spectrum values:* The range of the values generated by the Mel filter bank is reduced by replacing each value by its natural logarithm. This is done to make the statistical

*Inverse Discrete Fourier Transform***:** This transform is used to compress the spectral information into a set of low order coefficients which is called the Mel-cepstrum. Thirteen

*Deltas and Energy***:** For continuous models, the 13 MFCC parameters along with computed delta and delta-deltas parameters are used as a single stream 39 parameters feature vector. For semi-continuous models, x(0) represents the log Mel spectrum energy, and is used separately to derive other feature parameters, in addition to the delta and double delta parameters. Figure 5 shows part of the feature vector of a speech file after completing the feature extraction process. Each column represents the basic 13 features of a 25.6

This part contains the modifications required for a particular language. It contains three parts: acoustic models, language model, and pronunciation dictionary. Acoustic models contain the HMMs used in recognition process. The language model contains language's words and their combinations, each combination has two or three words. A pronunciation

dictionary contains the words and their pronunciation phonemes.

thereafter. For 16 KHz sampling rate, Sphinx engine uses a set of 40 Mel filters.

MFCC coefficients are used as a basic feature vector, ( ) 0 12 *<sup>t</sup> xk k* .

distribution of spectrum approximately Gaussian.

**Figure 5.** snapshot of the MFCCs of a speech file

**3.2. Linguistic database** 

milliseconds frame.

## **3.1. Front-end**

The purpose of this sub-system is to extract speech features which play a crucial role in speech recognition performance. Speech features includes Linear Predictive Cepstral Coefficients (LPCC), MFCCs and Perceptual Linear Predictive (PLP) coefficients. The Sphinx engine used in this work is based on MFCCs.

The feature extraction stage aims to produce the spectral properties (features vectors) of speech signals. The feature vector consists of 39 coefficients. A speech signal is divided into overlapping short segments that are represented using MFCCs. Figure 4 shows the steps to extract the MFCCs of a speech signal (Rabiner & Juang, 2004). These steps are summarized below.

**Figure 4.** Feature vectors extraction

*Sampling and Quantization*: Sampling and quantization are the two steps for analog-to-digital conversion. The sampling rate is the number of samples taken per second, the sampling rate used in this study is 16 k samples per seconds. The quantization is the process of representing real-valued numbers as integers. The analysis window is about 25.6 msec (410 samples), and consecutive frames overlap by 10 msec.

*Preemphasis:* This stage is to boost the high frequency part that was suppressed during the sound production mechanism, so making the information more available to the acoustic model.

*Windowing:* Each analysis window is multiplied by a Hamming window.

*Discrete Fourier Transform*: The goal of this step is to obtain the magnitude frequency response of each frame. The output is a complex number representing the magnitude and phase of the frequency component in the original signal.

*Mel Filter Bank***:** A set of triangular filter banks is used to approximate the frequency resolution of the human ear. The Mel frequency scale is linear up to 1000 Hz and logarithmic thereafter. For 16 KHz sampling rate, Sphinx engine uses a set of 40 Mel filters.

*Log of the Mel spectrum values:* The range of the values generated by the Mel filter bank is reduced by replacing each value by its natural logarithm. This is done to make the statistical distribution of spectrum approximately Gaussian.

*Inverse Discrete Fourier Transform***:** This transform is used to compress the spectral information into a set of low order coefficients which is called the Mel-cepstrum. Thirteen MFCC coefficients are used as a basic feature vector, ( ) 0 12 *<sup>t</sup> xk k* .

*Deltas and Energy***:** For continuous models, the 13 MFCC parameters along with computed delta and delta-deltas parameters are used as a single stream 39 parameters feature vector. For semi-continuous models, x(0) represents the log Mel spectrum energy, and is used separately to derive other feature parameters, in addition to the delta and double delta parameters. Figure 5 shows part of the feature vector of a speech file after completing the feature extraction process. Each column represents the basic 13 features of a 25.6 milliseconds frame.


**Figure 5.** snapshot of the MFCCs of a speech file

## **3.2. Linguistic database**

282 Modern Speech Recognition Approaches with Case Studies

engine used in this work is based on MFCCs.

**Figure 4.** Feature vectors extraction

samples), and consecutive frames overlap by 10 msec.

phase of the frequency component in the original signal.

**3.1. Front-end** 

below.

model.

more elaborate discussion in (Jurafsky and Martin, 2009).

The following is a brief introduction to typical ASR system components. The reader can find

The purpose of this sub-system is to extract speech features which play a crucial role in speech recognition performance. Speech features includes Linear Predictive Cepstral Coefficients (LPCC), MFCCs and Perceptual Linear Predictive (PLP) coefficients. The Sphinx

The feature extraction stage aims to produce the spectral properties (features vectors) of speech signals. The feature vector consists of 39 coefficients. A speech signal is divided into overlapping short segments that are represented using MFCCs. Figure 4 shows the steps to extract the MFCCs of a speech signal (Rabiner & Juang, 2004). These steps are summarized

Continuous waveform MFCCs

Sampling and Quantization Deltas and Energy

Discrete Fourier Transform Mel Filter Bank

Preemphasis Inverse Discrete Fourier Transform

Windowing Log of the Mel spectrum values

*Sampling and Quantization*: Sampling and quantization are the two steps for analog-to-digital conversion. The sampling rate is the number of samples taken per second, the sampling rate used in this study is 16 k samples per seconds. The quantization is the process of representing real-valued numbers as integers. The analysis window is about 25.6 msec (410

*Preemphasis:* This stage is to boost the high frequency part that was suppressed during the sound production mechanism, so making the information more available to the acoustic

*Discrete Fourier Transform*: The goal of this step is to obtain the magnitude frequency response of each frame. The output is a complex number representing the magnitude and

*Windowing:* Each analysis window is multiplied by a Hamming window.

This part contains the modifications required for a particular language. It contains three parts: acoustic models, language model, and pronunciation dictionary. Acoustic models contain the HMMs used in recognition process. The language model contains language's words and their combinations, each combination has two or three words. A pronunciation dictionary contains the words and their pronunciation phonemes.

#### *3.2.1. Acoustic models*

Acoustic models are statistical representations of the speech phones. Precise acoustic model is a key factor to improve recognition accuracy as it characterizes the HMMs of each phone. Sphinx uses 39 English phonemes (The CMU Pronunciation Dictionary, 2011). The acoustic models use a 3- to 5-state Markov chain to represent the speech phone (Lee, 1988). Figure 6 shows a representation of a 3-state phone's acoustic model. In Figure 6, S1 is the representation of phone at the beginning, while S2 and S3 represent of the phone at the middle and the end states, respectively. Associated with S1, S2, and S3 are state emission probabilities, () ( | ) *jt tt b x Po x S j* , representing the probability of observing the feature vector in the state j. The emission probabilities are usually modeled by Gaussian mixture densities.

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 285

N-gram tables

N-gram to language model

> **Language Model**

Compute the word unigram counts.

**Figure 7.** Steps for creating and testing language model

*3.2.3. Pronunciation dictionary* 

**Test Text** 

Word frequency to vocabulary

Word frequency

Text

The CMU language modeling tool comes with a tool for evaluating the language model. The evaluation measures the perplexity as indication of the convenient (goodness) of the

Perplexity

Vocabulary

Perplexity calculation

N-gram calculations

Both training and recognition stages require a pronunciation dictionary which is a mapping table that maps words into sequences of phonemes. A pronunciation dictionary is basically designed to be used with a particular set of words. It provides the pronunciation of the vocabulary for the transcription corpus using the defined phoneme set. Like acoustic models and language model, the performance of a speech recognition system depends critically on the dictionary and the phoneme set used to build the dictionary. In decoding stage, the dictionary serves as intermediary between the acoustic model and the language model.

There are two types of dictionaries: closed vocabulary dictionary and open vocabulary dictionary. In closed vocabulary dictionary, all corpus transcription words are listed in the dictionary. In contrast, it is possible to have non-corpus transcription words in the open vocabulary dictionary. Typically, the phoneme set, that is used to represent dictionary words, is manually designed by language experts. However, when human expertise is not available, the phoneme set is possible to be selected using data-driven approach as

language model. For more information of the perplexity, please refer to Section 7.

 Convert the word unigram counts into a vocabulary list. Generate bigram and trigram tables based on this vocabulary.

**Figure 6.** 3-state phone acoustic model

In continuous speech, each phoneme is influenced in different degrees by its neighboring phonemes. Therefore, for better acoustic modeling, Sphinx uses triphones. Triphones are context dependent models of phonemes; each triphone represents a phoneme surrounded by specific left and right phonemes (Hwang, 1993).

#### *3.2.2. Language model*

The N-gram language model is trained by counting N-gram occurrences in a large transcription corpus to be then smoothed and normalized. In general, an N-gram language model is used to calculate the probability of a given sequence of words as follows:

$$P(\mathbf{w}\_1^n) = \prod\_{k=1}^n p(\mathbf{w}\_k \Big| \mathbf{w}\_1^{k-1})$$

Where n is limited to include the words' history as bigram (two consequent words), trigram (three consequent words), 4-gram (four consequent words), etc. for example, by assigning n=2, the probability of a three word sequence using bigram is calculated as follows: 123 3 2 2 1 1 *P ww pw w pw w pw* (w ) ( | ) ( ) ( )

The CMU statistical language tool is described in (Clarkson & Rosenfeld, 1997). The CMU statistical language tool kit has been used to generate our Arabic statistical language model. Figure 7 shows the steps for creation and testing the language model, the steps are:

Compute the word unigram counts.

284 Modern Speech Recognition Approaches with Case Studies

probabilities, () ( | )

**Figure 6.** 3-state phone acoustic model

123 3 2 2 1 1 *P ww pw w pw w pw* (w ) ( | ) ( ) ( )

*3.2.2. Language model* 

by specific left and right phonemes (Hwang, 1993).

Acoustic models are statistical representations of the speech phones. Precise acoustic model is a key factor to improve recognition accuracy as it characterizes the HMMs of each phone. Sphinx uses 39 English phonemes (The CMU Pronunciation Dictionary, 2011). The acoustic models use a 3- to 5-state Markov chain to represent the speech phone (Lee, 1988). Figure 6 shows a representation of a 3-state phone's acoustic model. In Figure 6, S1 is the representation of phone at the beginning, while S2 and S3 represent of the phone at the middle and the end states, respectively. Associated with S1, S2, and S3 are state emission

vector in the state j. The emission probabilities are usually modeled by Gaussian mixture

In continuous speech, each phoneme is influenced in different degrees by its neighboring phonemes. Therefore, for better acoustic modeling, Sphinx uses triphones. Triphones are context dependent models of phonemes; each triphone represents a phoneme surrounded

S1 S2 S3

The N-gram language model is trained by counting N-gram occurrences in a large transcription corpus to be then smoothed and normalized. In general, an N-gram language

> *k P pw w*

Where n is limited to include the words' history as bigram (two consequent words), trigram (three consequent words), 4-gram (four consequent words), etc. for example, by assigning n=2, the probability of a three word sequence using bigram is calculated as follows:

The CMU statistical language tool is described in (Clarkson & Rosenfeld, 1997). The CMU statistical language tool kit has been used to generate our Arabic statistical language model.

Figure 7 shows the steps for creation and testing the language model, the steps are:

n 1 1 1 1 (w ) ( ) *<sup>n</sup> <sup>k</sup> k*

model is used to calculate the probability of a given sequence of words as follows:

*jt tt b x Po x S j* , representing the probability of observing the feature

*3.2.1. Acoustic models* 

densities.


**Figure 7.** Steps for creating and testing language model

The CMU language modeling tool comes with a tool for evaluating the language model. The evaluation measures the perplexity as indication of the convenient (goodness) of the language model. For more information of the perplexity, please refer to Section 7.

## *3.2.3. Pronunciation dictionary*

Both training and recognition stages require a pronunciation dictionary which is a mapping table that maps words into sequences of phonemes. A pronunciation dictionary is basically designed to be used with a particular set of words. It provides the pronunciation of the vocabulary for the transcription corpus using the defined phoneme set. Like acoustic models and language model, the performance of a speech recognition system depends critically on the dictionary and the phoneme set used to build the dictionary. In decoding stage, the dictionary serves as intermediary between the acoustic model and the language model.

There are two types of dictionaries: closed vocabulary dictionary and open vocabulary dictionary. In closed vocabulary dictionary, all corpus transcription words are listed in the dictionary. In contrast, it is possible to have non-corpus transcription words in the open vocabulary dictionary. Typically, the phoneme set, that is used to represent dictionary words, is manually designed by language experts. However, when human expertise is not available, the phoneme set is possible to be selected using data-driven approach as demonstrated by (Singh et al. 2002). In addition to providing phonemic transcriptions of the words of the target vocabulary, the dictionary is the place where alternative pronunciation variants are added such as in (Ali et al., 2009) for Arabic.

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 287

As pros and cons of both approaches, the knowledge-based approach is not exhaustive; not all of the variations that occur in continuous speech have been described. Whereas obtaining reliable information using data-driven is difficult. However, (Amdal & Fossler-Lussier 2003) mentioned that there is a growing interest in data-driven methods over the knowledgebased methods due to lack of domains expertise. Figure 8 displays these two techniques. Figure 8 also distinguishes between the types of variations and the modeling techniques by a dashed line. The pronunciation variation types are above the dashed line whereas the

This work focuses on Arabic speech recognition, which has gained increasing importance in the last few years. Arabic is a Semitic language spoken by more than 330 million people as a native language (Farghaly & Shaalan, 2009). While Arabic language has many spoken dialects, it has a standard written language. As a result, more challenges are introduced to speech recognition systems as the spoken dialects are not officially written. The same country could contain different dialects and a dialect itself can vary from region to another according to different factors such as religion, gender, urban/rural, etc. Speakers with

In this chapter, we consider the modern standard Arabic (MSA) which is currently used in writing and in most formal speech. MSA is also the major medium of communication for public speaking and news broadcasting (Ryding, 2005) and is considered to be the official language in most Arabic-speaking countries (Lamel et al., 2009). Arabic language challenges will be presented in the next section. Followed by the literature review and recent results efforts in Arabic speech recognition. For more information about modern standard Arabic,

different dialects usually use modern standard Arabic (MSA) to communicate.

modeling techniques are under the dashed line.

**Figure 8.** Pronunciation variations and modeling techniques

**5. Arabic speech recognition** 

**5.1. Modern standard Arabic** 

(Ryding, 2005) is a rich reference.

## **3.3. Decoder (Recognizer)**

With help from the linguistic part, the decoder is the module where the recognition process takes place. The decoder uses the speech features presented by the Front-End to search for the most probable words and, then, sentences that correspond to the observed speech features. The recognition process starts by finding the likelihood of a given sequence of speech features based on the phonemes HMMs.

The speech recognition problem is to transcribe the most likely spoken words given the acoustic observations. If 1 2 , ,.... *O oo on* is the acoustic observation, and 1 2 , ,.... *W ww wn* is a word sequence, then:

$$\mathcal{W} = \underbrace{\arg\max}\_{\text{for all words}} \text{ P(W)P(O \mid W)}$$

Where ܹ is the most probable word sequence of the spoken words, which is also called maximum posteriori probability. *P(W)* is the prior probability computed in the language model, and *P(O|W)* is the probability of observation computed using the acoustic model.

## **4. Pronunciation variation**

The main goal of ASRs is to enable people to communicate more naturally and effectively. But this ultimate dream faces many obstacles such as different speaking styles which lead to "pronunciation variation" phenomenon. This phenomenon appears in the form of insertions, deletions, or substitutions of phoneme(s) relative to the phonemic transcription in the pronunciation dictionary. (Benzeghiba et al., 2007) presented the speech variability sources: foreign and regional accents, speaker physiology, spontaneous speech, rate of speech, children speech, emotional state, noises, new words, and more. Accordingly, handling these obstacles is a major requirement to have better ASR performance.

There are two types of pronunciation variations: cross-word variations and within-word variations. A within-word variation causes alternative pronunciation(s) of the same word. In contrast, a cross-word variation occurs in continuous speech in which a sequence of words forms a compound word that should be treated as a one entity. The pronunciation variation can be modeled in two approaches: knowledge-based and data-driven. Knowledge-based depends on linguistic studies that lead to the phonological rules which are called to find the possible alternative variants. On the other hand, data-driven methods depend solely on the pronunciation corpus to find the pronunciation variants (direct data-driven) or transformation rules (indirect data-driven). In this chapter, we will use the knowledge-based approach to model the cross-word pronunciation variation problem.

As pros and cons of both approaches, the knowledge-based approach is not exhaustive; not all of the variations that occur in continuous speech have been described. Whereas obtaining reliable information using data-driven is difficult. However, (Amdal & Fossler-Lussier 2003) mentioned that there is a growing interest in data-driven methods over the knowledgebased methods due to lack of domains expertise. Figure 8 displays these two techniques. Figure 8 also distinguishes between the types of variations and the modeling techniques by a dashed line. The pronunciation variation types are above the dashed line whereas the modeling techniques are under the dashed line.

**Figure 8.** Pronunciation variations and modeling techniques

## **5. Arabic speech recognition**

286 Modern Speech Recognition Approaches with Case Studies

speech features based on the phonemes HMMs.

**3.3. Decoder (Recognizer)** 

a word sequence, then:

**4. Pronunciation variation** 

problem.

variants are added such as in (Ali et al., 2009) for Arabic.

demonstrated by (Singh et al. 2002). In addition to providing phonemic transcriptions of the words of the target vocabulary, the dictionary is the place where alternative pronunciation

With help from the linguistic part, the decoder is the module where the recognition process takes place. The decoder uses the speech features presented by the Front-End to search for the most probable words and, then, sentences that correspond to the observed speech features. The recognition process starts by finding the likelihood of a given sequence of

The speech recognition problem is to transcribe the most likely spoken words given the acoustic observations. If 1 2 , ,.... *O oo on* is the acoustic observation, and 1 2 , ,.... *W ww wn* is

Where ܹ is the most probable word sequence of the spoken words, which is also called maximum posteriori probability. *P(W)* is the prior probability computed in the language model, and *P(O|W)* is the probability of observation computed using the acoustic model.

The main goal of ASRs is to enable people to communicate more naturally and effectively. But this ultimate dream faces many obstacles such as different speaking styles which lead to "pronunciation variation" phenomenon. This phenomenon appears in the form of insertions, deletions, or substitutions of phoneme(s) relative to the phonemic transcription in the pronunciation dictionary. (Benzeghiba et al., 2007) presented the speech variability sources: foreign and regional accents, speaker physiology, spontaneous speech, rate of speech, children speech, emotional state, noises, new words, and more. Accordingly,

There are two types of pronunciation variations: cross-word variations and within-word variations. A within-word variation causes alternative pronunciation(s) of the same word. In contrast, a cross-word variation occurs in continuous speech in which a sequence of words forms a compound word that should be treated as a one entity. The pronunciation variation can be modeled in two approaches: knowledge-based and data-driven. Knowledge-based depends on linguistic studies that lead to the phonological rules which are called to find the possible alternative variants. On the other hand, data-driven methods depend solely on the pronunciation corpus to find the pronunciation variants (direct data-driven) or transformation rules (indirect data-driven). In this chapter, we will use the knowledge-based approach to model the cross-word pronunciation variation

P(W)P(O|W)

ݔ݃݉ܽݎܽ = ܹ ᇣᇧᇧᇤᇧᇧᇥ ௪ௗ௦

handling these obstacles is a major requirement to have better ASR performance.

This work focuses on Arabic speech recognition, which has gained increasing importance in the last few years. Arabic is a Semitic language spoken by more than 330 million people as a native language (Farghaly & Shaalan, 2009). While Arabic language has many spoken dialects, it has a standard written language. As a result, more challenges are introduced to speech recognition systems as the spoken dialects are not officially written. The same country could contain different dialects and a dialect itself can vary from region to another according to different factors such as religion, gender, urban/rural, etc. Speakers with different dialects usually use modern standard Arabic (MSA) to communicate.

## **5.1. Modern standard Arabic**

In this chapter, we consider the modern standard Arabic (MSA) which is currently used in writing and in most formal speech. MSA is also the major medium of communication for public speaking and news broadcasting (Ryding, 2005) and is considered to be the official language in most Arabic-speaking countries (Lamel et al., 2009). Arabic language challenges will be presented in the next section. Followed by the literature review and recent results efforts in Arabic speech recognition. For more information about modern standard Arabic, (Ryding, 2005) is a rich reference.

#### **5.2. Arabic speech recognition challenges**

Arabic speech recognition faces many challenges. First, Arabic has many dialects where same words are pronounced differently. In addition, the spoken dialects are not officially written, it is very costly to obtain adequate corpora, which present a training problem for the Arabic ASR researchers (Owen et al., 2006). Second, Arabic has short vowels (diacritics), which are usually ignored in text. The lack of diacritical marks introduces another serious problem to Arabic speech recognition. Consequently, more hypotheses' words will be considered during decoding process which may reduce the accuracy. (Elmahdy et al., 2009) summarized some of the problems raised in Arabic speech recognition. They highlighted the following problems: Arabic phonetics, diacritization problem, grapheme-to-phoneme, and morphological complexity. Although foreign phoneme sounds as /v/ and /p/ are used in Arabic speech in foreign names, the standard Arabic letters do not have standard letter assigned for foreign sounds. Second, the absence of the diacritical marks in modern Arabic text creates ambiguities for pronunciations and meanings. For example, the non-diacritized Arabic word (كتب (could be read as one of several choices, some of which are: (بَ َ he, َكتِب) ,(wrote كتُ , it was written), and (بُ كتُ , books). Even though, an Arabic reader can interpret and utter the correct choice, it is hard to embed this cognitive process in current speech recognition and speech synthesis systems. The majority of Arabic corpora available for the task of acoustic modeling have non-diacritized transcription. (Elmahdy et al., 2009) also showed that grapheme-to-phoneme relation is only true for diacritized Arabic script. Hence fore, Arabic speech recognition has an obstacle because the lack of diacritized corpora. Arabic morphological complexity is demonstrated by the large number of affixes (prefixes, infixes, and suffixes) that can be added to the three consonant radicals to form patterns. (Farghaly& Shaalan, 2009) provided a comprehensive study of Arabic language challenges and solutions. The mentioned challenges include: the nonconcatenative nature of Arabic morphology, the absence of the orthographic representation of Arabic diacritics from contemporary Arabic text, and the need for an explicit grammar of MSA that defines linguistic constituency in the absence of case marking. (Lamel et al., 2009) presented a number of challenges for Arabic speech recognition such as no diacritics, dialectal variants, and very large lexical variety. (Alotaibi et al., 2008) introduced foreign-accented Arabic speech as a challenging task in speech recognition. (Billa et al., 2002) discussed a number of research issues for Arabic speech recognition, e.g., absence of diacritics in written text and the presence of compound words that are formed by the concatenation of certain conjunctions, prepositions, articles, and pronouns, as prefixes and suffixes to the word stem.

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 289

described an Arabic speech recognition system based on Sphinx 4. Three corpora were developed, namely, the Holy Qura'an corpus of about 18.5 hours, the command and control corpus of about 1.5 hours, and the Arabic digits corpus of less than 1 hour of speech. They also proposed an automatic toolkit for building pronunciation dictionaries for the Holy Qur'an and standard Arabic language. (Al-Otaibi, 2001)] provided a single-speaker speech dataset for MSA. He proposed a technique for labeling Arabic speech. using the Hidden Markov Model Toolkit (HTK), he reported a recognition rate for speaker dependent ASR of 93.78%. (Afify et al. , 2005) compared grapheme-based recognition system with explicitly modeling diacritics (short vowels). They found that a diacritic modeling improves recognition performance. (Satori et al. , 2007) used CMU Sphinx tools for Arabic speech recognition. They demonstrated the use of the tools for recognition of isolated Arabic digits. They achieved a digits recognition accuracy of 86.66% for data recorded from six speakers. (Alghamdi et al., 2009) developed an Arabic broadcast news transcription system. They used a corpus of 7.0 h for training and 0.5 h for testing. The WER they obtained was 14.9%. (Lamel et al., 2009) described the incremental improvements to a system for the automatic transcription of broadcast data in Arabic, highlighting techniques developed to deal with specificities (no diacritics, dialectal variants, and lexical variety) of the Arabic language. (Billa et al., 2002) described the development of audio indexing system for broadcast news in Arabic. Key issues addressed in their work revolve around the three major components of the audio indexing system: automatic speech recognition, speaker identification, and named entity identification. (Soltau et al., 2007) reported advancements in the IBM system for Arabic speech recognition as part of the continuous effort for the Global Autonomous Language Exploitation (GALE) project. The system consisted of multiple stages that incorporate both diacritized and non-diacritized Arabic speech model. The system also incorporated a training corpus of 1,800 hours of unsupervised Arabic speech. (Azmi et al., 2008) investigated using Arabic syllables for speaker-independent speech recognition system for Arabic spoken digits. The pronunciation corpus used for both training and testing consisted of 44 Egyptian speakers. In a clean environment, experiments showed that the recognition rate obtained using syllables outperformed the rate obtained using monophones, triphones, and words by 2.68%, 1.19%, and 1.79%, respectively. Also in noisy telephone channel, syllables outperformed the rate obtained using monophones, triphones, and words by 2.09%, 1.5%, and 0.9%, respectively. (Elmahdy et al., 2009) used acoustic models trained with large MSA news broadcast speech corpus to work as multilingual or multi-accent models to decode colloquial Arabic. (Khasawneh et al., 2004) compared the polynomial classifier that was applied to isolated-word speaker-independent Arabic speech and dynamic time warping (DTW) recognizer. They concluded that the polynomial classifier produced better recognition performance and much faster testing response than the DTW recognizer. (Shoaib et al., 2003) presented an approach to develop a robust Arabic speech recognition system based on a hybrid set of speech features. The hybrid set consisted of intensity contours and formant frequencies. (Alotaibi, 2004) reported achieving highperformance Arabic digits recognition using recurrent networks. (Choi et al., 2008)

#### **5.3. Literature and recent work**

A number of researchers have recently addressed development of Arabic speech recognition systems. (Abushariah et al., 2012) proposed a framework for the design and development of a speaker-independent continuous automatic Arabic speech recognition system based on a phonetically rich and balanced speech corpus. Their method reduced the WER to 9.81% for a diacritized transcription corpus, as they have reported. (Hyassat & Abu Zitar, 2008)

ِب) ,(wrote

**5.2. Arabic speech recognition challenges** 

كتُ , it was written), and (بُ

**5.3. Literature and recent work** 

Arabic speech recognition faces many challenges. First, Arabic has many dialects where same words are pronounced differently. In addition, the spoken dialects are not officially written, it is very costly to obtain adequate corpora, which present a training problem for the Arabic ASR researchers (Owen et al., 2006). Second, Arabic has short vowels (diacritics), which are usually ignored in text. The lack of diacritical marks introduces another serious problem to Arabic speech recognition. Consequently, more hypotheses' words will be considered during decoding process which may reduce the accuracy. (Elmahdy et al., 2009) summarized some of the problems raised in Arabic speech recognition. They highlighted the following problems: Arabic phonetics, diacritization problem, grapheme-to-phoneme, and morphological complexity. Although foreign phoneme sounds as /v/ and /p/ are used in Arabic speech in foreign names, the standard Arabic letters do not have standard letter assigned for foreign sounds. Second, the absence of the diacritical marks in modern Arabic text creates ambiguities for pronunciations and meanings. For example, the non-diacritized Arabic word (كتب (could be read as one of several choices, some of which are: (بَ َ

and utter the correct choice, it is hard to embed this cognitive process in current speech recognition and speech synthesis systems. The majority of Arabic corpora available for the task of acoustic modeling have non-diacritized transcription. (Elmahdy et al., 2009) also showed that grapheme-to-phoneme relation is only true for diacritized Arabic script. Hence fore, Arabic speech recognition has an obstacle because the lack of diacritized corpora. Arabic morphological complexity is demonstrated by the large number of affixes (prefixes, infixes, and suffixes) that can be added to the three consonant radicals to form patterns. (Farghaly& Shaalan, 2009) provided a comprehensive study of Arabic language challenges and solutions. The mentioned challenges include: the nonconcatenative nature of Arabic morphology, the absence of the orthographic representation of Arabic diacritics from contemporary Arabic text, and the need for an explicit grammar of MSA that defines linguistic constituency in the absence of case marking. (Lamel et al., 2009) presented a number of challenges for Arabic speech recognition such as no diacritics, dialectal variants, and very large lexical variety. (Alotaibi et al., 2008) introduced foreign-accented Arabic speech as a challenging task in speech recognition. (Billa et al., 2002) discussed a number of research issues for Arabic speech recognition, e.g., absence of diacritics in written text and the presence of compound words that are formed by the concatenation of certain conjunctions, prepositions, articles, and pronouns, as prefixes and suffixes to the word stem.

A number of researchers have recently addressed development of Arabic speech recognition systems. (Abushariah et al., 2012) proposed a framework for the design and development of a speaker-independent continuous automatic Arabic speech recognition system based on a phonetically rich and balanced speech corpus. Their method reduced the WER to 9.81% for a diacritized transcription corpus, as they have reported. (Hyassat & Abu Zitar, 2008)

he, َكت

كتُ , books). Even though, an Arabic reader can interpret

described an Arabic speech recognition system based on Sphinx 4. Three corpora were developed, namely, the Holy Qura'an corpus of about 18.5 hours, the command and control corpus of about 1.5 hours, and the Arabic digits corpus of less than 1 hour of speech. They also proposed an automatic toolkit for building pronunciation dictionaries for the Holy Qur'an and standard Arabic language. (Al-Otaibi, 2001)] provided a single-speaker speech dataset for MSA. He proposed a technique for labeling Arabic speech. using the Hidden Markov Model Toolkit (HTK), he reported a recognition rate for speaker dependent ASR of 93.78%. (Afify et al. , 2005) compared grapheme-based recognition system with explicitly modeling diacritics (short vowels). They found that a diacritic modeling improves recognition performance. (Satori et al. , 2007) used CMU Sphinx tools for Arabic speech recognition. They demonstrated the use of the tools for recognition of isolated Arabic digits. They achieved a digits recognition accuracy of 86.66% for data recorded from six speakers. (Alghamdi et al., 2009) developed an Arabic broadcast news transcription system. They used a corpus of 7.0 h for training and 0.5 h for testing. The WER they obtained was 14.9%. (Lamel et al., 2009) described the incremental improvements to a system for the automatic transcription of broadcast data in Arabic, highlighting techniques developed to deal with specificities (no diacritics, dialectal variants, and lexical variety) of the Arabic language. (Billa et al., 2002) described the development of audio indexing system for broadcast news in Arabic. Key issues addressed in their work revolve around the three major components of the audio indexing system: automatic speech recognition, speaker identification, and named entity identification. (Soltau et al., 2007) reported advancements in the IBM system for Arabic speech recognition as part of the continuous effort for the Global Autonomous Language Exploitation (GALE) project. The system consisted of multiple stages that incorporate both diacritized and non-diacritized Arabic speech model. The system also incorporated a training corpus of 1,800 hours of unsupervised Arabic speech. (Azmi et al., 2008) investigated using Arabic syllables for speaker-independent speech recognition system for Arabic spoken digits. The pronunciation corpus used for both training and testing consisted of 44 Egyptian speakers. In a clean environment, experiments showed that the recognition rate obtained using syllables outperformed the rate obtained using monophones, triphones, and words by 2.68%, 1.19%, and 1.79%, respectively. Also in noisy telephone channel, syllables outperformed the rate obtained using monophones, triphones, and words by 2.09%, 1.5%, and 0.9%, respectively. (Elmahdy et al., 2009) used acoustic models trained with large MSA news broadcast speech corpus to work as multilingual or multi-accent models to decode colloquial Arabic. (Khasawneh et al., 2004) compared the polynomial classifier that was applied to isolated-word speaker-independent Arabic speech and dynamic time warping (DTW) recognizer. They concluded that the polynomial classifier produced better recognition performance and much faster testing response than the DTW recognizer. (Shoaib et al., 2003) presented an approach to develop a robust Arabic speech recognition system based on a hybrid set of speech features. The hybrid set consisted of intensity contours and formant frequencies. (Alotaibi, 2004) reported achieving highperformance Arabic digits recognition using recurrent networks. (Choi et al., 2008) presented recent improvements to their English/Iraqi Arabic speech-to-speech translation system. The presented system-wide improvements included user interface, dialog manager, ASR, and machine translation components. (Nofal et al., 2004) demonstrated a design and implementation of stochastic-based new acoustic models for use with a command and control system speech recognition system for the Arabic. (Mokhtar & El-Abddin, 1996) represented the techniques and algorithms used to model the acoustic-phonetic structure of Arabic speech recognition using HMMs. (Park et al. , 2009) explored the training and adaptation of multilayer perceptron (MLP) features in Arabic ASRs. They used MLP features to incorporate short-vowel information into the graphemic system. They also used linear input networks (LIN) adaptation as an alternative to the usual HMM-based linear adaptation. (Imai et al.,1995) presented a new method for automatic generation of speakerdependent phonological rules in order to decrease recognition errors caused by pronunciation variability dependent on speakers. (Muhammad et al., 2011) evaluated conventional ASR system for six different types of voice disorder patients speaking Arabic digits. MFCC and Gaussian mixture models (GMM)/HMM were used as features and classifier, respectively. Recognition result was analyzed for recognition for types of diseases. (Bourouba et al., 2006) presented a HMM/support vectors machine (SVM) (k-nearest neighbor) for recognition of isolated spoken Arabic words. (Sagheer et al., 2005) presented a visual speech features representation system. They used it to comprise a complete lipreading system. (Taha et al. , 2007) demonstrated an agent-based design for Arabic speech recognition. They defined the Arabic speech recognition as a multi-agent system where each agent had a specific goal and deals with that goal only. (Elmisery et al., 2003) implemented a pattern matching algorithm based on HMM using field programmable gate array (FPGA). The proposed approach was used for isolated Arabic word recognition. (Gales et al., 2007) described the development of a phonetic system for Arabic speech recognition. (Bahi & Sellami, 2001) presented experiments performed to recognize isolated Arabic words. Their recognition system was based on a combination of the vector quantization technique at the acoustic level and markovian modeling. (Essa et al., 2008) proposed a combined classifier architectures based on Neural Networks by varying the initial weights, architecture, type, and training data to recognize Arabic isolated words. (Emami & Mangu, 2007) studied the use of neural network language models (NNLMs) for Arabic broadcast news and broadcast conversations speech recognition. (Messaoudi et al., 2006) demonstrated that by building a very large vocalized vocabulary and by using a language model including a vocalized component, the WER could be significantly reduced. (Vergyri et al., 2004) showed that the use of morphology-based language models at different stages in a large vocabulary continuous speech recognition (LVCSR) system for Arabic leads to WER reductions. To deal with the huge lexical variety, (Xiang et al., 2006) concentrated on the transcription of Arabic broadcast news by utilizing morphological decomposition in both acoustic and language modeling in their system. (Selouani & Alotaibi, 2011) presented genetic algorithms to adapt HMMs for non-native speech in a large vocabulary speech recognition system of MSA. (Saon et al., 2010) described the Arabic broadcast transcription system fielded by IBM in the Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 291

GALE project. they reported improved discriminative training, the use of subspace Gaussian mixture models (SGMM), the use of neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models, and neural network based language modeling (NNLMs) . The achieved WER was 8.9% on the evaluation test set. (Kuo et al., 2010) studied various syntactic and morphological context features incorporated in an NNLM for Arabic speech recognition.

Since the ASR decoder works better with long words, our method focuses on finding a way to merge transcription words to increase the number of long words. For this purpose, we consider to merge words according to their tags. That is, merge a noun that is followed by an adjective, and merge a preposition that is followed by a word. we utilizes PoS tagging approach to tag the transcription corpus. the tagged transcription is then used to find the

A tag is a word property such as noun, pronoun, verb, adjective, adverb, preposition, conjunction, interjection, etc. Each language has its own tags. Tags may be different from language to language. In our method, we used the Arabic module of Stanford tagger (Stanford Log-linear Part-Of-Speech Tagger, 2011). The total number of tags of this tagger is 29 tags, only 13 tags were used in our method as listed in Table 3. As we mentioned, we focused on three kinds of tags: noun, adjectives, and preposition. In Table 3, DT is a

shorthand for the determiner article (التعريف ال (that corresponds to "the" in English.

subordinating conjunction

قيادية جديدة، Adjective JJ 8 9 JJR Adjective, comparative ،أدنى كبرى 10 NN Noun, singular or mass نجم ،إنتاج 11 NNP Proper noun, singular ،أوبك لبنان 12 NNS Noun, plural ،توقعات طلبات 13 NOUN\_QUANT Noun, quantity ثلثي ،الربع

حرف جر مثل : في

حرف مصدري مثل : ْ أن

1 ADJ\_NUM Adjective, Numeric ،السابع الرابعة 2 DTJJ DT + Adjective ،النفطية الجديد 3 DTJJR Adjective, comparative ،الكبرى العليا 4 DTNN DT + Noun, singular or mass العاصمة ،المنظمة 5 DTNNP DT + Proper noun, singular ،العراق القاھرة 6 DTNNS DT + Noun, plural الواليات ،السيارات
