Tag Meaning Example

7 IN Preposition

**Table 3.** A partial list of Stanford Tagger's tag with examples

**6. The proposed method** 

new merged words.

GALE project. they reported improved discriminative training, the use of subspace Gaussian mixture models (SGMM), the use of neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models, and neural network based language modeling (NNLMs) . The achieved WER was 8.9% on the evaluation test set. (Kuo et al., 2010) studied various syntactic and morphological context features incorporated in an NNLM for Arabic speech recognition.

## **6. The proposed method**

290 Modern Speech Recognition Approaches with Case Studies

presented recent improvements to their English/Iraqi Arabic speech-to-speech translation system. The presented system-wide improvements included user interface, dialog manager, ASR, and machine translation components. (Nofal et al., 2004) demonstrated a design and implementation of stochastic-based new acoustic models for use with a command and control system speech recognition system for the Arabic. (Mokhtar & El-Abddin, 1996) represented the techniques and algorithms used to model the acoustic-phonetic structure of Arabic speech recognition using HMMs. (Park et al. , 2009) explored the training and adaptation of multilayer perceptron (MLP) features in Arabic ASRs. They used MLP features to incorporate short-vowel information into the graphemic system. They also used linear input networks (LIN) adaptation as an alternative to the usual HMM-based linear adaptation. (Imai et al.,1995) presented a new method for automatic generation of speakerdependent phonological rules in order to decrease recognition errors caused by pronunciation variability dependent on speakers. (Muhammad et al., 2011) evaluated conventional ASR system for six different types of voice disorder patients speaking Arabic digits. MFCC and Gaussian mixture models (GMM)/HMM were used as features and classifier, respectively. Recognition result was analyzed for recognition for types of diseases. (Bourouba et al., 2006) presented a HMM/support vectors machine (SVM) (k-nearest neighbor) for recognition of isolated spoken Arabic words. (Sagheer et al., 2005) presented a visual speech features representation system. They used it to comprise a complete lipreading system. (Taha et al. , 2007) demonstrated an agent-based design for Arabic speech recognition. They defined the Arabic speech recognition as a multi-agent system where each agent had a specific goal and deals with that goal only. (Elmisery et al., 2003) implemented a pattern matching algorithm based on HMM using field programmable gate array (FPGA). The proposed approach was used for isolated Arabic word recognition. (Gales et al., 2007) described the development of a phonetic system for Arabic speech recognition. (Bahi & Sellami, 2001) presented experiments performed to recognize isolated Arabic words. Their recognition system was based on a combination of the vector quantization technique at the acoustic level and markovian modeling. (Essa et al., 2008) proposed a combined classifier architectures based on Neural Networks by varying the initial weights, architecture, type, and training data to recognize Arabic isolated words. (Emami & Mangu, 2007) studied the use of neural network language models (NNLMs) for Arabic broadcast news and broadcast conversations speech recognition. (Messaoudi et al., 2006) demonstrated that by building a very large vocalized vocabulary and by using a language model including a vocalized component, the WER could be significantly reduced. (Vergyri et al., 2004) showed that the use of morphology-based language models at different stages in a large vocabulary continuous speech recognition (LVCSR) system for Arabic leads to WER reductions. To deal with the huge lexical variety, (Xiang et al., 2006) concentrated on the transcription of Arabic broadcast news by utilizing morphological decomposition in both acoustic and language modeling in their system. (Selouani & Alotaibi, 2011) presented genetic algorithms to adapt HMMs for non-native speech in a large vocabulary speech recognition system of MSA. (Saon et al., 2010) described the Arabic broadcast transcription system fielded by IBM in the

Since the ASR decoder works better with long words, our method focuses on finding a way to merge transcription words to increase the number of long words. For this purpose, we consider to merge words according to their tags. That is, merge a noun that is followed by an adjective, and merge a preposition that is followed by a word. we utilizes PoS tagging approach to tag the transcription corpus. the tagged transcription is then used to find the new merged words.

A tag is a word property such as noun, pronoun, verb, adjective, adverb, preposition, conjunction, interjection, etc. Each language has its own tags. Tags may be different from language to language. In our method, we used the Arabic module of Stanford tagger (Stanford Log-linear Part-Of-Speech Tagger, 2011). The total number of tags of this tagger is 29 tags, only 13 tags were used in our method as listed in Table 3. As we mentioned, we focused on three kinds of tags: noun, adjectives, and preposition. In Table 3, DT is a shorthand for the determiner article (التعريف ال (that corresponds to "the" in English.


**Table 3.** A partial list of Stanford Tagger's tag with examples

In this work, we used the Noun-Adjective as shorthand for a compound word generated by merging a noun and an adjective. We also used Preposition-Word as shorthand for a compound word generated by merging a preposition with a subsequent word. The prepositions used in our method include:

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 293

Figure 10 shows the process of generating a compound word. It demonstrates that a noun followed by an adjective will be merged to produce a one compound word. similarly , the preposition followed by a word will be merged to perform a one compound word. It is noteworthy to mention that our method is independent from handling pronunciation variations that may occur at words junctures. That is, our method does not consider the

The steps for modeling cross-word phenomenon can be described by the algorithm (pseudocode) shown in Figure 11. In the figure, the Offline stage means that the stage is implemented once before decoding, while Online stage means that this stage needs to be

A Tagged Arabic Sentence

Noun Adjective

A compound Word

*W*<sup>3</sup> *W*<sup>4</sup>

**Offline Stage** 

 *If the adjacent tags are adjective/noun or word/preposition* 

 *Represent the compound word in the transcription* 

*Based on the new transcription, build the enhanced dictionary Based on the new transcription, build the enhanced language model*  **Online Stage** 

*Switching the variants back to its original separated words* 

*Using a PoS tagger, have the transcription corpus tagged* 

 *For each two adjacent tags of each tagged sentence* 

*For all tagged sentences in the transcription file* 

 *Generate the compound word* 

…

*W*<sup>1</sup> *W*<sup>2</sup> *W*<sup>5</sup>

phonological rules that could be implemented between certain words.

repeatedly implemented after each decoding process.

**Figure 10.** A Noun-Adjective compound word generation

 *End if End for End for* 

Read in this direction

W: Word

**Figure 11.** Cross-word modeling algorithm using PoS tagging

(منذ ، حتى ، في ، على ، عن ، الى ، من ( (mundhu, Hata, fy, 'ala, 'an, 'ila, min), Other prepositions were not included as they are rarely used in MSA. Table 4 shows the tagger output for a simple non-diacritized sentence.


**Table 4.** An Arabic sentence and its tags

Thus, the tagger output is used to generate compound words by searching for Noun-Adjective and Preposition-Word sequences. Figure 9 shows two possible compound words: َ ِام َ جضخم) َرن ُ ُردن) and) ب ِياأل ف (for Noun-Adjective case and for Preposition-Word case, respectively. These two compound words are, then, represented in new sentences as illustrated in Figure 9. Therefore, the three sentences (the original and the new ones) will be used, with all other cases, to produce the enhanced language model and the enhanced pronunciation dictionary.

**Figure 9.** The compound words representations

Figure 10 shows the process of generating a compound word. It demonstrates that a noun followed by an adjective will be merged to produce a one compound word. similarly , the preposition followed by a word will be merged to perform a one compound word. It is noteworthy to mention that our method is independent from handling pronunciation variations that may occur at words junctures. That is, our method does not consider the phonological rules that could be implemented between certain words.

The steps for modeling cross-word phenomenon can be described by the algorithm (pseudocode) shown in Figure 11. In the figure, the Offline stage means that the stage is implemented once before decoding, while Online stage means that this stage needs to be repeatedly implemented after each decoding process.

W: Word

292 Modern Speech Recognition Approaches with Case Studies

prepositions used in our method include:

output for a simple non-diacritized sentence.

An input sentence to the tagger

Tagger output (read from left to right)

َ ِام َ جضخم)

َرن

pronunciation dictionary.

**Table 4.** An Arabic sentence and its tags

ُردن) and) ب

ُ

**Figure 9.** The compound words representations

ِياأل

In this work, we used the Noun-Adjective as shorthand for a compound word generated by merging a noun and an adjective. We also used Preposition-Word as shorthand for a compound word generated by merging a preposition with a subsequent word. The

(منذ ، حتى ، في ، على ، عن ، الى ، من ( (mundhu, Hata, fy, 'ala, 'an, 'ila, min), Other prepositions were not included as they are rarely used in MSA. Table 4 shows the tagger

Thus, the tagger output is used to generate compound words by searching for Noun-Adjective and Preposition-Word sequences. Figure 9 shows two possible compound words:

respectively. These two compound words are, then, represented in new sentences as illustrated in Figure 9. Therefore, the three sentences (the original and the new ones) will be used, with all other cases, to produce the enhanced language model and the enhanced

وأوضح عضو لجنة المقاولين في غرفة الرياض بشير العظم

DTNN/العظم NNP/بشير DTNNP/الرياض

bashyru 'l ' aZm

wa 'wdaHa 'udwu lajnata 'lmuqawilyna fy ghurfitu 'lriyaD

NN/غرفة IN/في DTNNS/المقاولين NN/لجنة NN/عضو VBD/وأوضح

ف (for Noun-Adjective case and for Preposition-Word case,

**Offline Stage**  *Using a PoS tagger, have the transcription corpus tagged For all tagged sentences in the transcription file For each two adjacent tags of each tagged sentence If the adjacent tags are adjective/noun or word/preposition Generate the compound word Represent the compound word in the transcription End if End for End for Based on the new transcription, build the enhanced dictionary Based on the new transcription, build the enhanced language model*  **Online Stage**  *Switching the variants back to its original separated words* 

**Figure 11.** Cross-word modeling algorithm using PoS tagging

## **7. The results**

The proposed method was investigated on a speaker-independent modern standard Arabic speech recognition system using Carnegie Mellon University Sphinx speech recognition engine. Three performance metrics were used to measure the performance enhancement: the word error rate (WER), out of vocabulary (OOV), and perplexity (PP).

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 295

Where PP is the perplexity, P is the probability of the word set to be tested W=w1, w2, … ,

The performance detection method proposed by Plötz in (Plötz,2005) is used to investigate the achieved recognition results. A 95% is used as a level of confidence. The WER of the baseline system (12.21 %) and the total number of words in the testing set (9288 words ) are used to find the confidence interval [εl , εh]. The boundaries of the confidence interval are found to be [12.21 – 0.68 , 12.21 + 0.68] [11.53,12.89]. If the changed classification error rate is outside this interval, this change can be interpreted as statistically significant. Otherwise,

Table 5 shows the enhancements for different experiments. Since the enhanced method (in Noun-Adjective case) achieved a WER of (9.82%) which is out of the above mentioned confidence interval [11.53,12.89], it is concluded that the achieved enhancement is statistically significant. The other cases are similar, i.e. (Preposition-Word, and Hybrid cases
