Experiment Perplexity OOV (%) Baseline System 34.08 328/9288 = 3.53% 1 Noun-Adjective 3.00 287/9288 = 3.09% 2 Preposition-Word 3.22 299/9288 = 3.21% 3 Hybrid (1 & 2) 2.92 316/9288 = 3.40%

wN, and N is the total number of words in the testing set.

It is most likely caused by chance.

the account of others.

corpus, i.e. economics and sports.

**Table 6.** Perplexities and OOV for different experiments

also achieved a significant improvement).

**Table 5.** Accuracy achieved and WERs for different cases

WER is a common metric to measure performance of ASRs. WER is computed using the following formula:

$$WER = \frac{\mathcal{S} + D + I}{N}$$

Where:


The word accuracy can also be measured using WER as the following formula:

$$\text{Word Accuracy} = 1 - \text{WER}$$

OOV is a metric to measure the performance of ASRs. OOV is known as a source of recognition errors, which in turn could lead to additional errors in the words that follow (Gallwitz et al., 1996). Hence fore, increasing OOVs plays a significant role in increasing WER and deteriorating performance. In this research work, the baseline system is based on a closed vocabulary. The closed vocabulary assumes that all words of the testing set are already included in the dictionary. (Jurafsky & Martin, 2009) explored the differences between open and closed vocabulary. In our method, we calculate OOV as the percentage of recognized words that are not belonging to the testing set, but to the training set. The following formula is used to find OOV:

$$\text{OOV (baseline system)} = \frac{\text{none testing set words}}{\text{total words in the testing set}} \ast 100$$

The perplexity of the language model is defined in terms of the inverse of the average log likelihood per word (Jelinek, 1999). It is an indication of the average number of words that can follow a given word, a measure of the predictive power of the language model, (Saon & Padmanabhan ,2001). Measuring the perplexity is a common way to evaluate N-gram language model. It is a way to measure the quality of a model independent of any ASR system. Of course, The measurement is performed on the testing set. A lower perplexity system is considered better than one of higher perplexity. The perplexity formula is:

$$\text{PP(W)} = \sqrt{\frac{1}{\text{P(w}\_1, \text{w}\_2, ..., \text{w}\_N)}}$$

Where PP is the perplexity, P is the probability of the word set to be tested W=w1, w2, … , wN, and N is the total number of words in the testing set.

The performance detection method proposed by Plötz in (Plötz,2005) is used to investigate the achieved recognition results. A 95% is used as a level of confidence. The WER of the baseline system (12.21 %) and the total number of words in the testing set (9288 words ) are used to find the confidence interval [εl , εh]. The boundaries of the confidence interval are found to be [12.21 – 0.68 , 12.21 + 0.68] [11.53,12.89]. If the changed classification error rate is outside this interval, this change can be interpreted as statistically significant. Otherwise, It is most likely caused by chance.

Table 5 shows the enhancements for different experiments. Since the enhanced method (in Noun-Adjective case) achieved a WER of (9.82%) which is out of the above mentioned confidence interval [11.53,12.89], it is concluded that the achieved enhancement is statistically significant. The other cases are similar, i.e. (Preposition-Word, and Hybrid cases also achieved a significant improvement).


**Table 5.** Accuracy achieved and WERs for different cases

294 Modern Speech Recognition Approaches with Case Studies

 S is the number of substituted words, D is the number of deleted words, I is the number of inserted words,

following formula is used to find OOV:

N is the total number of words in the testing set.

The proposed method was investigated on a speaker-independent modern standard Arabic speech recognition system using Carnegie Mellon University Sphinx speech recognition engine. Three performance metrics were used to measure the performance enhancement: the

WER is a common metric to measure performance of ASRs. WER is computed using the

Word Accuracy = 1 – WER

OOV is a metric to measure the performance of ASRs. OOV is known as a source of recognition errors, which in turn could lead to additional errors in the words that follow (Gallwitz et al., 1996). Hence fore, increasing OOVs plays a significant role in increasing WER and deteriorating performance. In this research work, the baseline system is based on a closed vocabulary. The closed vocabulary assumes that all words of the testing set are already included in the dictionary. (Jurafsky & Martin, 2009) explored the differences between open and closed vocabulary. In our method, we calculate OOV as the percentage of recognized words that are not belonging to the testing set, but to the training set. The

OOV �baseline system� <sup>=</sup> none testing set words

system is considered better than one of higher perplexity. The perplexity formula is:

*N*

PP(W) = 12 N

1 P(w ,w ,…,w )

The perplexity of the language model is defined in terms of the inverse of the average log likelihood per word (Jelinek, 1999). It is an indication of the average number of words that can follow a given word, a measure of the predictive power of the language model, (Saon & Padmanabhan ,2001). Measuring the perplexity is a common way to evaluate N-gram language model. It is a way to measure the quality of a model independent of any ASR system. Of course, The measurement is performed on the testing set. A lower perplexity

total words in the testing set ∗ 100

����� �

��� =

The word accuracy can also be measured using WER as the following formula:

word error rate (WER), out of vocabulary (OOV), and perplexity (PP).

**7. The results** 

following formula:

Where:

Table 5 shows that the highest accuracy achieved is in Noun-Adjective case. The reduction in accuracy in the hybrid case is due to the ambiguity introduced in the language model. For more clarification, our method depends on adding new sentences to the transcription corpus that is used to build the language model. Therefore, adding many sentences will finally cause the language model to be biased to some n-grams (1-grams, 2-grams, and 3-grams) on the account of others.

The common way to evaluate the N-gram language model is using perplexity. The perplexity for the baseline is 34.08. For the proposed cases, the language models' perplexities are displayed in Table 6. The measurements were taken based on the testing set, which contains 9288 words. The enhanced cases are clearly better as their perplexities are lower. The reason for the low perplexities is the specific domains that we used in our corpus, i.e. economics and sports.


**Table 6.** Perplexities and OOV for different experiments

The OOV was also measured for the performed experiments. Our ASR system is based on a closed vocabulary, so we assume that there are no unknown words. The OOV was calculated as the percentage of recognized words that do not belong to the testing set, but to the training set. Hence,

Cross-Word Arabic Pronunciation Variation Modeling Using Part of Speech Tagging 297

اإلس ِ ِّي ِ

َان

َان

َ َدم ِ ِّي ِ ِ ُك َرِة الق

ِّ ِ َ ِمن َّ الدور َدم ِي

ن َّ الدور ل

ِين ِمن َّ الدور

ِين َّ الدور َّالث ِ َعِة َوالث

> ِين َّالث ِ َعِة َوالث

ِين ِم َّالث ِ َعِة َوالث

ِ ِّي َرحل

الم سب

اإل ف

ِ ِّي َرحل

الم سب

اإل ف

َ َدم َرحل

fy 'lmarHalati 'lsabi ' a wa 'lthalathyn mina 'ldawry

fy 'lmarHalati 'lsabi ' a wa 'lthalathyn mina 'ldawry

fy 'lmarHalati 'lsabi ' a wa 'lthalathyn mina 'ldawry

fy 'lmarHalati 'lsabi ' a wa 'lthalathyn mina 'ldawry

َّالث ِ َعِة َوالث

الم ب

َ ِة َّ الساب

َ ِة َّ الساب

َ ِة َّ الساب

َ ِة َّ الساب

'l'sbany likurati 'lqadam

'l'sbany likurati 'lqadam

'l'sbany likurati 'lqadam

'l'sbany likurati 'lqadam

ِي َ ف

ِي َ

ِي َ

الم َرحل ِي َ ف

Table 9 shows an illustrative example of the enhancement that was achieved in the enhanced system. It shows that the baseline system missed one word "من )"min) while it appears in the enhanced system. Introducing a compound word in this sentence avoided the

> ِ ُك َرِة الق ِ ِّي ل َان

> > َ َدم ِ ُك َرِة الق ِ ِّي ل َان اإلسب ِ ِّي ِ

According to the proposed algorithm, each sentence in the enhanced transcription corpus can have a maximum of one compound word, since sentences are added to the enhanced corpus once a compound word is formed. Finally, After the decoding process, the results are scanned in order to decompose the compound words back to their original form (two

ُ ('lkuwaytldawly 'lkuwayt 'ldawly)

Table 10 shows comparison results of the suggested methods for cross-word modeling. It shows that PoS tagging approach outperform the other methods ( i.e. the phonological rules and small word merging) which were investigated on the same pronunciation corpus. The use of phonological rules was demonstrated in (AbuZeina et al. 2011a) while merging of small-words method was presented in (AbuZeina et al. 2011b). even though PoS tagging seems to be better than the other methods, more research should be carried out for more confidence. So, the comparison demonstrated in Table 10 is subject to change as more cases need to be investigated for both techniques. That is, cross-word was modeled using only two Arabic phonological rules, while only two compounding schemes were applied in PoS

The recognition time is compared with the baseline system. The comparison includes the testing set which includes 1144 speech files. The specifications of the machine where we

ِ ُك َرِة الق ل

misrecognition that occurred in the baseline system.

**Table 9.** An example of enhancement in the enhanced system

ف) fymatari fy matari)

separate words). This process is performed using a lookup table such as:

The text of a speech file to be tested

As recognized by the baseline system

As recognized by the enhanced system

Final output after decomposing the merging

ِ ِّي

ار ِ َ يمط َ ِ ار ف ِ َ ِي َمط

الكَو ُّ يتالدَول

**8. Discussion** 

tagging approach.

ِ ِّي ُ

الكَويت ُّ الدَول

OOV �baseline system� <sup>=</sup> none testing set words total words in the testing set ∗ 100

which is equal to 328/9288\*100= 3.53%. For the enhanced cases, Table 6 shows the resulting OOVs. Clearly, the lower the OOV the better the performance is, which was achieved in all three cases.

Table 7 shows some statistical information collected during experiments. The "Total compound words" is the total number of Noun-Adjective cases found in the corpus transcription. The "unique compound words" indicates the total number of Noun-Adjective cases after removing duplicates. The last column, "compound words replaced" is the total number of compound words that were replaced back to their original two disjoint words after the decoding process and prior to the evaluation stage.


**Table 7.** Statistical information for compound words

Despite the claim that the Stanford Arabic tagger accuracy is more than 96%, a comprehensive manual verification and correction were made on the tagger output. It was reasonable to review the collected compound words as our transcription corpus is small (39217 words). For large corpora, the accuracy of the tagger is crucial for the results. Table 8 shows an error that occurred in the tagger output. The word, for example, "وقال )"waqala) should be VBD instead of NN.


**Table 8.** Example of Stanford Arabic Tagger Errors

Table 9 shows an illustrative example of the enhancement that was achieved in the enhanced system. It shows that the baseline system missed one word "من )"min) while it appears in the enhanced system. Introducing a compound word in this sentence avoided the misrecognition that occurred in the baseline system.


**Table 9.** An example of enhancement in the enhanced system

According to the proposed algorithm, each sentence in the enhanced transcription corpus can have a maximum of one compound word, since sentences are added to the enhanced corpus once a compound word is formed. Finally, After the decoding process, the results are scanned in order to decompose the compound words back to their original form (two separate words). This process is performed using a lookup table such as:

ِ ِّي الكَو ُّ يتالدَول ِ ِّي ُ الكَويت ُّ الدَول ُ ('lkuwaytldawly 'lkuwayt 'ldawly) ار ِ َ يمط َ ِ ار ف ِ َ ِي َمط ف) fymatari fy matari)

## **8. Discussion**

296 Modern Speech Recognition Approaches with Case Studies

after the decoding process and prior to the evaluation stage.

words

1 Noun-Adjective 3328 2672 377 2 Preposition-Word 3883 2297 409 3 Hybrid (1 & 2) 7211 4969 477

Despite the claim that the Stanford Arabic tagger accuracy is more than 96%, a comprehensive manual verification and correction were made on the tagger output. It was reasonable to review the collected compound words as our transcription corpus is small (39217 words). For large corpora, the accuracy of the tagger is crucial for the results. Table 8 shows an error that occurred in the tagger output. The word, for example, "وقال )"waqala)
