**4.2.1 Feature extraction and relative feature calculation**

Study on emotion of speech indicates that pitch, energy, duration, formant, Mel prediction cepstrum coefficient (MPCC) and linear prediction cepstrum coefficient (LPCC) are effective absolute features to distinguish certain emotions. In the paper, for each frame, six basic features, including pitch, amplitude energy, box-dimension, zero cross ratio, energyfrequency-value, first formant frequency, as well as their first and second derivatives, are extracted. Besides, 10-order LPCC and 12-order MFCC are also be extracted. Though absolute features of speeches corresponding to same emotion have large differences among different speakers, the differences of feature change induced by emotion stimulation are small relatively. Hence, relative features which reflects feature change is more credible than absolute features for emotion recognition. Relative features used in the paper embody alterations of pitch, energy or other features. They are obtained by computing the change rate relative to natural speech. Features of the kind are robust to different speakers because its calculation is combined with normalization of the features of neutral speeches. For computing relative features, the reference features of neutral version of each text and each speaker should be obtained by calculating the statistics of some frame-based parameters. In this paper, the statistic features used are means of dynamic features, including pitch, amplitude energy, energy-frequency-value, box-dimension, zero cross ratio, and first formant frequency as well as their first and second derivatives. Then, the six statistic features are used to normalize the corresponding dynamic features for each emotion speech, including training samples and test samples. Assuming , 1,2, ,18 *Mf i <sup>i</sup>* are reference features of neutral version, , 1,2, 18 *<sup>i</sup> f i* are the corresponding dynamic feature vectors, the relative feature vectors *R <sup>i</sup> f* can be obtained according to following formula:

Multimodal Intelligent Tutoring Systems 91

method provided by C De Borda, is proposed. Ranked voting method permits a voter to choose more than one candidate in proper order. Moreover, the improved algorithm also

<sup>1</sup>pitch, box- dimension, energy with their first and second

For a speech sample and a classifier, the voting weight of a certain emotion is determined according to the likelihood between the speech and the HMM model corresponding to the emotion. Firstly, the likelihood values between the speech sample and HMM models are calculated. Secondly, the emotion states are sorted according to likelihood. Then, the voting weights of the first three emotions are allocated according to the order. In the paper, the weight is determined as Table 3. Finally, the weights from four classifiers corresponding to each emotion are summed up and the emotion which has maximum value is selected as result.

> **First Second Third**  Weight 1 0.6 0.3

step3: Vote the first three emotion attached by weight according to Table 3 for each

step4: Sum up the weights from four classifiers for each emotion and choose the

To evaluate the performance of the proposed classifier in this paper, Database of Emotional Speech was set up to provide speech samples. This corpus contains utterances of five emotions, twenty texts and five actors, two males and three females. Each speaker repeats each text three times in each emotion, meaning that sixty utterances per emotion. For classifier evaluation, 1,140 samples of eight speakers, which have been assessed, are used. The evaluation was done in a "leave-one-speaker-out" manner. One feature vector, formed by six relative features combined with LPCC or MFCC, is used. The experiment results are

Text is an important modality for learner-tutor interaction, many of the ITS have the function enabling the tutor to chat with the student or assist the student in theoretical questions. so studying the relationship between natural language and affective information as well as assessing the underpinned affective qualities of natural language is also

<sup>2</sup>energy-frequency-value, box- dimension, formant with their first and second derivatives; 12-order MFCC <sup>3</sup>pitch, zero cross ratio, formant with their first and second

makes the voted emotions attached by different weights.

Table 2. Feature vector.

classifier.

listed in Table 4.

**4.3 Text**

Table 3. Weight Allocation for Voting.

The steps are listed as follows for each speech sample. step1: Initialize weight value as 0 for each emotion.

step2: Sort emotions according to likelihood for each classifier.

emotion which has the biggest weight sum as the recognition result.

**classifier Feature vector** 

derivatives; 10-order LPCC

derivatives; 12-order MFCC

4 all features extracted in the paper

$$R\bar{f}\_i = (\bar{f}\_i - Mf\_i) \;/\; (Mf\_i + 0.00000001) \tag{7}$$

where 1 2 , ,..... *<sup>T</sup> i i i iL f ff f* , 1 2 [,, ]*<sup>T</sup> R f Rf Rf Rf i i i iL* and *L* indicates the length of feature vector.

#### **4.2.2 Isolated HMMs**

The HMMs are left-right discrete models. The most pervasive methods, Forward-Backward Procedure, Viterbi Algorithm and Baum Welch re-estimation are employed in this paper. Baum Welch re-estimation based on likelihood training criterion is used to train the HMMs, each HMM modeling one emotion; Forward-Backward Procedure exports the likelihood probability; Viterbi Algorithm, focusing on the best path through the model, evaluates the likelihood of the best match between the given speech observations and the given HMMs, then achieves the "optimal" state sequences. The recognizing process based on HMMs is shown as Figure 4. A speech sample is analyzed and then represented by a feature vector, according to which the likelihood between the speech sample and each HMM is computed. Then the emotion state corresponding to maximum likelihood is selected as the output of the classifier through comparison.

Fig. 4. Emotion Recognition by HMMs.

#### **4.2.3 HMMs fusion system**

For the complexity of speech emotion recognition, single classifier systems have limited performance. In recent years, classifier fusion proves to be effective and efficient. By taking advantage of complementary information provided by the constituent classifiers, classifier fusion offers improved performance. Classifier fusion can be done at two different levels, namely, score level and decision level. In score level fusion, raw outputs (scores or confidence levels) of the individual classifiers are combined in a certain way to reach a global decision. The combination can be performed either simply using the sum rule or averaged sum rule, or more sophisticatedly, using another classifier. Decision level fusion, on the other hand, arrives at the final classification decision by combining the decisions of individual classifiers. Voting is a well-known technique for decision-level fusion. It can mask errors from one or more classifiers and make the system more robust. Voting strategies include: majority, weighted voting, plurality, instance runoff voting, threshold voting, and the more general weighted k-out-of-n systems. In this paper, four HMMs classifiers, which have different feature vectors (see Table2), are used. HMMs classifier takes only the emotion which satisfies the model most as the recognition result. But the correct result often should be the emotion which satisfies the model secondly or thirdly. So a new algorithm named weighted ranked voting, which is a reformed version of ranked voting

The HMMs are left-right discrete models. The most pervasive methods, Forward-Backward Procedure, Viterbi Algorithm and Baum Welch re-estimation are employed in this paper. Baum Welch re-estimation based on likelihood training criterion is used to train the HMMs, each HMM modeling one emotion; Forward-Backward Procedure exports the likelihood probability; Viterbi Algorithm, focusing on the best path through the model, evaluates the likelihood of the best match between the given speech observations and the given HMMs, then achieves the "optimal" state sequences. The recognizing process based on HMMs is shown as Figure 4. A speech sample is analyzed and then represented by a feature vector, according to which the likelihood between the speech sample and each HMM is computed. Then the emotion state corresponding to maximum likelihood is selected as the output of

For the complexity of speech emotion recognition, single classifier systems have limited performance. In recent years, classifier fusion proves to be effective and efficient. By taking advantage of complementary information provided by the constituent classifiers, classifier fusion offers improved performance. Classifier fusion can be done at two different levels, namely, score level and decision level. In score level fusion, raw outputs (scores or confidence levels) of the individual classifiers are combined in a certain way to reach a global decision. The combination can be performed either simply using the sum rule or averaged sum rule, or more sophisticatedly, using another classifier. Decision level fusion, on the other hand, arrives at the final classification decision by combining the decisions of individual classifiers. Voting is a well-known technique for decision-level fusion. It can mask errors from one or more classifiers and make the system more robust. Voting strategies include: majority, weighted voting, plurality, instance runoff voting, threshold voting, and the more general weighted k-out-of-n systems. In this paper, four HMMs classifiers, which have different feature vectors (see Table2), are used. HMMs classifier takes only the emotion which satisfies the model most as the recognition result. But the correct result often should be the emotion which satisfies the model secondly or thirdly. So a new algorithm named weighted ranked voting, which is a reformed version of ranked voting

*i i i iL f ff f* , 1 2 [,, ]*<sup>T</sup> R f Rf Rf Rf i i i iL* and *L* indicates the length of feature

where 1 2 , ,..... *<sup>T</sup>*

**4.2.2 Isolated HMMs** 

the classifier through comparison.

Fig. 4. Emotion Recognition by HMMs.

**4.2.3 HMMs fusion system** 

vector.

( ) /( 0.0000001) *Rf f Mf Mf ii i i* (7)

method provided by C De Borda, is proposed. Ranked voting method permits a voter to choose more than one candidate in proper order. Moreover, the improved algorithm also makes the voted emotions attached by different weights.


Table 2. Feature vector.

For a speech sample and a classifier, the voting weight of a certain emotion is determined according to the likelihood between the speech and the HMM model corresponding to the emotion. Firstly, the likelihood values between the speech sample and HMM models are calculated. Secondly, the emotion states are sorted according to likelihood. Then, the voting weights of the first three emotions are allocated according to the order. In the paper, the weight is determined as Table 3. Finally, the weights from four classifiers corresponding to each emotion are summed up and the emotion which has maximum value is selected as result.


Table 3. Weight Allocation for Voting.

The steps are listed as follows for each speech sample.


To evaluate the performance of the proposed classifier in this paper, Database of Emotional Speech was set up to provide speech samples. This corpus contains utterances of five emotions, twenty texts and five actors, two males and three females. Each speaker repeats each text three times in each emotion, meaning that sixty utterances per emotion. For classifier evaluation, 1,140 samples of eight speakers, which have been assessed, are used. The evaluation was done in a "leave-one-speaker-out" manner. One feature vector, formed by six relative features combined with LPCC or MFCC, is used. The experiment results are listed in Table 4.
