**3. Attention information**

Being "a window to the mind", the eye and its movement are tightly coupled with human cognitive processes. In this paper, we use the eye tracker iView RED from SMI (IView RED. http://www.smivision.com/) to follow student's gaze. Eye movement provides an indication of student's interest and focus of attention. Screen areas that may trigger a system response when being looked at (or not looked at) are called "interest areas". Figure 2

to actions of the agent tutor. We refined our tutoring strategy module by means of questionnaires presented to teachers. In the questionnaires we presented several scenarios of tutoring and asked the teachers to give the appropriate pedagogical and affective action for each scenario. The affective action includes the facial expression, emotional speech synthesis and text that produced from the Artificial Intelligence Markup Language (AIML) Retrieval

Being "a window to the mind", the eye and its movement are tightly coupled with human cognitive processes. In this paper, we use the eye tracker iView RED from SMI (IView RED. http://www.smivision.com/) to follow student's gaze. Eye movement provides an indication of student's interest and focus of attention. Screen areas that may trigger a system response when being looked at (or not looked at) are called "interest areas". Figure 2

Mechanism. The Architecture of MITS can be seen in figure 1.

Fig. 1. Architecture of MITS.

**3. Attention information** 

illustrates one example of the interest areas. For each interest area, the interest score is calculated. When the score for an area exceeds a threshold, the agent will react if a reaction is defined.

Fig. 2. Example of "interest areas".

The key functionality of the attention information processing module in our MITS is characterized by three main components:


The components aforesaid are all based on the modified version of the algorithm described by Qvarfordt (Qvarfordt & Zhai, 2005), where it is used for an intelligent virtual tourist information environment (iTourist). Two interest metrics were developed: (1) the Interest Score (IScore) and (2) the Focus of Interest Score (FIScore). IScore is used to determine an area's "arousal" level, or the likelihood that the user is interested in it. When the IScore metric passes a certain threshold, the area is said to become "active". FIScore measures how the user keeps up his or her interest in an active area. If the FIScore for an active area falls below a certain threshold, it becomes deactivated and a new active area is selected based on the IScore. According to the key functionality of the attention information processing module in our system, a simplified version of the IScore metric is sufficient for our purpose. IScore basic component is eye-gaze intensity *p* :

$$p = \frac{T\_{ISou}}{T\_{IS}}\tag{1}$$

Multimodal Intelligent Tutoring Systems 87

get an idea about the effectiveness of machine-based emotion recognition compared to humans, a review of research has been done by Huang (Huang & Chen, 1998). They investigated the performance of machine based emotion employing both video and audio information. Their work was based on human performance results reported by DeSilva (DeSilva & Miyasato, 1997). Their research indicated that the machine performance was on average better than human performance with 75% accuracy. In addition, comparing detection of confusions indicated similarities between machine and human. These results are encouraging in the context of our research for integrating multimodal affective interaction into tutoring systems. Although the term "affective tutoring systems" can be traced back as far as Picard's book "Affective computing" in 1997, to date, only few projects have explicitly considered emotion for ITS. However, all the projects in existence are singlechannel, and mainly concentrated on the facial expression recognition. In this paper, we detect the student's emotion through facial expression, speech and text which are main carriers of human emotion. The following subsection will give a brief description of our

Facial expression recognition has attracted a significant interest in the scientific community due to its importance for human centered interfaces. Many researchers have integrated the facial expressions into ITS (Reategui & Boff, 2008; Roger, 2006; Sarrafzadeh & Alexander, 2008). However, the performance of facial expression recognition could be influenced by occlusion on the face caused by pose variation, glass wearing, and hair or hand covering etc. The ability to handle occluded facial features is most important for achieving robustness of facial expression recognition. In contrast to normal methods that do not deal with the occlusion regions separately, our approach detects and eliminates the facial occlusions for robust facial expression recognition. Thus, the procedure of facial occlusion removal is added to normal classification procedure. Here, we propose a novel method for partial occlusion removal by iterative operation of facial occlusion detection and reconstruction using RPCA and saliency detection until no occlusion is detected. Then, the reconstructed face after occlusion removal is put to AdaBoost classifier for robust facial expression

Robust principal component analysis (RPCA) is robust to outliers (i.e. artifacts due to occlusion, illumination, image noise etc.) in training data and can be used to construct lowdimensional linear-subspace representations from noisy data. When the face contains a small fraction of the subjects with glasses, or forehead overlaid by hair, or chin surrounded by hands, the pixels corresponding to those coverings are likely to be treated as outliers by RPCA. Hence, the reconstructed image of the original face will possibly not contain the occlusions.

methods to capture the emotion through the three channels.

Fig. 3. Work flow of the robust facial expression recognition.

**4.1 Facial expression** 

recognition, as shown in Figure 3.

**4.1.1 Face recognition using RPCA** 

Where *TISon* refers to the accumulated gaze duration within a time window of size *TIS* (in our system, 1000 ms) and *TIS* is the size of the moving time window. In order to account for factors that may relate to user's interest, Qvarfordt characterized the IScore as (1 (1 )) *is p p p* , where *is p* is the arousal level of the area and is the excitability modification defined as below in (Qvarfordt & Zhai, 2005).

$$\alpha = \frac{\mathbf{c}\_f \alpha\_f + \mathbf{c}\_c \alpha\_c + \mathbf{c}\_s \alpha\_s + \mathbf{c}\_a \alpha\_a}{\mathbf{c}\_f + \mathbf{c}\_c + \mathbf{c}\_s + \mathbf{c}\_a} \tag{2}$$

Where *<sup>f</sup>* ,*c* ,*s* ,*<sup>a</sup>* are constants empirically adjusted, they are defined as:


We modified the formula 2, only *<sup>f</sup>* ,*s* , *<sup>a</sup>* were integrated into MITS. The factor *<sup>f</sup>* is

represented as *sw <sup>f</sup> f N N* , where *Nsw* denotes the number of times eye gaze enters and

leaves the area and *N <sup>f</sup>* denotes the maximum possible *Nsw* in the preset time window. *<sup>f</sup>* is identified as one indication of a user's interest in an area. Since some noise in the eye movement signal, larger areas could have a higher chance of being "hit" than smaller ones,

*<sup>s</sup>* is defined to avoid this. *<sup>s</sup>* is represented by *<sup>b</sup> <sup>s</sup> S S S* , where *Sb* is the area size of the

common areas which are also the smallest, and *S* represents the size of the current area. As for the *<sup>a</sup>* , it is employed to indicate whether the area has been paid enough attention. 1 *<sup>a</sup>* when the area has been paid enough attention and 0 when it has not been paid enough attention.

#### **4. Affective information**

Our interest in the emotion integrated in tutoring systems is motivated by the social cognitive theory suggesting that learning takes place through a complex interplay between both cognitive and affective dimensions. Researches in cognitive sciences argue that emotion enables people to communicate efficiently by monitoring and regulating social interaction, by evaluating and modifying emotional experiences. ITS would be significantly enhanced if computers could adapt according to the affective state of the student. In order to

the user keeps up his or her interest in an active area. If the FIScore for an active area falls below a certain threshold, it becomes deactivated and a new active area is selected based on the IScore. According to the key functionality of the attention information processing module in our system, a simplified version of the IScore metric is sufficient for our purpose.

> *ISon IS T*

*f f cc ss aa f csa*

*<sup>a</sup>* are constants empirically adjusted, they are defined as:

leaves the area and *N <sup>f</sup>* denotes the maximum possible *Nsw* in the preset time window.

 *<sup>f</sup>* is identified as one indication of a user's interest in an area. Since some noise in the eye movement signal, larger areas could have a higher chance of being "hit" than smaller ones,

common areas which are also the smallest, and *S* represents the size of the current area. As

*<sup>a</sup>* when the area has been paid enough attention and 0 when it has not been paid

Our interest in the emotion integrated in tutoring systems is motivated by the social cognitive theory suggesting that learning takes place through a complex interplay between both cognitive and affective dimensions. Researches in cognitive sciences argue that emotion enables people to communicate efficiently by monitoring and regulating social interaction, by evaluating and modifying emotional experiences. ITS would be significantly enhanced if computers could adapt according to the affective state of the student. In order to

*<sup>a</sup>* , it is employed to indicate whether the area has been paid enough attention.

*<sup>s</sup>* is represented by *<sup>b</sup> <sup>s</sup>*

, where *Nsw* denotes the number of times eye gaze enters and

*S S S*

*c ccc c ccc*

Where *TISon* refers to the accumulated gaze duration within a time window of size *TIS* (in our system, 1000 ms) and *TIS* is the size of the moving time window. In order to account for factors that may relate to user's interest, Qvarfordt characterized the IScore as

, where *is p* is the arousal level of the area and

*<sup>f</sup>* is the frequency of the user's eye gaze entering and leaving the area

*<sup>c</sup>* is the categorical relationship with the previous active area

 *<sup>f</sup>* ,*s* ,

*<sup>p</sup> <sup>T</sup>* (1)

(2)

*<sup>a</sup>* were integrated into MITS. The factor

, where *Sb* is the area size of the

is the excitability

*<sup>f</sup>* is

IScore basic component is eye-gaze intensity *p* :

modification defined as below in (Qvarfordt & Zhai, 2005).

*<sup>s</sup>* is the relative size to a baseline area

*f N N*

We modified the formula 2, only

*<sup>s</sup>* is defined to avoid this.

**4. Affective information** 

represented as *sw <sup>f</sup>*

*<sup>a</sup>* records previous activation of the area

(1 (1 )) *is p p p* 

Where *<sup>f</sup>* ,*c* ,*s* ,

for the

1

enough attention.

get an idea about the effectiveness of machine-based emotion recognition compared to humans, a review of research has been done by Huang (Huang & Chen, 1998). They investigated the performance of machine based emotion employing both video and audio information. Their work was based on human performance results reported by DeSilva (DeSilva & Miyasato, 1997). Their research indicated that the machine performance was on average better than human performance with 75% accuracy. In addition, comparing detection of confusions indicated similarities between machine and human. These results are encouraging in the context of our research for integrating multimodal affective interaction into tutoring systems. Although the term "affective tutoring systems" can be traced back as far as Picard's book "Affective computing" in 1997, to date, only few projects have explicitly considered emotion for ITS. However, all the projects in existence are singlechannel, and mainly concentrated on the facial expression recognition. In this paper, we detect the student's emotion through facial expression, speech and text which are main carriers of human emotion. The following subsection will give a brief description of our methods to capture the emotion through the three channels.

#### **4.1 Facial expression**

Facial expression recognition has attracted a significant interest in the scientific community due to its importance for human centered interfaces. Many researchers have integrated the facial expressions into ITS (Reategui & Boff, 2008; Roger, 2006; Sarrafzadeh & Alexander, 2008). However, the performance of facial expression recognition could be influenced by occlusion on the face caused by pose variation, glass wearing, and hair or hand covering etc. The ability to handle occluded facial features is most important for achieving robustness of facial expression recognition. In contrast to normal methods that do not deal with the occlusion regions separately, our approach detects and eliminates the facial occlusions for robust facial expression recognition. Thus, the procedure of facial occlusion removal is added to normal classification procedure. Here, we propose a novel method for partial occlusion removal by iterative operation of facial occlusion detection and reconstruction using RPCA and saliency detection until no occlusion is detected. Then, the reconstructed face after occlusion removal is put to AdaBoost classifier for robust facial expression recognition, as shown in Figure 3.

Fig. 3. Work flow of the robust facial expression recognition.

#### **4.1.1 Face recognition using RPCA**

Robust principal component analysis (RPCA) is robust to outliers (i.e. artifacts due to occlusion, illumination, image noise etc.) in training data and can be used to construct lowdimensional linear-subspace representations from noisy data. When the face contains a small fraction of the subjects with glasses, or forehead overlaid by hair, or chin surrounded by hands, the pixels corresponding to those coverings are likely to be treated as outliers by RPCA. Hence, the reconstructed image of the original face will possibly not contain the occlusions.

Multimodal Intelligent Tutoring Systems 89

We employ harr-like features for feature extraction and implement multiple one-against-rest two-class AdaBoost classifiers for robust facial expression recognition. In the algorithm, multiple two-class classifiers are constructed from weak features which are selected to discriminate one class from the others. It can solve the problem that weak features to discriminate multiple classes are hard to be selected in traditional multi-class AdaBoost algorithm. The proposed algorithms were trained and tested on our Facial Expression Database. This database consists of 57 university students in age from 19 to 27 years old and includes videos with hand and glass occlusion when displaying kinds of facial expressions. We also randomly add occlusions on the face to generate occluded faces. The experiment

> **Emotion** anger happiness sadness **Accuracy** 83.5 85.3 70.6 **Emotion** disgust surprise average **Accuracy** 75.0 73.3 77.5

In the student-tutor interaction, human tutors respond to both what a student says and to how the student says it. However, most tutorial dialogue systems can not detect the student emotion and attitudes underlying an utterance. In this paper, we introduce the speech

Study on emotion of speech indicates that pitch, energy, duration, formant, Mel prediction cepstrum coefficient (MPCC) and linear prediction cepstrum coefficient (LPCC) are effective absolute features to distinguish certain emotions. In the paper, for each frame, six basic features, including pitch, amplitude energy, box-dimension, zero cross ratio, energyfrequency-value, first formant frequency, as well as their first and second derivatives, are extracted. Besides, 10-order LPCC and 12-order MFCC are also be extracted. Though absolute features of speeches corresponding to same emotion have large differences among different speakers, the differences of feature change induced by emotion stimulation are small relatively. Hence, relative features which reflects feature change is more credible than absolute features for emotion recognition. Relative features used in the paper embody alterations of pitch, energy or other features. They are obtained by computing the change rate relative to natural speech. Features of the kind are robust to different speakers because its calculation is combined with normalization of the features of neutral speeches. For computing relative features, the reference features of neutral version of each text and each speaker should be obtained by calculating the statistics of some frame-based parameters. In this paper, the statistic features used are means of dynamic features, including pitch, amplitude energy, energy-frequency-value, box-dimension, zero cross ratio, and first formant frequency as well as their first and second derivatives. Then, the six statistic features are used to normalize the corresponding dynamic features for each emotion speech, including training samples and test samples. Assuming , 1,2, ,18 *Mf i <sup>i</sup>* are reference features of neutral version, , 1,2, 18 *<sup>i</sup> f i* are the corresponding dynamic feature vectors,

can be obtained according to following formula:

**4.1.4 AdaBoost classification** 

results are listed in Table 1.

**4.2 Speech emotion** 

Table 1. Facial Recognition Results.

emotion recognition into the ITS.

the relative feature vectors *R <sup>i</sup> f*

**4.2.1 Feature extraction and relative feature calculation** 

#### **4.1.2 Occlusion detection using saliency detection**

To find the occlusion regions on the face, we adopt the method of saliency detection. Firstly, the original face image is transformed into gray level and normalized to *I x* ,*y* using histogram equalization. Then, *I x* ,*y* is reconstructed to *R x* ,*y* using RPCA. We can obtain the residual image *D x* ,*y* between the reconstructed image *R x* ,*y* and *I x* ,*y* by:

$$D\left(\mathbf{x}, y\right) = \left| R\left(\mathbf{x}, y\right) - I\left(\mathbf{x}, y\right) \right| \tag{3}$$

Then, the residual image *D x* ,*y* is put to a saliency detector to find the local places with high complexity, which is hypothesized to be the occlusion on the face. The measure of the local saliency is defined as:

$$H\_{D,Rx} = -\sum\_{i} P\_{D,R\_X} \left( d\_i \right) \log\_2 P\_{D,R\_X} \left( d\_i \right) \tag{4}$$

Where , *DR i <sup>X</sup> P d* is the probability of descriptor (or difference image) *D* taking the value *<sup>i</sup> d* in the local region *Rx* . We apply the saliency detection in the residual image over a wide range of scale, and set the threshold value of *HD Rx* , . The region with biggest *HD Rx* , value over the threshold is set to the occlusion region. If all regions have *HD Rx* , less than the threshold, it is presumed that no occlusion exists. Note that we just choose one occlusion region in one operation of saliency detection even if there are multiple regions with saliency value over the threshold.

#### **4.1.3 Occlusion region reconstruction**

Detailed information is most important to facial expression recognition. To avoid the wrong information introduced by face reconstruction in non-occluded region, we adopt the mechanism of occlusion region reconstruction rather than the total face reconstruction. To obtain the new face image *P x* ,*y* , pixel values of the detected occlusion region will be replaced by the reconstructed face using RPCA. Thus, the wrong information in the occlusion region may be shielded while the other regions of the face retain the same. To further decrease the impact of occlusion for facial expression reconstruction, we perform occlusion region reconstruction for several iterations until the difference of the reconstructed face between two iterations is below a threshold. The new face image *P x <sup>t</sup>* ,*y* in iteration *t*  can be obtained by:

$$P\_t\left(\mathbf{x},\mathbf{y}\right) = \begin{cases} I\left(\mathbf{x},\mathbf{y}\right) & \left(\mathbf{x},\mathbf{y}\right) \notin R\_{\text{acclusion}}\\ R\_t\left(\mathbf{x},\mathbf{y}\right) & \left(\mathbf{x},\mathbf{y}\right) \in R\_{\text{acclusion}} \end{cases} \tag{5}$$

Where *I x* ,*y* is the normalized image, *R x <sup>t</sup>* ,*y* is the reconstructed image using RPCA in iteration *t*, and *Rocclusion* defines the occlusion region. Note that

$$R\_t\left(x,y\right) = \begin{cases} RPCA\left(I\right) & t=1\\ RPCA\left(P\_{t-1}\right) & t>1 \end{cases} \tag{6}$$

Where *RPCA* designates the RPCA procedure, *t* is the iteration index.
