**3. EHeBby reasoner**

The chatbot brain has been implemented using an extended version of the ALICE ALICE (2011) architecture, one of the most widespread conversational agent technologies.

The ALICE dialogue engine is based on a pattern matching algorithm which looks for a match between the user's sentences and the information stored in the chatbot knowledge base. Alice knowledge base is structured with an XML-like language called AIML (Artificial Intelligence Mark-up Language). Standard AIML tags make possible for the chatbot understanding user questions, to properly give him an answer, save and get values of variables, or store the context of conversation. The basic item of knowledge in ALICE is the *category*, which represents a question-answer module, composed a *pattern* section representing a possible user question, and a *template* section which identifies the associated chatbot answer. The AIML

a Humoristic Chatbot 5

An Emotional Talking Head for a Humoristic Chatbot 323

people listening it, often producing a funny effect Mihalcea et al. (2006). This module removes punctuation marks and stopwords (i.e. word that do not carry any meaning) from the sentence, and then analyzes its phonetic transcription, obtained by using the CMU dictionary CMU (2010). This technique is aimed at discovering possible repetitions of the beginning phonemes in subsequent words. In particular the module searches the presence of at least

detecting in the first sentence three words having the first phoneme in common, and in the second sentence two pairs of words having the first three phonemes in common. The words infancy and infants have the same following initial phonemes *ih1 n f ah0 n* while the words

This module detects the presence of antinomies in a sentence has been developed exploiting

• an extended antinomy relation, which is an antinomy relation between a word and a

• an indirect antinomy relation, which is an antinomy relation between a word and an

This module analyzes the presence of adult slang searching in a set of pre-classified words.

This area allows the chatbot to evocate funny sentences that are not directly coded as AIML categories, but that are encoded as vectors in a semantic space, created by means of Latent Semantic Analysis (LSA) Dumais & Landauer (1997). In fact, if none of the features characterizing a humorous phrase is recognized in the sentence through the humor recognition area, the user question is mapped in a semantic space. The humor evocation area then computes the semantic similarity between what is said by the user and the sentences encoded in the semantic space; subsequently it tries to answer to the user with a funny expression which is conceptually close to the user input. This procedure allows to go beyond the rigid pattern-matching rules, generating the funniest answers which best semantically fit

The sex was so good that even the neighbors had a cigarette Artificial Insemination: procreation without recreation

the lexical dictionary WordNet. In particular the module searches into a sentence for:

• a direct antinomy relation among nouns, verbs, adverbs and adjectives;

synonym of its antonym. The relation is restricted to the adjectives;

antonym of its synonym. The relation is restricted to the adjectives.

A clean desk is a sign of a cluttered desk drawer Artificial intelligence usually beats real stupidity

These humorous sentences contain antinomy relation:

As an example the following sentences are reported:

three words have in common the first one, the first two or the first three phonemes.

As an example the module consider the following humorous sentences:

adultery and adults begin with the following phonemes *ah0 d ah1 l t*.

**3.2.2 Antinomy recognition module**

**3.2.3 Adult slang recognition module**

**3.3 Humor evocation area**

the user query.

Veni, Vidi, Visa: I came, I saw, I did a little shopping Infants don't enjoy infancy like adults do adultery

reasoner has been extended defining *ad hoc* tags for computational humor and emotional purposes.

The chabot implements different features, by means of specific reasoning areas, shown in figure 1. The areas called *Humor Recognition Area* and *Humor Evocation Area*, deal with the recognition and generation of humor during the conversation with the user. A set of AIML files, representing the chatbot KB are processed during the conversation. Humor recognition and generation features are triggered when the presence of specific AIML tags is detected. The humorous tags are then processed by a *Computational Humor Engine*, which in turn queries other knowledge repositories, to analyze or generate humor during the conversation. In particular the *AIML Computational Humor Engine* exploits both WordNet MultiWordNet (2010) and the a pronouncing dictionary of the Carnegie Mellon University (CMU) CMU (2010) in order to recognize humorous features in the conversation, and a semantic space in oder to retrieve humorous sentences related to the user utterances. The area called *Emotional Area* deals with the association of chabot emotional reaction to the user sentences. In particular it allows for a binding of a conversation humor level with a set of *ad hoc* created emotional tags, which are processed by the *AIML Emotional Engine* in order to send the necessary information to the Talking Head. In particular in the proposed model we have considered only three possible humor levels, and three correspondent emotional expressions.

#### **3.1 AIML KB**

The AIML knowledge base of our humorous conversational agent is composed of four kinds of AIML categories:


#### **3.2 Humour recognition area**

The humour recognition consists in the identification, inside the user sentences, of particular humorous texts features. According to Mihalcea and Strapparava Mihalcea et al. (2006) we focus on three main humorous features: alliteration, antinomy and adult slang. Special tags inserted in the AIML categories allows the chatbot to execute modules aimed to detect the humorous features.

#### **3.2.1 Alliteration recognition module**

The phonetic effect induced by the alliteration, the rhetoric figure consisting in the repetition of a letter, a syllable or a phonetic sound in consecutive words, captures the attention of 4 Will-be-set-by-IN-TECH

reasoner has been extended defining *ad hoc* tags for computational humor and emotional

The chabot implements different features, by means of specific reasoning areas, shown in figure 1. The areas called *Humor Recognition Area* and *Humor Evocation Area*, deal with the recognition and generation of humor during the conversation with the user. A set of AIML files, representing the chatbot KB are processed during the conversation. Humor recognition and generation features are triggered when the presence of specific AIML tags is detected. The humorous tags are then processed by a *Computational Humor Engine*, which in turn queries other knowledge repositories, to analyze or generate humor during the conversation. In particular the *AIML Computational Humor Engine* exploits both WordNet MultiWordNet (2010) and the a pronouncing dictionary of the Carnegie Mellon University (CMU) CMU (2010) in order to recognize humorous features in the conversation, and a semantic space in oder to retrieve humorous sentences related to the user utterances. The area called *Emotional Area* deals with the association of chabot emotional reaction to the user sentences. In particular it allows for a binding of a conversation humor level with a set of *ad hoc* created emotional tags, which are processed by the *AIML Emotional Engine* in order to send the necessary information to the Talking Head. In particular in the proposed model we have considered only three

The AIML knowledge base of our humorous conversational agent is composed of four kinds

1. the standard set of ALICE categories, which are suited to manage a general conversation

2. a set of categories suited to generate humorous sentences by means of jokes. The generation of humor is obtained writing specific funny sentences in the template of the

3. a set of categories suited to retrieve humorous or funny sentences through the comparison between the user input and the sentences mapped in a semantic space belonging to the evocative area. The chatbot answers with the sentence which is semantically closer to the

4. a set of categories suited to to recognize an humorous intent in the user sentences. This feature is obtained connecting the chatbot knowledge base to other resources, like the WordNet lexical dictionary MultiWordNet (2010) and the CMU pronouncing dictionary

The humour recognition consists in the identification, inside the user sentences, of particular humorous texts features. According to Mihalcea and Strapparava Mihalcea et al. (2006) we focus on three main humorous features: alliteration, antinomy and adult slang. Special tags inserted in the AIML categories allows the chatbot to execute modules aimed to detect the

The phonetic effect induced by the alliteration, the rhetoric figure consisting in the repetition of a letter, a syllable or a phonetic sound in consecutive words, captures the attention of

5. a set of categories suited to generate emotional expressions in the talking head.

possible humor levels, and three correspondent emotional expressions.

purposes.

**3.1 AIML KB**

of AIML categories:

with the user;

category.

user input.

CMU (2010).

humorous features.

**3.2 Humour recognition area**

**3.2.1 Alliteration recognition module**

people listening it, often producing a funny effect Mihalcea et al. (2006). This module removes punctuation marks and stopwords (i.e. word that do not carry any meaning) from the sentence, and then analyzes its phonetic transcription, obtained by using the CMU dictionary CMU (2010). This technique is aimed at discovering possible repetitions of the beginning phonemes in subsequent words. In particular the module searches the presence of at least three words have in common the first one, the first two or the first three phonemes. As an example the module consider the following humorous sentences:

Veni, Vidi, Visa: I came, I saw, I did a little shopping Infants don't enjoy infancy like adults do adultery

detecting in the first sentence three words having the first phoneme in common, and in the second sentence two pairs of words having the first three phonemes in common. The words infancy and infants have the same following initial phonemes *ih1 n f ah0 n* while the words adultery and adults begin with the following phonemes *ah0 d ah1 l t*.

#### **3.2.2 Antinomy recognition module**

This module detects the presence of antinomies in a sentence has been developed exploiting the lexical dictionary WordNet. In particular the module searches into a sentence for:


These humorous sentences contain antinomy relation:

A clean desk is a sign of a cluttered desk drawer Artificial intelligence usually beats real stupidity

#### **3.2.3 Adult slang recognition module**

This module analyzes the presence of adult slang searching in a set of pre-classified words. As an example the following sentences are reported:

The sex was so good that even the neighbors had a cigarette Artificial Insemination: procreation without recreation

#### **3.3 Humor evocation area**

This area allows the chatbot to evocate funny sentences that are not directly coded as AIML categories, but that are encoded as vectors in a semantic space, created by means of Latent Semantic Analysis (LSA) Dumais & Landauer (1997). In fact, if none of the features characterizing a humorous phrase is recognized in the sentence through the humor recognition area, the user question is mapped in a semantic space. The humor evocation area then computes the semantic similarity between what is said by the user and the sentences encoded in the semantic space; subsequently it tries to answer to the user with a funny expression which is conceptually close to the user input. This procedure allows to go beyond the rigid pattern-matching rules, generating the funniest answers which best semantically fit the user query.

a Humoristic Chatbot 7

An Emotional Talking Head for a Humoristic Chatbot 325

models have been developed. An interesting overview can be found in Ortony (1997). Among the models cited in Ortony (1997), the model by Ekman have been chosen as basis for our work. According to Ekman's model, there are six primary emotions: anger, disgust, fear, joy, sadness, surprise. We have developed a reduced version of this model, including only three of the listed basic emotions: anger, joy, sadness. We selected them as basis to express humor. At this moment our agent is able to express one of these three emotions at a time, with a variable intensity level. The emotional state of the agent is represented by a couple of values: the felt emotion, and its corresponding intensity. The state is established on the basis of the humor level detected in the conversation. As just said, there are only three possible values for the humor level. These levels have to correspond to a specific emotion in the chatbot, with an intensity level. The correspondence should to be defined according to a collection of psychological criteria. At this moment, the talking head has a predefined behavior for its humorist attitude useful to express these humor levels. Each level is expressed with a specific emotion at a certain intensity level. This emotional patterns represent a default behavior for the agent. The programmer can create a personal version of emotional behavior defining different correspondences between humor levels and emotional intensities. Moreover, he can also program specialized behaviors for single steps of the conversation or single witticisms,

The established emotional state has to be expressed by prosody and facial expressions. Both of them are generated by the *Emotional Area*. This task is launched by *ad hoc* AIML tags.

Our talking head is conceived to be a multi-platform system that is able to speak several languages, so that various implementations have been realized. In what follows the different components of our model are presented: model generation, animation technique,

The FaceGen Modeler FaceGen (2010) has been used to generate graphic models of the 3D head. FaceGen is a special tool for the creation of 3D human heads and characters as polygon meshes. The facial expressions are controlled by means of numerical parameters. Once the head is created, it can be exported as a Wavefront Technologies .obj file containing the information about vertexes, normals and textures of the facial mesh. The .obj is compliant with the most popular high level graphics libraries such as Java3D and OpenGL. A set of faces with different poses is generated to represent a "viseme", which is related to a phoneme or a groups of phonemes. A phoneme is the elementary speech sound, that is the smallest phonetic unit in a language. Indeed, the spoken language can be thought as a sequence of phonemes. The term "viseme" appeared in literature for the first time in Fischer (1968) and it is equivalent to the phoneme for the face gesture. The viseme is the facial pose obtained by articulatory movements during the phoneme emission. Emotional expressions can be generated by FaceGen also. In our work we have implemented just 4 out of the Ekman basic emotions Ekman & Friesen (1969): joy, surprise, anger, sadness. The intensity of each emotion can be controlled by a parameter or mixed to each other, so that a variety of facial expressions can be obtained. Such "emotional visemes" will be used during the animation task. Some optimizations can be performed to decrease amount of memory necessary to store such a set of visemes. Just the head geometry can be loaded from the .obj file. Lights and virtual camera parameters are set within the programming code. A part of the head mesh can be loaded as a background mesh and after the 3 sub-meshes referred to face, tongue and teeth are loaded.

as exceptions to the default one.

coarticulation, and emotion management.

**4. EHeBby talking head**

**4.1 Face model generation**

#### **3.3.1 Semantic space creation**

A semantic representation of funny sentences has been obtained mapping them in a semantic space. The semantic space has been built according to a Latent Semantic Analysis (LSA) based approach described in Agostaro (2005)Agostaro (2006). According to this approach, we have created a semantic space applying the truncated singular value decomposition (TSVD) on a *m* × *n* co-occurrences matrix obtained analyzing a specific texts corpus, composed of humorous texts, where each *(i, j)-th* entry of the matrix represents square root of the number of times the *i-th* word appears in the *j-th* document.

After the decomposition we obtain a representation of words and documents in the reduced semantic space. Moreover we can automatically encode in the space new items, such as sentences inserted into AIML categories, humorous sentences and user utterances. In fact, a vectorial representation can be obtained evaluating the sum of the vectors associated to words composing each sentence.

To evaluate the similarity between two vectors *vi* and *vj* belonging to this space according to Agostaro et al. we use the following similarity measure Agostaro (2006):

$$\text{sim}\left(v\_{i\prime}v\_{j}\right) = \begin{cases} \cos^{2}\left(v\_{i\prime}v\_{j}\right) \text{ if } \cos\left(v\_{i\prime}v\_{j}\right) \ge 0\\ 0 & \text{otherwise} \end{cases} \tag{1}$$

The closer this value is to 1, the higher is the similarity grade. The geometric similarity measure between two items establishes a semantic relation between them. In particular given a vector **s**, associated to a user sentence s, the set CR(s) of vectors sub-symbolically conceptually related to the sentence s is given by the q vectors of the space whose similarity measure with respect to **s** is higher than an experimentally fixed threshold T.

$$\mathcal{CR}(\mathbf{s}) = v\_i |\text{sim}(\mathbf{s}, v\_i) > T \quad \text{with} \quad i = 1 \ldots q \tag{2}$$

To each of these vectors will correspond a funny sentence used to build the space. Specific AIML tags called *relatedSentence* and *randomRelatedSentence* allow the chatbot to query the semantic space to retrieve respectively the semantically closer riddle to the user query or one of the most conceptually related riddles. Tha chatbot can also improve its own AIML KB mapping in the evocative area new items like jokes, riddles and so on introduced by the user during the dialogue.

#### **3.4 Emotional area**

This area is suited to the generation of emotional expressions in the Talking Head. Many possible models of emotions have been proposed in literature. We can distinguish three different categories of models. The first one includes models describing emotions through collections of different dimensions (intensity, arousal, valence, unpredictability, potency, ...). The second one includes models based on the hypothesis that a human being is able to express only a limited set of primary emotions. All the range of the human emotions should be the result of the combination of the primary ones. The last category includes mixed models, according to which an emotion is generated by a mixture of basic emotions parametrized by a set of dimensions. One of the earlier model of the second category is the model of Plutchik Ekman (1999). He listed the following primary emotions: acceptance, anger, anticipation, disgust, joy, fear, sadness, surprise. Thee emotions can be combined to produce secondary emotions, and in their turn those can be combined to produce ternary emotions. Each emotion can be characterized by an intensity level. After this pioneering model, many other similar 6 Will-be-set-by-IN-TECH

A semantic representation of funny sentences has been obtained mapping them in a semantic space. The semantic space has been built according to a Latent Semantic Analysis (LSA) based approach described in Agostaro (2005)Agostaro (2006). According to this approach, we have created a semantic space applying the truncated singular value decomposition (TSVD) on a *m* × *n* co-occurrences matrix obtained analyzing a specific texts corpus, composed of humorous texts, where each *(i, j)-th* entry of the matrix represents square root of the number

After the decomposition we obtain a representation of words and documents in the reduced semantic space. Moreover we can automatically encode in the space new items, such as sentences inserted into AIML categories, humorous sentences and user utterances. In fact, a vectorial representation can be obtained evaluating the sum of the vectors associated to

To evaluate the similarity between two vectors *vi* and *vj* belonging to this space according to

The closer this value is to 1, the higher is the similarity grade. The geometric similarity measure between two items establishes a semantic relation between them. In particular given a vector **s**, associated to a user sentence s, the set CR(s) of vectors sub-symbolically conceptually related to the sentence s is given by the q vectors of the space whose similarity

To each of these vectors will correspond a funny sentence used to build the space. Specific AIML tags called *relatedSentence* and *randomRelatedSentence* allow the chatbot to query the semantic space to retrieve respectively the semantically closer riddle to the user query or one of the most conceptually related riddles. Tha chatbot can also improve its own AIML KB mapping in the evocative area new items like jokes, riddles and so on introduced by the user

This area is suited to the generation of emotional expressions in the Talking Head. Many possible models of emotions have been proposed in literature. We can distinguish three different categories of models. The first one includes models describing emotions through collections of different dimensions (intensity, arousal, valence, unpredictability, potency, ...). The second one includes models based on the hypothesis that a human being is able to express only a limited set of primary emotions. All the range of the human emotions should be the result of the combination of the primary ones. The last category includes mixed models, according to which an emotion is generated by a mixture of basic emotions parametrized by a set of dimensions. One of the earlier model of the second category is the model of Plutchik Ekman (1999). He listed the following primary emotions: acceptance, anger, anticipation, disgust, joy, fear, sadness, surprise. Thee emotions can be combined to produce secondary emotions, and in their turn those can be combined to produce ternary emotions. Each emotion can be characterized by an intensity level. After this pioneering model, many other similar

*CR*(*s*) = *vi*|*sim*(*s*, *vi*) > *T with i* = 1... *q* (2)

Agostaro et al. we use the following similarity measure Agostaro (2006):

measure with respect to **s** is higher than an experimentally fixed threshold T.

**3.3.1 Semantic space creation**

words composing each sentence.

during the dialogue.

**3.4 Emotional area**

of times the *i-th* word appears in the *j-th* document.

*sim vi*, *vj* = cos<sup>2</sup> *vi*, *vj* if cos *vi*, *vj* ≥ 0 <sup>0</sup> *otherwise* (1) models have been developed. An interesting overview can be found in Ortony (1997). Among the models cited in Ortony (1997), the model by Ekman have been chosen as basis for our work. According to Ekman's model, there are six primary emotions: anger, disgust, fear, joy, sadness, surprise. We have developed a reduced version of this model, including only three of the listed basic emotions: anger, joy, sadness. We selected them as basis to express humor. At this moment our agent is able to express one of these three emotions at a time, with a variable intensity level. The emotional state of the agent is represented by a couple of values: the felt emotion, and its corresponding intensity. The state is established on the basis of the humor level detected in the conversation. As just said, there are only three possible values for the humor level. These levels have to correspond to a specific emotion in the chatbot, with an intensity level. The correspondence should to be defined according to a collection of psychological criteria. At this moment, the talking head has a predefined behavior for its humorist attitude useful to express these humor levels. Each level is expressed with a specific emotion at a certain intensity level. This emotional patterns represent a default behavior for the agent. The programmer can create a personal version of emotional behavior defining different correspondences between humor levels and emotional intensities. Moreover, he can also program specialized behaviors for single steps of the conversation or single witticisms, as exceptions to the default one.

The established emotional state has to be expressed by prosody and facial expressions. Both of them are generated by the *Emotional Area*. This task is launched by *ad hoc* AIML tags.

#### **4. EHeBby talking head**

Our talking head is conceived to be a multi-platform system that is able to speak several languages, so that various implementations have been realized. In what follows the different components of our model are presented: model generation, animation technique, coarticulation, and emotion management.

#### **4.1 Face model generation**

The FaceGen Modeler FaceGen (2010) has been used to generate graphic models of the 3D head. FaceGen is a special tool for the creation of 3D human heads and characters as polygon meshes. The facial expressions are controlled by means of numerical parameters. Once the head is created, it can be exported as a Wavefront Technologies .obj file containing the information about vertexes, normals and textures of the facial mesh. The .obj is compliant with the most popular high level graphics libraries such as Java3D and OpenGL. A set of faces with different poses is generated to represent a "viseme", which is related to a phoneme or a groups of phonemes. A phoneme is the elementary speech sound, that is the smallest phonetic unit in a language. Indeed, the spoken language can be thought as a sequence of phonemes. The term "viseme" appeared in literature for the first time in Fischer (1968) and it is equivalent to the phoneme for the face gesture. The viseme is the facial pose obtained by articulatory movements during the phoneme emission. Emotional expressions can be generated by FaceGen also. In our work we have implemented just 4 out of the Ekman basic emotions Ekman & Friesen (1969): joy, surprise, anger, sadness. The intensity of each emotion can be controlled by a parameter or mixed to each other, so that a variety of facial expressions can be obtained. Such "emotional visemes" will be used during the animation task. Some optimizations can be performed to decrease amount of memory necessary to store such a set of visemes. Just the head geometry can be loaded from the .obj file. Lights and virtual camera parameters are set within the programming code. A part of the head mesh can be loaded as a background mesh and after the 3 sub-meshes referred to face, tongue and teeth are loaded.

a Humoristic Chatbot 9

An Emotional Talking Head for a Humoristic Chatbot 327

facial parameters to be used for interpolating the preceding keyframe towards the present one (*τ*<0). The latter regards the dominance of the next viseme on the parameters used morph the present keyframe towards the next one (*τ*>0). Our implementation doesn't make use of an animation engine to control the facial parameters (labial opening, labial protrusion and so on) but the interpolation process acts on the translation of all the vertexes in the mesh. The prosodic sequence S of time intervals [*ti*−1,*ti*[ associated to each phoneme can be expressed as

A viseme is defined "active" when *t* falls into the corresponding time interval. The preceding and the following visemes are defined as "adjacent visemes". Due to the negative exponential nature of the dominance function, just the adjacent visemes are considered for computing weights. For each time instant, 3 weights must be computed on the basis of the respective

> (*t*) = *wi*(*t*) +1 ∑ *j*=−1

so that for each time instant the coordinates of the interpolating viseme vertexes *v*

where the index *l* indicates corresponding vertexes in all the involved keyframes.

*i*+1 ∑ *k*=*i*−1 *wi* (*t*) · *v* (*l*)

Our implementation simplifies also this computation. It is sufficient to determine the result of the coarticulation just for the keyframes, because the interpolation is obtained using directly the morphing engine with a linear control function. Once the dominance functions are determined, each coarticulated keyframe is computed and its duration is the same as in the

A sequence of two adjacent vowels is called diphthong. The word "euro" contains one diphthong. The vowels in a diphthong must be visually distinct as two separate entities. The visemes belonging to the vowels in a diphthong mustn't influence each other. Otherwise, both the vowel visemes wouldn't be distinguishable due to their fusion. In order to avoid this problem, the slope of the dominance function belonging to each vocal viseme in a diphthong must be very steep (see Fig.2). On the contrary, the sequence vowel-consonant requires a different profile of the dominant function. Indeed, the consonant is heavily influenced by the preceding vowel: a vowel must be dominant with respect to the adjacent consonants, but not with other vowels. As shown in Fig.3, the dominance of a vowel with respect to a consonant

*wi*−*j*(*t*)

dominance functions of 3 visemes at a time. The weights are computed as follows:

where *τ<sup>i</sup>* the mid point of the *i*-th time interval. The *wi* must be normalized:

*w i*

*v* (*l*) *int*(*t*) =

is accomplished with a less steep curve than the consonant one.

{*Vint*(*t*)} will be computed as follows:

**4.2.2 Diphthongs and dominant visemes**

corresponding phoneme.

*S* = { *f*<sup>1</sup> ∈ [0, *t*1[ ; *f*<sup>2</sup> ∈ [*t*1, *t*2[ ;...; *fn* ∈ [*tn*−1, *tn*[} (4)

*wi*(*t*) = *Di*(*t*) = *α<sup>i</sup>* · exp(−*θ<sup>i</sup>* · |*t* − *τi*|) (5)

(6)

(*l*) *int*(*t*) ∈

*<sup>k</sup>* (*t*) (7)

follows:

Indeed, these 3 parts of the head are really involved in the animation. The amount of vertexes can be reduced with a post-processing task with a related decrease of quality, which is not severe if this process involves the back and top sides of the head. Moreover, for each polygon mesh a texture should be loaded, but all the meshes can use the same image file as texture to save memory. A basic viseme can provide both the image texture and the texture coordinates to allow the correct position of the common texture for the other ones.

#### **4.2 Animation**

The facial movements are performed by morphing . Morphing starts from a sequence of geometry objects called "keyframes". Each keyframe's vertex translates from its position to occupy the one of the corresponding vertex in the subsequent keyframe. For this reason we have to generate a set of visemes instead of modifying a single head geometric model. Such an approach is less efficient than an animation engine able to modify the shape according to facial parameters (tongue position, labial protrusion and so on) but it simplifies strongly the programming level: First, the whole mesh is considered in the morphing process, and efficient morphing engines are largely present in many computer graphics libraries. Various parameters have to be set to control each morphing step between two keyframes, i.e. the translation time. In our animation scheme, the keyframes are the visemes related to the phrase to be pronounced but they cannot be inserted in the sequence without considering the facial coarticulation to obtain realistic facial movements. The coarticulation is the natural facial muscles modification to generate a succession of fundamental facial movements during phonation. The Löfqvist gestural model described in Löfqvist (1990) controls the audio-visual synthesis; such a model defines the "dominant visemes", which influence both the preceding and subsequent ones. Each keyframe must be blended dynamically with the adjacent ones. The next section is devoted to this task, showing a mathematical model for the coarticulation.

#### **4.2.1 Cohen-Massaro model**

The Cohen-Massaro model Cohen & Massaro (1993) computes the weights to control the keyframe animation. Such weights determine the vertexes positions of an intermediate mesh between two keyframes. It is based on the coarticulation, which is the influence of the adjacent speech sounds to the actual one during the phonation. Such a phenomenon can be also considered for the interpolation of a frame taking into account the adjacent ones in such a way that the facial movement appear more natural. Indeed, the Cohen-Massaro model moves from the work by Löfqvist, where a speech segment shows the strongest influence on the organs of articulation of the face than the adjacent segments. Dominance is the name given to such an influence and can be mathematically defined as a time dependent function. In particular, an exponential function is adopted as the dominance function. The dominance function proposed in our approach is simplified with respect to the original one. Indeed, it is symmetric. The profile of a dominance function for given speech segment *s* and facial parameter *p* is expressed by the following equation:

$$D\_{sp} = \mathfrak{a} \cdot \exp(-\theta \, |\mathfrak{r}|^{\mathfrak{c}}) \tag{3}$$

where *α* is the peak for *τ* = 0, *θ* and *c* control the function slope and *τ* is the time variable referred to the mid point of the speech segment duration. In our implementation we set *c* = 1 to reduce the number of parameters to be tuned. The dominance function reachs its maximum value (*α*) in the mid point of speech segment duration, where *τ* = 0. In the present approach, we assume that the time interval of each viseme is the same of the duration of the respective phoneme. The coarticulation can be thought as composed by two sub-phenomenons: the pre- and post- articulation. The former consists in the influence of the present viseme on the 8 Will-be-set-by-IN-TECH

Indeed, these 3 parts of the head are really involved in the animation. The amount of vertexes can be reduced with a post-processing task with a related decrease of quality, which is not severe if this process involves the back and top sides of the head. Moreover, for each polygon mesh a texture should be loaded, but all the meshes can use the same image file as texture to save memory. A basic viseme can provide both the image texture and the texture coordinates

The facial movements are performed by morphing . Morphing starts from a sequence of geometry objects called "keyframes". Each keyframe's vertex translates from its position to occupy the one of the corresponding vertex in the subsequent keyframe. For this reason we have to generate a set of visemes instead of modifying a single head geometric model. Such an approach is less efficient than an animation engine able to modify the shape according to facial parameters (tongue position, labial protrusion and so on) but it simplifies strongly the programming level: First, the whole mesh is considered in the morphing process, and efficient morphing engines are largely present in many computer graphics libraries. Various parameters have to be set to control each morphing step between two keyframes, i.e. the translation time. In our animation scheme, the keyframes are the visemes related to the phrase to be pronounced but they cannot be inserted in the sequence without considering the facial coarticulation to obtain realistic facial movements. The coarticulation is the natural facial muscles modification to generate a succession of fundamental facial movements during phonation. The Löfqvist gestural model described in Löfqvist (1990) controls the audio-visual synthesis; such a model defines the "dominant visemes", which influence both the preceding and subsequent ones. Each keyframe must be blended dynamically with the adjacent ones. The next section is devoted to this task, showing a mathematical model for the coarticulation.

The Cohen-Massaro model Cohen & Massaro (1993) computes the weights to control the keyframe animation. Such weights determine the vertexes positions of an intermediate mesh between two keyframes. It is based on the coarticulation, which is the influence of the adjacent speech sounds to the actual one during the phonation. Such a phenomenon can be also considered for the interpolation of a frame taking into account the adjacent ones in such a way that the facial movement appear more natural. Indeed, the Cohen-Massaro model moves from the work by Löfqvist, where a speech segment shows the strongest influence on the organs of articulation of the face than the adjacent segments. Dominance is the name given to such an influence and can be mathematically defined as a time dependent function. In particular, an exponential function is adopted as the dominance function. The dominance function proposed in our approach is simplified with respect to the original one. Indeed, it is symmetric. The profile of a dominance function for given speech segment *s* and facial

*Dsp* = *α* · *exp*(−*θ* |*τ*|

where *α* is the peak for *τ* = 0, *θ* and *c* control the function slope and *τ* is the time variable referred to the mid point of the speech segment duration. In our implementation we set *c* = 1 to reduce the number of parameters to be tuned. The dominance function reachs its maximum value (*α*) in the mid point of speech segment duration, where *τ* = 0. In the present approach, we assume that the time interval of each viseme is the same of the duration of the respective phoneme. The coarticulation can be thought as composed by two sub-phenomenons: the pre- and post- articulation. The former consists in the influence of the present viseme on the

*c*

) (3)

to allow the correct position of the common texture for the other ones.

**4.2 Animation**

**4.2.1 Cohen-Massaro model**

parameter *p* is expressed by the following equation:

facial parameters to be used for interpolating the preceding keyframe towards the present one (*τ*<0). The latter regards the dominance of the next viseme on the parameters used morph the present keyframe towards the next one (*τ*>0). Our implementation doesn't make use of an animation engine to control the facial parameters (labial opening, labial protrusion and so on) but the interpolation process acts on the translation of all the vertexes in the mesh. The prosodic sequence S of time intervals [*ti*−1,*ti*[ associated to each phoneme can be expressed as follows:

$$S = \{ f\_1 \in [0, t\_1[ \vdots f\_2 \in [t\_1, t\_2[ \vdots \dots \dots \rangle t\_n \in [t\_{n-1}, t\_n[ \} ] ]] \tag{4}$$

A viseme is defined "active" when *t* falls into the corresponding time interval. The preceding and the following visemes are defined as "adjacent visemes". Due to the negative exponential nature of the dominance function, just the adjacent visemes are considered for computing weights. For each time instant, 3 weights must be computed on the basis of the respective dominance functions of 3 visemes at a time. The weights are computed as follows:

$$w\_{\bar{l}}(t) = D\_{\bar{l}}(t) = a\_{\bar{l}} \cdot \exp(-\theta\_{\bar{l}} \cdot |t - \tau\_{\bar{l}}|) \tag{5}$$

where *τ<sup>i</sup>* the mid point of the *i*-th time interval. The *wi* must be normalized:

$$w\_i'(t) = \frac{w\_i(t)}{\sum\_{j=-1}^{+1} w\_{i-j}(t)}\tag{6}$$

so that for each time instant the coordinates of the interpolating viseme vertexes *v* (*l*) *int*(*t*) ∈ {*Vint*(*t*)} will be computed as follows:

$$w\_{int}^{(l)}(t) = \sum\_{k=i-1}^{i+1} w\_i^{'}(t) \cdot v\_k^{(l)}(t) \tag{7}$$

where the index *l* indicates corresponding vertexes in all the involved keyframes.

Our implementation simplifies also this computation. It is sufficient to determine the result of the coarticulation just for the keyframes, because the interpolation is obtained using directly the morphing engine with a linear control function. Once the dominance functions are determined, each coarticulated keyframe is computed and its duration is the same as in the corresponding phoneme.

#### **4.2.2 Diphthongs and dominant visemes**

A sequence of two adjacent vowels is called diphthong. The word "euro" contains one diphthong. The vowels in a diphthong must be visually distinct as two separate entities. The visemes belonging to the vowels in a diphthong mustn't influence each other. Otherwise, both the vowel visemes wouldn't be distinguishable due to their fusion. In order to avoid this problem, the slope of the dominance function belonging to each vocal viseme in a diphthong must be very steep (see Fig.2). On the contrary, the sequence vowel-consonant requires a different profile of the dominant function. Indeed, the consonant is heavily influenced by the preceding vowel: a vowel must be dominant with respect to the adjacent consonants, but not with other vowels. As shown in Fig.3, the dominance of a vowel with respect to a consonant is accomplished with a less steep curve than the consonant one.

a Humoristic Chatbot 11

An Emotional Talking Head for a Humoristic Chatbot 329

input a list of phonemes, together with prosodic information (duration and intonation), and

The *pattern* delimits what the user can say. Every time the *pattern* is matched, the

The recognition of humorous sentences is obtained using specific tag inserted into the

The second category is activated if the previous answer of the chatbot was "Yes you can" (according to the *that* tag behavior), and the *humorlevel* tag evaluates the level of humor of the sentence matched with the \* wildcard (i.e. what the user said). The humor level can assume three different values, *low*, *medium* and *high*. Depending on the humor level value, the category will recursively call, by means of the *srai* tag, another category, which will explicit an emotional tag, including the information needed to the talking head expression, and a *prosody* tag to produce the prosody file. In particular we have extended the AIML language to include three emotional tags *joy*, *anger* and *sadness*. Each of them also includes a mandatory *intensity* attribute. The value assigned to the attribute is a measure of how much that emotion combines to produce the overall emotional state of the chatbot. The called tag link the proper

produces an audio file .wav which is played during the facial animation.

<pattern>WHAT DO YOU THINK ABOUT ROBOTS</pattern> <template>Robots will be able to buy happiness, but in condensed chip form!!

**5. Some example of interaction**

**5.1 Example of humorous sentences generation** The following is an example of an humorous dialogue User: What do you think about robots?

obtained writing an *ad hoc* AIML category:

corresponding *template* is activated.

**5.2 Example of humor recognition**

<pattern>\*</pattern>

template, as shown in the following categories:

<template>Yes you can</template>

<that>YES YOU CAN</that>

<pattern>CAN I TELL YOU A JOKE</pattern>

<srai> <humorlevel> <star/> </humorlevel> <srai>

<category>

<category>

< /category>

<template>

</template> < /category>

<category>

</template> < /category>

EHeBby: Robots will be able to buy happiness, but in condensed chip form!!

Fig. 2. The dominance function for the diphtong case (a) and the weights diagram (b) for the diphthong case

Fig. 3. The same of Fig.2 for the vowel-consonant case.

#### **4.3 The emotional talking head**

Emotions can be considered as particular visemes, called emotional visemes. They must be "mixed" with the phonetic visemes to express an emotion during the facial animation. Such a process can be performed in two different ways. FaceGen can generate also facial modification to express an emotion, so a phonetic viseme can be modified using FaceGen to include an emotion. As result, different sets of modified phonetic visemes can be produced. Each of them are different both as type and intensity of a given emotion. Such a solution is very accurate but it requires an adequate amount of memory and time to create a large emotional/phonetic visemes database. The second approach considers a single emotional viseme whose mesh vertexes coordinate are blended with a viseme to produce a new keyframe. Even though such a solution is less accurate than the previous one, it is less expensive on the computational side, and allows to include and mix "on the fly" emotional and phonetic visemes at run-time.

#### **4.4 Audio streaming synchronization**

Prosody contains all the information about the intonation and duration to be assigned to each phoneme in a sentence. In our talking head model, the prosody is provided by Espeak espeak (2010), a multilanguage and multiplatform tool that is able to convert the text into a .pho prosody file. The Talking Head is intrinsically synchronized with the audio streaming because the facial movements are driven by the .pho file, which determines the phoneme (viseme) and its duration. Espeak provides a variety of options to produce the prosody for the language and speech synthesizer to use. As an example it can generate a prosody control for the couple Italian/Mbrola, which is a speech synthesizer based on concatenation of diphones. It takes as input a list of phonemes, together with prosodic information (duration and intonation), and produces an audio file .wav which is played during the facial animation.
