**5. Multimodal affect detection**

[109] presents annotated video clips from movies. Therefore, although the emotional expressions are acted out by professional actors, they take place in real-world environments (or at least simulated ones). Since these expressions are likely to be as subtle as naturally occurring ones, as actors strive to mimic realistic behavior, we categorize this database as a natural one.

For the induced and natural databases, the measured sensory information is labeled with the emotional information. The label is usually obtained through subject self-assessment, observer/listener judgment, or FACS coding (manually coded facial expressions). Selfassessment is performed using tools such as self-assessment Manikin (SAM) [110] or feeltrace [111]. **Table 1** shows a list of publicly accessible multimodal emotional databases. Most of the databases address the visual and audio modalities, while few recent ones introduce

**Reference DB type # Subjects Modalities Affects Labeling**

EMG, skin conductance, and respiration)

Induced 20 Visual and audio Dimensional labeling and

and physiological (EEG, ECG, EMG, and skin conductance)

gaze), audio, and physiological (EEG, ECG, skin conductance and temperature, and respiration)

Visual, audio, and physiological (ECG, skin conductance and temperature, and respiration)

VAM (2008) [116] Natural 19 Visual and audio Dimensional labeling SAM

relief, interest, pleasure, hot anger, panic fear, despair, irritation, anxiety, sadness, admiration, tenderness, disgust, contempt, and surprise

categorical labeling

categorical labeling

six basic emotions

Dimensional and categorical labeling

Dimensional labeling SAM

stress

Low, medium, and high

Varies across databases Observers'

N/A

Feeltrace

Feeltrace

Observers' judgment

judgment + selfassessment

Observers' judgment

Selfassessment (SAM for arousal and valence)

GEMEP (2012) [112] Posed 10 Visual and audio Amusement, pride, joy,

SAL (2008) [113] Induced 24 Visual and audio Dimensional and

Belfast (2000) [114] Natural 24 Visual and audio Dimensional and

Multiple databases

MIT (2005) [83] Natural 17 Physiological (ECG,

DEAP (2012) [118] Induced 32 Visual for (22 subjects)

Induced 27 Visual (face + eye

Induced and natural

We concede that it does not perfectly fit in any of the three presented types.

70 Emotion and Attention Recognition Based on Biological Signals and Images

physiological channels.

HUMAINE (2007) [115]

SEMAINE (2010) [117]

MAHNOB-HCI (2012) [12]

Humans display emotions through a variety of behaviors that are difficult for a machine to fully appreciate. They modulate their facial muscles, eye gaze, body gestures, gait, and speech tone among other channels of expression to convey emotions. Therefore, the understanding of these emotional cues requires a multisensory system that is able to track several or all of these channels.

Many multimodal affect-recognition schemes have been proposed. They generally differ in terms of the modalities, classification method, and fusion mechanism used, and emotions recognized. In **Table 2**, we survey several representative multimodal affect-recognition studies. Facial-expression analysis features prominently in these studies, followed by speech prosody. However, there seems to be little agreement on the nature and number of the features to be extracted for each modality.

All of the reviewed works consider a subset of possible features that can be extracted from the dataset. Therefore, effective feature selection is required to simplify the classification models, and reduce training time and overfitting. Hence, diverse automated techniques are employed for that purpose, such as the wrapper method [28], analysis of variance (ANOVA)-based approach [12], sequential backward selection [7], minimum redundancy maximum relevance [121], and correlation-based feature selection [104]. Some works rely on expert knowledge [27, 106] as an effective feature-selection scheme. Furthermore, several works elect to reduce the dimensionality of the feature space using PCA [7, 10, 106].



**Reference Modalities Classifier\*\* Features Affects DB type Overall** 

from FAPs and their derivatives

**Body:** quantity of motion and contraction index of the body, velocity, acceleration, and fluidity of the hand's

**Speech:** intensity, pitch, MFCC, Bark spectral bands, voiced segment characteristics, and pause length (377 features in total)

BN **Face:** statistical values

72 Emotion and Attention Recognition Based on Biological Signals and Images

barycenter

PCA+MLP **Face:** eye blink per minute, mouth deformations, eyebrow actions **Body:** touch hand to face

(binary)

SVM **Face:** Four-dimensional feature vectors

GP **Face:** nod and shakes,

eyebrows

mouse

features

**Speech:** 36 features (12 MFCCs, their deltas and accelerations, and the zeromean coefficient)

**Speech:** mean, standard deviation, range, maximum, minimum, and median of pitch and intensity

eye blinks, mouth activities, shape of eyes and

**Posture:** pressure matrices (on chair while seated) **Physiological:** skin conductance

**Behavioral:** pressure on

**Physiological:** 20 GSR, 63 ECG, 14 respiration, 4 skin temperature, and 216 EEG

**Eye gaze:** pupil diameter, gaze distance, gaze coordinates

Castellano et al. [28]

Panning et al. [10]

Busso et al. [7]

Kapoor et al. [123]

Soleymani et al. [12]

Visual (face, body) and audio

Visual (face and body) and audio

Visual (face) and audio

Visual (face, posture) and physiological

Physiological + eye gaze

SVM (RBF Kernel)

**recognition rate\***

DLF: 74.6%

Posed FLF: 78.3%

Frustration Natural FLF: 40–90%

Frustration Natural FLF: 79%

Posed FLF: 89.1%

Induced DLF: 72%

DLF: 89.0%

Anger, despair, interest, pleasure, sadness, irritation, joy and pride

Anger, sadness, happiness, neutral

Arousal and valence


FLF: Feature-Level Fusion, DLF: Decision-Level Fusion, HF: Hybrid Fusion. \*\*HMM: Hidden Markov Mode, C-HMM: Coupled HMM, SC-HMM: Semi-Coupled HMM, EWSC-HMM: Error Weighted SC-HMM, SVR: Support Vector Regression, LDF: Linear Discrimination Function, NN: Neural Networks, GP: Gaussian Process, MGP: Mixture of Gaussian Processes, MLP: Multilayer Perceptron, BN: Bayesian Network, NB: Naïve Bayes. ELM: Extreme Learning Machine.

**Table 2.** Representative multimodal affect-recognition studies.

Three modality-fusion techniques are commonly employed. There seems to be somewhat conflicting results concerning the most effective class of modality-fusion methods. For instance, Kapoor and Picard [9] obtain better results using feature-level fusion. Conversely, Busso et al. [7] fail to realize a discernible difference between the two methods. Beyond the latter two approaches, Lin et al. [27] propose three hybrid approaches that use coupled HMM, semi-coupled HMM, and error-weighted semi-coupled HMM based on a Bayesian classifier-weighing method. Their results show improvements over feature-and decision-level fusion for posed and induced-emotional databases. However, Kim et al. [104] were not able to improve over decision-level fusion with their proposed hybrid approach. The presence of confounding variables such as modalities, emotions, classification technique, feature selection and reduction approaches, and datasets used limits the value of comparing fusion results across studies. Consequently, Lingenfelser et al. [95] conducted a systematic study of several feature-level, decision-level, and hybrid-fusion techniques for multimodal affect detection. They were not able to find clear advantages for one technique over another.

Various affect classification methods are employed. For dynamic classification where the evolving nature of an observed phenomenon is classified, HMM is the prevalent choice of classifier [27]. For static classification, researchers use a variety of classifiers and we were not able to discern any clear advantages of one over another. However, an empirical study of unimodal affect recognition through physiological features found an advantage for SVM over *k*-nearest neighbor, regression tree, and Bayesian network [122]. Yet, a systematic investigation of the effectiveness of classifiers for multimodal affect recognition is needed to address the issue.

The database type seems to have an effect on the overall affect-recognition rate. We notice that studies that use posed databases generally achieve higher levels of accuracy compared to ones that use other types (e.g., [7, 27]). In fact, Lin et al. [27] perform an analysis of recognition rates using the same methods on two database types: posed and induced. They achieve significantly better results with the posed database. Natural databases result in typically lower recognition rates (e.g., [10, 101, 106, 121]) with the exception of studies [9, 123] that classify a single affect.
