4.2. Video databases

The most convenient format for capturing induced and spontaneous emotions is video. This is due to a lack of clear start and end points for non-posed emotions [93]. In the case of RGB Video, the subtle emotional changes known as microexpressions have also been recorded with the hope of detecting concealed emotions as in USF-HD [94], YorkDDT [95], SMIC [34], CASME [96] and Polikovsky's [55] databases, the newest and most extensive among those being CASME.

Posed video databases in Table 3 suggest that they tend to be quite small in the number of participants, usually around 10, and often professional actors have been used. Unlike with still images, scientists have tried to benefit from voice, speech or any other type of utterances for emotion recognition. Many databases have also tried to gather micro-expressions, as they do not show up on still images or are harder to catch. The posed video databases have mainly focused on six primary emotions and a neutral expression.

Media induced databases, as in Table 4, have a larger number of participants and the emotions are usually induced by audiovisual media, like Superbowl ads [107]. Because the emotions in these databases are induced via external means, this format is great for gathering fake [108] or hidden [34] emotions.

Interaction induced video databases have more unique ways of gathering data, like child-robot interaction [23] or reviewing past memories [36]. This can be seen in Table 5. This type of databases takes significantly longer time to create [113], but this does not seem to affect the sample size. Almost all of the spontaneous databases are in video format from other media sources, purely because of how difficult they are to collect. Spontaneous databases are also some of the rarest, compared to other elicitation methods. This is reflected in Table 6, which has the lowest number of databases among the different elicitation methods.

BU-4DFE [16], Bosphorus [14] and BP4D [17]. BU-3DFE and BU-4DFE both contain posed datasets with six expressions, the latter having higher resolution. Bosphorus tries to address the issue of having a wider selection of facial expressions and BP4D is the only one among the four using induced expressions instead of posed ones. A sample of models from a 3D database

can be seen in Figure 8.

Table 2. Posed static databases.

Database

[77, 78] 1994

AT-T Database for Faces (formerly ORL)

The Japanese Female Facial Expression Database [74] 1998

CAS-PEAL [90] 2008 Multi-PIE [72] 2008

1

Participants

Primary 6

Neutral

Contempt

JACFEE [76] 1988 4 X X Eight images of each emotion POFA (or PFA) [73] 1993 14 X Cross-cultural studies and

Embarrassment

Pain

Yale [75] 1997 15 X Frontal face, different light angles, occlusions FERET [79] 1998 1199 X X Standard for face recognition algorithms KDEF [80] 1998 70 X X Psychological and medical research

The AR Face Database [81] 1998 126 ✓<sup>1</sup> X X X Frontal face, different light angles, occlusions

UT DALLAS [33] 2005 284 ✓ X Head and face detection, emotions induced

TFEID [89] 2007 40 X X Taiwanese actors, two simultaneous angles

PUT [91] 2008 100 X X High-resolution head-pose database Radboud Faces Database [92] 2008 67 X X X Supervised by FACS specialists FACES database [29] 2010 154 X Expression perception, wide age range,

iCV-MEFED [52] 2017 115 X X Psychologists picked best from 5

X X

MSFDE [82] 2000 12 X X FACS coding, ethnical diversity CAFE Database [83] 2001 24 X X FACS coding, ethnical diversity CMU PIE [84] 2002 68 X X X Illumination variation, varying poses Indian Face Database [85] 2002 40 ✓ X Indian participants from seven view angles NimStim Face Stimulus Set [86] 2002 70 X X X Facial expressions were supervised KFDB [87] 2003 1920 X X Includes ground truth for facial landmarks

PAL Face Database [88] 2004 576 ✓ X Wide age range

X X X

1040 337

A selection of six primary emotions has been used in databases with this symbol.

Smile

Positive

Negative

Other

40 X X Dark homogeneous background, frontal face

10 X X Subjects photographed themselves through a

Additional information

Review on Emotion Recognition Databases http://dx.doi.org/10.5772/intechopen.72748 47

neuropsychological research

backward masking)

semi-reflective mirror

using audiovisual media

Chinese face detection

evaluated by participants

Multiple view angles, illumination variation

(perception, attention, emotion, memory and

## 4.3. Miscellaneous databases

Apart from the formats mentioned above, 3D scanned and even thermal databases of different emotions have also been constructed. The most well-known 3D datasets are the BU-3DFE [15],

#### Review on Emotion Recognition Databases http://dx.doi.org/10.5772/intechopen.72748 47


Table 2. Posed static databases.

(hats, glasses, etc.). Great examples are the MMI [45] and Multi-PIE [72] databases, which were some of the first well-known ones using multiple view angles. In order to increase the accuracy of the human expression analysis models, databases like the FABO [22] have expanded the

Static databases are the oldest and most common type. Therefore, it's understandable that they were created with the most diverse of goals, varying from expression perception [29] to neuropsychological research [73], and have a wide range of data gathering styles, including selfphotography through a semi-reflective mirror [74] and occlusion and light angle variation [75]. Static databases usually have the largest number of participants and a bigger sample size. While it is relatively easy to find a database suited for the task at hand, categories of emotions are quite limited, as static databases only focus on six primary emotions or smile/neutral detection. In the future, it would be convenient if there were databases with more emotions, especially spontaneous or induced, because, as you can see in Table 2, almost all static databases to date are posed.

The most convenient format for capturing induced and spontaneous emotions is video. This is due to a lack of clear start and end points for non-posed emotions [93]. In the case of RGB Video, the subtle emotional changes known as microexpressions have also been recorded with the hope of detecting concealed emotions as in USF-HD [94], YorkDDT [95], SMIC [34], CASME [96] and

Posed video databases in Table 3 suggest that they tend to be quite small in the number of participants, usually around 10, and often professional actors have been used. Unlike with still images, scientists have tried to benefit from voice, speech or any other type of utterances for emotion recognition. Many databases have also tried to gather micro-expressions, as they do not show up on still images or are harder to catch. The posed video databases have mainly

Media induced databases, as in Table 4, have a larger number of participants and the emotions are usually induced by audiovisual media, like Superbowl ads [107]. Because the emotions in these databases are induced via external means, this format is great for gathering fake [108] or

Interaction induced video databases have more unique ways of gathering data, like child-robot interaction [23] or reviewing past memories [36]. This can be seen in Table 5. This type of databases takes significantly longer time to create [113], but this does not seem to affect the sample size. Almost all of the spontaneous databases are in video format from other media sources, purely because of how difficult they are to collect. Spontaneous databases are also some of the rarest, compared to other elicitation methods. This is reflected in Table 6, which

Apart from the formats mentioned above, 3D scanned and even thermal databases of different emotions have also been constructed. The most well-known 3D datasets are the BU-3DFE [15],

has the lowest number of databases among the different elicitation methods.

Polikovsky's [55] databases, the newest and most extensive among those being CASME.

focused on six primary emotions and a neutral expression.

frame from a portrait to the entire upper body.

46 Human-Robot Interaction - Theory and Application

4.2. Video databases

hidden [34] emotions.

4.3. Miscellaneous databases

BU-4DFE [16], Bosphorus [14] and BP4D [17]. BU-3DFE and BU-4DFE both contain posed datasets with six expressions, the latter having higher resolution. Bosphorus tries to address the issue of having a wider selection of facial expressions and BP4D is the only one among the four using induced expressions instead of posed ones. A sample of models from a 3D database can be seen in Figure 8.


Table 3. Posed video databases.

With RGB-D databases, however, it is important to note that the data is unique to each sensor with outputs having varying density and error, so algorithms trained on databases like the IIIT-D RGB-D [115], VAP RGB-D [116] and KinectFaceDB [117] would be very susceptible to hardware changes. For comparison with the 3D databases, an RGB-D sample has been provided in Figure 9. One of the newer databases, the iCV SASE [118] database, is RGB-D dataset solely dedicated to headpose with free facial expressions.

As their applications are more specific, thermal facial expression datasets are very scarce. Some of the first and more known ones are IRIS [123] and Equinox [121, 122], which consist of RGB and thermal image pairs that are labelled with three emotions [124], as can be seen in Figure 10. Thermal databases are usually posed or induced by audiovisual media. The ones in Table 8 mostly focus on positive, negative, neutral and six primary emotions. The average number of

There are mainly two types of emotion databases that contain audio content: stand-alone audio databases and video databases that include spoken words or utterances. The information extracted from audio is called context and can be generally categorized into a multitude, wherein the three important context subdivisions for emotion recognition databases are the

Semantic context is where the emotion can be isolated through specific emotionally marked words, while structural context is dependent on the stress patterns and syntactic structure of longer phrases. Temporal context is the longer lasting variant of the structural context as it

involves the change of emotion in speech over time, like emotional build-up [42].

participants is quite high relative to other types of databases.

4.3.1. Audio databases

Database

1

Audiovisual media.

Participants

IAPS [105] 1997 497–1483 Visual

eNTERFACE'05 [46] 2006 42 Auditory

CK+ [44] 2010 220 Posed and

MAHNOB [51] 2013 22 Posed and

Table 4. Media induced video databases.

Elicitation

media

media

AVM

AVM

Primary 6

Neutral

SD [32] 2004 28 AVM<sup>1</sup> ✓ X X One of the first international

SMIC [34] 2011 6 AVM ✓ Supressed emotions Face Place [106] 2012 235 AVM X X X Different ethnicities

SASE-FE [108] 2017 54 AVM ✓ X Fake emotions

AM-FED [107] 2013 81–240 AVM X X Reactions to Superbowl ads

Contempt

Embarrassment

Pain

Smile

Positive

Negative

Other

X Standard for face recognition

✓ X Laughter recognition research

X Updated version of CK

algorithms

Additional information

49

Review on Emotion Recognition Databases http://dx.doi.org/10.5772/intechopen.72748

> X Pleasure and arousal reaction images, subset for children

> > induced emotion data-sets

semantic, structural, and temporal ones.

Even though depth based databases, like the ones in Table 7, are relatively new compared to other types and there are very few of them, they still manage to cover a wide range of different emotions. With the release of commercial use depth cameras like the Microsoft Kinect [120], they will only continue to get more popular in the future.


Table 4. Media induced video databases.

As their applications are more specific, thermal facial expression datasets are very scarce. Some of the first and more known ones are IRIS [123] and Equinox [121, 122], which consist of RGB and thermal image pairs that are labelled with three emotions [124], as can be seen in Figure 10. Thermal databases are usually posed or induced by audiovisual media. The ones in Table 8 mostly focus on positive, negative, neutral and six primary emotions. The average number of participants is quite high relative to other types of databases.

#### 4.3.1. Audio databases

With RGB-D databases, however, it is important to note that the data is unique to each sensor with outputs having varying density and error, so algorithms trained on databases like the IIIT-D RGB-D [115], VAP RGB-D [116] and KinectFaceDB [117] would be very susceptible to hardware changes. For comparison with the 3D databases, an RGB-D sample has been provided in Figure 9. One of the newer databases, the iCV SASE [118] database, is RGB-D dataset

Even though depth based databases, like the ones in Table 7, are relatively new compared to other types and there are very few of them, they still manage to cover a wide range of different emotions. With the release of commercial use depth cameras like the Microsoft Kinect [120],

solely dedicated to headpose with free facial expressions.

Database

[97] 1997

University of Maryland DB

Polikovsky's database [55]

Table 3. Posed video databases.

2009

Participants

48 Human-Robot Interaction - Theory and Application

Primary 6

Neutral

Contempt

Embarrassment

Pain

Chen-Huang [28] 2000 100 X Facial expressions and speech

GEMEP [31] 2006 10 ✓ X Professional actors, supervised

AONE [100] 2007 75 Asian adults

FABO [22] 2007 4 ✓ X Face and upper-body IEMOCAP [101] 2008 10 ✓ X X Markers on face, head, hands RML [54] 2008 8 X Suppressed emotions

YorkDDT [95] 2009 9 X X Micro-expressions

Smile

CK [27] 2000 97 X One of the first FE databases made public

DaFEx [98] 2004 8 X X Italian actors mimicked emotions while uttering

Mind Reading [99] 2004 6 X X Teaching tool for children with behavioural

SAVEE [102] 2009 4 X X Blue markers, three images per emotion

ADFES [104] 2011 22 X X X X Frontal and turned facial expressions

CASME [96] 2013 35 ✓ X X Micro expressions, suppressed emotions

STOIC [103] 2009 10 X X X Face recognition, discerning gender, contains still

USF-HD [94] 2011 16 ✓ X Micro-expressions, mimicked shown expressions

Positive

Negative

40 X 1–3 expressions per clip

Other

10 X X Low intensity micro-expressions

images

Additional information

different sentences

disabilities

they will only continue to get more popular in the future.

There are mainly two types of emotion databases that contain audio content: stand-alone audio databases and video databases that include spoken words or utterances. The information extracted from audio is called context and can be generally categorized into a multitude, wherein the three important context subdivisions for emotion recognition databases are the semantic, structural, and temporal ones.

Semantic context is where the emotion can be isolated through specific emotionally marked words, while structural context is dependent on the stress patterns and syntactic structure of longer phrases. Temporal context is the longer lasting variant of the structural context as it involves the change of emotion in speech over time, like emotional build-up [42].


Table 5. Interaction induced video databases.


In case of multimodal data, the audio component can provide a semantic context, which can have a larger bearing on the emotion than the facial expressions themselves [11, 23]. However, in case of solely audio data, like the Bank and Stock Service [126] and ACC [127] databases, the

Review on Emotion Recognition Databases http://dx.doi.org/10.5772/intechopen.72748 51

The audio databases in Table 9 are very scarce and tailored to specific needs, like the Banse-Schrerer [26], which has only four participants and was gathered to see whether judges can deduce emotions from vocal cues. The easiest way to gather a larger amount of audio data is from call-

Even with all of the readily available databases out there, there is still a need for creating selfcollected databases for emotion recognition, as the existing ones don't always fulfil all of the

context of the speech plays a quintessential role in emotion recognition [128, 129].

Figure 9. RGB-D facial expression samples from the KinectFaceDB database [117].

Figure 8. 3D facial expression samples from the BU-3DFE database [15].

centres, where the emotions are elicited either by another person or a computer program.

criteria [130–133].

Table 6. Spontaneous video databases.

Figure 8. 3D facial expression samples from the BU-3DFE database [15].

Figure 9. RGB-D facial expression samples from the KinectFaceDB database [117].

Database

Database

ISL meeting corpus [35] 2002

AIBO database [23]

RU-FACS [109] 2005 SAL [11] 2005

SEMAINE [110, 111] 2010/2012 AVDLC [12] 2013

MMI [45] 2006 61/

2004

Participants

50 Human-Robot Interaction - Theory and Application

AAI [36] 2004 60 Human-human

CSC corpus [37] 2005 32 Human-human

90 24

29

150 292

TUM AVIC [53] 2007 21 Human-human

RECOLA [112] 2013 46 Human-human

Table 5. Interaction induced video databases.

Elicitation

90 Human-human interaction

interaction

interaction

interaction

interaction

Human-human interaction Human-computer interaction

Human-human interaction human-computer interaction

Posed/child-comedian interaction, adultaudiovisual media

[114] 2003

Belfast Naturalistic Emotional Database

Table 6. Spontaneous video databases.

Participants

Primary 6

Neutral

Contempt

Primary 6

X ✓ X X

Neutral

Contempt

Embarrassment

Pain

30 Child-robot interaction ✓ X X Robot instructed by children

X X X

Smile

Positive

Negative

Other

X X X Collected in a meeting fashion

X X X X Induced via past memories

X Honesty research

portrait images

X Commercial presentation

Operator was thoroughly familiar with SAL script Mood disorder and unipolar depression research

X Collaborative tasks. Audiovideo, ECG and EDA were

students

X Profile views along with

X

Subjects were all university

conversations held with a simulated "chat-bot" system

Additional information

Belfast natural database [42] 2003 125 X X X X Video clips from television and

VAM [43] 2008 47 X Video clips from a talk-show AFEW [39, 40] 2011/2012 330 X X Video clips from movies

Spanish Multimodal Opinion [41] 2013 105 X X Spanish video clips from YouTube

Embarrassment

Pain

Smile

Positive

Negative

Other

125 X X Studio recordings and television

interviews

program clips

Additional information

recorded

In case of multimodal data, the audio component can provide a semantic context, which can have a larger bearing on the emotion than the facial expressions themselves [11, 23]. However, in case of solely audio data, like the Bank and Stock Service [126] and ACC [127] databases, the context of the speech plays a quintessential role in emotion recognition [128, 129].

The audio databases in Table 9 are very scarce and tailored to specific needs, like the Banse-Schrerer [26], which has only four participants and was gathered to see whether judges can deduce emotions from vocal cues. The easiest way to gather a larger amount of audio data is from callcentres, where the emotions are elicited either by another person or a computer program.

Even with all of the readily available databases out there, there is still a need for creating selfcollected databases for emotion recognition, as the existing ones don't always fulfil all of the criteria [130–133].


Table 7. 3D and RGB-D databases.

5. Conclusion

Table 9. Audio databases.

Database

Equinox [121, 122] 2002

1

Database

1996

Banse-Scherer [26]

Bank and Stock Service [126] 2004 Participants

NVIE [47] 2010 215 Posed and

KTFE [125] 2014 26 Posed and

Audiovisual media.

Table 8. Thermal databases.

Elicitation

AVM1

AVM

Participants

ACC [127] 2005 1187 Human-

Elicitation

350 Human-human interaction

> computer interaction

Primary 6

X X

Primary 6

Neutral

Contempt

Embarrassment

Pain

4 Posed X X X X Vocally expressed emotions

Smile

Positive

Negative

Other

✓ X X Collected from a call center and

X X Collected from automated call center applications

Additional information

Capital Bank Service Center

Neutral

Contempt

Embarrassment

Pain

IRIS [123] 2007 4228 Posed X X X Some of the first thermal FE data-sets

Smile

Positive

340 Posed X X X Captured in SWIR, MWIR and LWIR

Negative

Other

X Spontaneous expressions are not present for every subject

Additional information

Review on Emotion Recognition Databases http://dx.doi.org/10.5772/intechopen.72748 53

With the rapid increase of computing power and size of data, it has become more and more feasible to distinguish emotions, identify people, and verify honesty based on video, audio or image input, taking a large step forward not only in human-computer interaction, but also in mental illness detection, medical research, security and so forth. In this paper an overview of existing face databases in varying categories has been given. They have been organised into tables to give the reader an easy way to find necessary data. This paper should be a good

starting point for anyone who considers training a model for emotion recognition.

Figure 10. Thermal images taken from the Equinox database [121, 122].


Table 8. Thermal databases.

Database

Participants

52 Human-Robot Interaction - Theory and Application

PICS [119] 2013 — Images, videos,

KinectFaceDB [117] 2014 52 RGB-D images,

Table 7. 3D and RGB-D databases.

Format

3D images

videos

Figure 10. Thermal images taken from the Equinox database [121, 122].

Primary 6

Neutral

Contempt

BU-3DFE [15] 2006 100 3D images X Ethnically diverse, two angled

Bosphorus [14] 2008 105 3D images X Occlusions, less ethnic diversity

VAP RGB-D [116] 2012 31 RGB-D videos X X 17 different recorded states

BP4D [17] 2014 41 3D videos X X X Human-human interaction IIIT-D RGB-D [115] 2014 106 RGB-D images X X Captured with Kinect

BU-4DFE [16] 2008 101 3D videos Newer version of BU-3DFE, has 3D

Embarrassment

Pain

Smile

Positive

Negative

Other

views

videos

X X Captured with Kinect, varying occlusions

than BU-3DF

and is still ongoing

repeated 3 times for each person

Includes several different datasets

Additional information


Table 9. Audio databases.
