**2. Neurophysiological studies on basic vocal emotion in speech and voice**

Vocal emotion has been investigated mainly in vocalization and speech. A study compared the ERP responses toward the perception of three basic emotions (happiness, sadness, and anger) in vocalization vs. pseudo‐speech (same as real‐speech except the lexical‐semantic contents were replaced by meaningless syllables [10, 11]) in a task when listeners were pre‐ sented with emotional vocal expressions followed by emotional and neutral faces and were asked to judge the emotionality of the face. Pell et al. [11] showed that the vocalization and speech can be differentiated very early at about 100 ms. Vocalization elicited a larger, earlier, and more differentiated P200 between emotions, and a stronger and earlier late‐positivity effect. These findings support a preferential decoding in the neurophysiological system of vocalization over speech‐embedded emotions in the human voice. They also demonstrated angry voice elicited the strongest P200 than the other expressions. In another study in which anger, happiness, and neutral vocalizations were compared, anger elicited a stronger positiv‐ ity in the 50 ms while both anger and happiness elicited a reduced N100 and an increased P200 as compared with neutral vocalization [7]. These findings, taken together, suggest an early sensory registration of emotional information which is assigned increased relevance or motivational significance in decoding human vocalization.

**1. Introduction**

48 Emotion and Attention Recognition Based on Biological Signals and Images

Theoretical models based on electrophysiological studies have indicated early and late neu‐ rophysiological markers that index online perception of vocal emotion expressions in speech as well as other higher‐order socioemotive expressions (e.g., confidence, sarcasm, sincerity, etc.), which roughly correspond to each hypothesized processing stage [1, 2]. Studies with event‐related potentials (ERPs), which focused on the analysis of averaged electrophysi‐ ological response to a certain vocal or speech event, have enlightened neurocognitive pro‐ cesses at a fine‐grained temporal scale. The early fronto‐central auditory N1 is known to be associated with a wide range of auditory stimulus types as a measure of sensory‐perceptual processing. In vocal emotion processing, N1 has been linked to the extraction of acoustic cues that differentiate different types of vocal signals, frequency, and intensity parameters [3, 4], and is unaffected by differences in emotional meaning. The fronto‐central P200 has been associated with the early attentional allocation or relevance evaluation of vocal signals [2, 5], ensuring preferential processing of emotional stimuli. Differentiation of P200 ampli‐ tude can be found between basic emotions [6] or between emotional vs. neutral speech [3, 7], suggesting that this component may reflect an early function of "tagging" emotional or motivational relevant stimuli. The P200 tended to be associated with higher mean and range of f0, larger mean and range of amplitude of speech, and slower speech rate [6], implicating that the early P200 modulation is partially explained by early meaning encoding as well as continued sensory processing [8]. A late centro‐parietal positivity (also named LPC) evoked by vocal emotion expressions has been defined as a positive‐going wave starting about 500 ms post‐onset of the vocal stimuli and perhaps sustaining until 1200 ms depending on stimulus features. The LPC is considered as reflecting continued or second‐pass evaluative process of the meaning of vocal emotional signals [2, 5]. The LPC was larger in emotional vocal stimuli, leading to larger differences in the LPC amplitude among basic emotion types [6], suggesting a more elaborative processing vocal information at this stage. In addition to these ERP effects, a more delayed sustained positivity may reflect a listener's attempt to infer the goal of a speaker, especially when an expected way of speaking is mismatched in an utterance context [9]. These event‐related potential components have provided a useful tool to examine the temporal neural dynamics of emotional decoding in voice and speech.

**2. Neurophysiological studies on basic vocal emotion in speech and voice**

Vocal emotion has been investigated mainly in vocalization and speech. A study compared the ERP responses toward the perception of three basic emotions (happiness, sadness, and anger) in vocalization vs. pseudo‐speech (same as real‐speech except the lexical‐semantic contents were replaced by meaningless syllables [10, 11]) in a task when listeners were pre‐ sented with emotional vocal expressions followed by emotional and neutral faces and were asked to judge the emotionality of the face. Pell et al. [11] showed that the vocalization and speech can be differentiated very early at about 100 ms. Vocalization elicited a larger, earlier, and more differentiated P200 between emotions, and a stronger and earlier late‐positivity effect. These findings support a preferential decoding in the neurophysiological system of vocalization over speech‐embedded emotions in the human voice. They also demonstrated Earlier ERP works have focused on how the brain responded to emotional transitions in the voice and to the transition in both voice and lexico‐semantics simultaneously [13]. Using a crosssplicing technique, a leading phrase of a sentence was crossspliced with the main stem of a sentence either congruent or incongruent in prosody with the leading phrase. The onset of the crosssplicing point of the vocal expression in the main sentence elicited a larger negativity (350–550 ms) for a mismatch in both voice and lexico‐semantics and a larger more right‐hemispheric distributed positivity (600–950 ms) for a mismatch in voice only (pseudo‐ utterances: [3]; utterances with no emotional lexical items: [1]). The negativity suggested an effort of integrating the emotional information in both vocal and semantic channel with the context. The late positivity suggests a detection of acoustic variation in the vocal expression.

Some evidence further delineated the role of a specific acoustic feature in the ERP responses toward the vocal emotion decoding. For example, one EEG study compared the ERPs for the mismatching emotional prosody (a statement with neutral voice which was disrupted by an anger voice) and that for the matching prosody revealed an increased N2/P3 as com‐ pared with the matching prosody ([12]). The amplitude of the N2/P3 complex was reduced and the latency of such complex was more delayed when the intensity for that prosody was weakened. This finding suggests that emotional significance in the voice can be promoted by increased sound intensity. The role of a specific acoustic profile such as loudness of sound needs to be specified in vocal‐emotion studies.
