**1. Introduction**

Turn-taking is an indispensable part of spontaneous and authentic human communication. Despite its significance, it is not always as obvious and straightforward as one might want it to be. Rather, it is sometimes conveyed by elusive and subtle cues. These cues can be of verbal or non-verbal nature, but, in successful communication, all of them can be picked up by the human observer. To facilitate effective natural communication between machines and humans, significant effort must be put towards understanding and recognizing the inter-dynamics and intent of non-verbal communication, of which turn-taking is also a part.

The theory of dialog acts offers one possible way to gain insight into the functionality of verbal and non-verbal expressions of communication. Dialog act (hereinafter DA) theory has its origins in speech act theory [1, 2]. But despite its name, DA theory is not merely a theoretical concept. As Bunt [3] emphasizes, its goal is to provide a computational model of language in actual use. According to Searle [2], a DA represents the meaning of an utterance at the level of illocutionary force, and hence, it constitutes the basic unit of linguistic communication.

There are numerous DA annotation schemes, some of which are more purposespecific, such as the Verbmobil scheme, which is based on business appointmentscheduling dialogs [4], the TRAINS scheme, annotating dialogs about train freight management [5], or the Coconut annotation scheme, with dialogs about buying dinning or living room furniture [6], while the ISO 24617-2, the DIT++, the DAMSL and the Switchboard annotation schemes, for example, cover various topics and apply to a wider range of material. The Switchboard scheme was created for a corpus of various authentic, spontaneous telephone calls in the United States and defined 42 types of DAs [7]. The DAMSL scheme, moreover, filled the need for applying multiple tags to a single segment [8] and was the first multidimensional scheme [3]. The concept of dimensions is best described by the ISO 24617-2 annotation scheme, whereby it is defined as a "class of DAs with the same type of semantic content" ([9]: 2). In comparison to multidimensional schemes, one-dimensional schemes use several tags, which are, however, mutually exclusive. Multidimensional schemes are, therefore, more appropriate for the annotation of naturally occurring dialogs. Another example of a multidimensional scheme is the DIT++ annotation scheme, which is partly based on the DAMSL scheme. It distinguishes between general-purpose and dimension-specific functions, which together form a set of ten dimensions – the Task/Activity dimension, the Auto-Feedback, the Allo-Feedback, the Turn Management, the Time Management, the Contact Management, the Own Communication Management, the Partner Communication Management, the Discourse Structuring Management, and the Social Obligations Management dimensions [3]. Furthermore, the DIT++ is not limited to verbal communication only; it also considers non-verbal communication, such as head gestures and prosody. The ISO 24617-2 annotation scheme is partially based on the DIT++ taxonomy. As Bunt [10] elaborates, it was created as a consolidation of selected taxonomies with the aim of avoiding confusion among the several existing annotation schemes and their inconsistent terminology [9]. Moreover, in addition to its multidimensionality, the ISO scheme strives to be a domain-independent scheme. Regarding dimensions, it contains functionally the same dimensions as the DIT++ with the exemption of the Contact Management dimension, which is not included in the ISO 24617-2. Among these nine dimensions, the scheme specifies 57 different functions. Six of these functions pertain to the dimension of Turn Management, namely, the functions of accepting, taking, grabbing, assigning, releasing, and keeping a turn. The functions are relatively self-explanatory as long as we remember that the function is always carried out by the sender, i.e. the "dialogue participant who produces a dialog act" ([9]: 4). The functions of turn management are all dimension-specific, which means that they cannot be assigned to any other dimension. The scheme also acknowledges the need for subtle characteristics of utterances such as conditionality, modality, (un)certainty, stance, and sentiment, which Petukhova and Bunt [11] raised in their analysis of existing annotation schemes. As a solution, the ISO 24617-2 proposes function qualifiers that can be applied to a DA function. Following its predecessor, the DIT++, the ISO 24617-2 also considers nonverbal behavior in terms of DA annotation. Afterall, in its definition of DAs, the ISO 24617-2 does not discriminate between verbal and non-verbal behavior, since it defines DAs as "a semantic unit of communicative behaviour".

**107**

*Can Turn-Taking Highlight the Nature of Non-Verbal Behavior: A Case Study*

manually and partially automatically with the tool Qannot.

pointers, illustrators, symbols or emblems, and batons.

Cooperrider's [27] classification of gestures, on the other hand, concerns itself with the question of whether the gesture "communicates a critical part of a message" ([27]: 179) or not. He divides gestures into foreground and background gestures. Foreground gestures are those gestures of which we are aware when we perform them, such as a thumb up, whereas background gestures occur

Although there seems to be strong evidence to support the multimodal and multi-signal nature of the human-human interaction, for decades, spoken language understanding has first and foremost focused on speech a priori [17]. The classification of non-verbal behavior by Mlakar et al. [18], draws upon McNeill's [19] common growth point theory, according to which speech and gestures both stem from a common growth point of a concept and mutually influence one another, Pierce's [20] semiotics, that provides analysis of non-linguistic signs and symbols as the meaning of non-verbal behavior, Ekman and Friesen's [21] categories and coding of non-verbal behavior, and Birdwhistell's [22] insights into the importance of kinesics. Moreover, the classification by [18] utilizes the communication management theory [23, 24] and, therefore, also encompasses discourse functions to some extent. Mlakar et al. [18] refer to 'gestures' as behavior generated by moving body parts (i.e., head, hands/arms, face, and posture) performing a communicative purpose, i.e., containing a discourse function, as a non-verbal communication intent (hereinafter NCI). These non-verbal expressions represent the basis of cognitive capabilities and understanding [25]. Namely, although not bound by grammar, non-verbal expressions co-align with language structures and compensate for the less articulated verbal expression models, thus providing a certain degree of clarity of discourse [26]. The non-verbal behavior retains the semantics and at the same time helps in providing suggestive influences and serves for interactive purposes, even such as content expression of one's mental state, attitude, and social functions. The classification proposed by Mlakar et al. [18] positions the role/intent of non-verbal concepts into five main NCI classes of regulators or adapters, deictics or

Hence, the ISO 24617-2 is well-suited for the annotation of multimodal material and was implemented in research of non-verbal behavior. Yoshino et at. [12] utilized the scheme to annotate information navigation and attentive listening dialogs to improve natural conversation modeling for caretakers that communicate with the elderly. Navaretta and Paggio [13] explore non-verbal behavior occurring when providing feedback among persons who just met, i. e. in highly spontaneous settings. The one-hour recordings, annotated with the tool Anvil, specifically analyze what kind of head movement or facial expressions accompany a certain subtype of the feedback dimension. Their classification of non-verbal behavior is based on the MUMIN scheme. Petukhova and Bunt [14] utilize almost an hourlong recording from the corpus AMI, which consists of project meetings. They analyze DAs according to the DIT++ and the ISO 24617-2 schemes together with co-occurring non-verbal behavior, which is classified according to the CoGest scheme. In their previous work, Petukhova and Bunt [15] annotate recordings from the AMI corpus according to the DIT scheme. Both the annotation of DAs and the annotation of non-verbal behavior is carried out with the DIT scheme, since, as they emphasize, non-verbal behavior helps us understand the true function of a DA. The pragmatical annotation of the multimodal corpus HuComTech [16], however, is not based on the ISO scheme, yet its main annotation units are very similar. They are referred to as communicative acts, which denote the function or purpose of an utterance (e.g., agreement, turn management, information). The annotation of non-verbal behavior, including facial expressions, eyebrow movement, head movement, touch motions, posture, or emotions, was performed

*DOI: http://dx.doi.org/10.5772/intechopen.95516*

#### *Can Turn-Taking Highlight the Nature of Non-Verbal Behavior: A Case Study DOI: http://dx.doi.org/10.5772/intechopen.95516*

*Types of Nonverbal Communication*

The theory of dialog acts offers one possible way to gain insight into the functionality of verbal and non-verbal expressions of communication. Dialog act (hereinafter DA) theory has its origins in speech act theory [1, 2]. But despite its name, DA theory is not merely a theoretical concept. As Bunt [3] emphasizes, its goal is to provide a computational model of language in actual use. According to Searle [2], a DA represents the meaning of an utterance at the level of illocutionary

force, and hence, it constitutes the basic unit of linguistic communication.

There are numerous DA annotation schemes, some of which are more purposespecific, such as the Verbmobil scheme, which is based on business appointmentscheduling dialogs [4], the TRAINS scheme, annotating dialogs about train freight management [5], or the Coconut annotation scheme, with dialogs about buying dinning or living room furniture [6], while the ISO 24617-2, the DIT++, the DAMSL and the Switchboard annotation schemes, for example, cover various topics and apply to a wider range of material. The Switchboard scheme was created for a corpus of various authentic, spontaneous telephone calls in the United States and defined 42 types of DAs [7]. The DAMSL scheme, moreover, filled the need for applying multiple tags to a single segment [8] and was the first multidimensional scheme [3]. The concept of dimensions is best described by the ISO 24617-2 annotation scheme, whereby it is defined as a "class of DAs with the same type of semantic content" ([9]: 2). In comparison to multidimensional schemes, one-dimensional schemes use several tags, which are, however, mutually exclusive. Multidimensional schemes are, therefore, more appropriate for the annotation of naturally occurring dialogs. Another example of a multidimensional scheme is the DIT++ annotation scheme, which is partly based on the DAMSL scheme. It distinguishes between general-purpose and dimension-specific functions, which together form a set of ten dimensions – the Task/Activity dimension, the Auto-Feedback, the Allo-Feedback, the Turn Management, the Time Management, the Contact Management, the Own Communication Management, the Partner Communication Management, the Discourse Structuring Management, and the Social Obligations Management dimensions [3]. Furthermore, the DIT++ is not limited to verbal communication only; it also considers non-verbal communication, such as head gestures and prosody. The ISO 24617-2 annotation scheme is partially based on the DIT++ taxonomy. As Bunt [10] elaborates, it was created as a consolidation of selected taxonomies with the aim of avoiding confusion among the several existing annotation schemes and their inconsistent terminology [9]. Moreover, in addition to its multidimensionality, the ISO scheme strives to be a domain-independent scheme. Regarding dimensions, it contains functionally the same dimensions as the DIT++ with the exemption of the Contact Management dimension, which is not included in the ISO 24617-2. Among these nine dimensions, the scheme specifies 57 different functions. Six of these functions pertain to the dimension of Turn Management, namely, the functions of accepting, taking, grabbing, assigning, releasing, and keeping a turn. The functions are relatively self-explanatory as long as we remember that the function is always carried out by the sender, i.e. the "dialogue participant who produces a dialog act" ([9]: 4). The functions of turn management are all dimension-specific, which means that they cannot be assigned to any other dimension. The scheme also acknowledges the need for subtle characteristics of utterances such as conditionality, modality, (un)certainty, stance, and sentiment, which Petukhova and Bunt [11] raised in their analysis of existing annotation schemes. As a solution, the ISO 24617-2 proposes function qualifiers that can be applied to a DA function. Following its predecessor, the DIT++, the ISO 24617-2 also considers nonverbal behavior in terms of DA annotation. Afterall, in its definition of DAs, the ISO 24617-2 does not discriminate between verbal and non-verbal behavior, since it

**106**

defines DAs as "a semantic unit of communicative behaviour".

Hence, the ISO 24617-2 is well-suited for the annotation of multimodal material and was implemented in research of non-verbal behavior. Yoshino et at. [12] utilized the scheme to annotate information navigation and attentive listening dialogs to improve natural conversation modeling for caretakers that communicate with the elderly. Navaretta and Paggio [13] explore non-verbal behavior occurring when providing feedback among persons who just met, i. e. in highly spontaneous settings. The one-hour recordings, annotated with the tool Anvil, specifically analyze what kind of head movement or facial expressions accompany a certain subtype of the feedback dimension. Their classification of non-verbal behavior is based on the MUMIN scheme. Petukhova and Bunt [14] utilize almost an hourlong recording from the corpus AMI, which consists of project meetings. They analyze DAs according to the DIT++ and the ISO 24617-2 schemes together with co-occurring non-verbal behavior, which is classified according to the CoGest scheme. In their previous work, Petukhova and Bunt [15] annotate recordings from the AMI corpus according to the DIT scheme. Both the annotation of DAs and the annotation of non-verbal behavior is carried out with the DIT scheme, since, as they emphasize, non-verbal behavior helps us understand the true function of a DA. The pragmatical annotation of the multimodal corpus HuComTech [16], however, is not based on the ISO scheme, yet its main annotation units are very similar. They are referred to as communicative acts, which denote the function or purpose of an utterance (e.g., agreement, turn management, information). The annotation of non-verbal behavior, including facial expressions, eyebrow movement, head movement, touch motions, posture, or emotions, was performed manually and partially automatically with the tool Qannot.

Although there seems to be strong evidence to support the multimodal and multi-signal nature of the human-human interaction, for decades, spoken language understanding has first and foremost focused on speech a priori [17]. The classification of non-verbal behavior by Mlakar et al. [18], draws upon McNeill's [19] common growth point theory, according to which speech and gestures both stem from a common growth point of a concept and mutually influence one another, Pierce's [20] semiotics, that provides analysis of non-linguistic signs and symbols as the meaning of non-verbal behavior, Ekman and Friesen's [21] categories and coding of non-verbal behavior, and Birdwhistell's [22] insights into the importance of kinesics. Moreover, the classification by [18] utilizes the communication management theory [23, 24] and, therefore, also encompasses discourse functions to some extent. Mlakar et al. [18] refer to 'gestures' as behavior generated by moving body parts (i.e., head, hands/arms, face, and posture) performing a communicative purpose, i.e., containing a discourse function, as a non-verbal communication intent (hereinafter NCI). These non-verbal expressions represent the basis of cognitive capabilities and understanding [25]. Namely, although not bound by grammar, non-verbal expressions co-align with language structures and compensate for the less articulated verbal expression models, thus providing a certain degree of clarity of discourse [26]. The non-verbal behavior retains the semantics and at the same time helps in providing suggestive influences and serves for interactive purposes, even such as content expression of one's mental state, attitude, and social functions. The classification proposed by Mlakar et al. [18] positions the role/intent of non-verbal concepts into five main NCI classes of regulators or adapters, deictics or pointers, illustrators, symbols or emblems, and batons.

Cooperrider's [27] classification of gestures, on the other hand, concerns itself with the question of whether the gesture "communicates a critical part of a message" ([27]: 179) or not. He divides gestures into foreground and background gestures. Foreground gestures are those gestures of which we are aware when we perform them, such as a thumb up, whereas background gestures occur unconsciously, automatically, such as nodding during a telephone call. Therefore, foreground gestures are also in the foreground of the interaction. Among their characteristics, he lists co-occurrence with demonstratives, absence of speech, and a significant effort in their production, i.e., gestures that are bigger and more precise. Contrary to them are background gestures. They are both smaller in size and precision and occur while the sender is speaking. Despite this clear division, Cooperrider [27] emphasizes that the line between foreground and background gestures is anything but straightforward, as some gestures can break the foreground-background barrier. He demonstrates this with pointing gestures, which are generally in the foreground, but when pointing to oneself, they occur in the background. Furthermore, even symbolic gestures can take the background if performed automatically and if they are void of their communicative message. On the other hand, beats occur only as background gestures. One can, therefore, roughly consider illustrators, symbols, and partially deictics as NCI occurring in the foreground, while regulators, beats, and partially deictics can be considered as NCI occurring in the background, while still bearing in mind that the dividing line can always be crossed.

Hence, Cooperrider [27] differentiates between gestures with a semantic or propositional content, i.e., a message that provides some kind of information, and those that are void of it. The same distinction can be made for DAs. There are DAs that primarily convey information that is indispensable for communication, such as the task dimension, and those DAs that primarily do not contain propositional content (hence, they contain metadiscursive content) yet are vital for successful natural communication, such as the turn and management dimensions. Nevertheless, we must apply the same caveat as the one in the background-foreground distinction for gestures, as some DAs can occur either in the foreground or the background. For example, the dimension of managing social obligations can generally be considered part of the foreground, such as the concept of greeting someone upon the first encounter. Still, if a social convention is performed routinely, unconsciously, and is deprived of its semantic content, such as thanking someone for the floor, such a DA can be considered as occurring in the background. The nine DA dimensions can, therefore, roughly be divided into those occurring in the foreground, such as the task and the social obligation management dimension, and those occurring in the background, such as the feedback dimensions, the time and the turn management dimensions, the discourse structuring dimension, and the own- and the partner communication management dimensions.

For successful communication, the message must be as clear as possible. An utterance with a mismatching underlying nature is potentially confusing. For example, to take a turn, which is a typical background DA, one sometimes begins one's utterance with "look". The NCI accompanying "look" is usually a subtle hand gesture (e.g., a referential deictic), completely void of meaning and therefore a background gesture. Whereas when one uses "look" in the propositional sense, one uses a pointing gesture; both the DA and the NCI are, in this case, of foreground nature. To use a pointing (foreground) gesture with the mentioned turn-taking (background) DA in the "look" example would therefore be confusing, steering the collocutor to search for an object in sight, which does not exist. Therefore, to ensure cohesion and for the communication to be more effective, it seems plausible that a non-propositional episodes should require a background DA as well as a background NCI.

In light of this foreground-background link between DAs and the NCI of gestures, we set out to explore whether the theory of DAs can help predict the nature of the NCI of the corresponding unit. Specifically, we hypothesize that turn management DAs correlate with background gestures. Therefore, we propose the following hypothesis:

**109**

**Figure 1.**

*contexts of conversational episodes.*

*Can Turn-Taking Highlight the Nature of Non-Verbal Behavior: A Case Study*

*Turn management DAs, as background expressions, will tend to co-occur with NCI of background nature. In particular, turn management DAs will co-occur primarily* 

In order to perform research into authentic non-verbal behavior during turntaking, we utilized a 57-minute long video recording from the Corpus EVA [18]. Our annotation scheme, adapted from Mlakar et al. [28], outlined in **Figure 1**, was applied in the dataset to perform conversational analysis. For this research, dialog

The main objective of the scheme is to identify inferred meanings of co-verbal expressions as a function of linguistic, paralinguistic, and social signals (e.g., where and when to gesture) on a symbolic level, and to identify the physical nature (e.g., articulation of body language) and use of the available "imaginary forms" (e.g., how to gesture, how to vocalize), i.e., the level of the interpretation of non-verbal forms. The first layer, in **Figure 1**, the symbolic interpretation, is the focus of this research. It is used to analyze the interpretation of the interplay between various conversational signals, that is, verbal and non-verbal (i.e., DAs, gestures, syntax, discourse markers) at a symbolic level. The second layer, the interpretation of form, is concerned with how information is expressed beyond language, through prosody and embodied expressions, as an abstract concept of a non-verbal conversational expression with a specific communicative intent, i. e. how it is physically realized. For example, the 'form' of a gesture or 'accentuation' of speech. Its primary goal is to provide a detailed description, the closest possible to the physical reality and the entity that will realize it (e.g., an embodied conversational agent). As already mentioned, in this chapter, however, we focus on the first layer. The layer which aims to find patterns and tendencies in how people communicate through joint use of language, prosody, gaze, gesture, facial expressions, and other articulation of the

*The topology of annotation in the EVA Corpus: The levels of annotation describing verbal and non-verbal* 

*DOI: http://dx.doi.org/10.5772/intechopen.95516*

*with communication regulators.*

**2. Data and methodology**

acts were added as a linguistic branch.

*Can Turn-Taking Highlight the Nature of Non-Verbal Behavior: A Case Study DOI: http://dx.doi.org/10.5772/intechopen.95516*

*Turn management DAs, as background expressions, will tend to co-occur with NCI of background nature. In particular, turn management DAs will co-occur primarily with communication regulators.*
