**2.4 Annotation agreement**

*Types of Nonverbal Communication*

as greetings, goodbyes, introductions.

• **Regulators/adaptors (R)** define embodiments that are primarily used to model the flow of information exchange. Adaptors are regarded as part of background behavior and can be produced even without speech. They exist without a specific speech reference and do not link with a specific speech structure. The regulators are further classified into self-adaptors (RS), the communication regulators (RC) subclass, the affect-regulators (RA) subclass, the manipulators (RM) subclass, and the social function and obligation regulators (RO) subclass. Self-adaptors relate to how a speaker continuously manages the planning and the execution of the speaker's own communication. The communication-regulators refer to managing interactions with other interlocutors through systems of turn-taking, feedback, and sequencing, e.g., interactive communication management (ICM). The affect-regulators are either self- or person-addressed and are used to further emphasize or express attitude or emotion regarding a topic, object, or person. Manipulators convey relief or release of emotional tension or outline states of the body or mind, such as anxiety, uncertainty, or nervousness. Finally, social function and obligation regulation primarily deals with embodied behavior used in social settings, such

• **Deictics (D)** include entities that can actually be present in the real environment of the gesturer (e.g., indicating objects, persons, or places) or are ideally present in the discourse content or abstract (e.g., pointing upwards or pointing backward to indicate the past). If deictic expressions are actual word referents with a semantic interlink, they are regarded as part of the foreground. If the semantic link does not exist or is weak, deictic expressions will also be recognized as part of the background. We further distinguish between pointers

• **Symbols/emblems (S)** tend to establish a strong semantic link with verbal counterparts. They are regarded as foreground and include all symbolic gestures and symbolic grammars. Their specific meaning is often culturalspecific, as the same emblem can have different meanings in different cultures. Nevertheless, there are cross-cultural hand emblems, which are easily recognizable because, despite their arbitrary link with the speech they refer to, they have a direct verbal translation, which usually consists of one or two words or a whole sentence (often a traditional expression shared in a specific culture).

• **Batons (B)** are those staccato strikes that create emphasis and grab attention, such as a short and single baton that marks an important point in a conversation. Whereas repeated batons can "hammer" a critical concept. Batons are equivalent to beats, however, beats may appear as a more random movement (e.g., outlining rhythm). Batons, on the other hand, may also set the rhythm and signal importance but, more importantly, they also outline the structure of verbal counterparts, e.g., tag a set of words that should be processed together

In terms of the background-foreground distribution of observed NCIs, we can observe that the material contains predominantly non-verbal behavior "functioning" in the background. Overall, we have observed roughly 1,684 non-verbal

expressions, out of which 1,274 belonged to regulators (75.65 percent) and 136 (8.08 percent) to illustrators and symbols. The rest, 275 (16.33 percent), belonged to deictic expressions. The majority of NCI is, therefore, of background nature.

(DP), indexes/referential pointers (DR), and enumerators (DE).

(e.g., to produce a summary of the meaning of an utterance).

**112**

In total, five annotators, two with a linguistic background, and three with a technical background in machine interaction were involved in this phase of annotations. Annotations were performed in separate sessions, each session describing a specific signal. The annotation was performed in pairs, i.e., two or three annotators annotated the same signal. After the annotation, consensus was reached by observing and commenting on the values where the was no or little annotation agreement among multiple annotators (including those not involved in the annotation of the signal). The final corpus was generated after all disagreements were resolved. Procedures for checking inconsistencies were finally applied by an expert annotator.

Before starting with each session, the annotators were given an introductory presentation defining the nature of the signal they were observing and the exact meaning of the finite set of values they could use. An experiment measuring agreement was also performed. It included an introductory annotation session in which the preliminary inconsistencies were resolved. Overall, given the complexity of the task and the fact that the values in **Table 2** also cover cases with a possible duality of meaning, the level of agreement is acceptable and comparable to other multimodal corpus annotation tasks [29].

For the less complex signals, influenced primarily by a single modality (e.g., pitch, gesture unit, gesture phrase, body-part/modality, sentence type), the annotators' agreement measured in terms of Cohen's kappa [30] was high, namely, between 0.75 and 0.9 on the Kappa score. The signals such as, Part-of-Speech, Syntax, Word Segmentation, were annotated (semi)automatically and the two expert annotators (linguists) overviewed the process and corrected the tags manually. The agreement was measured over the agreement on the corrections made. Pitch was annotated completely automatically, no agreement was measured. The only exceptions between less complex, unimodal signals, were Gesture phrase (0.53) and Prosodic phrases (0.71). The disagreements were expected since in some cases it is quite ambiguous to identify where a certain phrase ends and the next stars. Moreover, in a lot of cases, a retraction phase of a gesture can be recognized as stroke phase of the next gesture phrase.

As summarized in Table 3, for the more complex signals that involve multiple modalities for their comprehension (including speech, gestures, and text) the disagreements in interpretation were expectedly higher.


**Table 1.**

*A coarse-grained classification of the underlying nature of NCI classes and DA dimensions.*


#### **Table 2.**

*Results of the preliminary inter-coder agreement experiment.*
