**Part 2**

**AR in Biological, Medical and Human Modeling and Applications** 

86 Augmented Reality – Some Emerging Application Areas

Wang Y., (2005) Human movement tracking using a wearable wireless sensor network.

Woodgate G.J., Ezra D., Harrold J., Holliman N.S., Jones G.R. & Moseley R.R. (1998)

Zheng J. Y. & Li Z.Z, (1999) "Virtual recovery of excavated relics," *Computer Graphics and* 

"Autostereoscopic 3D display systems with observer tracking," *Signal Processing:* 

Master Thesis. Iowa State University. Ames, Iowa

*Image Communication* 14 (1998) 131-145

*Applications*, IEEE, vol. 19, pp. 6-11, 1999

**0**

**5**

Olov Engwall

*Sweden*

*(Royal Institute of Technology), Stockholm*

**Augmented Reality Talking Heads as a Support**

*Centre for Speech Technology, School of Computer Science and Communication, KTH*

Visual face gestures, such as lip, head and eyebrow movements, are important in all human speech communication as a support to the acoustic signal. This is true even if the speaker's face is computer-animated. The visual information about the phonemes, i.e. speech sounds, results in better speech perception (Benoît et al., 1994; Massaro, 1998) and the benefit is all the greater if the acoustic signal is degraded by noise (Benoît & LeGoff, 1998; Sumby & Pollack,

Many phonemes are however impossible to identify by only seeing the speaker's face, because they are visually identical to other phonemes. Examples are sounds that only differ in voicing, such as [b] *vs.* [p], or sounds for which the difference in the articulation is too far back in the mouth to be seen from the outside, such as [k] *vs.* [N] or [h]. A good speech reader can determine to which viseme, i.e. which group of visually identical phonemes, a speech sound belongs to, but must guess within this group. A growing community of hearing-impaired persons with residual hearing therefore relies on cued speech (Cornett & Daisey, 1992) to identify the phoneme within each viseme group. With cued speech, the speaker conveys additional phonetic information with hand sign gestures. The hand sign gestures are however arbitrary and must be learned by both the speaker and the listener. Cued speech can furthermore only be used when the speaker and listener see each other.

An alternative to cued speech would therefore be that the differences between the phonemes are directly visible in an augmented reality display of the speaker's face. The basic idea is the following: Speech recognition is performed on the speaker's utterances, resulting in a continuous transcription of phonemes. These phonemes are used in real time as input to a computer-animated talking head, to generate an animation in which the talking head produces the same articulatory movements as the speaker just did. By delaying the acoustic signal from the speaker slightly (about 200 ms), the original speech can be presented together with the computer animation, thus giving the listener the possibility to use audiovisual information for the speech perception. An automatic lip reading support of this type already exists, in the SYNFACE extension (Beskow et al., 2004) to the internet telephony application Skype. Using the same technology, but adding augmented reality, the speech perception support can be extended to display not only facial movements, but face and tongue movements together, in displays similar to the ones shown in Fig. 1. This type of speech

1954) or a hearing-impairment (Agelfors et al., 1998; Summerfield, 1979).

**1. Introduction**

**for Speech Perception and Production**
