**2.1 Overview of human-machine interaction systems**

Both machines and human measure their environment through sense or input interfaces and modify their environment through expression or output interfaces. Most popular mode of human-computer or human-intelligence machine interaction is simply based on keyboards and mice. These devices are familiar but lack of naturalness and do not support remote control or telerobotics interface. Thus in recent years researchers are giving tedious pressure to find attractive and natural user interface devices. The term natural user interface is not an exact expression, but usually means an interface that is simple, easy to use and seamless as possible. Multimodal user interfaces are a strong candidate for building natural user interfaces. In multimodal approaches user can include simple keyboard and mouse with advance perception techniques like speech recognition and computer vision (gestures, gaze, etc.) as user machine interface tools.

Weimer et. al. [Weimer, 1989] described a multimodal environment that uses gesture and speech input to control a CAD system. They used 'Dataglove' to track the hand gestures and presented the objects in three-dimension onto the polarizing glasses. Yang et. al. [Yang, 1998] have implemented a camera-based face and facial features (eyes, lips and nostrils) tracker system. The system can also estimate user gaze direction and head poses. They have implemented two multimodal applications: a lip-reading system and a panoramic image viewer. The lip-reading systems improve speech recognition accuracy by using visual input to disambiguate among acoustically confusing speech elements. The panoramic image viewer uses gaze to control panning and tilting, and speech to control zooming. Perzanowski et. al. [Perzanowski, 2001] proposed multimodal human-robot interface for mobile robot. They have incorporated both natural language understanding and gesture recognition as a communication mode. They have implemented their method on a team of 'Nomad 200' and 'RWI ATRV-Jr' robots. These robots understand speech, hand gestures and input from a handheld Palm pilot to other Personal Digital Assistant (PDA).

To use the gestures in the HCI or HRI it is necessary to interpret the gestures by computer or robot. The interpretation of human gestures requires that static or dynamic modelling of the human hand, arm, face and other parts of the body that is measurable by the computers or intelligent machines. First attempt is to measure the gesture features (hand pose and/or arm joint angles and spatial positions) are by the so called glove-based devices [Sturman, 1994]. The problems regarding gloves and other interface devices can be solved using vision-based non-contract and nonverbal communication techniques. Numbers of approaches have been applied for the visual interpretation of gestures to implement human-machine interaction [Pavlovic, 1997]. Torrance [Torrance, 1994] proposed a natural language-based interface for teaching mobile robots about the names of places in an indoor environment. But due to the

User, Gesture and Robot Behaviour Adaptation for Human-Robot Interaction 233

component analysis (PCA). The eigenface algorithm uses the principal component analysis (PCA) for dimensionality reduction and to find the vectors those are best account for the distribution of face images within the entire face image spaces. Turk and Pentland [Turk, 1991] first successfully used eigenfaces for face detection and person identification or face recognition. In this method from the known face images training image dataset are prepared. The face space is defined by the "eigenfaces" which are eigenvectors generated from the training face images. Face images are projected onto the feature space (or eigenfaces) that best encodes the variation among known face images. Recognition is performed by projecting a test image onto the "facespace" (spanned by the m number of eigenfaces) and then classified the face by comparing its position (Euclidian distance) in face

Independent component analysis (ICA) is similar to PCA except that the distributions of the components are designed to be non-Gaussian. The ICA separates the high-order moments of the input in addition to the second order moments utilized in PCA [Bartlett, 1998]. Face recognition system using Linear Discriminant Analysis (LDA) or Fisher Linear Discriminant Analysis (FDA) has also been very successful [Belhumeur, 1997]. In feature-based matching methods, facial features such as the eyes, lips, nose and mouth are extracted first and their locations and local statistics (geometric shape or appearance) are fed into a structural classifier [Kanade, 1977]. One of the most successful of these methods is the Elastic Bunch Graph Matching (EBGM) system [Wiskott, 1997]. Other well-known methods in these systems are Hidden Markov Model (HMM) and convolution neural network [Rowley, 1997].

A gesture is a motion of the body parts or the whole body that contains information [Billinghurst, 2002]. The first step in considering gesture-based interaction with computers or robots is to understand the role of gestures in human-human communication. Gestures are varying among individuals or vary from instance to instance for a given individual. The Gesture meanings also follow one-to-many mapping or many-to-one mapping. Two approaches are commonly used to recognize gestures. One is a glove-based approach that requires wearing of cumbersome contact devices and generally carrying a load of cables that connect the devices to computers [Sturman, 1994]. Another approach is a vision-based technique that does not require wearing any contact devices, but uses a set of video cameras and computer vision techniques to interpret gestures [Pavlovic, 1997]. Although glove-based approaches provide more accurate results, they are expensive and encumbering. Computer vision techniques overcome these limitations. In general, vision-based systems are more natural than glove-based systems and are capable of hand, face and body tracking but do not provide the same accuracy in pose determination. However, for general purposes, achieving a higher-level accuracy may be less important than a real-time and inexpensive method. In addition, many gestures involve two hands, but most of the research efforts in glove-based gesture recognition use only one glove for data acquisition. In vision-based

Vision-based gesture recognition systems have three major components: image processing or extracting important clues (hand or face pose and position), tracking the gesture features (related position or motion of face and hand poses), and gesture interpretation. Vision-based gesture recognition system varies along a number of dimensions: number of cameras, speed and latency (real-time or not), structural environment (restriction on lighting conditions and

space with the positions of known individuals.

**2.3 Gesture recognition and gesture based interface** 

systems, we can use two hands and facial gestures at the same time.

lack of speech recognition his interface system still uses keyboard-based text input. Kortenkamp et. al. [Kortenkamp, 1996] have developed gesture-based human-mobile robot interface. They have used static arm poses as gestures. Stefan Waldherr et. al. proposed gesture-based interface for human and service robot interaction [Waldherr, 2000]. They combined template-based approach and Neural Network based approach for tracking a person and recognizing gestures involving arm motion. In their work they proposed illumination adaptation methods but did not consider user or hand pose adaptation. Bhuiyan et. a.l detected and tracked face and eye for human robot interaction [Bhuiyan, 2004]. But only the largest skin-like region for the probable face has been considered, which may not be true when two hands are present in the image. However, all of the above papers focus primarily on visual processing and do not maintain knowledge of different users nor consider how to deal with them.
