**2.2.1 Face detection**

Face detection from a single image or an image sequences is a difficult task due to variability in pose, size, orientation, color, expression, occlusion and lighting condition. To build a fully automated system that extracts information from images of human faces, it is essential to develop and efficient algorithms to detect human faces. Visual detecting of face has been studied extensively over the last decade. Face detection researchers summarized the face detection work into four categories: template matching approaches [Sakai, 1996] [Miao, 1999] [Augusteijn, 1993] [Yuille, 1992] [Tsukamoto, 1994]; feature invariant approaches [Sirohey, 1993]; appearance-based approaches [Turk, 1991] and knowledge-based approaches [Yang, 1994] [Yang, 2002] [Kotropoulous, 1997]. Many researches also used skin color as a feature and leading remarkably face tracking as long as the lighting conditions do not varies too much [Dai, 1996], [Crowley, 1997] [Bhuiyan, 2003], [Hasanuzzaman 2004b]. Recently, several researchers combined multiple features for face localization and detection and those are more robust than single feature based approaches. Yang and Ahuja [Yang, 1998] proposed a face detection method based on color, structure and geometry. Saber and Tekalp [Saber, 1998] presented a frontal view-face localization method based on color, shape and symmetry.

## **2.2.2 Face recognition**

During the last few years face recognition has received significant attention from the researchers [Zhao, 2003] [Chellappa, 1995]. Zhao [Zhao, 2003] et. al. have summarized the past recent research on face recognition methods with three categories: Holistic matching methods, Feature-based matching methods and Hybrid methods. One of the most widely used representations of the face recognition is eigenfaces, which are based on principal 232 The Future of Humanoid Robots – Research and Applications

lack of speech recognition his interface system still uses keyboard-based text input. Kortenkamp et. al. [Kortenkamp, 1996] have developed gesture-based human-mobile robot interface. They have used static arm poses as gestures. Stefan Waldherr et. al. proposed gesture-based interface for human and service robot interaction [Waldherr, 2000]. They combined template-based approach and Neural Network based approach for tracking a person and recognizing gestures involving arm motion. In their work they proposed illumination adaptation methods but did not consider user or hand pose adaptation. Bhuiyan et. a.l detected and tracked face and eye for human robot interaction [Bhuiyan, 2004]. But only the largest skin-like region for the probable face has been considered, which may not be true when two hands are present in the image. However, all of the above papers focus primarily on visual processing and do not maintain knowledge of different users nor

A first step of any face recognition or visually person identification system is to locate the face in the images. After locating the probable face, researchers use facial features (eyes, nose, nostrils, eyebrows, mouths, leaps, etc.) detection method to detect face accurately [Yang, 2000]. Face recognition or person identification compares an input face image or image features against a known face database or features databases and report match, if any. Following two subsections summarize promising past research works in the field of face

Face detection from a single image or an image sequences is a difficult task due to variability in pose, size, orientation, color, expression, occlusion and lighting condition. To build a fully automated system that extracts information from images of human faces, it is essential to develop and efficient algorithms to detect human faces. Visual detecting of face has been studied extensively over the last decade. Face detection researchers summarized the face detection work into four categories: template matching approaches [Sakai, 1996] [Miao, 1999] [Augusteijn, 1993] [Yuille, 1992] [Tsukamoto, 1994]; feature invariant approaches [Sirohey, 1993]; appearance-based approaches [Turk, 1991] and knowledge-based approaches [Yang, 1994] [Yang, 2002] [Kotropoulous, 1997]. Many researches also used skin color as a feature and leading remarkably face tracking as long as the lighting conditions do not varies too much [Dai, 1996], [Crowley, 1997] [Bhuiyan, 2003], [Hasanuzzaman 2004b]. Recently, several researchers combined multiple features for face localization and detection and those are more robust than single feature based approaches. Yang and Ahuja [Yang, 1998] proposed a face detection method based on color, structure and geometry. Saber and Tekalp [Saber, 1998] presented a frontal view-face localization method based on color, shape

During the last few years face recognition has received significant attention from the researchers [Zhao, 2003] [Chellappa, 1995]. Zhao [Zhao, 2003] et. al. have summarized the past recent research on face recognition methods with three categories: Holistic matching methods, Feature-based matching methods and Hybrid methods. One of the most widely used representations of the face recognition is eigenfaces, which are based on principal

consider how to deal with them.

detection and recognition.

**2.2.1 Face detection** 

and symmetry.

**2.2.2 Face recognition** 

**2.2 Face detection and recognition** 

component analysis (PCA). The eigenface algorithm uses the principal component analysis (PCA) for dimensionality reduction and to find the vectors those are best account for the distribution of face images within the entire face image spaces. Turk and Pentland [Turk, 1991] first successfully used eigenfaces for face detection and person identification or face recognition. In this method from the known face images training image dataset are prepared. The face space is defined by the "eigenfaces" which are eigenvectors generated from the training face images. Face images are projected onto the feature space (or eigenfaces) that best encodes the variation among known face images. Recognition is performed by projecting a test image onto the "facespace" (spanned by the m number of eigenfaces) and then classified the face by comparing its position (Euclidian distance) in face space with the positions of known individuals.

Independent component analysis (ICA) is similar to PCA except that the distributions of the components are designed to be non-Gaussian. The ICA separates the high-order moments of the input in addition to the second order moments utilized in PCA [Bartlett, 1998]. Face recognition system using Linear Discriminant Analysis (LDA) or Fisher Linear Discriminant Analysis (FDA) has also been very successful [Belhumeur, 1997]. In feature-based matching methods, facial features such as the eyes, lips, nose and mouth are extracted first and their locations and local statistics (geometric shape or appearance) are fed into a structural classifier [Kanade, 1977]. One of the most successful of these methods is the Elastic Bunch Graph Matching (EBGM) system [Wiskott, 1997]. Other well-known methods in these systems are Hidden Markov Model (HMM) and convolution neural network [Rowley, 1997].

#### **2.3 Gesture recognition and gesture based interface**

A gesture is a motion of the body parts or the whole body that contains information [Billinghurst, 2002]. The first step in considering gesture-based interaction with computers or robots is to understand the role of gestures in human-human communication. Gestures are varying among individuals or vary from instance to instance for a given individual. The Gesture meanings also follow one-to-many mapping or many-to-one mapping. Two approaches are commonly used to recognize gestures. One is a glove-based approach that requires wearing of cumbersome contact devices and generally carrying a load of cables that connect the devices to computers [Sturman, 1994]. Another approach is a vision-based technique that does not require wearing any contact devices, but uses a set of video cameras and computer vision techniques to interpret gestures [Pavlovic, 1997]. Although glove-based approaches provide more accurate results, they are expensive and encumbering. Computer vision techniques overcome these limitations. In general, vision-based systems are more natural than glove-based systems and are capable of hand, face and body tracking but do not provide the same accuracy in pose determination. However, for general purposes, achieving a higher-level accuracy may be less important than a real-time and inexpensive method. In addition, many gestures involve two hands, but most of the research efforts in glove-based gesture recognition use only one glove for data acquisition. In vision-based systems, we can use two hands and facial gestures at the same time.

Vision-based gesture recognition systems have three major components: image processing or extracting important clues (hand or face pose and position), tracking the gesture features (related position or motion of face and hand poses), and gesture interpretation. Vision-based gesture recognition system varies along a number of dimensions: number of cameras, speed and latency (real-time or not), structural environment (restriction on lighting conditions and

User, Gesture and Robot Behaviour Adaptation for Human-Robot Interaction 235

are segmented from the real-time capture images [Hasanuzzaman, 2004d]. Triesch et. al. [Triesch, 2002] employed the elastic graph matching techniques to classify hand posters against complex backgrounds. They represented hand posters by label graphs with an underlying two-dimensional topology. Attached to the nodes are jets, which are a sort of local image description based on Gabor filters. This approach can achieve scale-invariant and user invariant recognition and does not need hand segmentation. This approach is not view-independent, because use one graph for one hand posture. The major disadvantage of

Appearance based approaches use template images or features from the training images (images, image geometry parameters, image motion parameters, fingertip position, etc.) which use for gesture recognition [Birk, 1997]. The gestures are modeled by relating the appearance of any gesture to the appearance of the set of predefined template gestures. A different group of appearance-based model uses 2D hand image sequences as gesture templates. For each gestures number of images are used with little orientation variations [Hasanuzzaman, 2004a]. Appearance based approaches are generally computationally less expensive than model based approaches because its does not require translation time from 2D information to 3D model. Dynamic gestures recognition is accomplished using Hidden Markov Models (HMMs), Dynamic Time Warping, Bayesian networks or other patterns recognition methods that can recognize sequences over time steps. Nam et. al. [Nam, 1996] used HMM methods for recognition of space-time hand-gestures. Darrel et. al. [Darrel, 1993] used Dynamic Time Warping method, a simplification of Hidden Markov Models (HMMs) to compare the sequences of images against previously trained sequences by adjusting the length of sequences appropriately. Cutler et. al. [Cutler, 1998] used a ruled-based system for gesture recognition in which image features are extracted by optical flow. Yang [Yang, 2000] recognizes hand gestures using motion trajectories. First they extract the two-dimensional motion in an image, and motion patterns are learned from the extracted trajectories using a

**3. Frame-based knowledge-representation system for gesture-based HRI** 

The 'frame-based approach' is a knowledge-based problem solving approach based on the so called, 'Frame theory', first proposed by Marvin Minsky [Minsky, 1974]. A frame is a data-structure for representing a stereotyped unit of human memory including definitive and procedural knowledge. Attached to each frame there are several kinds of information about the particular object or concept it describes such as name and a set of attributes called slots. Collections of related frames are linked together into frame systems. Framed-based approach has been used successfully in many robotic applications [Ueno, 2002]. Ueno presented the concepts and methodology of knowledge modeling based on Cognitive Science for realizing the autonomous humanoid service robotics arm and hand system HARIS [Ueno, 2000]. A knowledge-based software platform called SPAK (Software Platform for Agent and Knowledge management) has developed for intelligent service robots under the internet-based distributed environment [Ampornaramveth, 2004]. SPAK has been developed to be a platform on which various software components for different robotic tasks can be integrated over a networked environment. SPAK works as a knowledge and data management system, communication channel, intelligent recognizer, intelligent scheduler, and so on. Zhang et. al. [Zhang, 2004b] have developed an Industrial Robot Arm control system using SPAK. In that system SPAK works as a communication channel and intelligent robot actions scheduler. Kiatisevi et. al. [Kiatisevi, 2004] has proposed a

this algorithm is the high computational cost.

time delay network.

background), primary features (color, edge, regions, moments, etc.), user requirements etc. Multiple cameras can be used to overcome occlusion problems for image acquisition but this adds correspondences and integration problems. The first phase of vision-based gesture recognition task is to select a model of the gesture. The modelling of gesture depends on the intent-dent applications by the gesture. There are two different approaches for vision-based modelling of gesture: Model based approach and Appearance based approach. The Model based techniques are tried to create a 3D model of the user hand (parameters: Joint angles and palm position) [Rehg, 1994] or contour model of the hand [Shimada, 1996] [Lin, 2002] and use these for gesture recognition. The 3D models can be classified in two large groups: volumetric model and skeletal models. Volumetric models are meant to describe the 3D visual appearance of the human hands and arms. Skeletal models are related to the human hand skeleton.

Once the model is selected, an image analysis stage is used to compute the model parameters from the image features that are extracted from single or multiple video input streams. Image analysis phase includes hand localization, hand tracking, and selection of suitable image features for computing the model parameters. Two types of cues are often used for gesture or hand localization: color cues and motion cues. Color cue is useful because human skin color footprint is more distinctive from the color of the background and human cloths [Kjeldsen, 1996], [Hasanuzzaman, 2004d]. Color-based techniques are used to track objects defined by a set of colored pixels whose saturation and values (or chrominance values) are satisfied a range of thresholds. The major drawback of color-based localization methods is that skin color footprint is varied in different lighting conditions and also the human body colors. Infrared cameras are used to overcome the limitations of skin-color based segmentation method [Oka, 2002].

The motion-based segmentation is done just subtracting the images from background [Freeman, 1996]. The limitation of this method is considered the background or camera is static. Moving objects in the video stream can be detected by inter frame differences and optical flow [Cutler, 1998]. However such a system cannot detect a stationary hand or face. To overcome the individual shortcomings some researchers use fusion of color and motion cues [Azoz, 1998]. The computation of model parameters is the last step of the gesture analysis phase and it is followed by gesture recognition phase. The type of computation depends on both the model parameters and the features that were selected. In the recognition phase, parameters are classified and interpreted in the light of the accepted model or the rules specified for the gesture interpretation. Two tasks are commonly associated with the recognition process: optimal partitioning of the parameter space and implementation of the recognition procedure. The task of optimal partitioning is usually addresses through different learning-from-examples training procedures. The key concern in the implementation of the recognition procedure is computation efficiency. A recognition method usually determines confidence scores or probabilities that define how closely the image data fits each model. Gesture recognition methods are divided into two categories: static gesture or hand poster and dynamic gesture or motion gesture.

Eigenspace or PCA is also used for hand pose classification similarly its used for face detection and recognition. Moghaddam and Pentland used eigenspaces (eigenhands) and principal component analysis not only to extract features, but also as a method to estimate complete density functions for localization [Moghaddam, 1995]. In our previous research, we have used PCA for hand pose classification from three larger skin-like components that 234 The Future of Humanoid Robots – Research and Applications

background), primary features (color, edge, regions, moments, etc.), user requirements etc. Multiple cameras can be used to overcome occlusion problems for image acquisition but this adds correspondences and integration problems. The first phase of vision-based gesture recognition task is to select a model of the gesture. The modelling of gesture depends on the intent-dent applications by the gesture. There are two different approaches for vision-based modelling of gesture: Model based approach and Appearance based approach. The Model based techniques are tried to create a 3D model of the user hand (parameters: Joint angles and palm position) [Rehg, 1994] or contour model of the hand [Shimada, 1996] [Lin, 2002] and use these for gesture recognition. The 3D models can be classified in two large groups: volumetric model and skeletal models. Volumetric models are meant to describe the 3D visual appearance of the human hands and arms. Skeletal models are related to the human

Once the model is selected, an image analysis stage is used to compute the model parameters from the image features that are extracted from single or multiple video input streams. Image analysis phase includes hand localization, hand tracking, and selection of suitable image features for computing the model parameters. Two types of cues are often used for gesture or hand localization: color cues and motion cues. Color cue is useful because human skin color footprint is more distinctive from the color of the background and human cloths [Kjeldsen, 1996], [Hasanuzzaman, 2004d]. Color-based techniques are used to track objects defined by a set of colored pixels whose saturation and values (or chrominance values) are satisfied a range of thresholds. The major drawback of color-based localization methods is that skin color footprint is varied in different lighting conditions and also the human body colors. Infrared cameras are used to overcome the limitations of skin-color

The motion-based segmentation is done just subtracting the images from background [Freeman, 1996]. The limitation of this method is considered the background or camera is static. Moving objects in the video stream can be detected by inter frame differences and optical flow [Cutler, 1998]. However such a system cannot detect a stationary hand or face. To overcome the individual shortcomings some researchers use fusion of color and motion cues [Azoz, 1998]. The computation of model parameters is the last step of the gesture analysis phase and it is followed by gesture recognition phase. The type of computation depends on both the model parameters and the features that were selected. In the recognition phase, parameters are classified and interpreted in the light of the accepted model or the rules specified for the gesture interpretation. Two tasks are commonly associated with the recognition process: optimal partitioning of the parameter space and implementation of the recognition procedure. The task of optimal partitioning is usually addresses through different learning-from-examples training procedures. The key concern in the implementation of the recognition procedure is computation efficiency. A recognition method usually determines confidence scores or probabilities that define how closely the image data fits each model. Gesture recognition methods are divided into two categories:

Eigenspace or PCA is also used for hand pose classification similarly its used for face detection and recognition. Moghaddam and Pentland used eigenspaces (eigenhands) and principal component analysis not only to extract features, but also as a method to estimate complete density functions for localization [Moghaddam, 1995]. In our previous research, we have used PCA for hand pose classification from three larger skin-like components that

static gesture or hand poster and dynamic gesture or motion gesture.

hand skeleton.

based segmentation method [Oka, 2002].

are segmented from the real-time capture images [Hasanuzzaman, 2004d]. Triesch et. al. [Triesch, 2002] employed the elastic graph matching techniques to classify hand posters against complex backgrounds. They represented hand posters by label graphs with an underlying two-dimensional topology. Attached to the nodes are jets, which are a sort of local image description based on Gabor filters. This approach can achieve scale-invariant and user invariant recognition and does not need hand segmentation. This approach is not view-independent, because use one graph for one hand posture. The major disadvantage of this algorithm is the high computational cost.

Appearance based approaches use template images or features from the training images (images, image geometry parameters, image motion parameters, fingertip position, etc.) which use for gesture recognition [Birk, 1997]. The gestures are modeled by relating the appearance of any gesture to the appearance of the set of predefined template gestures. A different group of appearance-based model uses 2D hand image sequences as gesture templates. For each gestures number of images are used with little orientation variations [Hasanuzzaman, 2004a]. Appearance based approaches are generally computationally less expensive than model based approaches because its does not require translation time from 2D information to 3D model. Dynamic gestures recognition is accomplished using Hidden Markov Models (HMMs), Dynamic Time Warping, Bayesian networks or other patterns recognition methods that can recognize sequences over time steps. Nam et. al. [Nam, 1996] used HMM methods for recognition of space-time hand-gestures. Darrel et. al. [Darrel, 1993] used Dynamic Time Warping method, a simplification of Hidden Markov Models (HMMs) to compare the sequences of images against previously trained sequences by adjusting the length of sequences appropriately. Cutler et. al. [Cutler, 1998] used a ruled-based system for gesture recognition in which image features are extracted by optical flow. Yang [Yang, 2000] recognizes hand gestures using motion trajectories. First they extract the two-dimensional motion in an image, and motion patterns are learned from the extracted trajectories using a time delay network.
