**Learning Novel Objects for Domestic Service Robots**

Muhammad Attamimi1, Tomoaki Nakamura1, Takayuki Nagai1, Komei Sugiura2 and Naoto Iwahashi2

> <sup>1</sup>*The University of Electro-Communications* <sup>2</sup>*National Institute of Information and Communications Technology Japan*

#### **1. Introduction**

256 The Future of Humanoid Robots – Research and Applications

[Yang, 1994] G. Yang and T. S. Huang, "Human Face Detection in Complex Background",

[Yuille, 1992] A. Yuille, P. Hallinan and D. Cohen, "Feature Extraction from Faces Using

[Zhang, 2004b] T. Zhang, M. Hasanuzzaman, V. Ampornaramveth, P. Kiatisevi, H. Ueno,

[Zhao, 2003] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, "Face Recognition: A Literature Survey" ACM Computing Surveys, Vol. 35, No. 4, pp. 399-458, 2003.

Deformable Templates, International Journal of Computer Vision, Vol. 8, No. 2, pp

"Human-Robot Interaction Control for Industrial Robot Arm Through Software Platform for Agents and Knowledge Management", in Proceedings of IEEE International Conference on Systems, Man and Cybernetics (IEEE SMC 2004),

Pattern Recognition, Vol. 27, No.1, pp.53-63, 1994.

Netherlands, pp. 2865-2870, 2004.

99-111, 1992.

It is fair to say that robots which can interact with and serve humans especially in the domestic environment will spread widely in near future. A fundamental task called mobile manipulation is required for such domestic service robots. Therefore, many humanoid robots have been developed with the ability of mobile manipulation (1–5). Recently, competitions such as RoboCup@Home (6), Mobile Manipulation Challenge (7), and Semantic Robot Vision Challenge (8), have been proposed to evaluate such robots.

Since the tasks are implemented on domestic service robots, it stands to reason that natural interaction such as speech instruction should be used for the mobile manipulation. Here, we focus on the mobile manipulation using natural speech instruction such as "Bring me X" (X is an out-of-vocabulary (OOV) word). In order to realize this task, the integration of navigation, manipulation, speech recognition, and image recognition is required.

Image and speech recognition are difficult especially when novel objects are involved in the system. For example, there are objects specific to each home and new products can be brought into the home. It is impossible to register the names and images of all these objects with the robot in advance. Hence, we propose a method for learning novel objects with a simple procedure.

The robot, on which the proposed learning method is implemented, is intended to be used in a private domestic environment. Therefore, the procedure of teaching objects to the robot must be simple. For example, the user says, "This object is X" (X is the name of the object) and shows the object to the robot (Fig.1: Left). It is easy for a user to teach a robot many objects with this procedure. Then the user orders the robot to bring him/her something. For example, the user says, "Bring me X" (Fig.1: Right). As we mentioned earlier, such extended manipulation tasks are necessary for domestic service robots. However, there are three problems in teaching novel objects to the robots. The first problem is speech recognition of an object's name. In usual methods, phonemes of names must be registered in an internal dictionary. However, it is impossible to register all objects in advance. The second problem is the speech synthesis. A robot must utter the name of the recognized object for interaction with humans such as "Is it X?" However, conventional robot utterance systems cannot utter a word which is not registered in the dictionary. Even if the phoneme sequence of an OOV word can be recognized,

Service Robots 3

Learning Novel Objects for Domestic Service Robots 259

color histogram SIFT

**OOV**

OOV phoneme sequence OOV sound **OOV**

What is this ?

User

Object Recognition

object ID

It is OOV Robot

database

an environment. In this research, we used the "Supermarket" task of RoboCup@Home (6) as the extended mobile manipulation task. RoboCup@Home is a competition that tests the ability of robots in a domestic environment. Supermarket is a standardized task based on the fetch-and-carry operation. There are also other tasks which can be used for evaluation (7; 8). The "Semantic Robot Vision Challenge" (8) evaluates the ability of a robot to find an object in a real environment. However, only three teams participated in the 2009 competition. Furthermore, Semantic Robot Vision Challenge is not for evaluating manipulation. The "Mobile Manipulation Challenge" was held at the 2010 International Conference on Robotics and Automation. Even this competition evaluates the mobile manipulation ability of robots, only four teams participated. It is difficult to determine what task should be used for evaluating robots, even though there are tasks (6–8) for it. We used one of the tasks of RoboCup@Home, which we believe is the most standard. RoboCup@Home has the largest number of participants <sup>1</sup> and has clearly-stated rules, which are open to the public. Besides, the rules are improved every year. From these reasons, such tasks are better than self-defined

This chapter is organized as follows: the following section discusses a method for finding novel objects in cluttered scene. Then, the idea of pronouncing out-of-vocabulary words using voice conversion will be discussed in section 3. In section 4, the procedure of extended mobile manipulation task is described. Next section will discuss some experimental results to

<sup>1</sup> 24 teams participated in 2010 RoboCup@Home competition (6). On the other hand, a few teams participated in Mobile Manipulation Challenge (7), and Semantic Robot Vision Challenge (8).

**object image**

Speech Processing

Image Processing

This is OOV

User

utter name of object

showing object

ones.

Fig. 2. Overview of learning novel objects.

Fig. 1. Left: The user teaches the object to the robot. Right: The robot recognizes and utters the OOV word.

it cannot be used for speech synthesis since accuracy of phoneme recognition is less than 90%. The third problem is segmentation of the object region from a scene in the learning phase. When a robot learns an object, it must find where the object region is in the scene and segment it.

Methods for extracting OOV words in a speech have been proposed (9) for solving the first problem. Phonemes of OOV words can be obtained with these methods but they are not always correct. To solve the second problem, the user is required to restate the OOV word again and again so that correct phonemes are obtained (10). The robot can utter the word correctly but it is best that the robot learns the word from one user's utterance. There are also methods for situations in which the correct phonemes are not obtained. With such methods, the user utters the spelling of the OOV word to correct the phonemes (11). However, this requires a long time for the robot to learn OOV words by recognizing their spelling in Japanese or Chinese.

Considering these problems, we propose a system shown in Fig.2. We solve the first problem by extracting OOV words from a template sentence. The second problem may be solved by uttering phonemes of OOV words using a text-to-speech (TTS) system. However, it is difficult to recognize phonemes correctly. In the proposed method, the OOV part of the user's speech is converted to the robot's voice by voice conversion using Eigenvoice Gaussian mixture models (EGMMs) (12).

There has been research on the segmentation of images (13–15) for solving the third problem. The method developed by Rother *et al.* (13) requires a rough hand-drawn region of an object. Shi and Malik's method (14) can segment images automatically, but it cannot determine which segment is the object region. Mishra and Aloimonos's method (15) can segment the object accurately using color, 3D information, and motion. However, an initial point that locates inside the object region must be specified.

On the other hand, an object, which can be assumed as a human movements (e.g. hand movements), can be extracted from complicated scene because the proposed method is designed for a human to teach a robot an object. A color histogram and scale invariant feature transform (SIFT) are computed from extracted objects and registered in a database. This information is used for object recognition.

We implement the proposed method on a robot called "DiGORO". We believe it is important to evaluate the robot in a realistic domestic environment with a realistic task. When the robot moves to the object, it does not always arrive at an ideal position nor angle, and the illumination changes according to the position. A system is needed to work well in such 2 Will-be-set-by-IN-TECH

Fig. 1. Left: The user teaches the object to the robot. Right: The robot recognizes and utters

it cannot be used for speech synthesis since accuracy of phoneme recognition is less than 90%. The third problem is segmentation of the object region from a scene in the learning phase. When a robot learns an object, it must find where the object region is in the scene and segment

Methods for extracting OOV words in a speech have been proposed (9) for solving the first problem. Phonemes of OOV words can be obtained with these methods but they are not always correct. To solve the second problem, the user is required to restate the OOV word again and again so that correct phonemes are obtained (10). The robot can utter the word correctly but it is best that the robot learns the word from one user's utterance. There are also methods for situations in which the correct phonemes are not obtained. With such methods, the user utters the spelling of the OOV word to correct the phonemes (11). However, this requires a long time for the robot to learn OOV words by recognizing their spelling in Japanese

Considering these problems, we propose a system shown in Fig.2. We solve the first problem by extracting OOV words from a template sentence. The second problem may be solved by uttering phonemes of OOV words using a text-to-speech (TTS) system. However, it is difficult to recognize phonemes correctly. In the proposed method, the OOV part of the user's speech is converted to the robot's voice by voice conversion using Eigenvoice Gaussian mixture models

There has been research on the segmentation of images (13–15) for solving the third problem. The method developed by Rother *et al.* (13) requires a rough hand-drawn region of an object. Shi and Malik's method (14) can segment images automatically, but it cannot determine which segment is the object region. Mishra and Aloimonos's method (15) can segment the object accurately using color, 3D information, and motion. However, an initial point that locates

On the other hand, an object, which can be assumed as a human movements (e.g. hand movements), can be extracted from complicated scene because the proposed method is designed for a human to teach a robot an object. A color histogram and scale invariant feature transform (SIFT) are computed from extracted objects and registered in a database.

We implement the proposed method on a robot called "DiGORO". We believe it is important to evaluate the robot in a realistic domestic environment with a realistic task. When the robot moves to the object, it does not always arrive at an ideal position nor angle, and the illumination changes according to the position. A system is needed to work well in such

Bring me "Kuma"

> I will bring you "Kuma"

This object is "K uma"

the OOV word.

it.

or Chinese.

(EGMMs) (12).

inside the object region must be specified.

This information is used for object recognition.

Fig. 2. Overview of learning novel objects.

an environment. In this research, we used the "Supermarket" task of RoboCup@Home (6) as the extended mobile manipulation task. RoboCup@Home is a competition that tests the ability of robots in a domestic environment. Supermarket is a standardized task based on the fetch-and-carry operation. There are also other tasks which can be used for evaluation (7; 8). The "Semantic Robot Vision Challenge" (8) evaluates the ability of a robot to find an object in a real environment. However, only three teams participated in the 2009 competition. Furthermore, Semantic Robot Vision Challenge is not for evaluating manipulation. The "Mobile Manipulation Challenge" was held at the 2010 International Conference on Robotics and Automation. Even this competition evaluates the mobile manipulation ability of robots, only four teams participated. It is difficult to determine what task should be used for evaluating robots, even though there are tasks (6–8) for it. We used one of the tasks of RoboCup@Home, which we believe is the most standard. RoboCup@Home has the largest number of participants <sup>1</sup> and has clearly-stated rules, which are open to the public. Besides, the rules are improved every year. From these reasons, such tasks are better than self-defined ones.

This chapter is organized as follows: the following section discusses a method for finding novel objects in cluttered scene. Then, the idea of pronouncing out-of-vocabulary words using voice conversion will be discussed in section 3. In section 4, the procedure of extended mobile manipulation task is described. Next section will discuss some experimental results to

<sup>1</sup> 24 teams participated in 2010 RoboCup@Home competition (6). On the other hand, a few teams participated in Mobile Manipulation Challenge (7), and Semantic Robot Vision Challenge (8).

Service Robots 5

Learning Novel Objects for Domestic Service Robots 261

*f* (*d*) *<sup>D</sup>*

depth histogram

*wd*

㧗

: weight for hue

*PD*(*x*, *y*) = *fD*(*D*(*x*, *y*)), (1) *PH*(*x*, *y*) = *fH*(*H*(*x*, *y*)). (2)

*PO*(*x*, *y*) = LPF[*wdPD*(*x*, *y*) + *whPH*(*x*, *y*)], (3)

: weight for depth

*wh*

object probability map (depth)

object probability map (hue)

hue histogram

*f* (*h*) *<sup>H</sup>*

of each component (*PD*(*x*, *y*) and *PH*(*x*, *y*)) at each pixel location can be easily obtained.

The weighted sum of these two object probability maps results in the object probability map

The weights *wd* and *wh* are automatically assigned inversely proportional to the variance of each histogram. If the variance of the histogram is larger, its information is considered as inaccurate and the weight decreases. LPF represents a low pass filter, and we use a simple 3 × 3 averaging filter for it. The map is binarized, and then a final object mask is obtained using the connected component analysis. In the learning phase, object images are simply collected, then color histograms and SIFT features are extracted. These are used for object

When the robot recognizes an object, the target object should be extracted from the scene. However, the same method in the learning phase is not applicable because the object is placed somewhere and it is not held by the user. Therefore, if objects are on the table, the plane detection technique is beneficial for detecting the objects. The 3D randomized Hough transform (24) is used for fast and accurate plane detection. This plane detection method is

2. Maximum plane is detected as table top using randomized Hough transform (24).

object probability map

object region

*P* (*x*, *y*) *<sup>O</sup>*

initial region

*P* (*x*, *y*) *<sup>D</sup>*

*P* (*x*, *y*) *<sup>H</sup>*

Fig. 4. Segmentation of object region using motion attention.

**2.3 Object detection and identification in recognition phase**

1. 3D information is captured in the scene.

3. The plane is removed from 3D information. 4. The remaining point is projected on the plane.

motion attension image

optical flow

depth image

*M* (*x*, *y*)

*D*(*x*, *y*)

*H*(*x*, *y*)

detection and recognition.

summarized below.

*PO*(*x*, *y*).

hue image

Fig. 3. (a)3D visual sensor. (b)color image (1024 × 768). (c)depth image (176 × 144). (d)mapped color image (176 × 144).

validate the proposed system. Discussion about proposed method will be done in section 6 followed by conclusion of this chapter in 7.

#### **2. Finding novel objects in cluttered scene**

#### **2.1 3D visual sensor**

Figure 3 shows the visual sensor used in this chapter. This sensor is able to acquire color and accurate depth information in real time by calibrating a TOF and two CCD cameras.

The distance measurement capability of TOF camera is based on the TOF principle. In TOF systems, the time taken for light to travel from an active illumination source to the objects in the field of view and return to the sensor is measured. A TOF camera SwissRanger SR4000 (23) is used as a part of 3D visual sensor. It emits a modulated near-infrared (NIR) and the CMOS/CCD imaging sensor measures the phase delay of the returned modulated signal at each pixel. These measurements in the sensor result in a 176 × 144 pixel depth map.

In the geometric camera calibration, the parameters that express camera pose and properties can be classified into extrinsic parameters (i.e. rotation and translation) and intrinsic ones (i.e. focal length, coefficient of lens distortion, optical center and pixel size). The extrinsic parameters represent camera position and pose in 3D space, while the intrinsic parameters are needed to project a 3D scene onto the 2D image plane. We use Zhang's calibration method in our proposed system, since the technique only requires the camera to observe a checkerboard pattern shown at a few different orientations. For the calibration of TOF camera, the reflected signal amplitude can be used to observe the checkerboard pattern. Therefore, it is straightforward to apply the same calibration method. Figure 3 (b), (c) and (d) show images captured from the visual sensor.

#### **2.2 Motion attention based object segmentation**

Assuming a user shows a target object to the robot, there may be people, objects, or furniture behind that object. The problem is object segmentation in such a complex background. Because the user has the object at hand, the object can be segmented out by taking into account the motion cue. This fact motivates us to use object segmentation based on motion attention. Figure 4 shows an overview of motion attention. A motion detector first extracts the initial object region *M*(*x*, *y*). Then, object information, such as color (hue) image *H*(*x*, *y*) and depth image *D*(*x*, *y*), is taken from the region. In particular, a hue histogram *fH*(*h*) and depth histogram *fD*(*d*) are taken from the region and normalized. Here, *h* and *d* represent the quantized value of hue and depth, respectively. Since these two histograms can be considered as probability density functions of the target object, the object probability map 4 Will-be-set-by-IN-TECH

(a) (b) (c) (d)

validate the proposed system. Discussion about proposed method will be done in section 6

Figure 3 shows the visual sensor used in this chapter. This sensor is able to acquire color and

The distance measurement capability of TOF camera is based on the TOF principle. In TOF systems, the time taken for light to travel from an active illumination source to the objects in the field of view and return to the sensor is measured. A TOF camera SwissRanger SR4000 (23) is used as a part of 3D visual sensor. It emits a modulated near-infrared (NIR) and the CMOS/CCD imaging sensor measures the phase delay of the returned modulated signal at

In the geometric camera calibration, the parameters that express camera pose and properties can be classified into extrinsic parameters (i.e. rotation and translation) and intrinsic ones (i.e. focal length, coefficient of lens distortion, optical center and pixel size). The extrinsic parameters represent camera position and pose in 3D space, while the intrinsic parameters are needed to project a 3D scene onto the 2D image plane. We use Zhang's calibration method in our proposed system, since the technique only requires the camera to observe a checkerboard pattern shown at a few different orientations. For the calibration of TOF camera, the reflected signal amplitude can be used to observe the checkerboard pattern. Therefore, it is straightforward to apply the same calibration method. Figure 3 (b), (c) and (d) show images

Assuming a user shows a target object to the robot, there may be people, objects, or furniture behind that object. The problem is object segmentation in such a complex background. Because the user has the object at hand, the object can be segmented out by taking into account the motion cue. This fact motivates us to use object segmentation based on motion attention. Figure 4 shows an overview of motion attention. A motion detector first extracts the initial object region *M*(*x*, *y*). Then, object information, such as color (hue) image *H*(*x*, *y*) and depth image *D*(*x*, *y*), is taken from the region. In particular, a hue histogram *fH*(*h*) and depth histogram *fD*(*d*) are taken from the region and normalized. Here, *h* and *d* represent the quantized value of hue and depth, respectively. Since these two histograms can be considered as probability density functions of the target object, the object probability map

accurate depth information in real time by calibrating a TOF and two CCD cameras.

each pixel. These measurements in the sensor result in a 176 × 144 pixel depth map.

Fig. 3. (a)3D visual sensor. (b)color image (1024 × 768). (c)depth image

CCD camera

(176 × 144). (d)mapped color image (176 × 144).

**2. Finding novel objects in cluttered scene**

followed by conclusion of this chapter in 7.

TOF camera

**2.1 3D visual sensor**

captured from the visual sensor.

**2.2 Motion attention based object segmentation**

Fig. 4. Segmentation of object region using motion attention.

of each component (*PD*(*x*, *y*) and *PH*(*x*, *y*)) at each pixel location can be easily obtained.

$$P\_D(\mathbf{x}, y) = f\_D(D(\mathbf{x}, y)),\tag{1}$$

$$P\_H(\mathbf{x}, y) = f\_H(H(\mathbf{x}, y)). \tag{2}$$

The weighted sum of these two object probability maps results in the object probability map *PO*(*x*, *y*).

$$P\_O(\mathbf{x}, \mathbf{y}) = \text{LPF}[w\_d P\_D(\mathbf{x}, \mathbf{y}) + w\_{\text{lt}} P\_H(\mathbf{x}, \mathbf{y})],\tag{3}$$

The weights *wd* and *wh* are automatically assigned inversely proportional to the variance of each histogram. If the variance of the histogram is larger, its information is considered as inaccurate and the weight decreases. LPF represents a low pass filter, and we use a simple 3 × 3 averaging filter for it. The map is binarized, and then a final object mask is obtained using the connected component analysis. In the learning phase, object images are simply collected, then color histograms and SIFT features are extracted. These are used for object detection and recognition.

#### **2.3 Object detection and identification in recognition phase**

When the robot recognizes an object, the target object should be extracted from the scene. However, the same method in the learning phase is not applicable because the object is placed somewhere and it is not held by the user. Therefore, if objects are on the table, the plane detection technique is beneficial for detecting the objects. The 3D randomized Hough transform (24) is used for fast and accurate plane detection. This plane detection method is summarized below.


Service Robots 7

Learning Novel Objects for Domestic Service Robots 263

is converted into the robot's voice since the original sound is the user's voice, which is not naturally concatenated with a synthesized voice. The voice conversion is based on Eigenvoice Gaussian mixture models (EGMMs) (12). The recognized phoneme sequence of the OOV word is not used for synthesis since phoneme recognition accuracy is less than 90%, and the number of utterances for teaching an OOV word is virtually constrained to one owing to the

In this section, we describe the procedure of the mobile manipulation task called

Figure 6 shows the robot "DiGORO" we previously developed (28). It is composed of the

Fig. 6. The robot platform "DiGORO".

time constraint of RoboCup@Home.

**4.1 Robot platform: DiGORO**

• HOKUYO laser range finder UTM-30LX • KAWADA upper body humanoid robot • Onboard PC (Intel Core2Duo processor) × 5 • Sanken directional microphone CS-3e • YAMAHA loudspeaker NX-U10

• Mesa infrared TOF camera Swissranger

• Imaging Source CCD camera × 2

"Supermarket".

following hardware: • Electric wheelchair

**4. Procedure of extended mobile manipulation task**

Directional microphone

TOF camera

HIRO 6-DO arm 2 1-DOF waist

CCD camera 2

Onboard PC 5

Electric wheelchair

Laser range finder

Fig. 5. Overview of the speech processing.

5. Connected components analysis is performed on the plane and each object is segmented out.

A SIFT descriptors is used for recognition. First, the candidates are narrowed down by using color information followed by the matching of SIFT descriptors, which are collected during the learning phase. It should be noted that the SIFT descriptors are extracted from multiple images taken from different viewpoints. Moreover, the number of object images is reduced for speeding up the SIFT matching process by matching among within-class object images and discarding similar ones. This process is also useful for deciding the threshold on the SIFT matching score.
