**5. Experiments**

In this section, experiments have been conducted to evaluate the proposed method and the applied system through the task called supermarket task.

### **5.1 Experiment 1: Evaluation of proposed method**

### **5.1.1 Segmentation accuracy**

In this experiment, we evaluated segmentation accuracy. It is necessary to segment the object region from a complex background which generally exists in the domestic environment. We used motion attention discussed in Section 2 because the object is held by a user in the learning phase.

The experiment was carried out in an ordinary living room, shown in Fig.9. We used 120 ordinary objects, as shown in Fig.10. A user taught the robot each object by showing and telling its name to the robot. The robot acquired 40 consecutive frames for each object and extracted the target object region from each image. Figure 11 shows examples of object segmentation.

Service Robots 11

Learning Novel Objects for Domestic Service Robots 267

0

0 0.5 1 Recall

0.5

Precision

(a) (b)

(a) (b) (c)

extracted. (b)Left: Object Right: Results of segmentation. (c)Left: Object Right: Results of

Recognition rate 91% 89% 89%

indicates the manually labeled object region. Figure 12(b) shows a 2D plot of recall vs. precision. Each point represents an average of a single object (10 frames). As a result, the averages of all objects were 76.2% for recall and 95.8% for precision. This result indicates that the inside of object regions is extracted correctly because the precision was high. Therefore, it

Figure 13 shows examples of segmentation failure. The object in Fig.13(a) was not segmented at all because object's entire surface reflected near infrared rays which lead to fail on measuring 3D information. A part of object in Fig.13(b) was segmented because the black region absorbed near infrared rays. Moreover, a part of the object in Fig.13(c) was segmented because near infrared rays are reflected partially. We can see that black and metallic objects

Place 1 Place 2 Place 3

Fig. 12. (a) Definitions of recall and precision. (b) Results of object detection.

Fig. 13. Examples of object segmentation failure. (a) Object which could not be

segmentation.

Table 1. Object recognition rates.

tend to cause low recall.

will not negatively affect object recognition.

1

Fig. 9. Experimental Environment.

Fig. 10. 120 Objects used for experiments.

Fig. 11. Examples of object segmentation.

We extracted 10 out of the 40 frames for evaluating the segmentation accuracy of each object. Detection accuracy was measured using recall and precision rates, which are generically used for evaluation of classification, as shown in Fig.12(a), because it can be considered that pixels are classified into two classes, object region and non-object region. Here, "Object region" 10 Will-be-set-by-IN-TECH

We extracted 10 out of the 40 frames for evaluating the segmentation accuracy of each object. Detection accuracy was measured using recall and precision rates, which are generically used for evaluation of classification, as shown in Fig.12(a), because it can be considered that pixels are classified into two classes, object region and non-object region. Here, "Object region"

Fig. 9. Experimental Environment.

Fig. 10. 120 Objects used for experiments.

Fig. 11. Examples of object segmentation.

Fig. 12. (a) Definitions of recall and precision. (b) Results of object detection.

Fig. 13. Examples of object segmentation failure. (a) Object which could not be extracted. (b)Left: Object Right: Results of segmentation. (c)Left: Object Right: Results of segmentation.


Table 1. Object recognition rates.

indicates the manually labeled object region. Figure 12(b) shows a 2D plot of recall vs. precision. Each point represents an average of a single object (10 frames). As a result, the averages of all objects were 76.2% for recall and 95.8% for precision. This result indicates that the inside of object regions is extracted correctly because the precision was high. Therefore, it will not negatively affect object recognition.

Figure 13 shows examples of segmentation failure. The object in Fig.13(a) was not segmented at all because object's entire surface reflected near infrared rays which lead to fail on measuring 3D information. A part of object in Fig.13(b) was segmented because the black region absorbed near infrared rays. Moreover, a part of the object in Fig.13(c) was segmented because near infrared rays are reflected partially. We can see that black and metallic objects tend to cause low recall.

Service Robots 13

Learning Novel Objects for Domestic Service Robots 269

taught by several teachers including the user who asked the robot to bring something. For comparison, we conducted the experiment under simpler conditions. In each condition, volunteers uttered sentences "Bring me X" which consist of 120 words and the robot recognized X. There were eight volunteers and 960 utterances were recognized in total. The distance between the volunteer and the microphone was 50 cm. The ambient noise level in the experiment was set as 55dBA, which simulated the standard noise level in the RoboCup@Home competition when there is no other noise source such as announcement. If the speech recognition system can work in 55dBA noise, it can also work in a domestic

Figure 14(a) shows the recognition rate in each condition, and the details of each condition are

**1. Recognition with correct phonemes:** Correct phonemes of the 120 words were manually registered in the dictionary. Each volunteer uttered "Bring me X" (X is the object name)

**2. Teacher and user is the same person:** Each volunteer uttered 120 sentences "This is X" (X is the object name) and the robot learned the 120 OOV words. The robot recognized the

**3. Teachers taught OOV words:** First, 120 words were randomly assigned to eight teachers and these words were taught to the robot by them. Then, the robot recognized the 120 OOV words spoken by the user who is one of teachers. Therefore, the words were not always taught by the user. 118 out of the 960 were spoken by the teacher, i.e. teacher was the same as the user, and 842 out of the 960 utterances were spoken by others, i.e. teacher

The recognition rate was 95.2% in Condition 1, as shown in Fig.14(a). On the other hand, the accuracy of phonemes was 69.3% and the recognition rate was 82.4% in Condition 2. This indicates that the recognition rate was over 80%, which is satisfactory in a practical situation. In Condition 3, the recognition rate was 75.2%, as shown in Fig.14(a). The recognition rate was 83.4% when the teacher was the same as the user and 74.1% when the teacher was not the same as the user. Note that the speech files used in the training and those used in the test were different, even if the trainer and the tester was the same person. We can see that the recognition rate was lower than that in Condition 2. However, this is not a problem if restating

The objective of this experiment is to evaluate the quality of the robot's utterances. The

First, we made a database that included 960 utterances. It had 120 unique words and each word was uttered by eight volunteers. The ambient noise level was 55 dBA and the distance between the volunteer and microphone was 50 cm. Next, robot's utterances were generated using the proposed method. Utterances were also generated using a baseline method for

**Voice Conversion (VC) (proposed):** The utterances in the database are converted to robot voice by using EGMMs (12) (details of the proposed method were explained in Section

120 OOV words spoken by the volunteer who was the same as the teacher.

environment. The recognition rate was calculated from these utterances.

and the robot recognized the object name.

was not the same as the user.

**5.1.4 Quality evaluation of robot's utterances**

experimental procedure is described below.

comparison. These two methods are summarized as follows:

as follows:

is allowed.

3).

Fig. 14. (a) Recognition results. (Condition 1: Recognition with correct phonemes. Condition 2: Teacher and user are same person. Condition 3: Teachers taught the OOV words.) (b) Evaluation of voice conversion. The CMOS of VC was 1.45.


Table 2. CMOS evaluation and scores.

### **5.1.2 Object recognition accuracy**

We used 120 common objects, which had been learnt by the robot, as mentioned in the previous subsection. Three different locations with different lighting conditions in the living room were selected, and each object, which was segmented out using motion attention, was recognized. The results are listed in Table 1. The average recognition rate was about 90%. A major problem was false recognition between similar kinds of object such as cup noodles with different taste, because those objects have similar texture.

Next, we evaluated the proposed recognition method with the COIL100 database (29). COIL100 consist of 100 objects and 72 images per object. 36 images of each object were used for learning and the other 36 images were used for recognition. The recognition rate was 97.6%.

#### **5.1.3 Recognition accuracy of out-of-vocabulary words**

We evaluated the recognition accuracy of OOV words. The experimental procedure is described as follows. The teacher taught the robot OOV words such as "This is X". In a domestic environment, the teacher may not be only one person but also his/her family or friends. Therefore, we conducted the experiment under the condition that OOV words are 12 Will-be-set-by-IN-TECH

26%

30%

Slightly better About the same

24%

11%

5%2% 2%

Much better Better

Slightly worse Worse

(a) (b)

Quality Score Much better 3 Better 2 Slightly better 1 About the same 0 Slightly worse -1 Worse -2 Much worse -3

We used 120 common objects, which had been learnt by the robot, as mentioned in the previous subsection. Three different locations with different lighting conditions in the living room were selected, and each object, which was segmented out using motion attention, was recognized. The results are listed in Table 1. The average recognition rate was about 90%. A major problem was false recognition between similar kinds of object such as cup noodles with

Next, we evaluated the proposed recognition method with the COIL100 database (29). COIL100 consist of 100 objects and 72 images per object. 36 images of each object were used for learning and the other 36 images were used for recognition. The recognition rate was

We evaluated the recognition accuracy of OOV words. The experimental procedure is described as follows. The teacher taught the robot OOV words such as "This is X". In a domestic environment, the teacher may not be only one person but also his/her family or friends. Therefore, we conducted the experiment under the condition that OOV words are

Fig. 14. (a) Recognition results. (Condition 1: Recognition with correct phonemes. Condition 2: Teacher and user are same person. Condition 3: Teachers taught the OOV

words.) (b) Evaluation of voice conversion. The CMOS of VC was 1.45.

0

Table 2. CMOS evaluation and scores.

different taste, because those objects have similar texture.

**5.1.3 Recognition accuracy of out-of-vocabulary words**

**5.1.2 Object recognition accuracy**

97.6%.

20

40

Recognition rate [%]

60

80

100

taught by several teachers including the user who asked the robot to bring something. For comparison, we conducted the experiment under simpler conditions. In each condition, volunteers uttered sentences "Bring me X" which consist of 120 words and the robot recognized X. There were eight volunteers and 960 utterances were recognized in total. The distance between the volunteer and the microphone was 50 cm. The ambient noise level in the experiment was set as 55dBA, which simulated the standard noise level in the RoboCup@Home competition when there is no other noise source such as announcement. If the speech recognition system can work in 55dBA noise, it can also work in a domestic environment. The recognition rate was calculated from these utterances.

Figure 14(a) shows the recognition rate in each condition, and the details of each condition are as follows:


The recognition rate was 95.2% in Condition 1, as shown in Fig.14(a). On the other hand, the accuracy of phonemes was 69.3% and the recognition rate was 82.4% in Condition 2. This indicates that the recognition rate was over 80%, which is satisfactory in a practical situation. In Condition 3, the recognition rate was 75.2%, as shown in Fig.14(a). The recognition rate was 83.4% when the teacher was the same as the user and 74.1% when the teacher was not the same as the user. Note that the speech files used in the training and those used in the test were different, even if the trainer and the tester was the same person. We can see that the recognition rate was lower than that in Condition 2. However, this is not a problem if restating is allowed.

#### **5.1.4 Quality evaluation of robot's utterances**

The objective of this experiment is to evaluate the quality of the robot's utterances. The experimental procedure is described below.

First, we made a database that included 960 utterances. It had 120 unique words and each word was uttered by eight volunteers. The ambient noise level was 55 dBA and the distance between the volunteer and microphone was 50 cm. Next, robot's utterances were generated using the proposed method. Utterances were also generated using a baseline method for comparison. These two methods are summarized as follows:

**Voice Conversion (VC) (proposed):** The utterances in the database are converted to robot voice by using EGMMs (12) (details of the proposed method were explained in Section 3).

Service Robots 15

Learning Novel Objects for Domestic Service Robots 271

Navigation Grasping

Fig. 16. Success rates. (Condition 1: words are taught by the same as requester. Condition 2:

Speech recognition Navigation to object

Time [s]

specified object, grasped it, and came back to the volunteer. This process was repeated for

We conducted the task under two conditions. One was similar to a real competition and the other was a more difficult condition. In each condition, we changed the dictionary of speech recognition because the user who teaches the object to the robot may not be the only person.

**Condition 1:** In the learning phase, each volunteer taught the robot the objects' names. The

**Condition 2:** In the learning phase, 120 words were randomly assigned to eight volunteers and they taught these words to the robot. Each volunteer asked the robot to bring objects in the execution phase. Therefore, the names of the objects to bring were not always taught

same volunteer asked the robot to bring the objects in the execution phase.

by the same volunteer who commanded the robot in the execution phase.

Fig. 17. Elapsed time of each process. (Condition 1: words are taught by the same as

requester. Condition 2: words are taught by different volunteers. )

0 50 100 150 200

Condition 1 Condition 2

Success rate [%]

0

words are taught by different volunteers. )

Condition 1

Condition 2

The details of the conditions are as follows.

three objects.

Speech recognition

Navigation to user

Object recognition

Object detection and recognition Grasping

20

40

60

80

100

Fig. 15. The map and location of the tables/shelves.

**Text-To-Speech (TTS) (baseline):** The phoneme sequences obtained by phoneme recognition were used for generating robot utterances.

We then formed another group of six volunteers to evaluate the quality of generated utterances. Each volunteer listened to the utterances generated using TTS and VC. These utterances were composed of 120 unique words. The order of words was chosen at random. The order of TTS and VC samples was also chosen at random for each trial.

The comparison mean opinion score (CMOS) was used for evaluation. CMOS is specified by ITU-T recommendation P.800 (30). In the field of speech synthesis, CMOS is used for comparing voices synthesized with two methods. Specifically, the evaluation was conducted using the following questionnaire.

(Volunteer listens to two robot's utterances.) Do you think the former is more accurate than the latter in terms of pronunciation?

The evaluation and its scores are listed in Table 2.

The evaluation results are shown in Fig.14(b). The CMOS of VC was 1.45, which suggests that VC is preferred. We can see that the proposed method, which utilizes VC, is efficient if the word which has been learnt is uttered once.

### **5.2 Experiment 2: Evaluation of applied system in mobile manipulation**

We implemented an integrated audio-visual processing system on DiGORO and performed an experiment in a living room. The purpose of this experiment is to evaluate the robot in which the proposed method has been applied in mobile manipulation. We chose a task called "Supermarket" in the RoboCup@Home league. RoboCup@Home has several advantages, that is competition that has large number of participants, and clearly-stated rules, which are open to the public. In addition, improvements on the rules are done annually.

### **5.2.1 Experimental setup**

Figure 15 illustrates a map generated from DiGORO's own on board SLAM mapping module. The location of the tables/shelf is also shown.

We designed the task module according to the flow in Section 4.3. A volunteer first interacted with the robot at the start position. Then the robot navigated to a table/shelf, recognized the 14 Will-be-set-by-IN-TECH

**Text-To-Speech (TTS) (baseline):** The phoneme sequences obtained by phoneme recognition

We then formed another group of six volunteers to evaluate the quality of generated utterances. Each volunteer listened to the utterances generated using TTS and VC. These utterances were composed of 120 unique words. The order of words was chosen at random.

The comparison mean opinion score (CMOS) was used for evaluation. CMOS is specified by ITU-T recommendation P.800 (30). In the field of speech synthesis, CMOS is used for comparing voices synthesized with two methods. Specifically, the evaluation was conducted

(Volunteer listens to two robot's utterances.) Do you think the former is more accurate than

The evaluation results are shown in Fig.14(b). The CMOS of VC was 1.45, which suggests that VC is preferred. We can see that the proposed method, which utilizes VC, is efficient if the

We implemented an integrated audio-visual processing system on DiGORO and performed an experiment in a living room. The purpose of this experiment is to evaluate the robot in which the proposed method has been applied in mobile manipulation. We chose a task called "Supermarket" in the RoboCup@Home league. RoboCup@Home has several advantages, that is competition that has large number of participants, and clearly-stated rules, which are open

Figure 15 illustrates a map generated from DiGORO's own on board SLAM mapping module.

We designed the task module according to the flow in Section 4.3. A volunteer first interacted with the robot at the start position. Then the robot navigated to a table/shelf, recognized the

The order of TTS and VC samples was also chosen at random for each trial.

**5.2 Experiment 2: Evaluation of applied system in mobile manipulation**

to the public. In addition, improvements on the rules are done annually.

Dining table Side table

Fig. 15. The map and location of the tables/shelves.

were used for generating robot utterances.

W N E

Bookshelf

using the following questionnaire.

**5.2.1 Experimental setup**

the latter in terms of pronunciation?

The evaluation and its scores are listed in Table 2.

word which has been learnt is uttered once.

The location of the tables/shelf is also shown.

Start position

Objects

Bookshelf

Side table

Dining table

Fig. 16. Success rates. (Condition 1: words are taught by the same as requester. Condition 2: words are taught by different volunteers. )

Fig. 17. Elapsed time of each process. (Condition 1: words are taught by the same as requester. Condition 2: words are taught by different volunteers. )

specified object, grasped it, and came back to the volunteer. This process was repeated for three objects.

We conducted the task under two conditions. One was similar to a real competition and the other was a more difficult condition. In each condition, we changed the dictionary of speech recognition because the user who teaches the object to the robot may not be the only person. The details of the conditions are as follows.


Service Robots 17

Learning Novel Objects for Domestic Service Robots 273

Condition 2 was 1320, which was comparable to Team A. It should be noted that we used the average scores. Although, a team could performed a task only once in the competition. In

In an actual competition, three objects which the robot brings were selected from ten common objects whose names are listed and was given to the teams. Therefore, it was possible to manually register the names of all the objects in the dictionary. On the other hand, objects which the robot brings were chosen from 120 objects in our experiment. Moreover, no manual process was included in the learning process. Considering these conditions, we can see that our robot obtained promising results even though the environment was different from the

In this section, we will discuss the results from the evaluation of segmentation accuracy. Precision was 95.8%, which indicates that the inside of the object region was extracted correctly. On the other hand, recall was 76.2%, which was less than the precision. This indicates that sometimes only part of the object region was segmented out. This is because the TOF camera could not capture 3D information due to the material of the object. For example, 3D information cannot be captured from black or metallic objects because these objects reflect or absorb near infrared rays. We believe this will be improved by using a stereo vision. DiGORO (Fig.6) has two CCD cameras and can compute stereo disparity with them. We now discuss the results of object recognition. The object recognition rate was about 90%. We used color and SIFT features for object recognition. Generally, it is difficult to recognize objects that have the same color and/or with no textures. For future work, we plan to use an object recognition method that integrates 3D shape information (31), which can significantly

For this research, the robot learned OOV words from one user's utterance and it is possible for the robot to recognize and utter them. The recognition rate was 82.4% and utterance was judged as better than the baseline method, which means a practical system is constructed. Failure in recognition was because false phonemes were learnt in the learning phase. The recognition rate can be improved by a user confirmation which phonemes were learnt correctly or not after learning. For example, a user utters "This is X" and the robot learns the object. Then the user confirms which "X" can be recognized or not by asking "Did you

registered correctly. Otherwise, the OOV word may not be registered correctly and the user

We evaluated the system in a domestic environment using the Supermarket task, which is one of the tasks in the RoboCup@Home league. Here, let us briefly discuss the evaluation task. As we mentioned earlier, it is difficult to determine what task should be used for evaluation, and there is no global standardized tasks for this. This situation makes it very difficult to evaluate robots, which were developed by different groups, through a same realistic task. We cannot compare our robot with others by using a self-defined task, since it is almost impossible to build their robots from scratch. Therefore, we think global standardized tasks are needed.

" (*X* = *X*

), then the OOV word is

that respect, this comparison may be unfair.

improve object recognition performance.

**6.2 Learning and recognition of OOV words**

can teach the object name again to the robot.

**6.3 Evaluation in domestic environment**

memorize X?" If the robot utters "Yes, I memorized *X*

competition.

**6. Discussion**

**6.1 Image processing**

Fig. 18. Score comparison. (Condition 1: words are taught by the same as requester. Condition 2: words are taught by different volunteers. )

In the two different experimental setups, five volunteers who don't have prior knowledge of the robot conducted the task.

Therefore, the robot was supposed to bring 30 objects throughout this experiment. In each task, 30 out of the 120 objects were randomly chosen. The training data for these objects were obtained in Experiment 1.

#### **5.2.2 Experimental results**

We evaluated the results from three view points, success rate of each process, process elapsed time, and the score as a total performance.

Figure 16 shows the success rate of each process. We can see that high success rates over 90% were obtained, except for speech recognition. The speech recognition rate was 93% in Condition 1. On the other hand, it was 80% in Condition 2. This is because the phoneme sequences in the lexicon were not accurate.

Figure 17 depicts the average elapsed time for each process (per object). The results suggest that the trial can be completed within 10 min (elapsed time should be tripled and added 60 sec for the robot's instruction). The phase of instruction to the robot took a long time. The confirmation from the robot such as "I will bring X. Is this correct?" or restating the instruction to the robot such as "Bring me X" by the volunteer when the robot could not recognize the object name, were responsible for it. The instruction phase in Condition 1 was shorter than that in Condition 2 because false recognition in Condition 1 was less than that in Condition 2. This figure also shows that the time of the object recognition phase in Condition 2 was longer than that in Condition 1 because the object location was chosen randomly in both conditions. It accidently took a long time to find objects in Condition 2, depending on their location.

Next, we evaluated the task scores as a reference. Note that the comparison of the scores may be unfair because there are differences between a laboratory and competition environments. However, we used the scores since they can be the only source for comparison among different robots through the same realistic task.

Figure 18 shows a comparison of scores among teams that participated in an actual competition in 2009. The average score in Condition 1 was 1560. From this score, DiGORO would outperform the best team in the competition. Furthermore, the average score in Condition 2 was 1320, which was comparable to Team A. It should be noted that we used the average scores. Although, a team could performed a task only once in the competition. In that respect, this comparison may be unfair.

In an actual competition, three objects which the robot brings were selected from ten common objects whose names are listed and was given to the teams. Therefore, it was possible to manually register the names of all the objects in the dictionary. On the other hand, objects which the robot brings were chosen from 120 objects in our experiment. Moreover, no manual process was included in the learning process. Considering these conditions, we can see that our robot obtained promising results even though the environment was different from the competition.

### **6. Discussion**

16 Will-be-set-by-IN-TECH

Team A This Team B Team C Team D Team E

(Condition 1) (Condition 2)

Fig. 18. Score comparison. (Condition 1: words are taught by the same as requester.

In the two different experimental setups, five volunteers who don't have prior knowledge of

Therefore, the robot was supposed to bring 30 objects throughout this experiment. In each task, 30 out of the 120 objects were randomly chosen. The training data for these objects were

We evaluated the results from three view points, success rate of each process, process elapsed

Figure 16 shows the success rate of each process. We can see that high success rates over 90% were obtained, except for speech recognition. The speech recognition rate was 93% in Condition 1. On the other hand, it was 80% in Condition 2. This is because the phoneme

Figure 17 depicts the average elapsed time for each process (per object). The results suggest that the trial can be completed within 10 min (elapsed time should be tripled and added 60 sec for the robot's instruction). The phase of instruction to the robot took a long time. The confirmation from the robot such as "I will bring X. Is this correct?" or restating the instruction to the robot such as "Bring me X" by the volunteer when the robot could not recognize the object name, were responsible for it. The instruction phase in Condition 1 was shorter than that in Condition 2 because false recognition in Condition 1 was less than that in Condition 2. This figure also shows that the time of the object recognition phase in Condition 2 was longer than that in Condition 1 because the object location was chosen randomly in both conditions. It accidently took a long time to find objects in Condition 2, depending on their location. Next, we evaluated the task scores as a reference. Note that the comparison of the scores may be unfair because there are differences between a laboratory and competition environments. However, we used the scores since they can be the only source for comparison among different

Figure 18 shows a comparison of scores among teams that participated in an actual competition in 2009. The average score in Condition 1 was 1560. From this score, DiGORO would outperform the best team in the competition. Furthermore, the average score in

research research

Score

the robot conducted the task.

obtained in Experiment 1.

**5.2.2 Experimental results**

0

time, and the score as a total performance.

sequences in the lexicon were not accurate.

robots through the same realistic task.

This

Condition 2: words are taught by different volunteers. )

400

800

1200

1600

2000

#### **6.1 Image processing**

In this section, we will discuss the results from the evaluation of segmentation accuracy. Precision was 95.8%, which indicates that the inside of the object region was extracted correctly. On the other hand, recall was 76.2%, which was less than the precision. This indicates that sometimes only part of the object region was segmented out. This is because the TOF camera could not capture 3D information due to the material of the object. For example, 3D information cannot be captured from black or metallic objects because these objects reflect or absorb near infrared rays. We believe this will be improved by using a stereo vision. DiGORO (Fig.6) has two CCD cameras and can compute stereo disparity with them. We now discuss the results of object recognition. The object recognition rate was about 90%. We used color and SIFT features for object recognition. Generally, it is difficult to recognize objects that have the same color and/or with no textures. For future work, we plan to use an object recognition method that integrates 3D shape information (31), which can significantly improve object recognition performance.

#### **6.2 Learning and recognition of OOV words**

For this research, the robot learned OOV words from one user's utterance and it is possible for the robot to recognize and utter them. The recognition rate was 82.4% and utterance was judged as better than the baseline method, which means a practical system is constructed. Failure in recognition was because false phonemes were learnt in the learning phase. The recognition rate can be improved by a user confirmation which phonemes were learnt correctly or not after learning. For example, a user utters "This is X" and the robot learns the object. Then the user confirms which "X" can be recognized or not by asking "Did you memorize X?" If the robot utters "Yes, I memorized *X* " (*X* = *X* ), then the OOV word is registered correctly. Otherwise, the OOV word may not be registered correctly and the user can teach the object name again to the robot.

#### **6.3 Evaluation in domestic environment**

We evaluated the system in a domestic environment using the Supermarket task, which is one of the tasks in the RoboCup@Home league. Here, let us briefly discuss the evaluation task. As we mentioned earlier, it is difficult to determine what task should be used for evaluation, and there is no global standardized tasks for this. This situation makes it very difficult to evaluate robots, which were developed by different groups, through a same realistic task. We cannot compare our robot with others by using a self-defined task, since it is almost impossible to build their robots from scratch. Therefore, we think global standardized tasks are needed.

Service Robots 19

Learning Novel Objects for Domestic Service Robots 275

[5] D. Holz, J. Paulus, T. Breuer, G. Giorgana, M. Reckhaus, F. Hegger, C. Müller, Z. Jin, R. Hartanto, P. Ploeger, et al., "The b-it-bots RoboCup@ Home 2009 team description

[7] "2010 Mobile Manipulation Challenge," http://www.willowgarage.com/mmc10,

[8] "Semantic Robot Vision Challenge," http://www.semantic-robot-vision-

[9] I. Bazzi, and J. Glass, "A multi-class approach for modelling out-of-vocabulary words," Seventh International Conference on Spoken Language Processing, pp.1613–1616, 2002. [10] M. Nakano, N. Iwahashi, T. Nagai, T. Sumii, X. Zuo, R. Taguchi, T. Nose, A. Mizutani, T. Nakamura, M. Attamim, et al., "Grounding New Words on the Physical World in Multi-Domain Human-Robot Dialogues," 2010 AAAI Fall Symposium Series, pp.74–79,

[11] H. Holzapfel, D. Neubig, and A. Waibel, "A dialogue approach to learning object descriptions and semantic categories," Robotics and Autonomous Systems, vol.56,

[12] T. Toda, Y. Ohtani, and K. Shikano, "One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices," IEEE International Conference on Acoustics, Speech and Signal

[13] C. Rother, V. Kolmogorov, and A. Blake, "Grabcut: Interactive foreground extraction using iterated graph cuts," ACM Transactions on Graphics (TOG), vol.23, no.3,

[14] J. Shi, and J. Malik, "Normalized cuts and image segmentation," IEEE Transactions on

[15] A.K. Mishra, and Y. Aloimonos, "Active Segmentation," International Journal of

[16] S. Hasler, H. Wersing, S. Kirstein, and E. Körner, "Large-Scale Real-Time Object Identification Based on Analytic Features," Artificial Neural Networks–ICANN 2009,

[17] H. Kim, E. Murphy-Chutorian, and J. Triesch, "Semi-autonomous learning of objects,"

[18] H. Wersing, S. Kirstein, M. Gotting, H. Brandl, M. Dunn, I. Mikhailova, C. Goerick, J. Steil, H. Ritter, and E. Korner, "Online learning of objects in a biologically motivated visual architecture," International Journal of Neural Systems, vol.17, no.4, pp.219–230,

[19] N. Iwahashi, "Robots that learn language: Developmental approach to human-machine

[20] D. Roy, "Grounding words in perception and action: computational insights," Trends in

[21] M. Fujita, R. Hasegawa, T. Takagi, J. Yokono, and H. Shimomura, "An autonomous robot that eats information via interaction with humans and environments," IEEE International Workshop on Robot and Human Interactive Communication, pp.383–389,

[22] M. Johnson-Roberson, G. Skantze, J. Bohg, J. Gustafson, R. Carlson, and D. Kragic, "Enhanced Visual Scene Understanding through Human-Robot Dialog," 2010 AAAI

Pattern Analysis and Machine Intelligence, vol.22, no.8, pp.888–905, 2002.

Computer Vision and Pattern Recognition Workshop, p.145, 2006.

conversations," Symbol Grounding and Beyond, pp.143–167, 2006.

paper," RoboCup 2009@ Home League Team Descriptions, Graz, Austria, 2009.

[6] "RoboCup@Home," http://www.ai.rug.nl/robocupathome/, 2010.

2010.

2010.

challenge.org/, 2009.

no.11, pp.1004–1013, 2008.

pp.309–314, 2004.

pp.663–672, 2009.

2007.

2002.

Processing, vol.4, pp.1249–1252, 2007.

Humanoid Robotics, vol.6, pp.361–386, 2009.

Cognitive Sciences, vol.9, no.8, pp.389–396, 2005.

Fall Symposium on Dialog with Robots, pp.143–144, 2010.

In this research, we propose to utilize the format of the task of RoboCup@Home, since we strongly believe that the tasks are the most standard tasks for evaluating robots for the following reasons:


Unfortunately, the comparison of the scores in the current form is not fair enough. Hence the score should be treated as reference. Although the score is just for a reference, DiGORO outperforms the best team who participated in the competition, and it shows DiGORO can perform well in a domestic environment. Any deduction in points was a result of the robot not recognizing what a user wanted it to bring. This can be improved by user confirmation in the learning phase, as mentioned above.

The learning and recognition of OOV words can be applied to other tasks. For example the "Who is who?" task, which is one of the tasks in RoboCup@Home, involves the learning of human faces and names. In this task, a user utters "My name is X" and the robot learns "X" as his/her name. With this method, we can deal with a vast number of names.

Furthermore, DiGORO has many other abilities, and it can carry out eight other tasks. For example the robot can carry out the command "Follow Me", which is for following humans, and "Shopping Mall", which is for learning the location in an unknown place. These advanced features led our team to the 1st place at RoboCup@Home 2010. This suggests that DiGORO can stably work in a domestic environment.

#### **7. Conclusion**

We proposed a practical learning method of novel objects. With this method a robot can learn a word from one utterance. It is possible to utter an OOV word using the segmentation of the word from a template sentence and voice conversion. The object region is extracted from a complicated scene through a user moving the object. We implemented them all in a robot as an object learning system and evaluate it by conducting the Supermarket task. The experimental results show that our robot, DiGORO, can stably work in a real environment.

#### **8. References**


18 Will-be-set-by-IN-TECH

In this research, we propose to utilize the format of the task of RoboCup@Home, since we strongly believe that the tasks are the most standard tasks for evaluating robots for the

2. Many teams from around the world participate, i.e. the task has already been performed

Unfortunately, the comparison of the scores in the current form is not fair enough. Hence the score should be treated as reference. Although the score is just for a reference, DiGORO outperforms the best team who participated in the competition, and it shows DiGORO can perform well in a domestic environment. Any deduction in points was a result of the robot not recognizing what a user wanted it to bring. This can be improved by user confirmation in

The learning and recognition of OOV words can be applied to other tasks. For example the "Who is who?" task, which is one of the tasks in RoboCup@Home, involves the learning of human faces and names. In this task, a user utters "My name is X" and the robot learns "X"

Furthermore, DiGORO has many other abilities, and it can carry out eight other tasks. For example the robot can carry out the command "Follow Me", which is for following humans, and "Shopping Mall", which is for learning the location in an unknown place. These advanced features led our team to the 1st place at RoboCup@Home 2010. This suggests that DiGORO

We proposed a practical learning method of novel objects. With this method a robot can learn a word from one utterance. It is possible to utter an OOV word using the segmentation of the word from a template sentence and voice conversion. The object region is extracted from a complicated scene through a user moving the object. We implemented them all in a robot as an object learning system and evaluate it by conducting the Supermarket task. The experimental

[1] T. Inamura, K. Okada, S. Tokutsu, N. Hatao, M. Inaba, and H. Inoue, "HRP-2W: A humanoid platform for research on support behavior in daily life environments,"

[2] K. Wyrobek, E. Berger, H. Van der Loos, and J. Salisbury, "Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot," IEEE International Conference on Robotics and Automation, pp.2165–2170, 2008. [3] F. Weisshardt, U. Reiser, C. Parlitz, and A. Verl, "Making High-Tech Service Robot Platforms Available," Proceedings-ISR/ROBOTIK 2010, pp.1115–1120, 2010. [4] J. Stückler, and S. Behnke, "Integrating indoor mobility, object manipulation, and intuitive interaction for domestic service tasks," IEEE-RAS International Conference on

as his/her name. With this method, we can deal with a vast number of names.

results show that our robot, DiGORO, can stably work in a real environment.

Robotics and Autonomous Systems, vol.57, no.2, pp.145–154, 2009.

3. The rules have been improved by many robotics researchers.

following reasons:

by many robots.

**7. Conclusion**

**8. References**

1. The rules are open to the public.

the learning phase, as mentioned above.

can stably work in a domestic environment.

Humanoid Robots, pp.506–513, 2009.


**Part 4** 

 **Humanoid Robots** 

**Current and Future Challenges for** 

