**3. Pronouncing out-of-vocabulary words using voice conversion**

Figure 5 shows a schematic of the speech processing of the method, which uses automatic speech recognition (ASR) system called ATRASR (25). ATRASR is a hidden Markov model (HMM)-based speech recognition system, and it is used as a front-end and word/phoneme decoder. The phoneme decoder is used for obtaining the phoneme sequence of OOV words. Therefore, word- and phoneme-level speech recognition is possible.

To suppress noise, a particle filter is first applied to the online estimation of non-stationary noise, and then minimum mean square error (MMSE) estimation is used for noise reduction (26). Voice activity detection is conducted using endpoint detection (EPD) based on the frame's energy. This noise reduction part is of critical importance in RoboCup@Home tasks since the noise condition is severe.

Acoustic models (AMs) for the speech recognizer consist of "clean AMs" (male and female voices), which are trained using only clean voices, and "noisy AMs" (male and female voices), which are trained clean voices mixed with noise. This makes the speech recognition system robust in a noisy environment.

We use a template-based segmentation of words. To teach a robot an OOV word, the user is supposed to say template sentences such as "This is X". In terms of practical use, using a standard template sentence is reasonable since it is easy for users to understand how to teach a robot a word. A set of segmented voice and phoneme sequences is registered in a database. The phoneme sequence is used for utterance recognition of an OOV word.

For generating an utterance with an OOV word, the proposed method first converts the segmented voice recorded when the OOV word is learnt. The other part of the utterance is synthesized using XIMERA (27), which is a TTS conversion system. The OOV word part 6 Will-be-set-by-IN-TECH

5. Connected components analysis is performed on the plane and each object is segmented

A SIFT descriptors is used for recognition. First, the candidates are narrowed down by using color information followed by the matching of SIFT descriptors, which are collected during the learning phase. It should be noted that the SIFT descriptors are extracted from multiple images taken from different viewpoints. Moreover, the number of object images is reduced for speeding up the SIFT matching process by matching among within-class object images and discarding similar ones. This process is also useful for deciding the threshold on the SIFT

Figure 5 shows a schematic of the speech processing of the method, which uses automatic speech recognition (ASR) system called ATRASR (25). ATRASR is a hidden Markov model (HMM)-based speech recognition system, and it is used as a front-end and word/phoneme decoder. The phoneme decoder is used for obtaining the phoneme sequence of OOV words.

To suppress noise, a particle filter is first applied to the online estimation of non-stationary noise, and then minimum mean square error (MMSE) estimation is used for noise reduction (26). Voice activity detection is conducted using endpoint detection (EPD) based on the frame's energy. This noise reduction part is of critical importance in RoboCup@Home tasks

Acoustic models (AMs) for the speech recognizer consist of "clean AMs" (male and female voices), which are trained using only clean voices, and "noisy AMs" (male and female voices), which are trained clean voices mixed with noise. This makes the speech recognition system

We use a template-based segmentation of words. To teach a robot an OOV word, the user is supposed to say template sentences such as "This is X". In terms of practical use, using a standard template sentence is reasonable since it is easy for users to understand how to teach a robot a word. A set of segmented voice and phoneme sequences is registered in a database.

For generating an utterance with an OOV word, the proposed method first converts the segmented voice recorded when the OOV word is learnt. The other part of the utterance is synthesized using XIMERA (27), which is a TTS conversion system. The OOV word part

Sentence templates OOV words Clean & noisy AMs

: Processing

Phoneme sequence Sound

: Data Sentence

ASR Word decoder Phoneme decoder

OOV extraction

templates

**3. Pronouncing out-of-vocabulary words using voice conversion**

Therefore, word- and phoneme-level speech recognition is possible.

The phoneme sequence is used for utterance recognition of an OOV word.

Speech Front-end

Speech

out.

matching score.

since the noise condition is severe.

robust in a noisy environment.

Particle filter MMSE EPD

TTS conversion

Voice conversion EGMM

Fig. 5. Overview of the speech processing.

Fig. 6. The robot platform "DiGORO".

is converted into the robot's voice since the original sound is the user's voice, which is not naturally concatenated with a synthesized voice. The voice conversion is based on Eigenvoice Gaussian mixture models (EGMMs) (12). The recognized phoneme sequence of the OOV word is not used for synthesis since phoneme recognition accuracy is less than 90%, and the number of utterances for teaching an OOV word is virtually constrained to one owing to the time constraint of RoboCup@Home.
