**3.3 Speech synthesis: SSML 1.0 and 1.1**

Another effort was to define how to control a speech synthesis, or TTS engine. This is to help the engine render the textual prompt in the most accurate way. The XML markup language for this purpose is the Speech Synthesis Markup Language Version 1.0, SSML 1.0 [20], which was released in March 2004.

**Figure 4** shows the five major processing steps present in all TTS engines. For each of them, the engine offers a normal behavior, called "non-markup behavior" in the picture. The SSML mark-up instead allows the engine to improve the default rendering by means of elements of the language. Each element is related to one specific processing step, and it is interpreted as a request by the author to perform some specific processing. It is then up to the processor to determine whether and in what way to realize the command.

The SSML example in **Figure 5** shows a prompt for a flight information system structured into a single paragraph (<p > element) and two sentences (<s > elements). Acronyms are substituted (<sub>) into expanded versions, pauses are added (<break>), and a time expression is explicitly labeled (<say-as>) to select the correct way of reading it. Other elements can change additional features, such as prosodic features of speed and rate (<prosody>), and how to change the speaking voice (<voice>).

SSML 1.0 [21] continued to be standardized to promote the use of SSML to more international languages, in particular Asian and Indian languages. Three workshops were held to encourage local companies and universities to propose features to be added to the language:


A new standard SSML 1.1 [24] was released in September 2010. See Appendix F of [24] for details on the changes. Among them, a < token> element was introduced for languages where the whitespace has peculiar behavior, such as in Mandarin, Japanese, Thai, Vietnamese, and Urdu.

**55**

*Speech Standards: Lessons Learnt*

*DOI: http://dx.doi.org/10.5772/intechopen.93134*

**3.4 Pronunciation lexicon: PLS 1.0**

**Figure 5.**

*A simple SSML document.*

will be taken into account.

**3.5 Call control—CCXML 1.0**

or for a different spelling for the same pronunciation.

Both speech grammars and synthesized prompts can require customizing the pronunciation of words in a specific application domain. This is often done by adding a user lexicon. The Pronunciation Lexicon Specification (PLS 1.0 [25]) was created to support the definition of a standard lexicon fully interoperable with SRGS 1.0 and SSML 1.0/1.1. PLS 1.0 became a W3C Recommendation in October 2008. A PLS document is a container of entries, <lexeme> elements, with a textual part described by the <grapheme> element and with textual replacements provided by <alias> elements or phonetic transcriptions by <phoneme> elements. There can be multiple pronunciations to accommodate different ways to speak a word/token,

A simple PLS 1.0 document example for a flight application is shown in **Figure 6**. For "Alitalia" and "Lufthansa," the pronunciations inside the <phoneme> element are given in IPA (International Phonetic Alphabet) [34]—a standard way to express the pronunciations for all spoken human languages. Moreover, the two lexemes have a double pronunciation; the first is the normal English one, while the second is closer to their original language (Italian and German, respectively) as spoken by a native speaker of that language. The prefer attribute indicates which pronunciation has to be selected for TTS rendering. For ASR, all the pronunciations

Another language defined by the W3C VBWG targets programming the call control of a voice browser in an innovative way. An XML markup language was developed to define handlers for telephony events generated by a telephone connection or a VoIP SIP interaction. The Voice Browser Call Control (CCXML 1.0) [26] language was designed to allow a very efficient implementation completely based upon events and handlers to avoid creating any latency that might impact the underlying signaling.

**Figure 4.** *SSML support for stages of speech synthesis.*

*Speech Standards: Lessons Learnt DOI: http://dx.doi.org/10.5772/intechopen.93134*

*Human 4.0 - From Biology to Cybernetic*

what way to realize the command.

voice (<voice>).

added to the language:

• Nov 2005 at Beijing (China)

• May 2006 at Crete (Greece)

Japanese, Thai, Vietnamese, and Urdu.

*SSML support for stages of speech synthesis.*

• Jun 2007 at Hyderabad

**3.3 Speech synthesis: SSML 1.0 and 1.1**

Another effort was to define how to control a speech synthesis, or TTS engine. This is to help the engine render the textual prompt in the most accurate way. The XML markup language for this purpose is the Speech Synthesis Markup Language

**Figure 4** shows the five major processing steps present in all TTS engines. For each of them, the engine offers a normal behavior, called "non-markup behavior" in the picture. The SSML mark-up instead allows the engine to improve the default rendering by means of elements of the language. Each element is related to one specific processing step, and it is interpreted as a request by the author to perform some specific processing. It is then up to the processor to determine whether and in

The SSML example in **Figure 5** shows a prompt for a flight information system structured into a single paragraph (<p > element) and two sentences (<s > elements). Acronyms are substituted (<sub>) into expanded versions, pauses are added (<break>), and a time expression is explicitly labeled (<say-as>) to select the correct way of reading it. Other elements can change additional features, such as prosodic features of speed and rate (<prosody>), and how to change the speaking

SSML 1.0 [21] continued to be standardized to promote the use of SSML to more international languages, in particular Asian and Indian languages. Three workshops were held to encourage local companies and universities to propose features to be

A new standard SSML 1.1 [24] was released in September 2010. See Appendix F of [24] for details on the changes. Among them, a < token> element was introduced for languages where the whitespace has peculiar behavior, such as in Mandarin,

Version 1.0, SSML 1.0 [20], which was released in March 2004.

**54**

**Figure 4.**

#### **Figure 5.** *A simple SSML document.*

#### **3.4 Pronunciation lexicon: PLS 1.0**

Both speech grammars and synthesized prompts can require customizing the pronunciation of words in a specific application domain. This is often done by adding a user lexicon. The Pronunciation Lexicon Specification (PLS 1.0 [25]) was created to support the definition of a standard lexicon fully interoperable with SRGS 1.0 and SSML 1.0/1.1. PLS 1.0 became a W3C Recommendation in October 2008.

A PLS document is a container of entries, <lexeme> elements, with a textual part described by the <grapheme> element and with textual replacements provided by <alias> elements or phonetic transcriptions by <phoneme> elements. There can be multiple pronunciations to accommodate different ways to speak a word/token, or for a different spelling for the same pronunciation.

A simple PLS 1.0 document example for a flight application is shown in **Figure 6**. For "Alitalia" and "Lufthansa," the pronunciations inside the <phoneme> element are given in IPA (International Phonetic Alphabet) [34]—a standard way to express the pronunciations for all spoken human languages. Moreover, the two lexemes have a double pronunciation; the first is the normal English one, while the second is closer to their original language (Italian and German, respectively) as spoken by a native speaker of that language. The prefer attribute indicates which pronunciation has to be selected for TTS rendering. For ASR, all the pronunciations will be taken into account.

#### **3.5 Call control—CCXML 1.0**

Another language defined by the W3C VBWG targets programming the call control of a voice browser in an innovative way. An XML markup language was developed to define handlers for telephony events generated by a telephone connection or a VoIP SIP interaction. The Voice Browser Call Control (CCXML 1.0) [26] language was designed to allow a very efficient implementation completely based upon events and handlers to avoid creating any latency that might impact the underlying signaling.

#### **Figure 6.**

*PLS 1.0 document for flight applications.*

A CCXML engine is also able to send and receive events through an HTTP/ HTTPS connector, which allows for the generation of outbound calls from a Web application and for monitoring calls and conferences via a Web interface.

CCXML 1.0 addresses both simple tasks of call handling (see **Figure 7**), as well as complex ones, such as conditional call handling, conferencing, coaching, etc. Each CCXML document describes transitions to handle specific events. In **Figure 7**, a "connection.alerting" event (incoming call) is accepted by the underlying telephony or VoIP layer, a VoiceXML dialog is started when the "connection.connected" event is received, and then the CCXML processor waits until either the caller disconnects ("connection.disconnect") or the VoiceXML dialog exits ("dialog.exit"). These are simple actions performed during telephony calls, both TDM and VoIP.

While working on the definition of CCXML 1.0, which became a W3C Recommendation in July 2011, the W3C VBWG decided to start another effort to define a state-chart language to generalize the ideas behind CCXML 1.0. This new specification is State Chart XML (SCXML): State Machine Notation for Control Abstraction (SCXML 1.0 [26]), and it can be used as the key component to control a generalized interaction in a multimodal interface. SCXML 1.0 is an XML markup language that provides a generic state-machine-based execution environment inspired by Harel state charts [35].

#### **3.6 IETF protocols: MRCPv1 and v2**

The implementation of voice browsing relies on other standards and protocols, the web architecture, with XML documents, namespaces, caching policies to start with, and obviously the HTTP/HTTPS protocols. All these are at the core of the W3C VBWG standards. However, the Internet Task Force Initiative (IETF) [36] was working on needed protocols.

The Media Resource Control Protocol (MRCP), whose initial draft was proposed by CISCO, SpeechWorks, and Nuance, defines the requests, responses, and events to

**57**

standards, see [39].

**Figure 7.**

**4. W3C MMIWG standards**

*Basic handling of incoming calls with CCXML.*

*Speech Standards: Lessons Learnt*

*DOI: http://dx.doi.org/10.5772/intechopen.93134*

control resources of general speech engines, such as ASR and TTS and even speaker verification to enable a distributed and scalable architecture. The initial draft was standardized by IETF as MRCPv1 (RFC 4463 [37]), and it was largely implemented by the industry. The protocol was based on Real-Time Transport Protocol (RTP) for media transport and RTSP (Real Time Streaming Protocol) for controlling speech resources. In the meantime, standardization continued to MRCPv2, which was instead

based on SIP (Session Initiation Protocol) for signaling and SDP (Session Description Protocol) for negotiating and exchanging capabilities. In November 2012, the standardization was completed (RFC 6787 [38]), and it enabled the control of new resources for recording, speaker verification, and identification. For a complete description of MRCPv2 and its relationship with W3C VBWG

The companion working group, Multi-Modal Interaction Working Group (MMIWG), led by Deborah Dahl was attended by almost the same companies attending VBWG. The goal of MMIWG was to extend the scope of standardization beyond the voice or typed input to embrace a much larger set of modalities, such as touch, gesture, emotions, and haptics both as input and output devices for a system. The major achievements of the W3C MMIWG were the following standards:

• Ink Markup Language, InkML [40], is designed to represent the input of handwriting by a stylus or a finger. In addition to representing traces, InkML offers a rich set of metadata that preserve the appearance of the original input

(i.e., color, width, orientation, timing, etc.).

*Speech Standards: Lessons Learnt DOI: http://dx.doi.org/10.5772/intechopen.93134*

*Human 4.0 - From Biology to Cybernetic*

A CCXML engine is also able to send and receive events through an HTTP/ HTTPS connector, which allows for the generation of outbound calls from a Web

CCXML 1.0 addresses both simple tasks of call handling (see **Figure 7**), as well as complex ones, such as conditional call handling, conferencing, coaching, etc. Each CCXML document describes transitions to handle specific events. In **Figure 7**, a "connection.alerting" event (incoming call) is accepted by the underlying telephony or VoIP layer, a VoiceXML dialog is started when the "connection.connected" event is received, and then the CCXML processor waits until either the caller disconnects ("connection.disconnect") or the VoiceXML dialog exits ("dialog.exit"). These are simple actions performed during telephony calls, both TDM and VoIP. While working on the definition of CCXML 1.0, which became a W3C Recommendation in July 2011, the W3C VBWG decided to start another effort to define a state-chart language to generalize the ideas behind CCXML 1.0. This new specification is State Chart XML (SCXML): State Machine Notation for Control Abstraction (SCXML 1.0 [26]), and it can be used as the key component to control a generalized interaction in a multimodal interface. SCXML 1.0 is an XML markup language that provides a generic state-machine-based execution environment

The implementation of voice browsing relies on other standards and protocols, the web architecture, with XML documents, namespaces, caching policies to start with, and obviously the HTTP/HTTPS protocols. All these are at the core of the W3C VBWG standards. However, the Internet Task Force Initiative (IETF) [36] was

The Media Resource Control Protocol (MRCP), whose initial draft was proposed by CISCO, SpeechWorks, and Nuance, defines the requests, responses, and events to

application and for monitoring calls and conferences via a Web interface.

**56**

**Figure 6.**

*PLS 1.0 document for flight applications.*

inspired by Harel state charts [35].

working on needed protocols.

**3.6 IETF protocols: MRCPv1 and v2**

#### **Figure 7.** *Basic handling of incoming calls with CCXML.*

control resources of general speech engines, such as ASR and TTS and even speaker verification to enable a distributed and scalable architecture. The initial draft was standardized by IETF as MRCPv1 (RFC 4463 [37]), and it was largely implemented by the industry. The protocol was based on Real-Time Transport Protocol (RTP) for media transport and RTSP (Real Time Streaming Protocol) for controlling speech resources.

In the meantime, standardization continued to MRCPv2, which was instead based on SIP (Session Initiation Protocol) for signaling and SDP (Session Description Protocol) for negotiating and exchanging capabilities. In November 2012, the standardization was completed (RFC 6787 [38]), and it enabled the control of new resources for recording, speaker verification, and identification.

For a complete description of MRCPv2 and its relationship with W3C VBWG standards, see [39].
