**3.2 Speech recognition: SRGS 1.0 and SISR 1.0**

Two standards were created by the W3C VBWG to define the knowledge resources for ASR engine: speech grammars and semantic interpretation. The first one is the formal definition of a speech grammar described in the W3C Recommendation "Speech

<sup>1</sup> The original quote is from Alan Key.

#### **Figure 2.**

*A simplified VoiceXML document.*

Recognition Grammar Specification Version 1.0" SRGS 1.0 [20]. Speech grammars and statistical language models (SLMs) are the two common ways to provide constraints to the speech recognition process. A grammar is a formal definition of all the sentences that can be spoken. The grammar drives the ASR engine to find the closest match with the acoustic signal. A grammar is a strong constraint for the ASR and is relatively simple to implement. Statistical LMs, typically used in speech-to-text systems where the user in not specifically prompted, are in contrast weaker constraints characterized by the probability of a word to be spoken in the context of the preceding words (known as n-gram probabilities). The W3C VBWG standardization effort focused on the speech grammar only, because it was useful for simpler recognition tasks, but also because the other formats, driven by research, were commonly used for n-grams2 . A proposal for an SLM standard in the VBWG is described in [32].

SRGS supports the definition of grammars for speech as well as for DTMF inputs. A grammar can be specified in two equivalent formats, an XML document, called GrXML and a more traditional textual format, called ABNF, the acronym for augmented Backus-Naur format (commonly used to describe the syntax of a programming language). The W3C SRGS 1.0 Recommendation very clearly defines those two equivalent formats and offers a great number of examples (see [20]).

**53**

**Figure 3.**

*Simple SRGS grammar with SISR script.*

*Speech Standards: Lessons Learnt*

be extended to a longer list.

*DOI: http://dx.doi.org/10.5772/intechopen.93134*

grammar from one format to the other.

compact speech recognition engine processing.

**Figure 3** will return the ECMAScript object:

The SRGS 1.0 specification was immediately adopted by all speech recognition engines, allowing them to interoperate within a VoiceXML platform. Of the two formats, GrXML became the predominant one, but it is very easy to transform a

**Figure 3** shows an excerpt of a SRGS 1.0 grammar, in GrXML format, with the goal to recognize utterances like: "from Rome to Paris," where the list of cities might

The part of the grammar devoted to the generation of a meaning representation

prescribes the use of the Compact Profile ECMA-327, which is a constrained version of ECMAScript. The goal was to gain computational efficiency to enable more

In SISR 1.0, each SRGS 1.0 rule, like "city" in **Figure 3**, contains a predefined variable called "out" whose properties are assigned within the <tag> elements. The content of the "out" variable of the most external rule, called the "root" rule, is

For the input utterance "from Rome to Paris," for example, the SRGS grammar in

**{fromcity: "FCO", tocity: "CDG"}** This is the case for simple and focused grammars where the result is just one or a few values. However, SISR supports also conditional logic and algorithms. This would be useful for instance to validate a checksum in a complex numeric (i.e., credit card numbers) or alphanumeric strings (as the personal taxation ID in Italy). That would allow the recognizer to validate and possibly reject a wrong result before returning it to the application and at the same time to increase the confi-

returned from the recognition engine to the application environment.

dence of alternative, and possibly correct, recognition result.

or semantic interpretation is indicated with blue characters. This is the domain of the second speech grammar standard produced by W3C VBWG "Semantic Interpretation for Speech Recognition Specification Version 1.0" SISR 1.0 [23]. Semantic results are encapsulated in each rule by means of <tag> elements, which contain snippets of the programming language ECMAScript [33], widely known in its Web variety as JavaScript. The W3C SISR 1.0 Recommendation

<sup>2</sup> For instance, the well-known MIT ARPA LM format, see http://www.seas.ucla.edu/spapl/weichu/ htkbook/node243\_mn.html

#### *Speech Standards: Lessons Learnt DOI: http://dx.doi.org/10.5772/intechopen.93134*

*Human 4.0 - From Biology to Cybernetic*

Recognition Grammar Specification Version 1.0" SRGS 1.0 [20]. Speech grammars and statistical language models (SLMs) are the two common ways to provide constraints to the speech recognition process. A grammar is a formal definition of all the sentences that can be spoken. The grammar drives the ASR engine to find the closest match with the acoustic signal. A grammar is a strong constraint for the ASR and is relatively simple to implement. Statistical LMs, typically used in speech-to-text systems where the user in not specifically prompted, are in contrast weaker constraints characterized by the probability of a word to be spoken in the context of the preceding words (known as n-gram probabilities). The W3C VBWG standardization effort focused on the speech grammar only, because it was useful for simpler recognition tasks, but also because the other formats, driven by research, were commonly used for

. A proposal for an SLM standard in the VBWG is described in [32]. SRGS supports the definition of grammars for speech as well as for DTMF inputs. A grammar can be specified in two equivalent formats, an XML document, called GrXML and a more traditional textual format, called ABNF, the acronym for augmented Backus-Naur format (commonly used to describe the syntax of a programming language). The W3C SRGS 1.0 Recommendation very clearly defines those two equivalent formats and offers a great number of examples (see [20]).

<sup>2</sup> For instance, the well-known MIT ARPA LM format, see http://www.seas.ucla.edu/spapl/weichu/

**52**

htkbook/node243\_mn.html

n-grams2

**Figure 2.**

*A simplified VoiceXML document.*

The SRGS 1.0 specification was immediately adopted by all speech recognition engines, allowing them to interoperate within a VoiceXML platform. Of the two formats, GrXML became the predominant one, but it is very easy to transform a grammar from one format to the other.

**Figure 3** shows an excerpt of a SRGS 1.0 grammar, in GrXML format, with the goal to recognize utterances like: "from Rome to Paris," where the list of cities might be extended to a longer list.

The part of the grammar devoted to the generation of a meaning representation or semantic interpretation is indicated with blue characters. This is the domain of the second speech grammar standard produced by W3C VBWG "Semantic Interpretation for Speech Recognition Specification Version 1.0" SISR 1.0 [23].

Semantic results are encapsulated in each rule by means of <tag> elements, which contain snippets of the programming language ECMAScript [33], widely known in its Web variety as JavaScript. The W3C SISR 1.0 Recommendation prescribes the use of the Compact Profile ECMA-327, which is a constrained version of ECMAScript. The goal was to gain computational efficiency to enable more compact speech recognition engine processing.

In SISR 1.0, each SRGS 1.0 rule, like "city" in **Figure 3**, contains a predefined variable called "out" whose properties are assigned within the <tag> elements. The content of the "out" variable of the most external rule, called the "root" rule, is returned from the recognition engine to the application environment.

For the input utterance "from Rome to Paris," for example, the SRGS grammar in **Figure 3** will return the ECMAScript object:

**{fromcity: "FCO", tocity: "CDG"}**

This is the case for simple and focused grammars where the result is just one or a few values. However, SISR supports also conditional logic and algorithms. This would be useful for instance to validate a checksum in a complex numeric (i.e., credit card numbers) or alphanumeric strings (as the personal taxation ID in Italy). That would allow the recognizer to validate and possibly reject a wrong result before returning it to the application and at the same time to increase the confidence of alternative, and possibly correct, recognition result.

#### **Figure 3.** *Simple SRGS grammar with SISR script.*
