**4.1 A brief history of speech synthesis techniques and how to control expressiveness**

The goal behind a speech synthesis system is to generate an audio speech signal corresponding to any input text.

A sentence is constituted of characters and a human knows how these characters should be pronounced. If we want a machine to be able to generate speech signal from text, we have to teach it, or program it to do the same.

Such systems have been developed for decades and many different approaches were used. Here, we summarize them in three categories: Concatenation, Parametric Speech Synthesis, and Statistical Parametric Speech Synthesis. However, the state of the art is more diverse and complex. It contains many variants and hybrid approaches between them.

#### *4.1.1 Concatenation*

This approach is based on the concatenation of pieces of audio signals corresponding to different phonemes. This method is segmented in several steps. First, the characters should be converted in the corresponding phones to be pronounced. A simplistic approach is to assume that one letter corresponds to one phoneme for example. Then the computer must know what signal corresponds to a phoneme. A possibility to solve this problem is to record a database containing all the existing phonemes in a given language.

Anatomically, the speech signal is generated by an excitation signal generated in the larynx. This excitation signal is transformed by resonance through the vocal tract, which acts as a filter constituted by the guttural, oral, and nasal cavities. If this excitation signal is generated by glottal pulses, then a voiced sound is obtained. Glottal pulses are generated by a series of openings and closures of vocal cords or vocal folds. The vibration of the vocal chords has a fundamental frequency. As opposed to voiced sounds, when the excitation signal is a simple flow of

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach*

The source-filter model is a way to represent speech production, which uses the idea of separating the excitation and the resonance phenomenon in the vocal tract. It assumes that these two phenomena are completely decoupled. The source corresponds to the glottal excitation and the filter corresponds to the vocal tract. This

An example of Parametric Speech modeling is the linear prediction model. The linear prediction (LP) model uses this theory assuming that the speech is the output signal of a recursive digital filter, when an excitation is received at the input. In other words, it is assumed that each sample can be predicted by a linear combination of the last *p* samples. The linear predictive coding works by estimating the coefficients of this digital filter representing the vocal tract. The number of coefficients to represent the vocal tract has to be chosen. The more coefficients we take, the better the vocal tract is represented, but the more complex the analysis will be. The excitation signal can then be computed by applying the inverse filter on the

In synthesis, this excitation signal is modeled by a train of impulses. In reality,

The vocal tract is a variable filter. Depending on the shape we give to this vocal tract, we are able to produce different sounds. A filter is considered constant for a short period of time and a different filter has to be computed for each period of

This approach has been successful to synthesize intelligible speech but not nat-

The approach used in [9] to discover how to control a set of parameters to obtain

For expressive speech synthesis, this technique has the advantage of giving

Vocal tract image from: https://en.wikipedia.org/wiki/User:Tavin##/media/File:VocalTract.svg-Ta

a desired emotion was done through perception tests. A set of sentences were synthesized with different values of these parameters. These sentences were then

access to many parameters of speech allowing a fine control.

the mechanics of the vocal folds is more complex making this assumption too

.

exhaled air, it is an unvoiced sound.

*DOI: http://dx.doi.org/10.5772/intechopen.89849*

*Diagram describing voice production mechanism and source-filter model.*

principle is illustrated in **Figure 4**<sup>1</sup>

speech signal.

ural human sounding speech.

simplistic.

**Figure 4.**

time.

1

**35**

vin/CC-BY-3.0

However concatenating phones one after another leads to very unnatural transitions between them. In the literature, this problem was tackled by recording successions of two phonemes, called diphones, instead of phones. All combinations of diphones are recorded in a dataset. The generation of speech is then performed by concatenation of these diphones. In this approach, many assumptions are not met in practice.

First, a text processing has to be performed. Indeed, text is constituted of punctuation, numbers, abbreviations, etc. Moreover, the letter-to-sound relationship is not respected in English and in many other languages. The pronunciation of words often depend on the context. Also, concatenating phone leads to a chopped signal and prosody of the generated signal is unnatural. To have a control on expressiveness with diphone concatenation techniques, it is possible to change *F*0 and duration with signal processing techniques implying some distorsion on the signal. Other parameters cannot be controlled without altering the signal leading to unnatural speech.

Another approach that is also based on the concatenation of pieces of signal is Unit Selection. Instead of concatenating phones (or diphones), larger parts of words are concatenated. An algorithm has to select the best units according to criteria: few discontinuities in the generated speech signal, a consistent prosody, etc.

For this purpose, a much larger dataset must be recorded containing a large variety of different combinations of phone series. The machine must know what part of signal corresponds to what phoneme, which means it has to be annotated by hand accurately. This annotation process is time-consuming. Today, there exist tools to do this task automatically. But this automation can in fact be done at the same time as synthesis as we will see later.

The advantages of this method is that the signal is less altered and most of the transitions between phones are natural because they are coming as is from the dataset.

With this method, a possibility to synthesize emotional speech is to record a dataset with separate categories of emotion. In synthesis, only units coming from a category will be used [8]. The drawback is that it is limited to discrete categories without any continuous control.

#### *4.1.2 Parametric speech synthesis*

Parametric Speech Synthesis is based on modeling how the signal is generated. It allows interpretability of the process. But in general, simplistic assumptions have to be made for speech modeling.

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach DOI: http://dx.doi.org/10.5772/intechopen.89849*

**Figure 4.**

A sentence is constituted of characters and a human knows how these characters should be pronounced. If we want a machine to be able to generate speech signal

Such systems have been developed for decades and many different approaches were used. Here, we summarize them in three categories: Concatenation, Parametric Speech Synthesis, and Statistical Parametric Speech Synthesis. However, the state of the art is more diverse and complex. It contains many variants and hybrid

This approach is based on the concatenation of pieces of audio signals corresponding to different phonemes. This method is segmented in several steps. First, the characters should be converted in the corresponding phones to be pronounced. A simplistic approach is to assume that one letter corresponds to one phoneme for example. Then the computer must know what signal corresponds to a phoneme. A possibility to solve this problem is to record a database containing all

However concatenating phones one after another leads to very unnatural transitions between them. In the literature, this problem was tackled by recording successions of two phonemes, called diphones, instead of phones. All combinations of diphones are recorded in a dataset. The generation of speech is then performed by concatenation of these diphones. In this approach, many assumptions are not

First, a text processing has to be performed. Indeed, text is constituted of punctuation, numbers, abbreviations, etc. Moreover, the letter-to-sound relationship is not respected in English and in many other languages. The pronunciation of words often depend on the context. Also, concatenating phone leads to a chopped signal and prosody of the generated signal is unnatural. To have a control on expressiveness with diphone concatenation techniques, it is possible to change *F*0 and duration with signal processing techniques implying some distorsion on the signal. Other parameters cannot be controlled without altering the signal leading to unnatural speech. Another approach that is also based on the concatenation of pieces of signal is Unit Selection. Instead of concatenating phones (or diphones), larger parts of words are concatenated. An algorithm has to select the best units according to criteria: few

discontinuities in the generated speech signal, a consistent prosody, etc.

For this purpose, a much larger dataset must be recorded containing a large variety of different combinations of phone series. The machine must know what part of signal corresponds to what phoneme, which means it has to be annotated by hand accurately. This annotation process is time-consuming. Today, there exist tools to do this task automatically. But this automation can in fact be done at the

The advantages of this method is that the signal is less altered and most of the transitions between phones are natural because they are coming as is from the dataset. With this method, a possibility to synthesize emotional speech is to record a dataset with separate categories of emotion. In synthesis, only units coming from a category will be used [8]. The drawback is that it is limited to discrete categories

Parametric Speech Synthesis is based on modeling how the signal is generated. It allows interpretability of the process. But in general, simplistic assumptions have to

from text, we have to teach it, or program it to do the same.

approaches between them.

*Human 4.0 - From Biology to Cybernetic*

the existing phonemes in a given language.

same time as synthesis as we will see later.

without any continuous control.

*4.1.2 Parametric speech synthesis*

be made for speech modeling.

**34**

*4.1.1 Concatenation*

met in practice.

*Diagram describing voice production mechanism and source-filter model.*

Anatomically, the speech signal is generated by an excitation signal generated in the larynx. This excitation signal is transformed by resonance through the vocal tract, which acts as a filter constituted by the guttural, oral, and nasal cavities. If this excitation signal is generated by glottal pulses, then a voiced sound is obtained. Glottal pulses are generated by a series of openings and closures of vocal cords or vocal folds. The vibration of the vocal chords has a fundamental frequency.

As opposed to voiced sounds, when the excitation signal is a simple flow of exhaled air, it is an unvoiced sound.

The source-filter model is a way to represent speech production, which uses the idea of separating the excitation and the resonance phenomenon in the vocal tract. It assumes that these two phenomena are completely decoupled. The source corresponds to the glottal excitation and the filter corresponds to the vocal tract. This principle is illustrated in **Figure 4**<sup>1</sup> .

An example of Parametric Speech modeling is the linear prediction model. The linear prediction (LP) model uses this theory assuming that the speech is the output signal of a recursive digital filter, when an excitation is received at the input. In other words, it is assumed that each sample can be predicted by a linear combination of the last *p* samples. The linear predictive coding works by estimating the coefficients of this digital filter representing the vocal tract. The number of coefficients to represent the vocal tract has to be chosen. The more coefficients we take, the better the vocal tract is represented, but the more complex the analysis will be. The excitation signal can then be computed by applying the inverse filter on the speech signal.

In synthesis, this excitation signal is modeled by a train of impulses. In reality, the mechanics of the vocal folds is more complex making this assumption too simplistic.

The vocal tract is a variable filter. Depending on the shape we give to this vocal tract, we are able to produce different sounds. A filter is considered constant for a short period of time and a different filter has to be computed for each period of time.

This approach has been successful to synthesize intelligible speech but not natural human sounding speech.

For expressive speech synthesis, this technique has the advantage of giving access to many parameters of speech allowing a fine control.

The approach used in [9] to discover how to control a set of parameters to obtain a desired emotion was done through perception tests. A set of sentences were synthesized with different values of these parameters. These sentences were then

<sup>1</sup> Vocal tract image from: https://en.wikipedia.org/wiki/User:Tavin##/media/File:VocalTract.svg-Ta vin/CC-BY-3.0

used in listening tests in which participants were asked to answer questions about the emotion they perceived. Based on these results, values of the different parameters were associated to the emotion expressions.

To find a good loss function, it is necessary to understand the statistics of the data we want to predict and how to compare them. For this, concepts from infor-

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach*

The form of the mathematical function used to process the signal can be consti-

Historically, the root of Deep Learning is a model called Neural Network. This model was inspired by the role of neurons in brain that communicate with electrical

Fully connected neural networks are successions of linear projections followed

More layers implies more parameters and thus a more complex model. It also means more intermediate representations and transformation steps. It was shown that deeper Neural Networks (more layers) performed better than shallow ones (fewer layers). This observation lead to the names Deep Neural Networks (DNNs) and Deep Learning. A complex model is capable of modeling a complex task but is

Merlin [16] toolkit has been an important tool to investigate the use of DNNs for speech synthesis. The first models developed within Merlin were based only on Fully connected neural networks. One DNN was used to predict acoustic parameters and another one to predict phone durations. It was a first successful attempt

Time dependencies are not well modeled and it ignores the autoregressive nature

Convolutional Neural Networks (CNNs) refer to the operation of convolution and remind the convolution filters of signal processing (Eq. 5). A convolution layer can thus be seen as a convolutional filter for which the coefficients were obtained by

*s*¼�*a*

X *b*

*ω*ð Þ *s*, *t f x*ð Þ � *s*, *y* � *t* (5)

*t*¼�*b*

of speech signal. In reality, this approach relies a lot on data and does not use

*h* ¼ *f <sup>h</sup>*ð Þ *Whx* þ *bh* (4)

Since then, more recent models drove away from this analogy and evolves

by non-linearities (sigmoid, hyperbolic tangent, etc.) called layers.

also more costly to optimize in terms of computation power and data.

that outperformed other statistical approaches at the time.

*g x*ð Þ¼ , *<sup>y</sup> <sup>ω</sup>*<sup>∗</sup> *f x*ð Þ¼ , *<sup>y</sup>* <sup>X</sup>*<sup>a</sup>*

tuted of lots of different operations. Some of these operations were found very performant in different fields and are widely used. In this section, we describe some operations relevant for speech synthesis. In Deep Learning, the ensemble of the operations applied to a signal to have a prediction is called *Architecture*. There is an important research interest in designing architectures for different tasks and data to process. This research reports empirical results comparing the performance of different combinations. The progress of this field is directly related to the computation

mation theory are used.

power available on the market.

*x*: input vector *h*: hidden layer vector

enough knowledge.

**37**

training the Deep Learning architecture.

*f <sup>h</sup>*: Activation functions

impulses and process information.

depending on their actual performance.

*W* and *b*: parameter matrices and vector

*4.2.1 Different operations and architectures*

*DOI: http://dx.doi.org/10.5772/intechopen.89849*
