*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach DOI: http://dx.doi.org/10.5772/intechopen.89849*

Other techniques were were found to outperform this. The use of Attention Mechanism was found beneficial [17].

Attention Mechanism was first developed in the field of computer vision. It was then successfully applied to Automatic Speech Recognition (ASR) and then to Textto-Speech synthesis (TTS).

In the Deep Learning architecture, a matrix is computed and used as weighting on the hidden representation at a given layer. The weighted representation is fed to the rest of the architecture until the end. This means that the matrix is asked to emphasize the part of the signal that is important to reduce the loss. This matrix is called the Attention matrix because it represents the importance of the different regions of the data.

In computer vision, a good illustration of this mechanism is that for a task of classification of objects, the attention matrix has high weights for the region corresponding to the object and low weights corresponding to the background of the image.

In ASR, this mechanism has been used in a so-called *Listen, Attend and Spell* [18] (LAS) setup. An important difference compared to the previous case is that it is a sequential problem. There must be an information feedback to have a recursive kind of architecture and each time step must be computed based on previous time steps.

LAS designates three parts of the Deep Learning architecture. The first one encodes audio features in a hidden representation. The role of the last one is to generate text information from a hidden representation. Between this encoder and decoder, at each time step, an Attention Mechanism computes a vector that will weigh the text encoding vector. This weighting vector should give importance to the part of the utterance to which the architecture should pay attention to generate the corresponding part of speech.

An Attention plot (**Figure 5**) of a generated sentence can be constructed by juxtaposing all the weighting vectors computed during the generation of a sentence. The resulting matrix can then be represented by mapping a color scale on the values contained.

This attention plot shows an attention path, that is, the importance given to characters along the audio output timeline. As can be observed in **Figure 5**, this attention path should have a close to diagonal shape. Indeed, the two sequences have a close chronological relationship.

#### **4.3 Information theory and speech probability distributions**

Information Theory is about optimizing how to send messages with as few resources as possible. To that end, the goal is to compress the information by

#### **Figure 5.**

*Alignment plot. The y-axis represents the character indices and the x-axis represents the audio frame feature indices. The color scale corresponds to the weight given to a given character to predict a given audio frame.*

*f*: input matrix *g*: output matrix

*ω*: convolutional filter weights

*Human 4.0 - From Biology to Cybernetic*

on the past values of the signal.

*xt*: input vector. *ht*: hidden layer vector

: output vector

*4.2.2 Encoder and decoder*

*W*, *U* and *b*: parameter matrices and vector

retrieves an output from a latent representation.

*4.2.3 Sequence-to-sequence modeling and attention mechanism*

based on encoder-decoders with RNNs called RNN transducer.

*f <sup>h</sup>* and *f <sup>y</sup>*: activation functions

ingful distances between examples.

*yt*

**38**

what filters to apply to detect edges, to blurr an image, etc.

tation of linguistic features and predict spectral speech features.

Convolutional filters were studied in the field of image processing. We know

In practice, often, the operation implemented is correlation, which is the same operation except that the filter is not flipped. Given that the parameters of the filters are optimized during training, the flipping part is useless. We can just consider that the filter optimized with a correlation implementation is just the flipped version of the one that would have been computed if convolution was implemented. For speech synthesis, convolutional layers have been used to extract a represen-

For a temporal signal such as speech, one-dimensional convolution along the time axis allows to model time dependencies. As layers are stacked, the receptive field increases proportionally. In speech, there are long-term dependencies in the signal, for example, in the intonation and emphasis of some words. To model these long-term dependencies, dilated convolution was proposed. It allows to increase the receptive field exponentially instead of proportionally with the number of layers. Recurrent Neural Network involves a recursive behaviour, that is, having an information feedback from the output to the input. This is analogous to recursive filters. Recursive filters are filters designed for temporal signals because they are able to model causal dependencies. It means that at a give time *t*, the value depends

*yt* ¼ *f <sup>y</sup> W yht* þ *b <sup>y</sup>*

An encoder is a part of neural network that outputs a hidden representation (or latent representation) from an input. A decoder is a part of neural network that

When the input and the output are the same, we talk about auto-encoders. The task in itself is useless, but the interesting part here is the latent representation. The latent space of an auto-encoder can provide interesting properties such as a lower dimensionality, meaning a compressed representation of the initial data or mean-

A sequence-to-sequence task is about converting sequential data from one domain to another, for example, from a language to another (translation), from speech to text (speech recognition), or from text to speech (speech synthesis). First Deep Learning architectures for solving sequence-to-sequence tasks were

*ht* ¼ *f <sup>h</sup>*ð Þ *Whxt* þ *Uhht*�<sup>1</sup> þ *bh* (6)

(7)

using the right code so that the messages do not contain redundancies to be as small as possible.

### *4.3.1 Information and probabilities*

Shannon's Information Theory quantifies information, thanks to the probability of outcomes. If we know an event will occur, its occurrence gives no information. The less likely it is to happen, the more it gives information.

This relationship between information and probability of an event is given by Shannon information content measured in bits. A bit is a variable that can have two different values: 0 or 1.

$$h(\mathbf{x}) = \log\_2\left(\frac{1}{p(\mathbf{x})}\right) \tag{8}$$

*4.3.2 Entropy and relative-entropy*

*DOI: http://dx.doi.org/10.5772/intechopen.89849*

Kullback-Leibler divergence, is defined as:

*4.3.3 Maximum likelihood and particular cases*

probability distribution *p* is:

**Figure 6.**

ity distribution.

dataset *Yi*.

**41**

The average information content of an outcome, also called entropy, of the

*In blue: Gaussian distribution with μ* ¼ *0 and σ* ¼ *0:5. In red: Laplacian distribution with μ* ¼ *0 and* b ¼ *0:5.*

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach*

*p x*ð Þlog2

*p x*ð Þlog *p x*ð Þ

1

*p x*ð Þ � � (11)

*q x*ð Þ *dx* (12)

*H p*ð Þ¼ <sup>X</sup>

*<sup>D</sup>*KLð Þ¼ *<sup>p</sup>*∥*<sup>q</sup>* <sup>X</sup>

It represents a dissimilarity between two probability distributions.

*x*

The relative-entropy between two probability distributions, also called

*x*

This concept is necessary to understand how to train a Deep Learning algorithm or, more generally, how to find the optimal parameters of a model. The role of a statistical model is to represent as accurately as possible the behavior of a probabil-

Maximum likelihood estimation (MLE) (Eq. 13) allows to estimate the parameters *θ* of a statistical parametric model *p x*ð Þ j*θ* by maximizing the probability of a dataset under the assumed statistical model, that is, the Deep Learning architecture.

It can be demonstrated that this is equivalent to minimizing *D*KLð Þ *p*∥*q* with *p*, the probability distribution of the model and *q*, the probability of the data [20]. It is a way to express that the probability distribution generated by the model should be

If assumptions can be made on the probability distributions, it is possible to have distances or errors for which the minimization is equivalent to MLE. These errors are computed by comparing estimations from the model *Y*^*<sup>i</sup>* and the value from the

as close as possible to the probability distribution of the data.

*θMLE* ¼ arg max *p*ð Þ **x**j*θ* (13)

The number of possible messages with *L* bits is 2*<sup>L</sup>*. If all messages are equally probable, the probability of each message is *<sup>p</sup>* <sup>¼</sup> <sup>1</sup> <sup>2</sup>*<sup>L</sup>*. We then have *L* ¼ log2 1 *p* � �. A generalization of this formula in which the messages are not equally probable is Eq. (8). It can be interpreted as the minimal number of bits to communicate this message.

The probability represents the degree of belief that an event will happen [19]. For example, we can wonder the probability of a result of four by rolling a six-sided die or the probability that the next letter in a text is the letter *r*.

These probabilities depend on the assumptions we make:


We obtain a probability distribution by listing the probability of all the possible outcomes. For the example of the result by rolling the perfectly balanced die, the possible outcomes are 1, 2, 3, 4, 5, 6 ½ � and their probabilities are ½ � 1*=*6, 1*=*6, 1*=*6, 1*=*6, 1*=*6, 1*=*6 .

In both examples, we have a finite number of possible outcomes. The probability distribution is said to be discrete. On the contrary, when the possible outcomes are distributed on a continuous interval, then the probability distribution is said to be continuous. This is the case, for example, of amplitude values in a spectrogram.

The most famous continuous probability distribution is the Gaussian distribution:

$$p(\mathbf{x}) = \frac{1}{\sigma\sqrt{2\pi}}\mathbf{e}^{-\frac{1}{2}\left(\frac{\mathbf{x}-\boldsymbol{\mu}}{\sigma}\right)^2} \tag{9}$$

Another important distribution, especially in speech processing, is the Laplacian distribution:

$$p(\mathbf{x}) = \frac{\mathbf{1}}{2b} \mathbf{e}^{\left(-\frac{|\mathbf{x} - \boldsymbol{\mu}|}{b}\right)} \tag{10}$$

Both distributions are plotted in **Figure 6**. The blue curve corresponds to the Gaussian probability distribution (with *μ* ¼ 0 and *σ* ¼ 0*:*5) and the red curve corresponds to the Laplacian probability distribution (with *μ* ¼ 0 and *b* ¼ 0*:*5). For both distributions, the maximum is *μ*, and are symmetrically decreasing as the distance from *μ* increases.

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach DOI: http://dx.doi.org/10.5772/intechopen.89849*

**Figure 6.** *In blue: Gaussian distribution with μ* ¼ *0 and σ* ¼ *0:5. In red: Laplacian distribution with μ* ¼ *0 and* b ¼ *0:5.*

### *4.3.2 Entropy and relative-entropy*

using the right code so that the messages do not contain redundancies to be as small

Shannon's Information Theory quantifies information, thanks to the probability of outcomes. If we know an event will occur, its occurrence gives no information.

This relationship between information and probability of an event is given by Shannon information content measured in bits. A bit is a variable that can have two

> 1 *p x*ð Þ � �

> > <sup>2</sup>*<sup>L</sup>*. We then have *L* ¼ log2

<sup>2</sup>*<sup>b</sup>* <sup>e</sup> �∣*x*�*μ*<sup>∣</sup> ð Þ *<sup>b</sup>* (10)

(8)

(9)

1 *p* � � . A

*h x*ð Þ¼ log2

The number of possible messages with *L* bits is 2*<sup>L</sup>*. If all messages are equally

generalization of this formula in which the messages are not equally probable is Eq. (8). It can be interpreted as the minimal number of bits to communicate this message. The probability represents the degree of belief that an event will happen [19]. For example, we can wonder the probability of a result of four by rolling a six-sided

• Is the die perfectly balanced? If yes, the probability of a result of four is 1*=*6.

• What is the language of the text? Do we know the subject, etc. Depending on this information, we can have different estimations of this probability.

We obtain a probability distribution by listing the probability of all the possible outcomes. For the example of the result by rolling the perfectly balanced die, the

The most famous continuous probability distribution is the Gaussian distribution:

Another important distribution, especially in speech processing, is the Laplacian

Both distributions are plotted in **Figure 6**. The blue curve corresponds to the Gaussian probability distribution (with *μ* ¼ 0 and *σ* ¼ 0*:*5) and the red curve corresponds to the Laplacian probability distribution (with *μ* ¼ 0 and *b* ¼ 0*:*5). For both distributions, the maximum is *μ*, and are symmetrically decreasing as the distance from *μ* increases.

*σ* ffiffiffiffiffi <sup>2</sup>*<sup>π</sup>* <sup>p</sup> <sup>e</sup>�<sup>1</sup> 2 *<sup>x</sup>*�*<sup>μ</sup>* ð Þ *<sup>σ</sup>* 2

In both examples, we have a finite number of possible outcomes. The probability distribution is said to be discrete. On the contrary, when the possible outcomes are distributed on a continuous interval, then the probability distribution is said to be continuous. This is the case, for example, of amplitude values in a

*p x*ð Þ¼ <sup>1</sup>

*p x*ð Þ¼ <sup>1</sup>

The less likely it is to happen, the more it gives information.

die or the probability that the next letter in a text is the letter *r*. These probabilities depend on the assumptions we make:

possible outcomes are 1, 2, 3, 4, 5, 6 ½ � and their probabilities are

probable, the probability of each message is *<sup>p</sup>* <sup>¼</sup> <sup>1</sup>

as possible.

*4.3.1 Information and probabilities*

*Human 4.0 - From Biology to Cybernetic*

different values: 0 or 1.

½ � 1*=*6, 1*=*6, 1*=*6, 1*=*6, 1*=*6, 1*=*6 .

spectrogram.

distribution:

**40**

The average information content of an outcome, also called entropy, of the probability distribution *p* is:

$$H(p) = \sum\_{\mathfrak{x}} p(\mathfrak{x}) \log\_2 \left( \frac{1}{p(\mathfrak{x})} \right) \tag{11}$$

The relative-entropy between two probability distributions, also called Kullback-Leibler divergence, is defined as:

$$D\_{\rm KL}(p \| q) = \sum\_{\mathbf{x}} p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})} \, d\mathbf{x} \tag{12}$$

It represents a dissimilarity between two probability distributions.

#### *4.3.3 Maximum likelihood and particular cases*

This concept is necessary to understand how to train a Deep Learning algorithm or, more generally, how to find the optimal parameters of a model. The role of a statistical model is to represent as accurately as possible the behavior of a probability distribution.

Maximum likelihood estimation (MLE) (Eq. 13) allows to estimate the parameters *θ* of a statistical parametric model *p x*ð Þ j*θ* by maximizing the probability of a dataset under the assumed statistical model, that is, the Deep Learning architecture.

$$\theta\_{\text{MLE}} = \arg\max\_{\theta} p(\mathbf{x}|\theta) \tag{13}$$

It can be demonstrated that this is equivalent to minimizing *D*KLð Þ *p*∥*q* with *p*, the probability distribution of the model and *q*, the probability of the data [20]. It is a way to express that the probability distribution generated by the model should be as close as possible to the probability distribution of the data.

If assumptions can be made on the probability distributions, it is possible to have distances or errors for which the minimization is equivalent to MLE. These errors are computed by comparing estimations from the model *Y*^*<sup>i</sup>* and the value from the dataset *Yi*.

Maximizing likelihood assuming a Gaussian distribution is equivalent to minimizing Mean Squared Error (MSE):

$$\text{MSE} = \frac{1}{n} \sum\_{i=1}^{n} \left( Y\_i - \hat{Y}\_i \right)^2 \tag{14}$$

Maximizing likelihood assuming a Laplacian distribution is equivalent to minimizing Mean Absolute Error (MAE):

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |Y\_i - \hat{Y}\_i| \tag{15}$$

To choose the right criterion to optimize when working with speech data, one should pay attention to speech probability distributions. Speech waveforms and magnitude spectrogram distribution are Laplacians [21, 22]. That is why MAE loss should be used to optimize their predictions.
