**2.2 Speech features**

recently Deep Neural Networks (DNNs). The field of Machine Learning allows machines to learn solving a given task. This field borrows from an ensemble of statistical models allowing to represent or transform data. It also uses concepts from Information Theory to measure distances between probability

A signal is a variation of a physical quantity carrying information. The acoustic speech signal is converted into an electrical signal by a microphone. An acoustic signal is a variation of pressure in a fluid that the human perceives through the sense of hearing. This signal is mono-dimensional because it can be represented by a

The electrical signal generated by the microphone is an analog signal. In order to process it with a digital machine, it must be digitized. This is done by electronic systems called analog-to-digital converters that sample and quantify analog signals to convert them into digital signals. After some processing of the digitized signal, a digital-to-analog converter can be used to convert the processed digital signal back into an analog signal. This analog electrical signal can then be converted into an acoustic signal though loudspeakers or earphones to make it available to human

Digital signal processing [1] is the set of theories and techniques for analyzing, synthesizing, quantifying, classifying, predicting, or recognizing signals, using dig-

A digital system receives as input a sequence of samples f g *x*ð Þ 0 , *x*ð Þ1 , *x*ð Þ2 , … , noted as *x n*ð Þ, and produces as output a sequence of samples *y n*ð Þ after application

A digital filter is a linear and invariant digital system. Let us consider a digital system that receives the sample sequences *x*1ð Þ *n* and *x*2ð Þ *n* as input. This system will respectively produce the sample sequences *y*1ð Þ *n* and *y*2ð Þ *n* as output. This system is linear if it produces the output *α y*1ð Þþ *n β y*2ð Þ *n* when it receives the sequence *αx*1ð Þþ *n βx*2ð Þ *n* as input. A digital system is said to be invariant if shifting the input

These linear and invariant digital systems can be described by equations of the

*y n*ð Þþ *a*<sup>1</sup> *y n*ð Þþ � 1 *a*<sup>2</sup> *y n*ð Þþ � 2 … þ *aN y n*ð Þ � *N*

*ai y n*ð Þ¼ � *<sup>i</sup>* <sup>X</sup>

This is equivalent to saying that the output *y n*ð Þ is a linear combination of the last N outputs, the input *x n*ð Þ, and the M previous inputs. A digital filter is therefore determined if the coefficients *ai* and *bi* are known. A filter is called non-recursive if only the inputs are used to compute *y n*ð Þ. If at least one of the previous output

¼ *b*0*x n*ð Þþ *b*1*x n*ð Þþ � 1 … þ *bMx n*ð Þ � *M* (1)

*M*

*i*¼0

*bix n*ð Þ � *i* (2)

sequence by *n*<sup>0</sup> samples also shifts the output sequence by *n*<sup>0</sup> samples.

*y n*ð Þþ<sup>X</sup> *N*

samples is used, it is called a recursive filter.

*i*¼1

distributions.

ital systems.

type:

or

**30**

**2. Expressive speech analysis**

*Human 4.0 - From Biology to Cybernetic*

mathematical function with a single variable: pressure.

ears. These steps are represented in **Figure 1**.

of a series of algebraic operations.

**2.1 Digital signal processing**

Speech is a signal carrying a lot of information. These expend from the sequence of words used to create a sentence, to the tone of voice used to utter this sentence. Not all of them are necessary to process and for some systems, trying to process all of them can harm the efficiency of the system. Also, the speech can carry noise before reception. That is why an important step in speech analysis is to extract descriptors or features that are relevant to the task of interest.

There exist many different feature spaces that describe speech information. In this section, we give an intuitive explanation of the ones widely used in Deep Learning architectures.

#### *2.2.1 Power spectral density and spectrogram*

Fourier analysis demonstrates that any physical signal can be decomposed into a sum of sinusoids of different frequencies. The power spectral density of a signal describes the amount of power carried by the different frequency bands of this signal.

This range of frequencies may be a discrete value set or a continuous frequency spectrum. In the field of digital signal processing, this power spectral density can be calculated by the Fast Fourier Transform (FFT) algorithm.

The graph of the power spectral density allows to visualize the frequency characteristics of a signal such as the fundamental frequency of a periodic signal and its harmonics. A periodic signal is a signal whose period is repeated indefinitely. The number of periods per unit of time that repeats is the fundamental frequency. Harmonics are the multiple frequencies of the fundamental. These frequencies have an important power density and present therefore extrema in the power spectral density.

An example of power spectral density is shown in the upper part of **Figure 2**. The first maximum is at the fundamental frequency which is 145.5 Hz. The other maxima are the harmonics.

When the signal's characteristics are evolving in the time, like with the voice for example, the spectrogram can be used to visualize this evolution. The spectrogram represents the power spectral density over the time. An example of power spectrogram is shown in the lower part of **Figure 2**. The x-axis is time and the y-axis is frequency. The colors correspond to the power density. A color scale is given on the right of the graph. The spectrogram is thus constructed by juxtaposing power spectral density functions computed on every frame as suggested in **Figure 2**.

#### *2.2.2 Mel-spectrogram*

The Mel-Spectrogram is a reduced version of the spectrogram. The use of this feature is very widespread for machine learning-based systems in general and for Deep learning-based TTS in particular.

The intuition behind this feature is to compress the representation of the speech in the higher values of the frequency domain based on the fact that human ear is sensitive to some frequencies more than others. The Mel Scale is an experimental function representing the sensitivity of human ear depending on the frequency.

The conversion of frequency *f* in Mel-frequency *m* is:

$$m = 2595 \cdot \log\_{10} \left( 1 + \frac{f}{700} \right) \tag{3}$$

values than for high frequencies. As an example, the interval 0, 2000 ½ � Hz is mapped to more than 1500 Mel while the interval 8000, 10000 ½ � Hz is mapped to

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach*

Emotion modeling is one of the main challenges in developing more natural human-machine interfaces. Among the many existing approaches, two of them are widely used in applications. A first representation is Ekman's six basic emotion model [2], which identifies anger, disgust, fear, happiness, sadness, and surprise as six basic categories of emotions from which the other emotions may be derived. Emotions can also be represented in a multidimensional continuous space like in the Russel's circumplex model [3]. This model makes it possible to better reflect the complexity and the variations in the expressions, unlike the category system. The two most commonly used dimensions in the literature are arousal, corresponding to the level of excitation, and valence, corresponding to the pleasantness level or positiveness of the emotion. A third dimension is sometimes added: dominance corresponding to the level of power of the speaker relative to the listener.

A more recent way of representing emotions is based on ranking, which prefers a relative preference method to annotate emotions rather than labeling them with absolute values [4]. The reason is that humans are not reliable for assigning absolute values to subjective concepts. However, they are better at discriminating between elements shown to them. Therefore, the design of perception tasks, for example, about emotion or style in speech, should take this into account by asking partici-

It is important to note that many other approaches exist [5] and it is a difficult question to know what approach should be used in applications in the field of Human-Computer Interaction. Indeed, these psychological models of affect are propositions of explanations of how emotions are expressed. But these propositions

Humans express their emotions via various channels: face, gesture, speech, etc. Different people will express and perceive emotions differently depending on their personality, their culture, and many other aspects. For developing application, one has therefore to take assumptions to reduce its scope and choose one approach of

In this chapter, we are interested in how the expressive speech synthesized will be perceived. It is therefore reasonable to begin by choosing a language and assum-

Research has recently evolved into systems using, without preprocessing, the signal or spectrogram of the signal as input: the neural network learns the features that best correspond to the task it is supposed to perform on its own. This principle has been successfully applied to the modeling of emotions, currently constituting

The goal behind a speech synthesis system is to generate an audio speech signal

**4.1 A brief history of speech synthesis techniques and how to control**

less than 300 Mel.

**3. Modeling of emotion expressiveness**

*DOI: http://dx.doi.org/10.5772/intechopen.89849*

pants to solve comparison tasks rather than rating tasks.

the state of the art in speech emotion recognition [6, 7].

are difficult to assess in practice.

ing the origin of the synthesized voice.

**4. Expressive speech synthesis**

corresponding to any input text.

**expressiveness**

**33**

emotion modeling.

**Figure 3** shows the curve of the Mel Scale as a function of the frequency. As one can observe, an interval of low frequencies is mapped to a larger interval of Mel

**Figure 2.** *Spectrum (top) and spectrogram (bottom) of a speech segment.*

**Figure 3.** *Mel Scale representing the perception of frequencies.*

values than for high frequencies. As an example, the interval 0, 2000 ½ � Hz is mapped to more than 1500 Mel while the interval 8000, 10000 ½ � Hz is mapped to less than 300 Mel.
