**2. Expressive speech analysis**

#### **2.1 Digital signal processing**

A signal is a variation of a physical quantity carrying information. The acoustic speech signal is converted into an electrical signal by a microphone. An acoustic signal is a variation of pressure in a fluid that the human perceives through the sense of hearing. This signal is mono-dimensional because it can be represented by a mathematical function with a single variable: pressure.

**2.2 Speech features**

*Digital signal processing for acoustic signals.*

*DOI: http://dx.doi.org/10.5772/intechopen.89849*

**Figure 1.**

Learning architectures.

maxima are the harmonics.

*2.2.2 Mel-spectrogram*

**31**

Deep learning-based TTS in particular.

*2.2.1 Power spectral density and spectrogram*

Speech is a signal carrying a lot of information. These expend from the sequence of words used to create a sentence, to the tone of voice used to utter this sentence. Not all of them are necessary to process and for some systems, trying to process all of them can harm the efficiency of the system. Also, the speech can carry noise before reception. That is why an important step in speech analysis is to extract

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach*

There exist many different feature spaces that describe speech information. In

Fourier analysis demonstrates that any physical signal can be decomposed into a sum of sinusoids of different frequencies. The power spectral density of a signal describes the amount of power carried by the different frequency bands of this signal. This range of frequencies may be a discrete value set or a continuous frequency spectrum. In the field of digital signal processing, this power spectral density can be

The graph of the power spectral density allows to visualize the frequency characteristics of a signal such as the fundamental frequency of a periodic signal and its harmonics. A periodic signal is a signal whose period is repeated indefinitely. The number of periods per unit of time that repeats is the fundamental frequency. Harmonics are the multiple frequencies of the fundamental. These frequencies have an important power density and present therefore extrema in the power spectral density. An example of power spectral density is shown in the upper part of **Figure 2**. The first maximum is at the fundamental frequency which is 145.5 Hz. The other

When the signal's characteristics are evolving in the time, like with the voice for example, the spectrogram can be used to visualize this evolution. The spectrogram represents the power spectral density over the time. An example of power spectrogram is shown in the lower part of **Figure 2**. The x-axis is time and the y-axis is frequency. The colors correspond to the power density. A color scale is given on the right of the graph. The spectrogram is thus constructed by juxtaposing power spectral density functions computed on every frame as suggested in **Figure 2**.

The Mel-Spectrogram is a reduced version of the spectrogram. The use of this feature is very widespread for machine learning-based systems in general and for

this section, we give an intuitive explanation of the ones widely used in Deep

descriptors or features that are relevant to the task of interest.

calculated by the Fast Fourier Transform (FFT) algorithm.

The electrical signal generated by the microphone is an analog signal. In order to process it with a digital machine, it must be digitized. This is done by electronic systems called analog-to-digital converters that sample and quantify analog signals to convert them into digital signals. After some processing of the digitized signal, a digital-to-analog converter can be used to convert the processed digital signal back into an analog signal. This analog electrical signal can then be converted into an acoustic signal though loudspeakers or earphones to make it available to human ears. These steps are represented in **Figure 1**.

Digital signal processing [1] is the set of theories and techniques for analyzing, synthesizing, quantifying, classifying, predicting, or recognizing signals, using digital systems.

A digital system receives as input a sequence of samples f g *x*ð Þ 0 , *x*ð Þ1 , *x*ð Þ2 , … , noted as *x n*ð Þ, and produces as output a sequence of samples *y n*ð Þ after application of a series of algebraic operations.

A digital filter is a linear and invariant digital system. Let us consider a digital system that receives the sample sequences *x*1ð Þ *n* and *x*2ð Þ *n* as input. This system will respectively produce the sample sequences *y*1ð Þ *n* and *y*2ð Þ *n* as output. This system is linear if it produces the output *α y*1ð Þþ *n β y*2ð Þ *n* when it receives the sequence *αx*1ð Þþ *n βx*2ð Þ *n* as input. A digital system is said to be invariant if shifting the input sequence by *n*<sup>0</sup> samples also shifts the output sequence by *n*<sup>0</sup> samples.

These linear and invariant digital systems can be described by equations of the type:

$$\begin{array}{c} \left(y(n) + a\_1y(n-1) + a\_2y(n-2) + \dots + a\_Ny(n-N)\right) \\ = b\_0\mathbf{x}(n) + b\_1\mathbf{x}(n-1) + \dots + b\_M\mathbf{x}(n-M) \end{array} \tag{1}$$

or

$$y(n) + \sum\_{i=1}^{N} a\_i y(n-i) = \sum\_{i=0}^{M} b\_i x(n-i) \tag{2}$$

This is equivalent to saying that the output *y n*ð Þ is a linear combination of the last N outputs, the input *x n*ð Þ, and the M previous inputs. A digital filter is therefore determined if the coefficients *ai* and *bi* are known. A filter is called non-recursive if only the inputs are used to compute *y n*ð Þ. If at least one of the previous output samples is used, it is called a recursive filter.

*The Theory behind Controllable Expressive Speech Synthesis: A Cross-Disciplinary Approach DOI: http://dx.doi.org/10.5772/intechopen.89849*

**Figure 1.** *Digital signal processing for acoustic signals.*
