**Author details**

Maximizing likelihood assuming a Gaussian distribution is equivalent to mini-

Maximizing likelihood assuming a Laplacian distribution is equivalent to mini-

To choose the right criterion to optimize when working with speech data, one should pay attention to speech probability distributions. Speech waveforms and magnitude spectrogram distribution are Laplacians [21, 22]. That is why MAE loss

In this chapter, we first briefly introduced digital signal processing and digital filtering, and described the different possibilities of emotion representation and the few most important speech feature spaces in this context, namely spectrogram and

Available speech synthesis methods were then exposed, from concatenation of speech signal segments to parametric modeling of speech production, to statistical

Most recent SPSS systems use Deep Learning that can be seen as non-linear

We focused on the tools for SPSS and explained Deep Learning architecture blocks that are used along with the right loss functions based on the probability

To build a controllable expressive speech synthesis system, one should keep several concepts in mind. First, it is necessary to gather data and process them to have a good representation to be used with a Deep Learning algorithm, that is, text, Mel-spectrograms, and information about the expressivness of speech. Then one has to design a Deep Learning architecture. Its operations should be inspired by the features to model (1D convolution or RNN cells for long-term context, attention mechanism for recursive relationships). It should have a way to control expressiveness either with a categorical representation [23] or a continuous representation [24]. But it is important to take into account that annotations should not be acquired from humans by asking them to give absolute values on subjective concepts, but rather by asking them to compare examples. And finally, the parametric model should be trained with a loss function adapted to the probability distribution

of the acoustic features, that is, MAE and Kullback-Leibler divergence loss.

Recherche dans l'Industrie et l'Agriculture (FRIA), Belgium.

Noé Tits is funded through a PhD grant from the Fonds pour la Formation à la

signal processing for which filters are optimized based on data.

*Yi* � *<sup>Y</sup>*^*<sup>i</sup>*

*Yi* � *<sup>Y</sup>*^*<sup>i</sup>* � � �

� �<sup>2</sup> (14)

� (15)

*MSE* <sup>¼</sup> <sup>1</sup> *n* X*n i*¼1

*MAE* <sup>¼</sup> <sup>1</sup> *n* X*n i*¼1

mizing Mean Squared Error (MSE):

*Human 4.0 - From Biology to Cybernetic*

mizing Mean Absolute Error (MAE):

should be used to optimize their predictions.

**5. Summary and application**

parametric speech synthesis.

distributions of speech features.

**Acknowledgements**

**42**

Mel-spectrogram.

Noé Tits\*, Kevin El Haddad and Thierry Dutoit Numediart Institute, University of Mons, Mons, Belgium

\*Address all correspondence to: noe.tits@umons.ac.be

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
