which make natural speech synthesis an extremely complex problem, with some of the most important ones being indexed below:

**the written language** is a discrete, compressed representation of the spoken language aimed at transferring a message, irrespective of other factors pertaining to the speaker's identity, emotional state, etc. Also, in almost any language, the written symbols are not truly informative of their pronunciation, with the most notable example being English. The pronunciation of a letter or sequence of letters which yield a single sound is called a *phone*. One exception here is the Korean alphabet for which the symbols approximate the position of the articulator organs, and was introduced in 1443 by King Sejong the Great to increase the literacy among the Korean population. But for most languages, the so called orthographic transparency is rather opaque;

**the human ear** is highly adapted to the frequency regions in which the relevant information from speech resides (i.e. 50–8000 Hz). Any slight changes to what is considered to be natural speech, any artefacts, or unnatural sequences present in a waveform deemed to contain spoken content, will be immediately detected by the listener;

**speaker and speech variability** is a result of the uniqueness of each individual. This means that there are no two persons having the same voice timbre or pronouncing the same word in a similar manner. Even more so, one person will never utter a word or a fixed message in an exactly identical manner even when the repetitions are consecutive;

**co-articulation effects** derive from the articulator organs' inertial movement. There are no abrupt transitions between sounds and, with very few exceptions, it is very hard to determine the exact boundary of each sound. Another result of the co-articulation is the presence of reductions or modifications in the spoken form of a word or sequence of words, derived from the impossibility or hardship of uttering a smooth transition between some particular phone pairs;

**prosody** is defined as the rhythm and melody or intonation of an utterance. The prosody is again related to the speaker's individuality, cultural heritage, education and emotional state. There are no clear systems which describe the prosody of a spoken message, and one's person understanding of, for example, portraying an angry state of mind is completely different from another;

**no fixed set of measurable factors** define a speaker's identity and speaking characteristics. Therefore, when wanting to reproduce one's voice the only way to do this for now is to record that person and extract statistical information from the acoustic signal;

**no objective measure** correlates the physical representation of a speech signal with the perceptual evaluation of a synthesised speech's quality and/or appropriateness.

The problems listed above have been solved, to some extent, in TTS systems by employing high-level machine learning algorithms, developing large expert resources or by limiting the applicability and use-case scenarios for the synthesised speech. In the following sections we describe each of the main components of a TTS system, with an emphasis on the acoustic modelling part which poses the greatest problems as of yet. We also index some of the freely available resources and tools which can aid a fast development of a synthesis system for commercial IVAs in a dedicated section of the chapter, and conclude with the discussion of some open problems in the final section.
