**5. Open resources and tools**

Deploying any research result into a commercial environment requires at least a baseline functional proof-of-concept from which to start optimising and adapting the system. It is the same in TTS systems, where especially the speech resources, textprocessing tools, and system architectures can be at first tested and only then developed and migrated to the live solution. To aid this development, the following table indexes some of the most important resources and tools available for text to speech synthesis systems. This is by no means an exhaustive list, but rather a starting point. The official implementations pertaining to the published studies are marked as such. If no official implementation was found, we relied on our experience and prior work to link an open tool which comes as close as possible to the original publication.

**Speech and text datasets and resources**

**Language Data Consortium (LDC)** is a repository and distribution point for various language resources. Link: www.ldc.upenn.edu

**The European Language Resources Association (ELRA)** is a non-profit organisation whose main mission is to make Language Resources for Human Language Technologies available to the community at large. Link: www.elra.info/en/

**META-SHARE [101]** is an open and secure network of repositories for sharing and exchanging language data, tools and related web services. Link: www.meta-share.org

**OpenSLR** is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related mainly to speech recognition. Link: www.openslr.org

**LibriVox** is a group of worldwide volunteers who read and record public domain texts creating free public domain audiobooks for download. Link: www.librivox.org

**Mozilla Common Voice** is part of Mozilla's initiative to help teach machines how real people speak. Link: www.commonvoice.mozilla.org/en/datasets

**Project Gutenberg** is an online library of free eBooks. Link: www.gutenberg.org

**LibriTTS [102]** is a multi-speaker English corpus of approximately 585 hours of read English speech designed for TTS research. Link: www.openslr.org/60/

**The Centre for Speech Technology Voice Cloning Toolkit (VCTK) Corpus** includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences. Link: www.datashare.is.ed.ac.uk/handle/10283/2950

**CMU Wilderness Multilingual Speech Dataset [103]** is a speech dataset of aligned sentences and audio for some 700 different languages. It is based on readings of the New Testament. Link: www. github.com/festvox/datasets-CMU\_Wilderness

**Text processing tools**

**Festival** is a complete TTS system, but it enables the use of its front-end tools independently. It supports several languages and dialects. Link: www.cstr.ed.ac.uk/projects/festival/

**CMUSphinx G2P** tool is a grapheme-to-phoneme conversion tool based on transformers. Link: www. github.com/cmusphinx/g2p-seq2seq

**Multilingual G2P** uses the eSpeak tool to generate phonetic transcriptions in multiple languages. Link: www.github.com/jcsilva/multilingual-g2p.

**Stanford NLP** tools includes various text-processing and knowledge extraction tools for English and other languages. Link: www.nlp.stanford.edu/software/

**RecoAPy [104]** tool includes an easy to use interface for recording prompted speech, but also a set of models able to perform high accuracy phonetic transcription in 8 languages. Link: www.gitlab.utcluj. ro/sadriana/recoapy

**word2vec [15]** is a word embedding model that learns vector representations of words that capture semantic and other properties of these words from large amounts of text data. Link: code.google.com/ archive/p/word2vec/

**GloVE [16]** is a word embedding method that learns from the co-occurences of words in text corpus obtaining similar vector representations for words that occur in the same context. Link: www.nlp. stanford.edu/projects/glove/

**ELMo [17]** obtains contextualized word embeddings that model the semantics and syntax of the word, but can learn different representations for various contexts. Link: www.allennlp.org/elmo

**BERT [18]** is a Transformer-based model that obtains context dependent word embeddings and can process sentences in parallel. Link: www.github.com/google-research/bert

**Speech synthesis systems**

**eSpeak** is a formant-based compact open source software speech synthesiser. Link: www.espeak. sourceforge.net/ [Official]

**Festival** is an unrestricted commercial and non-commercial use framework for building concatenative and HMM-based TTS systems. Link: www.cstr.ed.ac.uk/projects/festival/ [Official]

**MaryTTS [105]** is an open-source, multilingual TTS platform written in Java supporting diphone and unit selection synthesis. Link: http://mary.dfki.de/ [Official]

**HTS [36]** is the most commonly used implementation of the HMM-based speech synthesis. Link: http://hts.sp.nitech.ac.jp/ [Official]

**Merlin [48]** is a Python implementation of DNN models for statistical parametric speech synthesis. Link: www.github.com/CSTR-Edinburgh/merlin [Official]

**IDLAK [106]** is a project to build an end-to-end neural parametric TTS system within the Kaldi ASR framework. Link: www.idlak.readthedocs.io/en/latest/ [Official]

**DeepVoice [50]** follows the structure of HMM-based TTS systems, but replaces all its components with neural networks. Link: www.github.com/israelg99/deepvoice

**Char2Wav [52]** is an end-to-end neural model trained on characters that can synthesise speech with the SampleRNN vocoder. Link: https://github.com/sotelo/parrot [Official]

**Tacotron [53]** is one of the most frequently used end-to-end neural synthesis systems based on recurrent neural nets and attention mechanism. Link: www.github.com/keithito/tacotron

**VoiceLoop [107]** is one of the first neural synthesisers which uses a buffer memory instead of recurrent layers and does not require an audio-to-phone alignment. Link: www.github.com/ facebookarchive/loop [Official]

**Tacotron 2 [59]** is an enhanced version of Tacotron which modifies the attention mechanism and also uses the WaveNet vocoder to generate the output speech. Link: www.github.com/NVIDIA/tacotron2

**DeepVoice 3 [61]** is a fully convolutional synthesis system that can synthesise speech in a multispeaker scenario. Link: www.github.com/r9y9/deepvoice3\_pytorch

**DCTTS [60]** - Deep Convolutional TTS is a synthesis system that implements a two step synthesis, by first learning a coarse and then a fine-grained representation of the spectrogram. Link: www.github. com/tugstugi/pytorch-dc-tts

**ClariNet [62]** is the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. Link: www.github.com/ksw0306/ ClariNet

**Transformer TTS [67]** replaces the recurrent structures of Tacotron 2 with attention mechanisms. Link: www.github.com/soobinseo/Transformer-TTS

**GAN-TTS [79]** is a GAN-based synthesis system that uses a generator to produce speech and multiple discriminators that evaluate the naturalness and text-adequacy of the output. Link: www.github.com/ yanggeng1995/GAN-TTS

**FastSpeech [68]** is a novel feed-forward network based on Transformer which generates the Melspectrogram in parallel, and uses a teacher-based length predictor to achieve this parallel generation. Link: www.github.com/xcmyz/FastSpeech

**FastSpeech 2 [69]** is an enhanced version of FastSpeech where the length predictor teacher network is replaced by conditioning the output on duration, pitch and energy from extracted from the speech waveform at training and their predicted values in inference. Link: www.github.com/ming024/ FastSpeech2

**AlignTTS [70]** is a feed-forward Transformer-based network with a duration predictor which aligns the speech and audio. Link: www.github.com/Deepest-Project/AlignTTS

**Mellotron [108]** is a multispeaker TTS able to emote emotions by explicitly conditioning on rhythm and continuous pitch contours from an audio signal. Link: www.github.com/NVIDIA/mellotron [Official]

**Flowtron [82]** is an autoregressive normalising flow-based generative network for TTS, also capable of transferring style from one speaker to another. Link: www.github.com/NVIDIA/flowtron [Official]

**Glow-TTS [83]** is a flow-based generative model for parallel TTS using a dynamic programming method to achieve the alignment between text and speech. Link: www.github.com/jaywalnut310/ glow-tts [Official]

#### **Speech synthesis system libraries**

**Mozilla TTS** is a deep learning library for TTS that includes implementations for Tacotron, Tacotron 2, Glow-TTS and vocoders such as MelGAN, WaveRNN and others. Link: www.github.com/mozilla/TTS [Official]

**NeMO** is a toolkit that includes solutions for TTS, speech recognition and natural language processing tools as well. Link: www.github.com/NVIDIA/NeMo [Official]

**ESPNET-TTS** [109] is a toolkit that contains implementations for TTS systems like Tacotron, Transformer TTS, FastSpeech and others. Link: www.github.com/espnet/espnet [Official]

**Parakeet** is a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by Baidu Research and other research groups. Link: www.github.com/PaddlePaddle/Parakeet [Official]

**Neural Vocoders**

**WaveNet [55]** is an autoregressive and probabilistic model used to generate raw audio. It can also be conditioned on text to produce the very natural output speech, but its complexity makes it very resource demanding. Link: www.github.com/r9y9/wavenet\_vocoder

**WaveRNN [88]** is a recurrent neural network based vocoder that is able to generate audio faster then real time as a result of its compact architecture. Link: www.github.com/fatchord/ WaveRNN

**FFTNet [86]**, inspired by WaveNet also generates the waveform samples sequentially, with the current sample being conditioned on the previous ones, but simplifies its architecture and allows real-time synthesis. Link: www.github.com/syang1993/FFTNet

**nv-WaveNet** is an open-source implementation of several different single-kernel approaches to the WaveNet variant described by [50]. Link: www.github.com/NVIDIA/nv-wavenet [Official]

**LPCNet [89]** is a variant of WaveRNN that improves the waveform generation by combining the recurrent neural architecture with linear prediction coefficients. Link: www.github.com/mozilla/ LPCNet [Official]

**FloWaveNet [94]** is a generative model based on flows that can sample audio in real time. Compared to Parallel WaveNet and ClariNet it only requires a training process that is single-staged. Link: www. github.com/ksw0306/FloWaveNet [Official]

**Parallel WaveGAN [95]** is a vocoder that uses adversarial training and provides fast and lightweight waveform generation. Link: www.github.com/kan-bayashi/ParallelWaveGAN

**WaveGlow [95]** vocoder borrows from Glow and WaveNet to generate raw audio from Mel spectrograms. It is a flow-based model implemented with a single network. Link: www.github.com/ NVIDIA/waveglow [Official]

**MelGAN [90]** is a GAN-based vocoder that is able to generate coherent waveforms, the model is nonautoregressive and based on convolutional layers. Link: www.github.com/descriptinc/melgan-neurips [Official]

**GELP [91]** is a parallel neural vocoder utilising generative adversarial networks, and integrating a linear predictive synthesis filter into the model. Link: www.github.com/ljuvela/GELP

**SqueezeWave [100]** is a lightweight version of WaveGlow that can generate on-device speech output. Link: https://github.com/tianrengao/SqueezeWave [Official]

**WaveFlow [96]** is a flow-based model that includes WaveNet and WaveGlow as special cases and can synthesise audio faster than real-time. Link: www.github.com/L0SG/WaveFlow

**VocGAN [93]** is a GAN-based vocoder that can synthesise speech in real time even on a CPU. Link: www.github.com/rishikksh20/VocGAN

**WG-WaveNet [97]** is composed of a WaveGlow like flow-based model combined with WaveNet based postfilter that can synthesise speech without the need for a GPU. Link: www.github.com/ BogiHsu/WG-WaveNet

**Speech synthesis challenges**

**Blizzard Challenge** is a yearly challenge in which teams develop TTS systems starting from more or less the same resources, and are jointly evaluated in a large-scale listening test. Link: http://www. festvox.org/blizzard/

**Voice Cloning Challenge** is a bi-annual challenge in which teams are asked to provide a high-quality solution for cloning the voice of a target speaker within the same language, or cross-lingual. The results are also evaluated in a large scale listening test. Link: http:// www.vc-challenge.org/
