**3.2 The heliosphere machine learning engine**

The HELIOSPHERE Machine Learning Engine is responsible for providing the AI models used for various components. The deployment of algorithms/models rely on three main parts - (1) data queried from the Data engine, which are needed for the training and testing phases, (2) the neural and ML models, candidates for each component, and finally (3) the code required to implement everything together. The engine's iterative nature and its way of functioning - a neural model, is proposed, trained on available data. All suitable candidate models are compared and evaluated, which informs selecting the most suitable one for the task at hand. Then the model is deployed for the next debate or innovation cycle.

The engine utilises both tensorflow2 and pytorch3 options, allowing further Enrichment for the model building (code phase). The models can be accessed through internal API calls or the APIs of the partners. Based on the models and structure, several main components will be available, for instance:

The Speech-to-text component is a real-time component and used during the debate as an automatic tool for closed captioning and improving the speech-to-text in case errors occur during the live transcription. This separates the audio stream into segments of a predefined length with a buffer option for uninterruptible service. Each segment denoising and feature extraction is performed (which comprises the pre-processing phase), leading to the acoustic model generation and the language model. A speaker diarisation tool is used for discovering different speakers and enabling the segmenting of the incoming audio stream into individual speaker profiles. This allows a normalisation of the predictive models for each speaker. Since debates are often situated in a noisy environment, a separate voice frequency from background sounds before submitting it to the speech-to-text engine is identified. The speech-to-text conversion distinguishes between different speakers and currently disregards background music, fast or garbled speech, interruptions (such

<sup>1</sup> https://hadoop.apache.org/

<sup>2</sup> https://www.tensorflow.org/

<sup>3</sup> https://pytorch.org/

### *Computer-Mediated Communication*

as applause, crowd cheering, or other speakers butting in). The final output is a textual format saved into the Data engine module with the required annotations for each debate and each participant. The output is also available for visualisation on the dashboard.

The Language Model specifies all word combinations with semantic meaning formed and their probability of occurrence. The Dictionary is required to integrate phonemes and transcriptions of different pronunciations for a word. The level of granularity characterises it for transcription in phonemes.

Speech-to-Speech translation is implemented as a hybrid speech-to-speech system for three main reasons:


The ASR and Synthesis components are shared with other subsystems of the HELIOSPHERE ecosystem. As such, we will focus on developing the MT system and developing communication protocols with different methods to ensure a coherent speech-to-speech MT component. To potentially synchronise an avatar with the text, intermediate post-processing is conducted to generate a set of visemes and timecode based on the translated text's phonemes. We will also consider this post-processing as part of the MT component.

The MT component has two objectives: (i) to provide inclusiveness via translation for users that conduct the debates in different languages and (ii) to provide inclusiveness via translation of the debates into English to generate content in the correct language and format for the analytics component. We apply three different MT systems to handle speech (in the form of audio input) and text: (i) a text-to-text bilingual MT to translate from and to English; (ii) a text-to-text multilingual MT that encapsulates multiple languages, including English, aiming to provide translation between language pairs for which bilingual parallel data is not available and (iii) a multimodal, speech-text-to-text translation system that exploits both speech and text to improve the text translation.

HELIOSPHERE exploits neural MT approaches using open and free software, such as OpenNMT4 and Marian5 , which provide speech-to-text and multi-source translation. The goal is to improve our models' efficiency and the architecture of our system to make it suitable for an HPC ecosystem. The third type of the MT-system mentioned above systems conducts a second stage translation similar to automatic post-editing systems. It uses two types of inputs -- speech (usergenerated audio) and text (result from ASR or the first-stage translation) -- and produces an improved version of the initial translation. Following positive examples from domain-adapted MT, gender-aware MT, and others, we will develop a

<sup>4</sup> https://opennmt.net/

<sup>5</sup> https://marian-nmt.github.io/
