**5. Results**

The PERSIST platform was deployed on two physical servers at the University of Maribor, FERI. The functional scheme of the system is highlighted in **Figure 7**. The PERSIST system is used mainly by the clinician. Namely, they have to define and schedule activities as part of patient's care workflow (phase 1). On the other hand, the patients execute activities (phase 3). MSN and OHC are the main services within the system. The MSN service is used to implement activities and make their execution more natural by delivering the symmetric model of interaction, and the OHC service is used to store data and automate the execution of the clinical workflow.

Questionnaires are available in six different languages: Slovenian, English, Russian, Latvian, French, and Spanish. On the output side, the system represents the information generated by chatbot as female ECA Eva and the male ECA Adam (**Figure 8**). In this way, in the output also non-verbal elements are associated with synthesized speech. In this way, raw texts are presented to the user as a multimodal output, which combines a spoken communication channel and synchronized visual communication

#### **Figure 7.**

*Functional flow: Integration phases—Allocation of an activity (1), request for execution of the activity (2), implementation of the activity (3), creation of resource (4), and completion of the activity (5).*

*Multilingual Chatbots to Collect Patient-Reported Outcomes DOI: http://dx.doi.org/10.5772/intechopen.111865*

**Figure 8.** *Multimodal conversational response with ECAs.*

channel. At the input, the system accepts speech or text. Additionally, a word-to-concept mapping is delivered as part of spoken language understanding. This is needed in order to properly map user responses into answers expected by PROs.

We deployed the system on a server hosting five virtual machines over the Proxmox VE 6.3–2. Further, the server is running the Xubuntu 20.04 LTS operating system. On the other platform, named PERSIST\_INFERENCE, there are the Ubuntu Server 20.04 LTS OS, and microservices for ASR, TTS, and ECA. Microservices are integrated using predefined topics, and Kafka producers and consumers. To evaluate the hardware performance of the system, we simulated the load on the system by measuring CPU usage, memory usage, and average response time for both Camel and RASA chatbot. The results are outlined in **Figures 9**-**11**.

As seen in **Figure 9**, with the duplication of active users in tests the CPU usage is rising linearly from 11.65% with 25 active users to 56.04% with 1000 active users in the case of Camel, and mostly linear from 5.86% with 25 active users to 30.44% with 1000 active users for Rasa chatbot. The volatile memory was stagnating on both

**Figure 9.** *CPU use (%) per active users.*

**Figure 10.**

*Memory consumption (GB) per active users.*

**Figure 11.**

*Graphical results of average response time per active user.*

the Camel and the Rasa chatbot and proved independent of the increase of users (**Figure 10**). In the case of the Camel, the memory usage was near 50%, while on the Rasa chatbot near 25%. Further, **Figure 11** presents the MSN's internal average response time on requests between 25 and 1000 active users. The response time in this case is 0.1982 s with 25 active users and is increasing linearly as the number of users is increasing. We have 1.74 s response time with 200 active users. Then it starts rising more exponentially to 197,033 s delay, with 1000 active users.

#### *Multilingual Chatbots to Collect Patient-Reported Outcomes DOI: http://dx.doi.org/10.5772/intechopen.111865*

The models for the end-to-end ASR system SPREAD for six languages were trained on DGX-1, 8 × V100, 8 × 32 g GPU MEM, while the inference engine had 2 RTX8000, with 2 × 48 g GPUMEM. The audio datasets size used was minimal 1700 h of speech. The best model reached 2.6% WER, and all other models reached below 9% WER. The quality of the end-to-end TTS system PLATTOSand MUSHRA listening tests [63] were performed by PERSIST consortium partners. In this way, 21 consortium members participated, all in general with background knowledge in this field. Different TTS architectures were evaluated, while the architecture based on Tacotron and Waveglow was best rated. PLATTOS for all six languages was evaluated with score around 82 on 100 level scale. The results show that speech generated is highly intelligible and understandable. Further, the evaluation of the multimodal conversational response was reported in [61], where 30 individuals assigned an average score of 3.45 on the five-level Likert scale. The results show that the system produces a very viable and believable natural user interface.
